Groupings

A grouping is a collection of labeled groups of elements. It consists of an index of group keys and, for each key in the index, a set of integer indices that specify membership of the group. Groupings can be used to aggregate and to summarize data, to reshape data, and to perform certain calculations like moving averages.

Groupings are not tied to the data they may have been derived from.

As with other collection classes, there are two types that represent groupings. The generic Grouping<TKey> type represents a grouping where the generic type parameter specifies the type of the group keys. The IGrouping interface represents a grouping with an untyped index (represented by an IIndex). In addition, the Grouping class provides static methods to create groupings as well as some extension methods. Most of these methods have both generic and non-generic variants based on whether the type of the keys is statically known or not.

A grouping object has an Index property that returns the collection of group keys as a strongly typed index. When accessed through the IGrouping interface, the property returns an untyped index. The Count property returns the number of groups in the grouping. The GetIndexes(Int32) method returns the sequence of indexes for the group at the specified position. The GetCounts() method returns a vector that contains the number of keys in each group.

Groupings are created using static methods of the Grouping class. They are used most commonly as an argument in an aggregation operation, which is discussed in the next section.

Partitions

A partition is a grouping where each key is part of at most one group. There are several ways to create a partition:

The Partition method partitions a list into groups of equal size. This method has two to four arguments. The first argument is a list of key values. The second argument is the size of each partition. The third and fourth arguments are optional.

The third argument is a boolean value that specifies whether the partitions should be aligned to the end of the list. When omitted or set to false, the first partition starts at the first element in the list, and the first element in each partition is used as the group key. When set to true, the last partition ends at the last element in the list and the last element in each partition is used as the group key.

The fourth argument is used in conjunction with the third, and specifies whether incomplete partitions should be skipped. When omitted or set to false, one of the partitions may contain fewer elements than the requested size. When set to true, the incomplete partition is not included in the grouping.

The following example creates a partition of a list of dates. Each partition has 10 elements. Only full partitions are returned. The last partition ends on the last date in the list:

C#
var partition = Grouping.Partition(dates, 10, alignToEnd: true, skipIncomplete: true);
var partitionAvg = x.AggregateBy(partition, Aggregators.Mean);

The VariablePartition<TInput> creates a partition where group membership depends on a condition that must hold between elements. The method takes three arguments. The first is a list of key values. The second is a delegate that evaluates the condition. This delegate takes as its argument the first element in the current group and the next element to be considered. If the condition returns true, the next element is included in the current group. Otherwise, it becomes the first element of the next group. The last argument is a Direction value that specifies the direction in which groups grow. The index of the grouping is made up of the first or last elements in each group, depending on the grow direction.

The ByValue method creates a grouping based on the value in a vector. This corresponds to group by clauses in database queries. The method is overloaded. The first argument is always a list that contains the values to group on. An optional second argument specifies a IEqualityComparer<T> that is to be used to compare keys for equality. The grouping's index contains the unique values in the list. The indexes of a group are the indexes in the list where the key appears. The example below creates a grouping based on the value of the passenger class column in the Titanic dataset:

C#
var valueGrouping = Grouping.ByValue(titanic["Pclass"].As<int>());

Partitions may also be created by quantile. The ByQuantile method creates a partition based on the order of the values. The method has two overloads. The first argument is a list that contains the values to group on. The second argument is either an integer or a list of real numbers. If it is an integer, it specifies the number of partitions or groups. Each partition will contain roughly the same number of elements. For example, with two partitions, the first partition will contain all indexes whose value in the list is less than the median, while the second partition will contain all indexes whose value is greater than the median. If the second argument is a list of real numbers, they specify the quantiles to include. The number of elements in each partition is proportional to the fraction specified by successive quantiles.

In the example below, we again use the Titanic dataset to group passengers into 5 groups of roughly equal size based on age. The 20% youngest passengers will be in the first group, the next 20% in the second group, and so on:

C#
var quantileGrouping = Grouping.ByQuantile(titanic["Age"].As<double>(), 5);

Windows

A window grouping consists of overlapping segments of a list. There are several ways to define window groupings.

The Window(Int32, Int32, Int32, Boolean, Int32) method creates moving windows of fixed length. It takes 2 to 5 arguments. The first argument is a list of keys. The second argument is the size of the window.

The remaining arguments are optional. The third argument specifies the offset of the key in the window. A negative value means the offset is counted from the end of the window. The default value is -1, which means the last key in each window is used as the group key. The fourth argument specifies whether partial windows should be included in the grouping. When false (the default), only full size windows are included. The fifth argument specifies the minimum size of a window.

The example below creates a moving window of length 20 and computes the corresponding moving average:

C#
var window = Grouping.Window(dates, 20);
var ma20 = x.AggregateBy(window, Aggregators.Mean);

Fixed size moving windows can also be created without an index. In this case, the first argument is the length of the source data. The remaining arguments have the same meaning.

The RangeWindow method creates moving windows whose range (difference between largest and smallest value) is not greater than the specified value. It takes 3 arguments. The first argument is a list of keys. The second argument is the width of the window. The last argument is the direction the window moves in. For the forward direction, each element from first to last is taken as the first element of the group, and the window is expanded until the width exceeds the specified width.

A moving range window is a special case of a variable size window, created by the VariableWindow<TInput> method. It takes 3 arguments. The first argument is a list of keys. The second argument is a delegate that evaluates the condition. This delegate takes as its argument the first element in the current group and the next element to be considered. If the condition returns true, the next element is included in the current group. Otherwise, it becomes the first element of the next group. The third and last argument is once again the direction the window moves in.

Finally, the ExpandingWindow<TInput>(Index<TInput>) method creates a grouping of windows with the same starting point and increasing size. The only argument is an index that contains the keys. The example below creates an expanding moving window and computes the average:

C#
var expanding = Grouping.ExpandingWindow(dates);
var expAvg = x.AggregateBy(expanding, Aggregators.Mean);

Resampling

When one index is entirely contained in another, the keys in the larger index can be grouped according to the key in the smaller index that follows or precedes them. Such a grouping is commonly used to convert data sets based on different time frequencies. The Resample method creates such a grouping. It takes three arguments. The first argument is the original (larger) index. The second argument is the new (smaller) index. The third argument is a Direction value that indicates whether the entries in the new index should be taken as the start (Forward) or end (Backward) of a sampling interval. The new index is also the index of the grouping.

In the first example below, we create an index of dates on the 10th of each month. We then compute a resampling of the original dates using this index:

C#
var months = Index.CreateDateRange(new DateTime(2015, 1, 10), 10, Recurrence.Monthly);
var resampling1 = Grouping.Resample(dates, months, Direction.Backward);

In the second example, we create an index of dates on the 10th of each month. We then compute a resampling of the original dates using this index:

C#
var months = Index.CreateDateRange(new DateTime(2015, 1, 10), 10, Recurrence.Monthly);
var resampling1 = Grouping.Resample(dates, months, Direction.Backward);

Pivots

A pivot is a two-dimensional grouping. The Pivot<R, C> and IPivot types represent two-dimensional groupings. There are separate RowIndex and ColumnIndex properties. The GetCounts() method returns a matrix of group counts. The GetIndexes(Int32, Int32) method has an overload that takes the row and column position as arguments.

Pivots are created using the Pivot<R, C>(IList<R>, IList<C>) method. This method takes two arguments: a list of row keys and a list of column keys. In the example below, we create a table of percentage of survivors by class.

C#
var survived = Grouping.Pivot(
    titanic["PClass"].As<int>(),
    titanic["Survived"].As<bool>()).CountsMatrix();
survived.UnscaleRowsInPlace(survived.GetRowSums());
Console.WriteLine(survived);