Aggregating data frames

Aggregation operations on data frames can be performed on the data frame as a whole, or on grouped data.

Aggregating full data frames

The Aggregate<T>(AggregatorGroup<T>) method and its overloads compute aggregates of all the data in a data frame. This method has 5 overloads, some of which are defined as extension methods. The first method takes a single argument: a AggregatorGroup<T> that specifies the aggregator that is to be applied to each column. It returns a Vector<T> that contains the result of applying the aggregator to each column. If the aggregator does not support the element type of a column, a missing value is returned. The following example computes the mean of all numerical columns in the Titanic dataset:

C#
var titanic = DataFrame.ReadCsv(titanicFilename);
var means = titanic.Aggregate(Aggregators.Mean);

The second overload takes a parameter array of aggregator groups. It returns a data frame with a row for each aggregator in the array. All aggregator groups must return the same type. As an example, we add the count and the standard deviation to the previous aggregation:

C#
var descriptives = titanic.Aggregate(
    Aggregators.Count, 
    Aggregators.Mean, 
    Aggregators.StandardDeviation);

Aggregating grouped data frames

The AggregateBy<R1>(IGrouping, AggregatorGroup[]) method and its overloads compute aggregates of the data in a data frame grouped according to some criteria. The method has many overloads. The first argument always specifies the grouping. It can be a Grouping<TKey> object, a vector, or the key of the column that is to be used for the grouping. When using the column key, the element type of the column must be specified as the generic type argument. The remaining arguments follow the same pattern as the Aggregate<T>(AggregatorGroup<T>) method.

In the example below, we compute the mean of each column in the Titanic dataset grouped by the passenger class. The result is a data frame with one row for each class indexed by the class number. We show all three methods of specifying the grouping.

C#
var key = "Pclass";
var vector = titanic[key].As<int>();
var grouping = Grouping.ByValue(vector);
var meanByClass1 = titanic.AggregateBy(grouping, Aggregators.Mean);
var meanByClass2 = titanic.AggregateBy(vector, Aggregators.Mean);
var meanByClass3 = titanic.AggregateBy<int>(key, Aggregators.Mean);

When multiple aggregators are supplied, the resulting data frame has a hierarchical column index. The first level contains the original column keys. The second level contains the name of the aggregator. In the next example, we compute the count, the mean, and the median of each column:

C#
var manyByClass1 = titanic.AggregateBy(grouping, 
    Aggregators.Count, Aggregators.Mean, Aggregators.Median);