Grouping and Aggregation in F# QuickStart Sample

Illustrates how to group data and how to compute aggregates over groups and entire datasets. in F#.

View this sample in: C# Visual Basic

// Illustrates how to group data and how to compute aggregates 
// over groups and entire datasets.
    
#light

open System

open Extreme.DataAnalysis
open Extreme.Mathematics
open Extreme.Data.Text

// The license is verified at runtime. We're using a demo license here.
// For more information, see:
// https://numerics.net/trial-key
let licensed = Extreme.License.Verify("Demo license")

// We work with the Titanic dataset
let titanic = DelimitedTextFile.ReadDataFrame(@"..\..\..\..\data\titanic.csv")
// We'll use these columns often:
let age = titanic.GetColumn("Age")
let survived = titanic.["Survived"].As<bool>()
// We want to group by the passenger class,
// so we make this a categorical vector.
let pclass = titanic.["Pclass"].AsCategorical()

//
// Aggregators and Aggregation
// 

// The Aggregators class defines all common aggregator functions.
// The Aggregate method applies the aggregator
// to every column in the data frame:
let means = titanic.Aggregate(Aggregators.Mean)
printfn "%O" (means.Summarize())

// We can create custom aggregators. Here we compute
// the fraction of true values of a boolean vector:
let trueFraction = Aggregators.Create(
                    fun (b:Vector<bool>) -> float (b.CountTrue()) / float b.Count)
let pctSurvived = survived.Aggregate(trueFraction)

// We can also compute more than one aggregate:
let descriptives = titanic.Aggregate(
                    Aggregators.Count,
                    Aggregators.Mean,
                    Aggregators.StandardDeviation)
printfn "%O" (descriptives.Summarize())

// Aggregations can be applied to individual vectors:
let meanAge = age.Aggregate(Aggregators.Mean)

// Or to rows or columns of a matrix:
let m = Matrix.CreateRandom(5, 8)
let meanByRow = m.AggregateRows(Aggregators.Mean)
let meanByColumn = m.AggregateColumns(Aggregators.Mean)

//
// Groupings
//

// By defining a grouping, we can compute the aggregate
// for each group.

// The simplest grouping is by value, similar to 
// GROUP BY clauses in database queries.

// Let's get the average age by class:
let ageByClass = age.AggregateBy(pclass, Aggregators.Mean)

// Grouping by quantile means we sort the values
// and divide the result into groups of the same size.
let byQuantile = Grouping.ByQuantile(age, 5)
let survivedByAgeGroup = survived.AggregateBy(byQuantile, trueFraction)
printfn "Survival rate by age group:"
printfn "%O" (survivedByAgeGroup.Summarize())

// For the remainder we will use a vector with a DateTime index:
let x = Vector.CreateRandom(200)
let dates = Index.CreateDateRange(new DateTime(2016, 1, 1), x.Length)
x.Index <- dates

// A partition is a straight division of the data into equal groups:
let partition = Grouping.Partition(dates, 10, 
                    alignToEnd=true, skipIncomplete=true)
let partitionAvg = x.AggregateBy(partition, Aggregators.Mean)
printfn "Avg. by partition:"
printfn "%O" partitionAvg

//
// Moving and expanding windows
//

// Moving or rolling averages and related statistics 
// can be computed efficiently by using moving windows:
let window = Grouping.Window(dates, 20)
let ma20 = x.AggregateBy(window, Aggregators.Mean)
printfn "ma20:"
printfn "%O" (ma20.GetSlice(0, 20))
// Moving standard deviation is just as simple:
let mstd20 = x.AggregateBy(window, Aggregators.StandardDeviation)
printfn "mstd20:"
printfn "%O" (mstd20.GetSlice(0, 20))

// Moving windows can have a fixed number of elements, as above,
// or a fixed maximum width:
let window2 = Grouping.RangeWindow(dates, TimeSpan.FromDays(20.0))
let ma20_2= x.AggregateBy(window2, Aggregators.Mean)

// Expanding windows keep the starting point and move the end point
// forward in time:
let expanding = Grouping.ExpandingWindow(dates)
let expAvg = x.AggregateBy(expanding, Aggregators.Mean)
printfn "expAvg:"
printfn "%O" (expAvg.GetSlice(0, 10))

// 
// Resampling
//

// Resampling means computing values for a series 
// with longer periods by aggregating over the values
// for shorter periods.

// We start by creating an index with the boundaries,
// in this case the 10th of each month.
let months = Index.CreateDateRange(new DateTime(2016, 1, 10), 
                12, Recurrence.Monthly)
// We then create the resampling grouping from this:
// Giving the Direction argument as Backward means that
// the last value in the time period is used as the key
// for the group.
let resampling1 = Grouping.Resample(dates, months, Direction.Backward)
// We can also obtain this grouping in one step:            
let resampling2 = Grouping.Resample(dates, 
                    Recurrence.Monthly.Day(10), Direction.Backward)
let resampled = x.AggregateBy(resampling2, Aggregators.Mean)

//
// Pivot tables
//

// A pivot table is a 2-dimensional grouping on two key columns.
// For this, we go back to the Titanic dataset, and we compute 
// the survival rate per class in a different way. We group
// by class and by whether the passenger survived:
let pivot = Grouping.Pivot(
                titanic.["Pclass"].As<int>(),
                titanic.["Survived"].As<bool>())
// We can then get the # of elements in each group
// as a matrix, with rows indexed by class and columns
// indexed by survived:
let counts = pivot.CountsMatrix()
// Scaling by the row sums gives us the fraction
// of survived/did not survive for each class:
let fractions = counts.UnscaleRowsInPlace(counts.GetRowSums())
printfn "%O" (fractions.Summarize())

printf "Press Enter key to exit..."
Console.ReadLine() |> ignore