Variable-Length Data

Variable-length data, also known as ragged data, is a common pattern in data analysis where each observation contains a sequence whose length may vary. Examples include tokenized text (where each document has a different number of words), grouped measurements (where each entity has a different number of observations), segmented time series, and feature lists per sample.

Numerics.NET provides first-class support for variable-length data through the ListVector<T> class, which efficiently stores and manipulates sequences of variable-length lists.

What is Variable-Length Data?

Variable-length data occurs when observations naturally contain sequences of different lengths. Common examples include:

  • Tokenized text: Documents contain different numbers of words or tokens.

  • Grouped measurements: Each subject or entity has a different number of measurements.

  • Segmented time series: Time periods contain different numbers of observations.

  • Feature lists: Each sample has a variable number of features or attributes.

Unlike fixed-length data that can be represented in a rectangular matrix, variable-length data requires a more flexible representation that preserves the natural grouping structure.

Relationship to Groupings

A Grouping<TKey> defines how a flat sequence is partitioned into groups. A ListVector<T>materializes this partitioning as data: each group becomes a list, and the collection of lists becomes a vector.

This means that aggregating over a list vector is conceptually equivalent to flattening it first and then aggregating by its grouping:

C#
// A ListVector materializes a grouping
var flat = tokens.Flatten();
var grouping = tokens.Grouping;

// These two operations are equivalent:
var listSums = tokens.AggregateLists(Aggregators.Count);
var groupedSums = flat.AggregateBy(grouping, Aggregators.Count);

A ListVector<T> is thus not merely a container; it is a grouped representation of data that preserves and exposes the grouping structure through its Grouping property.

ListVector<T> as the Core Abstraction

The ListVector<T> class represents a vector whose elements are variable-length lists. It has the following characteristics:

  • Fixed structure: The number of lists and their boundaries are immutable.

  • Variable list lengths: Each list can contain any number of elements.

  • Efficient storage: Values are stored in a flattened format similar to compressed sparse row (CSR) storage, with all values in a contiguous array and offsets indicating list boundaries.

  • Mutable values: Individual elements within lists can be modified when the vector attributes permit it.

Creating a list vector from nested data is straightforward:

C#
// Create a list vector from nested data
var documents = new[] {
    new[] { "the", "quick", "brown" },
    new[] { "the", "lazy", "dog", "sleeps" },
    new[] { "pack", "my", "box" }
};
var tokens = Vector.CopyFromAsLists(documents);
Console.WriteLine($"List vector length: {tokens.Length}");
Console.WriteLine($"Total tokens: {tokens.FlattenedLength}");

Typical Workflows

ListVector<T> supports a variety of common operations on variable-length data:

Aggregating Per List

Use the AggregateLists method to compute statistics for each list:

C#
// Compute statistics per list
var observations = new[] {
    new[] { 1.5, 2.3, 1.8 },
    new[] { 4.1, 3.9, 4.5, 4.2 },
    new[] { 2.1, 2.0 }
};
var data = Vector.CopyFromAsLists(observations);

var means = data.AggregateLists(Aggregators.Mean);
var maxValues = data.AggregateLists(Aggregators.Maximum);
var counts = data.GetListLengths();

Console.WriteLine($"Means: {means}");
Console.WriteLine($"Max values: {maxValues}");
Console.WriteLine($"Counts: {counts}");

Transforming Elements

Use the Map method to transform elements within each list:

C#
// Transform elements within each list
var normalized = data.Map(x => x / 10.0);

// Transform using per-list scalars
var scaled = data.Map(means, (value, mean) => value - mean);

Flattening to a Standard Vector

Use the Flatten() method to convert the list vector back to a regular vector containing all elements in order:

C#
// Flatten back to a regular vector
var allValues = data.Flatten();
var overallMean = allValues.Mean();

Converting to a Fixed-Width Matrix

For downstream algorithms that require rectangular data, use the ToRowMatrix or ToColumnMatrix methods to convert the list vector to a matrix, padding or truncating lists as necessary:

C#
// Convert to matrix for downstream analysis
var matrix = data.ToRowMatrix(paddingValue: 0.0);
Console.WriteLine($"Matrix shape: {matrix.RowCount}x{matrix.ColumnCount}");

List-Level Operations

ListVector<T> provides several methods for manipulating entire lists:

C#
// Various list-level operations
var first3 = tokens.HeadLists(3);
var sorted = tokens.SortLists();
var reversed = tokens.ReverseLists();

Variable-Length Data in Data Frames

Variable-length lists often appear as columns in data frames: each row represents an observation, and one or more columns contain lists whose lengths may vary from row to row. This is a first-class pattern in Numerics.NET, not an edge case.

ListVector<T> is the underlying representation for such columns:

C#
// Create a data frame with a list-valued column
var ids = Vector.Create(new[] { 101, 102, 103 });
var measurements = Vector.CopyFromAsLists(new[] {
    new[] { 1.5, 2.3, 1.8 },
    new[] { 4.1, 3.9, 4.5, 4.2 },
    new[] { 2.1, 2.0 }
});

var df = DataFrame.FromColumns(
    ("ID", ids),
    ("Measurements", measurements)
);

Console.WriteLine(df);

You can compute statistics on list-valued columns and add them as new columns:

C#
// Compute statistics on list column
var measCol = df["Measurements"].As<IReadOnlyList<double>>();
var listVector = measCol as ListVector<double>;

if (listVector != null)
{
    var means = listVector.AggregateLists(Aggregators.Mean);
    df["Mean"] = means;

    var counts = listVector.GetListLengths();
    df["Count"] = counts;
}

Console.WriteLine(df);

This pattern is particularly useful when working with grouped data that you want to preserve in a structured format rather than flattening immediately.

When to Use ListVector<T> vs Other Representations

Use ListVector<T> when:

  • List lengths vary meaningfully and should be preserved.

  • The grouping structure is important for analysis or aggregation.

  • You want to perform per-list operations or transformations.

  • The data naturally arises from a grouping operation.

Use dense matrices when:

  • Data is naturally rectangular (all rows/columns have the same length).

  • You have already padded or truncated lists to a fixed length for downstream algorithms (e.g., machine learning or statistical methods that require rectangular input).

In many workflows, you may start with a ListVector<T> to preserve the natural structure of your data, perform per-list analysis, and then convert to a matrix when needed for algorithms that require rectangular data.

See Also