Variable-Length Data
Variable-length data, also known as ragged data, is a common pattern in data analysis where each observation contains a sequence whose length may vary. Examples include tokenized text (where each document has a different number of words), grouped measurements (where each entity has a different number of observations), segmented time series, and feature lists per sample.
Numerics.NET provides first-class support for variable-length data through the ListVector<T> class, which efficiently stores and manipulates sequences of variable-length lists.
What is Variable-Length Data?
Variable-length data occurs when observations naturally contain sequences of different lengths. Common examples include:
Tokenized text: Documents contain different numbers of words or tokens.
Grouped measurements: Each subject or entity has a different number of measurements.
Segmented time series: Time periods contain different numbers of observations.
Feature lists: Each sample has a variable number of features or attributes.
Unlike fixed-length data that can be represented in a rectangular matrix, variable-length data requires a more flexible representation that preserves the natural grouping structure.
Relationship to Groupings
A Grouping<TKey> defines how a flat sequence is partitioned into groups. A ListVector<T>materializes this partitioning as data: each group becomes a list, and the collection of lists becomes a vector.
This means that aggregating over a list vector is conceptually equivalent to flattening it first and then aggregating by its grouping:
// A ListVector materializes a grouping
var flat = tokens.Flatten();
var grouping = tokens.Grouping;
// These two operations are equivalent:
var listSums = tokens.AggregateLists(Aggregators.Count);
var groupedSums = flat.AggregateBy(grouping, Aggregators.Count);A ListVector<T> is thus not merely a container; it is a grouped representation of data that preserves and exposes the grouping structure through its Grouping property.
ListVector<T> as the Core Abstraction
The ListVector<T> class represents a vector whose elements are variable-length lists. It has the following characteristics:
Fixed structure: The number of lists and their boundaries are immutable.
Variable list lengths: Each list can contain any number of elements.
Efficient storage: Values are stored in a flattened format similar to compressed sparse row (CSR) storage, with all values in a contiguous array and offsets indicating list boundaries.
Mutable values: Individual elements within lists can be modified when the vector attributes permit it.
Creating a list vector from nested data is straightforward:
// Create a list vector from nested data
var documents = new[] {
new[] { "the", "quick", "brown" },
new[] { "the", "lazy", "dog", "sleeps" },
new[] { "pack", "my", "box" }
};
var tokens = Vector.CopyFromAsLists(documents);
Console.WriteLine($"List vector length: {tokens.Length}");
Console.WriteLine($"Total tokens: {tokens.FlattenedLength}");Typical Workflows
ListVector<T> supports a variety of common operations on variable-length data:
Aggregating Per List
Use the AggregateLists method to compute statistics for each list:
// Compute statistics per list
var observations = new[] {
new[] { 1.5, 2.3, 1.8 },
new[] { 4.1, 3.9, 4.5, 4.2 },
new[] { 2.1, 2.0 }
};
var data = Vector.CopyFromAsLists(observations);
var means = data.AggregateLists(Aggregators.Mean);
var maxValues = data.AggregateLists(Aggregators.Maximum);
var counts = data.GetListLengths();
Console.WriteLine($"Means: {means}");
Console.WriteLine($"Max values: {maxValues}");
Console.WriteLine($"Counts: {counts}");Transforming Elements
Use the Map method to transform elements within each list:
// Transform elements within each list
var normalized = data.Map(x => x / 10.0);
// Transform using per-list scalars
var scaled = data.Map(means, (value, mean) => value - mean);Flattening to a Standard Vector
Use the Flatten() method to convert the list vector back to a regular vector containing all elements in order:
// Flatten back to a regular vector
var allValues = data.Flatten();
var overallMean = allValues.Mean();Converting to a Fixed-Width Matrix
For downstream algorithms that require rectangular data, use the ToRowMatrix or ToColumnMatrix methods to convert the list vector to a matrix, padding or truncating lists as necessary:
// Convert to matrix for downstream analysis
var matrix = data.ToRowMatrix(paddingValue: 0.0);
Console.WriteLine($"Matrix shape: {matrix.RowCount}x{matrix.ColumnCount}");List-Level Operations
ListVector<T> provides several methods for manipulating entire lists:
// Various list-level operations
var first3 = tokens.HeadLists(3);
var sorted = tokens.SortLists();
var reversed = tokens.ReverseLists();Variable-Length Data in Data Frames
Variable-length lists often appear as columns in data frames: each row represents an observation, and one or more columns contain lists whose lengths may vary from row to row. This is a first-class pattern in Numerics.NET, not an edge case.
ListVector<T> is the underlying representation for such columns:
// Create a data frame with a list-valued column
var ids = Vector.Create(new[] { 101, 102, 103 });
var measurements = Vector.CopyFromAsLists(new[] {
new[] { 1.5, 2.3, 1.8 },
new[] { 4.1, 3.9, 4.5, 4.2 },
new[] { 2.1, 2.0 }
});
var df = DataFrame.FromColumns(
("ID", ids),
("Measurements", measurements)
);
Console.WriteLine(df);You can compute statistics on list-valued columns and add them as new columns:
// Compute statistics on list column
var measCol = df["Measurements"].As<IReadOnlyList<double>>();
var listVector = measCol as ListVector<double>;
if (listVector != null)
{
var means = listVector.AggregateLists(Aggregators.Mean);
df["Mean"] = means;
var counts = listVector.GetListLengths();
df["Count"] = counts;
}
Console.WriteLine(df);This pattern is particularly useful when working with grouped data that you want to preserve in a structured format rather than flattening immediately.
When to Use ListVector<T> vs Other Representations
Use ListVector<T> when:
List lengths vary meaningfully and should be preserved.
The grouping structure is important for analysis or aggregation.
You want to perform per-list operations or transformations.
The data naturally arises from a grouping operation.
Use dense matrices when:
Data is naturally rectangular (all rows/columns have the same length).
You have already padded or truncated lists to a fixed length for downstream algorithms (e.g., machine learning or statistical methods that require rectangular input).
In many workflows, you may start with a ListVector<T> to preserve the natural structure of your data, perform per-list analysis, and then convert to a matrix when needed for algorithms that require rectangular data.