Categorical Vectors

Categorical vectors are vectors whose elements are taken from a limited set of values or levels. The set of possible values is called the category index. The elements are stored as integer indexes (level indexes) into the set of possible values. This makes it possible to have missing values where the element type does not have a representation of a missing value.

Categorical vectors are implemented by the CategoricalVector<T> class. The ICategoricalVector interface defines the essential functionality of categorical vectors when the element type is not known.

Categorical vectors also implement the IGrouping interface, which means they can be used directly as grouping objects in aggregation operations.

Constructing Categorical Vectors

The CategoricalVector<T> class does not have any constructors. Instead, use the CreateCategorical``1(Int32) method of the Vector class. This method has four overloads. It takes a generic type argument which usually can be inferred from the actual arguments.

The first overload takes one argument: the length of the vector. This creates a categorical vector where all values are missing. The element type must be specified as the generic type argument.

C#
var c1 = Vector.CreateCategorical<int>(5);
var thisIsTrue = c1.IsMissing(2);

The second overload takes a list of values. The category index and level indexes are inferred from the values in the array. An optional second argument specifies the mutability of the new vector. The third overload has 2 or 3 arguments. The first is once again a list of values. The second argument is the category index. The optional third argument specifies the mutability. When a value is not found in the supplied category index, the corresponding entry in the result is marked as missing. In the example below, we create two categorical variables with the same values. Although the values are the same, the level indexes are different because the category indexes are different:

C#
var c1 = Vector.CreateCategorical<int>(5);
var thisIsTrue = c1.IsMissing(2);

The fourth overload also takes 2 or 3 arguments. The first argument is the category index. The second argument is a list of category indexes. The optional third argument specifies the mutability. We can create the same vector again with the following code:

C#
var c1 = Vector.CreateCategorical<int>(5);
var thisIsTrue = c1.IsMissing(2);

In addition, any vector can be converted to a categorical vector by calling its AsCategorical method. If the vector is already categorical, then the same vector is returned. Optionally, an Index<T> can be passed to this method. The following example constructs two versions of the same vector using this method:

Properties and Methods

The CategoricalVector<T> supports all standard properties and methods of vectors. Some properties and methods are unique to the class.

The CategoryIndex property returns an Index<T> that contains the possible values of the elements of the vector. The LevelIndexes property returns a vector containing the position of each element in the category index. A missing value corresponds to a value of -1. The GetLevelIndex(Int32) method returns the level index of the element at the specified position.

The GetIndexes method returns a sequence of indexes of the elements that have a specific value. You can supply the actual value to look up, or the level index. The code below illustrates all these properties and methods:

C#
var categories = c2.CategoryIndex; // { "a", "b", "d" }
var levels = c2.LevelIndexes; // [ 0, 1, 0, 1, 2 ]

var at3 = c2.GetLevelIndex(3); // 1
var indexesB = c2.GetIndexes("b").ToArray(); // [ 1, 3 ]
var indexesAt1 = c2.GetIndexes(1).ToArray(); // [ 1, 3 ]

A categorical vector is essentially a mapping from integer indexes to values contained in the category index. The WithCategories<U> method creates a new categorical vector that maps the level indexes to a different set of values. The only argument of this method is the new category index. The element type of the new index need not be the same. In the following example, we change the index of the vector we created earlier from lower-case strings to upper-case characters:

C#
var newIndex = Index.Create(new[] { 'A', 'B', 'D' });
var C2 = c2.WithCategories(newIndex); // [ 'A', 'B', 'A', 'B', 'D' ]
var counts = c2.GetCounts();