Basic Operations on Data Frames

Required introduction

Working with indexes

Indexes are at the heart of what makes data frames convenient and useful. See the section on indexes for more in-depth information. Indexes are used to access rows and columns of a data frame. Hierarchical indexes can be used to group rows, and subsequently perform calculations on each group.

The row and column indexes of a data frame can be accessed through the RowIndex and ColumnIndex properties. These properties are read-only. You can create a data frame with a different row or column index using the WithRowIndex and WithColumnIndex<C1> methods.

Renaming columns is also possible. There are two methods: RenameColumn lets you rename a single column. RenameColumns has two overloads. The first overload takes two arguments. The first is a sequence containing the keys to be replaced. The second is a sequence of the corresponding new keys. The second overload also takes two arguments. The first is a predicate that determines whether a key should be replaced. The second is a function that turns an old key into a new key.

The RowIndex method has a number of overloads that let you select one or more columns to use as the index. The overloads take 1 to 3 arguments: the key(s) of the column(s) that are to make up the index. If more than one column is selected, a hierarchical index will be created. The element types of the columns must be passed as generic type arguments:

var df2a = df2.WithRowIndex<string,int>("state", "year");

The total number of rows and columns are available through the RowCount and ColumnCount properties.

Accessing columns

The values in a data frame are stored in vectors that make up the columns of the data frame. Columns can be accessed by key or by ordinal position using the data frame's indexer property. Information about the data type of each column is not encoded in the .NET type of the data frame, and so the object returned by the indexers is of type IVector. You can turn this into a vector of a specific type by calling this object's As<U> method. This method converts the untyped vector to a Vector<T>. The element type is determined by the generic type argument.

Columns can also be retrieved using the GetColumn method, which has a generic type argument and returns a typed vector. This method also has a non-generic overload which returns a vector of Double. So, because the debt column has type Double, the following three expressions are equivalent:

var totalDebt1 = df2["debt"].As<double>().Sum();
var totalDebt2 = df2.GetColumn<double>("debt").Sum();
var totalDebt3 = df2.GetColumn("debt").Sum();

The column vectors are immutable. It is not possible to change their value in-place. They can be used in calculations and you can make writable copies.

To retrieve multiple columns into a new data frame, use the GetColumns method. This method takes as its only argument a sequence of column keys. It returns a new data frame that contains only the selected columns.

Adding and Removing Columns

While the columns of a data frame are immutable, the collection of columns itself is not. It is possible to add columns to an existing data frame or remove columns from it. The AddColumn method takes two arguments. The first is the key of the new column. An exception in thrown if a column with the same key already exists. The second argument is a vector or collection containing the values. Exactly how the values are added to the data frame depends on the index of the column being added.

  • If the column has an index of the same type as the data frame, then the column is aligned with the data frame's index and the values are ordered accordingly.

  • If the column does not have an index or if it is of a different type, then the values are not reordered.

Columns can be removed by key using the RemoveColumn method, or by ordinal index using the RemoveColumnAt method. In the example below, we create a vector of booleans that indicates whether the state is an Eastern state. We then add this column to the data frame, and then remove it:

var eastern = Vector.EqualTo(df2["state"].As<string>(), "Ohio");
Console.WriteLine("eastern =\n{0}", eastern);
df2.AddColumn("eastern", eastern);
Console.WriteLine("df2 =\n{0}", df2);
Console.WriteLine("df2 =\n{0}", df2);

Accessing rows


Because a data frame is a column-oriented structure, accessing values by row is much more expensive than accessing by column, and should be avoided whenever possible.

The DataFrame<R, C> class has a Rows property that returns a sequence of DataFrameRow<R, C> objects that represent a row in the data frame. The elements of a row can be indexed by key or by the position. DataFrameRow<R, C> objects have a single indexer property that can use the name of the column or the position of the column as an index.

A single row can be retrieved through the GetRow method. This method takes the row key as its argument and returns a DataFrameRow<R, C> object. There is also a GetRowAs<T> method which takes a generic type argument and converts the row into a vector of the specified type.

Multiple rows can be retrieved using the GetRows method. This method is overloaded and can take either a sequence of keys, a sequence of ordinal indexes, or a boolean vector as its argument. These methods return a new data frame that contains only the selected rows.

Sometimes the key value is not exact. For example, you may want to get the row in a data frame nearest to a certain date. The GetRowXxx methods have companion methods that perform this task. They are called GetNearestRow, GetNearestRowAs<T>, and GetNearestRows and retrieve an individual row, an individual row as a vector, and a data frame containing multiple rows, respectively. All these methods take a second argument of type Direction that specifies whether the nearest key should be equal to or less than (Backward) or equal to or greater than (Forward) the specified key(s).

The code below first creates a new data frame containing only the rows where the year is greater than 2001. It then creates another data frame with a DateTime index, and finds a row in two ways: first using an exact lookup, and then using a nearest match.

var df3 = df2.GetRows(Vector.GreaterThan(df2["year"].As<int>(), 2001));
Console.WriteLine("df2(year > 2001) =\n{0}", df3);

var df4 = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "first", new double[] { 11, 14, 17, 93, 55 } },
        { "second", new double[] { 22, 33, 43, 51, 69 } } })
        .WithRowIndex(Index.CreateDateRange(new DateTime(2015, 4, 1), 5));
var instant = new DateTime(2015, 4, 3, 17, 11, 3);
var date = instant.Date;
var row1 = df4.GetRowAs<double>(date);
var row2 = df4.GetNearestRowAs<double>(instant, Direction.Backward);

It is also possible to select rows by specifying the rows that should be removed. The RemoveRows method takes a sequence of row keys and returns a new data frame with these keys removed. The RemoveRowsWithMissingValues method returns a new data frame with all rows that contain a missing value removed. If no column keys are specified, all columns are checked for missing values. If one or more column keys are specified, only the specified columns are checked. If none of the rows contain missing values, the data frame is returned unmodified.