Transforming columns

Data frames are column based data structures. Manipulating columns is an essential operation when working with data frames.

While columns in a data frame are immutable, the list of columns itself is not. Adding and removing columns changes the existing data frame. Adding and removing rows always constructs a new data frame.

Adding and removing columns

The AddColumn method adds a column to the end of a data frame. The method takes two arguments. The first is the key of the new column. If a column with the same key already exists in the data frame, an exception is thrown. The second argument is the actual column, either as a list or a vector. In the example below, we construct a data frame with a couple of columns and then add two more columns:

C#
var dates = Index.CreateDateRange(new DateTime(2015, 11, 11), 5, Recurrence.Daily);
var df1 = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "values1", Vector.CreateRandom(5) },
        { "values2", Vector.CreateRandom(5) },
        }, dates);

df1.AddColumn("vzlues3", Vector.CreateRandom(5));
df1.AddColumn("values4", Vector.CreateRandom(5));

The exact behavior depends on whether the column vector has itself an index. If it does, and the element type is the same as the data frame's row index, then the values in the vector are inserted in the row with the matching index. This corresponds to a left join. Otherwise, the column is added as is.

The RenameColumn method allows you to change the key of an existing column or set of columns. This method has three overloads, which all take two arguments. In the first overload, the first argument is the original key, and the second argument is the new key. In the second overload, the first argument is a sequence of keys to replace, and the second overload is a sequence of keys to replace them with. In the third overload, the first argument is a predicate that returns whether a key should be replaced. The second argument is a function that transforms the current key into a new key.

In the example below, we first change the key of one column. We then replace all keys that start with a misspelled word, and replace it with the correct word:

C#
df1.RenameColumn("values4", "vzlues5");
df1.RenameColumns(s => s.StartsWith("vzlues"), s => "values" + s.Substring(6));

The RemoveColumn method removes a column from the data frame. The method takes as its only argument the key that should be removed. The RemoveColumnAt method removes the column at the specified position. Finally, theRemoveColumnsWithMissingValues method removes any columns that contain missing values. In the next example, we remove two of the columns, one by key and one by position:

C#
df1.RemoveColumn("values5");
df1.RemoveColumnAt(2);

Transforming columns

New columns are often derived from existing columns. Several methods make this process more streamlined. The methods exist in four variations: one where a new vector is returned without changing the data frame, one where the column is replaced, one where the column is inserted after the column it is derived from, and one where the new column is appended as the last column in the data frame. Each variation can be applied to individual columns or several columns at once.

Mapping single columns

The MapAndReplace method replaces a column with the result of applying a function to it. This method has three arguments. The first is the key of the column that is to be replaced. The second is a delegate that represents the mapping. The third argument is the key for the replacement column. The method replaces the column in the current data frame with the transformed version. The element type of the source column must be supplied as a generic type argument if it cannot be inferred from the transformation function.

The MapAndAppend method transforms the column, but leaves the original column and appends the result as the last column in the data frame. This method takes the same three arguments as the previous method in the same order.

The MapAndInsertAfter method again transforms the column, but instead of replacing the original, it inserts the result after the original column. This method takes the same three arguments as the two previous methods in the same order.

In the example below, we use each of the three methods on columns in the data frame defined earlier:

C#
df1.MapAndAppend<double>("values1", x => Vector.Cos(x), "cosValues1");
df1.MapAndInsertAfter<double>("values1", x => Vector.Sin(x), "sinValues1");
df1.MapAndReplace<double>("values2", x => Vector.Exp(x), "expValues2");

Mapping multiple columns

The same three methods discussed in the previous section have two overloads that operate on several columns at once. Both overloads take three arguments. The first is a list of the keys of the columns that should be transformed. The second argument is a delegate that represents the transformation. The third argument can be either a list of new column keys, or a delegate that maps old column keys to new column keys. In the example below, we illustrate both methods of specifying the keys of the new columns:

C#
var columns = new[] { "values1", "values2" };
var negColumns = new[] { "-values1", "-values2" };
df1.MapAndAppend<double>(columns, x => -x, negColumns);
df1.MapAndInsertAfter<double>(columns, x => 2.0 * x, s => "2*" + s);

In addition, the Map method applies a function to multiple columns and returns the result as a new data frame. It takes one or three arguments. In the three argument overload, the arguments have the same meaning as before. If only one argument is supplied, it is the transformation function. The element type of the source columns must be provided as a generic type argument. It applies the function to every column of the specified type and returns the result in a new data frame.

A note on performance

Since data frames are column oriented data structures, operations that operate on columns directly are much more efficient than operations that operate on rows. The Rows property is made available for the sake of convenience, but should be avoided in most instances. In particular, it is very inefficient to iterate through the rows to perform a calculation on some variables in each row. Nearly all such operations can be performed much more efficiently by operating directly on the columns.

For example, let's say we have a DataFrame<R, C> with three numerical columns: value1, value2 and value3. We want to calculate a fourth column that contains the average of each of the three variables for each observation. A naive implementation would be as follows:

C#
var avg = Vector.Create<double>(df1.RowCount);
int i = 0;
foreach (var row in df1.Rows)
{
    avg[i] = (row.Get<double>("value1")
            + row.Get<double>("value2")
            + row.Get<double>("value3")) / 3;
    i++;
}

However, the most compact and efficient expression is simply:

C#
var avg = (df1.GetColumn("value1")
         + df1.GetColumn("value1")
         + df1.GetColumn("value3")) / 3.0;
df1.AddColumn("Average", avg);