Constructing data frames

A data frame is an collection of columns that may have different element types and that has indexed access to rows and columns. The minimal functionality of a data frame is captured by the IDataFrame interface. The main implementation of a data frame is the DataFrame<R, C> class. Vectors and matrices also implement IDataFrame.

Constructing data frames

The DataFrame<R, C> class itself has no constructors. Instead, data frames are created by performing operations on existing data frames, or by calling one of the factory methods of the static DataFrame class. All these methods take the type of the row and column keys as generic type arguments. However, in most cases they can be inferred from the arguments so they can be omitted.

The simplest method, CreateEmpty<R, C>(), takes no arguments. It creates an empty data frame. The type of the row keys and the column keys must be specified as generic type arguments. You can add and remove columns using the methods in the next section. The first column that is added determines the row index.

There are several ways to create a data frame from a set of vectors. The method is called FromColumns and has multiple overloads. There are two basic mechanisms: you can specify a dictionary that maps column keys to the column values, or you can use separate collections of column keys and columns. When using a dictionary, it is the first argument:

C#
var data = new Dictionary<string, object>() {
        { "state", new string[] { "Ohio", "Ohio", "Ohio", "Nevada", "Nevada" } },
        { "year", new int[] { 2000, 2001, 2002, 2001, 2002 } },
        { "pop", new double[] { 1.5, 1.7, 3.6, 2.4, 2.9 } }
    };
var df1 = DataFrame.FromColumns(data);

The second argument is optional and specifies the row index. If no row index is provided, the first index of the correct type found in one of the columns is used. If no row index key type is specified, row numbers are used as keys.

C#
var df2 = DataFrame.FromColumns(new Dictionary<string, object>() {
    { "first", new double[] { 11, 14, 17, 93, 55 } },
    { "second", new double[] { 22, 33, 43, 51, 69 } } },
    Index.CreateDateRange(new DateTime(2015, 4, 1), 5));

It is also possible to supply a column index for the new data frame. In this case, only columns that are present in the column index are included in the new data frame. If a key in the column index cannot be found in the dictionary, the corresponding column is still included, but it will consist entirely of missing values, as in the following example where the key 'debt' is not in the dictionary:

C#
var df2a = DataFrame.FromColumns(data,
    Index.Create(new[] { "one", "two", "three", "four", "five" }),
    Index.Create(new[] { "year", "state", "pop", "debt" }));

A data frame may be created from a sequence of columns and a sequence of keys separately. The relevant overloads take two arguments. The first is a sequence of vectors. This may be a strongly typed vector or, if the columns have different element types, a sequence of IVector objects. The second argument is a sequence of column keys:

C#
var df3 = DataFrame.FromColumns(new Vector<double>[] {
    Vector.Create(11.0, 14.0, 17.0, 93.0, 55.0),
    Vector.Create(22.0, 33.0, 43.0, 51.0, 69.0) },
    Index.Create(new[] { "First", "Second" }));

The next option is to supply a set of tuples where the first item is the column key and the second item is a list of values. This overload simply takes a (parameter) array of tuples as its only argument:

C#
var df5 = DataFrame.FromColumns(
    ("state", new string[] { "Ohio", "Ohio", "Ohio", "Nevada", "Nevada" }),
    ("year", new int[] { 2000, 2001, 2002, 2001, 2002 }),
    ("pop", new double[] { 1.5, 1.7, 3.6, 2.4, 2.9 }));

Another way to construct a data frame is from a matrix. This can be done in two ways. The simplest is to call the ToDataFrame method on the matrix. The type of the row keys and the column keys must be specified as generic type arguments and must match the type of the existing indexes of the matrix.

C#
var a = Matrix.CreateRandom(100, 5);
a.RowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
a.ColumnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df7 = a.ToDataFrame<DateTime, string>();

You can also supply the row and column indexes as arguments to an overload of this method. In this case, the generic type arguments can be inferred:

C#
var b = Matrix.CreateRandom(100, 5);
var rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
var columnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df8 = a.ToDataFrame(rowIndex, columnIndex);

Alternatively, you can call the FromMatrix method. This method takes three arguments: the matrix, the row index and the column index. If the row or column index are null, the corresponding index from the matrix is used. If it does not have an index of the right type, an InvalidOperationException is thrown.

A data frame can be created from a sequence or list of .NET objects. The FromObjects method takes one generic type argument: the type of the objects, and can usually be inferred. There are two overloads. The first overload takes one argument: the sequence of objects of the specified type. This method returns a data frame with one row for each object in the sequence and one column for each public property. The column keys correspond to the names of the properties. The second overload takes as an optional second argument a list of the properties that should be included in the data frame. The order in which the properties are listed is preserved in the data frame. The following example illustrates both overloads:

C#
var b = Matrix.CreateRandom(100, 5);
var rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
var columnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df8 = a.ToDataFrame(rowIndex, columnIndex);

Finally, a data frame can be created from a data source like a DataTable or a text file. This is discussed in the next section.

Importing and Exporting

Most data starts out in an external data source. This section outlines how to load data from external data sources into a data frame and how to save a data frame to an external data source.

Importing and exporting text files

Data frames can be read from a text file using the ReadCsv method. This method takes as its first argument the path to the file to be read or a stream to read from. This constructs a data frame containing the data in the file with the columns indexed by the headers from the file. Optionally, you can specify the column that should be used as the row index. In this case you must also provide the data type of the column as a generic type argument.

The WriteCsv<R, C> method lets you export a data frame to CSV format. It is defined as an extension method in the DataFrame class. The code below illustrates all these methods.

C#
var df2a = df2.WithRowIndex<string,int>("state", "year");

Importing from data tables

A data frame can be created from a DataTable.

The FromDataTable method has four overloads, which come in two pairs. The first argument in all overloads is a DataTable that specifies the source of the data. An optional second argument is a sequence of strings that contain the names of the columns to retain in the data frame. These two overloads take no generic type arguments and return a data frame with the column names as column keys and row numbers as row keys.

The second pair of overloads take one generic type argument: the element type of the row keys. The first argument is once again the data table. The second argument is the name of the column that contains the row index. The values in this column must be convertible to the type specified by the generic type argument. An optional third argument once again specifies the names of the columns that should be included in the data frame.

Importing from other file formats

Several more common file formats are supported, either directly or in separate assemblies. Supported formats (will) include: R, Stata, SAS, Excel, HDF5.