Joining and reshaping data frames

Data often comes from different sources and has to be merged, or it may come in a format that is not suitable for further analysis. In this section, we discuss the methods available to shape data frames into the format you want.

Join types and relational databases

The terminology used in this section is taken directly from the theory of relational algebra that is the basis of relational database systems. A data frame corresponds to a database table, and its row index corresponds to its primary key.

The join type determines what happens when a key in one table is not present in the other table. There are four possibilities, enumerated by the JoinType enumeration type:

Value

Description

Left

In a left join, all rows in the left table are included in the result. If a key is not present in the right table, the corresponding entries are marked missing.

Right

In a right join, the roles of the left and right tables are reversed. All rows in the right table are included in the result. If a key is not present in the left table, the corresponding entries are marked missing.

Inner

In an inner join, only rows with matching keys in both tables are included in the result. If a key in either table is not found in the other, the row is omitted.

Outer

In a (full) outer join, all rows from both tables are included in the result. If a key in either table is not found in the other, the corresponding entries are marked missing.

An outer join is the only join type that guarantees that all the data in the left and right tables is present in the result. Although joins are most commonly used on the rows of two data frames, they can be equally useful when used on the columns.

Appending data frames

When several data frames contain data about different observations with no overlap, the data frames can be combined by creating a new data frame that contains all the rows from all data frames.

In the simplest case, all data frames have the same set of columns. The Append method appends the specified data frame to the current data frame. The Append method of the DataFrame class performs the same operation, and can handle more than 2 data frames at a time.

C#
var df1 = DataFrame.FromColumns(new Dictionary<string, object>() {
    { "A", new string[] { "A0", "A1", "A2", "A3"} },
    { "B", new string[] { "B0", "B1", "B2", "B3"} },
    { "C", new string[] { "C0", "C1", "C2", "C3"} },
    { "D", new string[] { "D0", "D1", "D2", "D3" } } },
    Index.Default(0, 3));
var df2 = DataFrame.FromColumns(new Dictionary<string, object>() {
    { "A", new string[] { "A4", "A5", "A6", "A7"} },
    { "B", new string[] { "B4", "B5", "B6", "B7"} },
    { "C", new string[] { "C4", "C5", "C6", "C7"} },
    { "D", new string[] { "D4", "D5", "D6", "D7" } } },
    Index.Default(4, 7));
var df12 = df1.Append(df2);
var df3 = DataFrame.FromColumns(new Dictionary<string, object>() {
    { "A", new string[] { "A8", "A9", "A10", "A11"} },
    { "B", new string[] { "B8", "B9", "B10", "B11"} },
    { "C", new string[] { "C8", "C9", "C10", "C11"} },
    { "D", new string[] { "D8", "D9", "D10", "D11" } } },
    Index.Default(8, 11));
var df123 = DataFrame.Append(df1, df2, df3);

The question still remains how to handle situations where the columns don't match exactly. One way is to specify the columns to keep explicitly. This can be done by passing an index with the desired column keys to the Append method.

Another option in to apply a join operation to the columns. The Append method takes a JoinType value with the following meaning:

Outer join

Keep all columns. If a column is missing from a data frame, missing values appear in the result. This is the default.

Inner join

Keep only the columns that appear in all data frames.

Left join

Keep only the columns in the first data frame. If a column is missing from a data frame, missing values appear in the result.

Right join

Keep only the columns in the last data frame. If a column is missing from a data frame, missing values appear in the result.

The example below shows all these options.

C#
var df1 = DataFrame.FromColumns(new Dictionary<string, object>() {
    { "A", new string[] { "A0", "A1", "A2", "A3"} },
    { "B", new string[] { "B0", "B1", "B2", "B3"} },
    { "C", new string[] { "C0", "C1", "C2", "C3"} } },
    Index.Default(0, 3));
var df2 = DataFrame.FromColumns(new Dictionary<string, object>() {
    { "A", new string[] { "A4", "A5", "A6", "A7"} },
    { "B", new string[] { "B4", "B5", "B6", "B7"} },
    { "D", new string[] { "D4", "D5", "D6", "D7" } } },
    Index.Default(4, 7));
var df12outer = df1.Append(df2, JoinType.Outer);
var df12Inner = df1.Append(df2, JoinType.Inner);
// Left column join is equivalent to using the left column index:
var df12Left = df1.Append(df2, JoinType.Left);
var df12Left2 = df1.Append(df2, df1.ColumnIndex);
// Again, these are equivalent:
var df12Right = df1.Append(df2, JoinType.Right);
var df12Right2 = df1.Append(df2, df2.ColumnIndex);

All Append methods have an optional argument that specifies whether the resulting row index should be validated. The default is false. Setting this argument to true forces a check that all keys in the resulting index are unique. This may take a considerable amount of time.

Database-like joins

The case where the data frames that are to be combined contain information about the same set of observations is more complicated.

Data frames are joined on keys. Both the row index of a data frame and any of its columns can be used as keys. The difference is that the row index of a data frame is guaranteed to be unique, while columns, or sets of columns, are not. The result of a join operation depends on two factors: the uniqueness of the keys in each table, and the action to take when no matching key is found.

All join operations are provided as static extension methods on the DataFrame class. The first three parameters of these methods are always the same: The first is the left data frame. The second is a JoinType value that specifies the type of operation to perform. The third parameter is the right data frame. These are all extension methods, so they may be called like instance methods on the left data frame.

It is possible that the two data frames have columns with the same key even when the columns are not used as join keys. This can be addressed in two ways. When the column keys are strings, you can supply suffixes that will be appended to the column keys of the left and right data frames as needed to make the keys unique. In the general case, you can supply functions that transform the key into a unique key.

Join on indexes

When the keys in both data frames are unique, as is the case with row indexes, there is guaranteed to be a one-to-one correspondence between the rows in the two data frames. The uniqueness of the keys is preserved in the result. The resulting data frame will have a row index with keys taken from the row indexes of the data frame.

The Join method performs this operation. No additional parameters are needed. In the example below, two data frames with indexes that span overlapping date ranges are joined:

C#
var dates = Index.CreateDateRange(new DateTime(2015, 11, 11), 5, Recurrence.Daily);
var df1 = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "values1", Vector.CreateRandom(5) } }, dates);
var dates2 = Index.CreateDateRange(dates[2], 5, Recurrence.Daily);
var df2 = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "values2", Vector.CreateRandom(5) } }, dates2);
var df3 = DataFrame.Join(df1, JoinType.Outer, df2);

Join on index and columns

When one or more columns in the left data frame are used as join keys, the keys may not be unique. Multiple rows in the left data frame may correspond to one row in the right data frame and we have a one-to-many relationship. The keys from the row index of the left data frame can still serve as the row index of the result if the join is a left join or an inner join.

Two overloads of the Join method perform this operation. You must specify tHe key of the column in the left data frame that corresponds to the row index of the right data frame. If the right frame has a hierarchical (multi-)index, then the keys of all the columns that correspond to levels in the multi-index must be specified.

In the following example, we start with a data frame that lists the last 5 presidents of the United States and their home state. We join this to a data frame containing state abbreviations and their full names to obtain a data frame that also lists the full name of the presidents' home state:

C#
var numbers = Index.Create(new[] { 44, 43, 42, 41, 40 });
var names = Vector.Create("Barack Obama", "George W. Bush", "Bill Clinton",
    "George H.W. Bush", "Ronald Reagan");
var homeStates = Vector.Create("IL", "TX", "AR", "TX", "CA");
var presidents = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "Name", names }, { "Home state", homeStates } }, numbers);

var abbreviations = Index.Create(new[] { "AR", "CA", "GA", "MI", "IL", "TX" });
var stateNames = Vector.Create("Arkansas", "California", "Georgia",
    "Michigan", "Illinois", "Texas");
var states = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "Full name", stateNames} }, abbreviations);

// Get full names of states in the list:
var presidentsWithState = DataFrame.Join(presidents, JoinType.Left, states, key: "Home state");

Join on columns only

When the join keys of both data frames are made up of columns, then the keys in neither data frame are unique. We then have a many-to-many relationship. No row index or column in the result can be guaranteed to be unique. For this reason, when joining on columns only, the row index is dropped and a default (row number) index is used instead.

Once again, there are two overloads of the Join method for this operation. You must specify tHe key of the column in each data frame that serves as the join key. Multiple key columns may be specified. The number of key columns for the left and right data frame must be the same.

Join on nearest

When the row index of a data frame is sorted, it is possible to perform a left join where the left row match is found by looking for the key nearest to the right key. For example, when joining two time series where one series is offset by several hours, a join on nearest will correctly join the data frames on the date. The JoinOnNearest<R, C> method performs this operation:

C#
var dates = Index.CreateDateRange(new DateTime(2015, 11, 11), 5, Recurrence.Daily);
var df1 = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "values1", Vector.CreateRandom(5) } }, dates);
var dates2 = Index.CreateDateRange(dates[0].AddHours(3), 5, Recurrence.Daily);
var df2 = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "values2", Vector.CreateRandom(5) } }, dates2);
var df3 = df1.JoinOnNearest(df2, Direction.Backward);

There are limitations, however. If the left time series has gaps where the right index doesn't, then multiple rows in the right frame may correspond to the same row in the left frame. It may be better to resample the index of the offset data frame. That way, matching keys in both indexes will coincide and a normal join operation can be performed.