Using Formulas to Define Statistical Models

In R, and the S language from which it was derived, model formulas (or model formulae as the official documentation calls them) are the standard way of specifying the variables in a model. The patsy package brought the same convenient syntax to Python. With an update we released earlier today, you can now use formulas to specify models in the Extreme Optimization Numerical Libraries for .NET.

Formula basics

Formulas work for almost all types of models. Here, we will use regression as an example. A regression model is defined by one dependent variable and one or more independent variables. To specify a model with dependent variable y and independent variables x1 and x2, we can use the formula:

y ~ x1 + x2

Output

y ~ x1 + x2

In this formula, the ~ operator separates the dependent variable from the independent variables, and the + operator indicates that x1 and x2 should both be included in the model.

It is useful to think of formulas as operating on sets of variables. So the terms y, x1, and x2 each stand for a set of one variable. The + operator stands for the union of the set of terms on the left and the set of terms on the right. It follows that including the same variable twice results in the same formula:

y ~ x1 + x1 + x2 + x1 + x2

is exactly the same as

y ~ x1 + x2

Other operators exist that specify different combinations of variables: They are all really just set operations. We will get to those in a minute.

In many multivariate models like clustering models, factor analysis or principal component analysis (PCA), there are no dependent variables, only features. In this case, the first part of the formula is omitted.

The Intercept

The special term 1 stands for the intercept (constant) term in a model. Because nearly all linear models include an intercept term, one is included by default. So, again for a regression, the model

y ~ x1 + x2

is equivalent to

y ~ 1 + x1 + x2

To exclude the intercept term, there are two options. The - operator can be used. This operator corresponds to the set difference between terms: it returns the set of terms in its right operand with the terms on the left removed:

y ~ x1 + x2 - 1

Alternatively, the special no-intercept term, 0, can be added as the first term in the model:

y ~ 0 + x1 + x2

Interactions

Models may include not just main effects but also interactions between terms. The interaction between two numerical variables is the element-wise product of the two variables. Interactions are specified by the : operator. For example:

y ~ x1 + x2 + x1:x2

It is very common to include both the main effects and the interactions in a model. It is so common that there is a special operator for this purpose: the product operator *. So, the above model can also be written much shorter as:

y ~ x1*x2

There is another shortcut operator: ** or ^ (both forms are equivalent). The right operand has to be an integer. It specifies how many times the left operand should be repeated in a product. So, the model

y ~ (x1 + x2)**3

is equivalent to

y ~ (x1 + x2)*(x1 + x2)*(x1 + x2)

Categorical variables

Most models require variables to be numerical. To include categorical variables in a model, they have to be encoded as numerical variables. This is done automatically In a formula like

y ~ x + a

where a is a categorical variable, the term a really means “the set of variables needed to fully represent the variable a in the model.”

The standard way to encode categorical variables is called dummy encoding or one hot encoding. For each level of the categorical variable, a corresponding indicator variable is generated that has a 1 where the variable’s value equals that level, and a 0 otherwise. That’s not the full story, however.

The sum of all indicator variables for a categorical variable is a variable with ones everywhere. This is exactly the same as an intercept term. So, if an intercept term is present in a model, then for a categorical variable with 2 levels, you only need 1 indicator variable in the model. For a variable with 3 levels, you only need 2. Adding the last indicator variable does not add any more information. It is redundant.

Handling redundancies, especially for models that include interactions between variables, can get somewhat complicated. Complicated enough for R to get it wrong sometimes. No worries: we’ve done it right.

The Catch All Term

Often, the data set used to fit a model contains just the variables that are used in the model. In this case, it is convenient to use the special . term. This term captures all the variables in the data set that have not yet appeared in the model. So, for example, if the data set contains the variables y, x1, x2, …, x17, but nothing else, then instead of writing out the full formula:

y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x11 + x12 + x13 + x14 + x15 + x16 + x17

you can simply write:

y ~ .

A code example

This example is adapted from the linear regression QuickStart sample. We’ll use C#, but the code is also available in Visual Basic and F#.

The example uses information about the test scores of 200 high school students. We’ll try to predict the science score in terms of some other variables: the scores in reading, math, and social studies, and the gender of the student.

var data = DataFrame.ReadCsv(@"..\..\..\..\Data\hsb2.csv");
var formula = "math + female + socst + read";
var model = new LinearRegressionModel(data, formula);
model.Compute();
Console.WriteLine(model.Summarize());

We start by reading the data from a CSV file into a data frame. We can then go straight to setting up the model using a formula, and computing it. We then print a summary of the results. This gives us the following output:

Output

# observations: 200
R Squared:            0.4892        Log-likelihood:      -674.63
Adj. R Squared:       0.4788        AIC:                 1359.25
Standard error:       7.1482        BIC:                 1375.75
===================================================================
Source of variation   Sum.Sq.   D.o.f. Var.Est.  F-stat.  prob.
Regression            9543.72     4.00  2385.93  46.6948   0.0000
                      9963.78   195.00    51.10
                     19507.50   199.00    98.03
===================================================================
Parameter                Value   St.Err.       t   p(>t)
Constant              12.32529   3.19356    3.86  0.0002
math                   0.38931   0.07412    5.25  0.0000
female                -2.00976   1.02272   -1.97  0.0508
socst                  0.04984   0.06223    0.80  0.4241
read                   0.33530   0.07278    4.61  0.0000

We see that math an reading scores were most significant. In this group, girls scored 2 points lower than boys, on average.

Limitations

This is just our initial implementation of model formulas. Not everything works entirely as we would like it, and there are some limitations.

Expressions are not supported.

This is the biggest limitation. Formulas in R and in patsy can contain expressions of variables. So, for example, you can have

log(y) ~ x1 + x2

In this example, the dependent variable is the natural logarithm of the variable y. When the expression contains operators that could be confused with formula operators, the special I() function is used to indicate that a term is to be evaluated as a mathematical expression:

y ~ x1 + I(x1+x2)

This formula specifies a model with 2 independent variables: x1 and the sum of x1 and x2. Such expressions are not supported in this release.

Special purpose variables are not supported.

Some models have variables that play a special role. For example, Poisson regression models sometimes have an offset variable that is associated with a parameter fixed at 1. In R, the offset parameter is indicated by the expression offset(variableName). This is currently not supported.

The ‘nested in’ operatorS `/` or %in% ARE not supported

These operators are not very common and a little confusing, so we decided to leave them out, at least for now.

The no-intercept term must appear first.

Adding the no-intercept term, 0, anywhere other than at the start of the independent variables or features has undefined behavior.

Conclusion

In only a few lines of code, we’ve loaded a data set, set up a model and printed a summary of the results. We believe that specifying models using formulas is a very useful feature. You can read more in the documentation. The QuickStart samples have also been updated.

Of course, there’s nothing like trying it out yourself. Get the update here!