Defining models using formulas

Formulas are a way to compactly specify the variables that appear in a model and their roles.

Formula syntax

The syntax for formulas is very similar to that of R and its predecessor, S. For example, the formula for a linear regression of a variable y on a set of variables, x, a, and b is:

y ~ x + a + b

The left-hand side of this equation, before the ~ sign, specifies the dependent variable(s). The right-hand side specifies the independent variables.

The terms in a formula can be more complicated expressions, like product terms and interactions. It is helpful to think of terms as sets of variables, where a term like x is a set consisting of of one variable. The operations in the formula are operations on sets of variables.

In all, the formula language includes 6 operators. They are, in order of lowest to highest precedence:

~: Separates the dependent variables or targets on the left from the independent variables or features on the right. If not present, the formula is considered to contain independent variables or features.
+: Union operator. Combines the terms on the left and right and computes their union.
-: Difference operator. Computes the set difference between two terms. It returns the set of items that are in the left operand but not in the right operand. This operator has the same precedence as +.
*: Product operator. Like +, it computes the union of the left and right terms, but also adds the interaction between each term on the left and each term on the right. In other words: a*b is equivalent to a + b + a:b.
:: Interaction operator. Computes the interaction between the left and right terms. The result consists of the interactions of each term in the left set with each term in the right set. In numerical terms, the interaction between two variables corresponds to their element-wise product.
^ or **: Computes a polynomial. The right operand must be an integer exponent, n. The result is applying the product operator * to n times the left operand. So, (a+b)**3 is equivalent to (a+b)*(a+b)*(a+b).

Parentheses can be used to change the order of operations. All operators are left-associative, so x - a - b is equivalent to (x - a) - b.

Variables are specified using their name. If the name contains spaces or other reserved characters, they can be quoted using back quotes (`), for example:

Result ~ `Item 1` + `Item 2 + 3`

In addition, two special terms, 1 and 0, indicate the presence or absence of an intercept term, as discussed in the next section.

Finally, the . term is a special value that represents all the variables in the dataset that have not been used up to that point. This is particularly useful for situations where the dataset contains many variables. If y is the dependent variable, and all other variables should be included in the model, then the formula is simply

y ~ .

Intercepts

Most linear models include an intercept or constant term. For convenience, formulas for such models include the term by default, even if it is not specified. So

y ~ x + a + b

is really equivalent to

y ~ 1 + x + a + b

There are two ways to exclude an intercept term from a model. The first is to explicitly remove it at the end:

y ~ x + a + b - 1

The second is to include the 'no intercept' term 0at the start of the formula:

y ~ 0 + x + a + b

Only regression models, including logistic regression models, include the intercept by default. Models that don't generally include an intercept term, like clustering models or PCA, don't include an intercept term by default.

Categorical variables

Categorical variables are special. First, an interaction between a categorical variable and itself does not add any information to the model. This is in contrast to numerical variables.

Second, most models require variables to be numerical, so in order to include categorical variables, they must be encoded into one or more indicator variables.

What complicates matters is that full encodings usually result in linear dependencies between the indicator variables. Put another way: adding the full set of indicator variables to a model would add redundant information.

For example, if a boolean variable is encoded using two indicator variables, one that has a 1 for true and zeros elsewhere, and one that has a 1 for false and zeros elsewhere, then the sum of the indicator variables will have 1 everywhere, which makes it exactly the same as the intercept term. If an intercept term is already present, then adding the second indicator variable does not add any new information, because it can be calculated from the intercept and the first indicator variable.

For this reason, in linear models not all indicator variables will end up being included in the model. Only indicator variables that add information to the model will be included.

Encodings for Categorical Variables

Categorical variables can be encoded in a variety of ways. Each encoding will produce different values for the model parameters. The interpretation of parameter values is different as well. Each encoding scheme and each encoding within a scheme brings out a different aspect of the role of the variable in the model. For this reason, an encoding of a categorical variable is sometimes referred to as a contrast.

Encodings are specific to the levels of a categorical variable (its CategoryIndex property). Categorical encodings are implemented by the CategoricalEncoding class. This class has no public constructors. Instead, one of the static methods should be used to create the encoding:

Name	Description
Dummy(IIndex, Int32)	Also called one hot encoding. Every level is compared against the reference level. Every level except the reference level is encoded using a binary variable. The first level is the default for the reference level.
Simple(IIndex, Int32)	Each level is compared to the reference level. The grand mean serves as the intercept. The first level is the default for the reference level.
Deviation(IIndex, Int32)	Every level except the reference level is encoded using one of three values: 1 if the value equals the level, -1 if the value equals the reference level, and 0 otherwise.
OrthogonalPolynomial(IIndex)	Only valid for ordinal variables where the levels are ordered. The levels are encoded as orthogonal polynomials which reflect linear, quadratic, cubic... trends in the categorical variable.
ForwardDifference(IIndex)	Only valid for ordinal variables where the levels are ordered. Each level is compared to the next level.
BackwardDifference(IIndex)	Only valid for ordinal variables where the levels are ordered. Each level is compared to the previous level.
Helmert(IIndex)	Only valid for ordinal variables where the levels are ordered. Each level is compared to the mean of subsequent levels.
ReverseHelmert(IIndex)	Only valid for ordinal variables where the levels are ordered. Each level is compared to the mean of previous levels.

Each encoding has two variants: full rank and reduced rank. The reduced-rank encoding is when using the full-rank encoding would lead to redundancies.

The GetContrastMatrix(Boolean) method returns a matrix whose columns contain the encodings of the variable.

To set the encoding for a variable in a model, use the model's Data property to access the model's group of Features. You can then call the group's SetEncoding(String, Func<IIndex, Int32, CategoricalEncoding>, Int32) method to select the encoding. The first argument is the key of the variable. The second is a function that creates the encoding. This can be one of the static methods of the CategoricalEncoding class. The third argument is optional and specifies the reference level. The GetEncoding(String) method returns the current the encoding.