Statistical Models

Statistical models serve two main purposes. First, they may provide insight into a phenomenon by providing quantitative indicators of the interactions between variables and their contribution to outcomes. Secondly, they may be used to predict outcomes for fresh data.

In either case, the process of creating and working with a statistical model consists of several steps:

Gathering the data;
Choosing the type of model to fit, and which data to include in the model;
Cleaning the data, including handling of missing values;
Fitting the model;
Validating the model using diagnostic information gathered from the fitting process;
Using the model to make predictions.

The data frame library is perfectly suited for the third step. It also provides the means to get to step 4: fitting the model. From there on, the statistical model classes provide all the required functionality.

Defining models

Statistical models describes relationships between variables. Specifying the variables that appear in a model is therefore an essential step. Variables can play different roles in a mode.

Features are variables that represent known properties of the phenomenon under investigation. Depending on the context, they may also be called independent, explanatory or exogenous variables, regressors, or inputs. They may be numerical (continuous) or categorical. Many models expect only numerical input, so categorical variables must be encoded into numerical variables.

Targets are variables that describe the outcome. The objective of a model is to describe how the features affect the target. They are also called dependent, explained or endogenous variables, outputs or (in classification) labels. There may be 0, 1, or many targets.

Weights are numerical variables that indicate the relative importance of an observation to the outcome. If the model supports it, weights are always optional.

Specific models may have variables that serve a specific purpose.

There is a great deal of flexibility in how the input to a model is specified. Usually, the data is supplied in the constructor as a IDataFrame (a DataFrame<R, C>, Matrix<T>, or Vector<T>).

When the data is supplied as a data frame, the column keys of the model variables must be specified. This can be done either by supplying them directly or as an R-style formula.

The supplied variables are then used to prepare the input for the model fitting algorithm. The model may derive additional variables from the input. For example, a constant (intercept) term is added by default to most regression models. Models that require numerical features automatically convert categorical variables to a set of indicator variables using a suitable encoding. In polynomial regression, a single input variable is expanded to a set of powers of the variable.

Fitting models

The classes that fit statistical models expect the input to be in a specific format. The input consists of one or more groups of variables.

Once a model has been fitted, it may turn out that some input variables are not used in the final model. This can happen, for example, when some variables are constant, or when there is a linear dependency between several variables.

So, in summary, a model has three sets of variables:

The variables that are supplied as input to the model. We call these the original variables.
The variables that are derived from the input variables into a form suitable for use by the fitting algorithm. We call these the input variables.
The variables that are present in the fitted model. We call these the model variables.

Validating the model

Once the model has been fitted, it must be validated. Models have a large number of properties that give information about the quality of the fit, including residuals, R² values, and so on. Often some form of goodness-of-fit test is also available.

Validation is specific to the type of model being fitted, and is discussed in detail in later sections.

Deploying the model

While some models are created purely for exploratory purposes, mostly models are used to make predictions based on new data.

Regression and classification models have an overloaded Predict method that takes a vector or data frame and produces the model's prediction for the supplied data. These methods take a ModelInputFormat argument that specifies which set of variables is being passed. The available options reflect the 3 sets of variables discussed earlier:

Value	Description
OriginalVariables	The data are the variables as passed to the model.
InputVariables	The data are the variables in the format expected by the model. They are derived from the original variables.
ModelVariables	The data are the variables that are present in the final model.
Automatic	The setting should be inferred from the number of variables, giving preference to original variables.

Automatic selection is the default. However, this may lead to unexpected results when there is insufficient information to distinguish between two options.

Types of models

Models may be predictive in that they model an outcome in terms of known inputs. The inputs are known as independent variables, predictor variables or features. The outputs are known as dependent variables or targets.

Regression models are predictive models that express a continuous variable in terms of one or more predictor variables, which may be continuous or categorical. ANOVA models are a special case where the predictor variables are categorical. Time series models are another special case where the values of the dependent variable are correlated, and so lagged versions of the dependent variable also appear as independent variables.

Classification models are predictive models that attempt to assign observations to one of two or more classes.

Clustering models attempt to group observations purely based on some measure of similarity without reference to predefined labels. Clustering models only have features. There are no dependent variables.

Transformation models attempt to bring out the most relevant features. A common application is dimensionality reduction, i.e. reducing the total number of features that are included in a model. Dimensionality reduction may be used as a preprocessing step when building predictive models. Transformation models only have features.