Statistical Models

Statistical models serve two main purposes. First, they may provide insight into a phenomenon by providing quantitative indicators of the interactions between variables and their contribution to outcomes. Secondly, they may be used to predict outcomes for fresh data.

In either case, the process of creating and working with a statistical model consists of several steps:

  1. Gathering the data;

  2. Choosing the type of model to fit, and which data to include in the model;

  3. Cleaning the data, including handling of missing values;

  4. Fitting the model;

  5. Validating the model using diagnostic information gathered from the fitting process;

  6. Using the model to make predictions.

The data frame library is perfectly suited for the third step. It also provides the means to get to step 4: fitting the model. From there on, the statistical model classes provide all the required functionality.

Defining models

Statistical models describes relationships between variables. Specifying the variables that appear in a model is therefore an essential step. Variables can play different roles in a mode.

Features are variables that represent known properties of the phenomenon under investigation. Depending on the context, they may also be called independent, explanatory or exogenous variables, regressors, or inputs. They may be numerical (continuous) or categorical. Many models expect only numerical input, so categorical variables must be encoded into numerical variables.

Targets are variables that describe the outcome. The objective of a model is to describe how the features affect the target. They are also called dependent, explained or endogenous variables, outputs or (in classification) labels. There may be 0, 1, or many targets.

Weights are numerical variables that indicate the relative importance of an observation to the outcome. If the model supports it, weights are always optional.

Specific models may have variables that serve a specific purpose.

There is a great deal of flexibility in how the input to a model is specified. Usually, the data is supplied in the constructor as a IDataFrame (a DataFrame<R, C>, Matrix<T>, or Vector<T>).

When the data is supplied as a data frame, the column keys of the model variables must be specified. This can be done either by supplying them directly or as an R-style formula.

The supplied variables are then used to prepare the input for the model fitting algorithm. The model may derive additional variables from the input. For example, a constant (intercept) term is added by default to most regression models. Models that require numerical features automatically convert categorical variables to a set of indicator variables using a suitable encoding. In polynomial regression, a single input variable is expanded to a set of powers of the variable.

Fitting models

The classes that fit statistical models expect the input to be in a specific format. The input consists of one or more groups of variables.

Once a model has been fitted, it may turn out that some input variables are not used in the final model. This can happen, for example, when some variables are constant, or when there is a linear dependency between several variables.

So, in summary, a model has three sets of variables:

  • The variables that are supplied as input to the model. We call these the original variables.

  • The variables that are derived from the input variables into a form suitable for use by the fitting algorithm. We call these the input variables.

  • The variables that are present in the fitted model. We call these the model variables.

Validating the model

Once the model has been fitted, it must be validated. Models have a large number of properties that give information about the quality of the fit, including residuals, R2 values, and so on. Often some form of goodness-of-fit test is also available.

Validation is specific to the type of model being fitted, and is discussed in detail in later sections.

Model lifecycle

Every statistical model progresses through three distinct states during its lifecycle. Understanding these states is essential for working with models effectively.

Unfitted

When a model is first constructed, it is in the Unfitted state. The model contains only its options and specifications, such as the formula, input data, and configuration settings. No fitting has been performed, so there are no diagnostics or predictive outputs available.

In this state, you can:

  • Specify model options and parameters.

  • Call the Fit method to train the model.

Attempting to predict, access diagnostics, or perform transformations will throw an InvalidOperationException.

Fitted

After calling Fit, the model enters the Fitted state. This is the full-featured state where all model capabilities are available.

In this state, you can:

  • Access full diagnostics, including residuals, fitted values, and goodness-of-fit statistics.

  • Perform predictions, forecasting, or transformations.

  • Examine model parameters and their statistical properties.

  • Deploy the model for production use.

  • Save the model to JSON for persistence.

Deployed

The Deployed state represents a minimal version of the model suitable for use in applications and services. It contains only the data necessary for prediction, forecasting, or transformation.

In this state, you can:

  • Perform predictions, forecasting, or transformations.

  • Access deployed contract properties such as coefficients and class labels.

The following are not available in this state:

  • Training data and residuals.

  • Diagnostic properties and fitted values.

  • Refitting the model.

This separation provides several benefits:

  • Reduced memory footprint: Only essential data is retained.

  • Privacy and security: Training data is not retained.

  • Clear distinction: Exploratory analysis is separated from production deployment.

  • Stable persistence: JSON-based persistence is versioned and reliable.

For more information about persistence, see Model Persistence.

Deploying the model

While some models are created purely for exploratory purposes, most models are used to make predictions based on new data. The deployment APIs allow you to prepare a model for production use and make predictions efficiently.

Deployment methods

Two methods are available for converting a fitted model to a deployed model:

Deploy()

Creates a new deployed instance of the model. The original fitted model remains unchanged and retains its full state. Use this method when you need to keep the original model for further analysis while also having a lightweight version for deployment.

DeployInPlace()

Converts the current instance into a deployed model by releasing any training data and diagnostic information. Use this method when you no longer need the full fitted model and want to free memory.

After deployment, members that are not available in deployment mode will throw an InvalidOperationException.

C#
// Create a deployed copy (original stays fitted)
var deployedModel = model.Deploy();

// Or convert in place (releases memory)
model.DeployInPlace();

Making predictions

Prediction works the same way whether the model is in the Fitted or Deployed state. Regression and classification models have an overloaded Predict method that takes a vector or data frame and produces the model's prediction for the supplied data. These methods take a ModelInputFormat argument that specifies which set of variables is being passed. The available options reflect the 3 sets of variables discussed earlier:

Value

Description

OriginalVariables

The data are the variables as passed to the model.

InputVariables

The data are the variables in the format expected by the model. They are derived from the original variables.

ModelVariables

The data are the variables that are present in the final model.

Automatic

The setting should be inferred from the number of variables, giving preference to original variables.

Automatic selection is the default. However, this may lead to unexpected results when there is insufficient information to distinguish between two options.

For more detailed examples of making predictions with deployed models, see Predicting With Deployed Models.

Diagnostics availability

Diagnostics such as residuals, coefficient tests, and goodness-of-fit statistics are only available when the model is in the Fitted state. Attempting to access these properties on a deployed model will throw an InvalidOperationException.

If you need both predictions and diagnostics, keep the model in the Fitted state or use the Deploy() method to create a separate deployed instance while retaining the original.

Model persistence

Fitted models can be saved to JSON format and later loaded for use in applications and services. The persistence mechanism saves only the deployed state of the model, which includes all data necessary for prediction but excludes training data and diagnostics.

To save a model, use the ToJson() method:

C#
// Save model to JSON string
string json = model.ToJson();

// Save to file
File.WriteAllText("model.json", json);

To load a previously saved model, use the static FromJson() method on the model type:

C#
// Load from file
string json = File.ReadAllText("model.json");
var model = SimpleRegressionModel.FromJson(json);

Models loaded from JSON are always in the Deployed state. The JSON format is versioned and designed to remain stable across library versions.

For detailed information about the persistence format and advanced usage, see Model Persistence.

Types of models

Models may be predictive in that they model an outcome in terms of known inputs. The inputs are known as independent variables, predictor variables or features. The outputs are known as dependent variables or targets.

Regression models are predictive models that express a continuous variable in terms of one or more predictor variables, which may be continuous or categorical. ANOVA models are a special case where the predictor variables are categorical. Time series models are another special case where the values of the dependent variable are correlated, and so lagged versions of the dependent variable also appear as independent variables.

Classification models are predictive models that attempt to assign observations to one of two or more classes.

Clustering models attempt to group observations purely based on some measure of similarity without reference to predefined labels. Clustering models only have features. There are no dependent variables.

Transformation models attempt to bring out the most relevant features. A common application is dimensionality reduction, i.e. reducing the total number of features that are included in a model. Dimensionality reduction may be used as a preprocessing step when building predictive models. Transformation models only have features.

See Also