Multiple Linear Regression

Multiple linear regression is a technique to analyze a linear relationship between one or more independent variables and a dependent variable. The values of the independent variables are considered to be exact, while the values of the dependent variables are subject to error. Multiple linear regression is implemented by the LinearRegressionModel class.

Constructing Multiple Linear Regression Models

The LinearRegressionModel class has three constructors. The first constructor takes two arguments. The first is a Vector<T> that represents the dependent variable. The second is a parameter array of vectors that represent the independent variables.

C#
var dependent = Vector.Create(yData);
var independent1 = Vector.Create(x1Data);
var independent2 = Vector.Create(x2Data);
var model1 = new LinearRegressionModel(dependent, independent1, independent2);

The second constructor takes 3 arguments. The first argument is a IDataFrame (a DataFrame<R, C> or Matrix<T>) that contains the variables to be used in the regression. The second argument is a string containing the name of the dependent variable. The third argument is an array of strings containing the names of the independent variables. All the names must exist in the column index of the data frame specified by the first argument.

C#
var dataFrame = DataFrame.FromColumns(new Dictionary<string, object>()
    { { "y", dependent }, { "x1", independent1 }, { "x2", independent2 } });
var model2 = new LinearRegressionModel(dataFrame, "y", "x1", "x2");

The next overload takes two or three arguments. The first argument once again contains the data. The second is a string that contains a formula that describes the model. See the section on formulas for details. The same model as above can be defined using a formula as:

C#
var model3 = new LinearRegressionModel(dataFrame, "y ~ x1 + x2");

Computing the Regression

The Compute method performs the actual analysis. Most properties and methods throw an exception when they are accessed before the Compute method is called. You can verify that the model has been calculated by inspecting the Computed property.

C#
model1.Fit();

The Predictions property returns a Vector<T> that contains the values of the dependent variable as predicted by the model. The Residuals property returns a vector containing the difference between the actual and the predicted values of the dependent variable. Both vectors contain one element for each observation.

Regression Parameters

The LinearRegressionModel class' Parameters property returns a ParameterVector<T> object that contains the parameters of the regression model. The elements of this vector are of type Parameter<T>. Regression parameters are created by the model. You cannot create them directly.

Parameters can be accessed by numerical index or by name. The name of a parameter is usually the name of the variable associated with it.

A multiple linear regression model has as many parameters as there are independent variables, plus one for the intercept (constant term) when it is included. The intercept, if present, is the first parameter in the collection, with index 0.

The Parameter<T> class has four useful properties. The Value property returns the numerical value of the parameter, while the StandardError property returns the standard deviation of the parameter's distribution.

The Statistic property returns the value of the t-statistic corresponding to the hypothesis that the parameter equals zero. The PValue property returns the corresponding p-value. A high p-value indicates that the variable associated with the parameter does not make a significant contribution to explaining the data. The p-value always corresponds to a two-tailed test.

The following example prints the properties of the parameter associated with the x1 variable in our earlier example:

C#
var x1Parameter = model1.Parameters.Get("x1");
Console.WriteLine("Name:        {0}", x1Parameter.Name);
Console.WriteLine("Value:       {0}", x1Parameter.Value);
Console.WriteLine("St.Err.:     {0}", x1Parameter.StandardError);
Console.WriteLine("t-statistic: {0}", x1Parameter.Statistic);
Console.WriteLine("p-value:     {0}", x1Parameter.PValue);

Verifying the Quality of the Regression

The ResidualSumOfSquares property gives the sum of the squares of the residuals. The regression line was found by minimizing this value. The StandardError property gives the standard deviation of the data.

The RSquared property returns the coefficient of determination. It is the ratio of the variation in the data that is explained by the model compared to the total variation in the data. Its value is always between 0 and 1, where 0 means the model explains nothing and 1 means the model explains the data perfectly.

When the model contains many independent variables, the additional variables may be modeling the errors in the data rather than the data itself. This causes the full model to be less reliable for making predictions. The AdjustedRSquared property returns an adjusted R2 value that attempts to compensate for this phenomenon.

An entirely different assessment is available through an analysis of variance. Here, the variation in the data is decomposed into a component explained by the model, and the variation in the residuals. The FStatistic property returns the F-statistic for the ratio of these two variances. The PValue property returns the corresponding p-value. A low p-value means that it is unlikely that the variation in the model is the same as the variation in the residuals. This means that the model is significant.

The results of the analysis of variance are also summarized in the regression model's ANOVA table, returned by the AnovaTable property.

Stepwise Regression

The LinearRegressionModel class has the ability to automatically select the 'best' set of variables through a process called stepwise regression. To run a stepwise regression, create a StepwiseOptions object and assign it to the model's StepwiseOptions property. There are five methods for stepwise regression, as enumerated by the StepwiseRegressionMethod type:

Method

Description

AllVariables

All variables are included in the model.

ForwardStepwise

Stepwise regression starting from an empty model, allowing variables to be added and removed.

ForwardSelection

Stepwise regression starting from an empty model, allowing variables to be added only.

BackwardStepwise

Stepwise regression starting from a complete model, allowing variables to be added and removed.

BackwardElimination

Stepwise regression starting from a complete model, allowing variables to be removed only.

To create a stepwise regression, create a new StepwiseOptions object and assign one of the above methods to its Method property. The thresholds for allowing a variable to enter or leave the model can be specified either on the basis of the F-statistic, or on the basis of the corresponding p-value. The threshold values can be set by setting either ToEnterStatisticThreshold and ToRemoveStatisticThreshold, or ToEnterPValueThreshold and ToRemovePValueThreshold.

With the options set, the model can be computed in the same way as a standard model, by calling the Compute method. The parameters in the model's Parameters collection are listed in the order in which they were added to the model.