Partial Least Squares
Partial least squares is a technique that fits combinations of independent variables called factors to one or more dependent variables. The factors are chosen to maximize the covariance between the factors and the dependent variables.
Partial least squares is useful when the number of independent variables is large compared to the number of observations, or when variables are highly correlated.
Constructing Partial Least Squares Models
The PartialLeastSquaresModel class has four constructors. The first constructor takes three arguments. The first is a Vector<T> that represents the dependent variable. The second is a parameter array of vectors that represent the independent variables. The last argument is the number of factors that should be computed. This creates a Partial Least Squares model with one dependent variable, sometimes called PLS1. The second constructor takes a matrix instead of a vector as the first argument. This constructs a multivariate PLS model, sometimes called PLS2, where each column of the matrix represents a dependent variable.
In the example below, we create two Partial Least Squares models using random data. The first has one dependent variable, 10 independent variables and 20 observations. The second has 3 dependent variables. In both cases, we're asking for 5 factors:
var dependent = Vector.CreateRandom(20);
var independents = Matrix.CreateRandom(20, 10);
var model1 = new PartialLeastSquaresModel(dependent, independents, 5);
var dependents = Matrix.CreateRandom(20, 3);
var model2 = new PartialLeastSquaresModel(dependents, independents, 5);
The third constructor takes 4 arguments. The first argument is a IDataFrame (a DataFrame<R, C> or Matrix<T>) that contains the variables to be used in the regression. The second argument is an array of strings containing the names of the dependent variables. The third argument is an array of strings containing the names of the independent variables. All the names must exist in the column index of the data frame specified by the first argument. The last argument is once again the number of factors.
In the code that follows, we give the two matrices of dependent and independent variables a column index. We join these matrices to get a matrix that can act as a data frame. We then use this matrix, along with the arrays of column names, to construct the same PLS model:
var xNames = new string[] {
"x1", "x2", "x3", "x4", "x5",
"x6", "x7","x8", "x9", "x10" };
independents.ColumnIndex = Index.Create(xNames);
var yNames = new string[] { "y1", "y2", "y3" };
dependents.ColumnIndex = Index.Create(yNames);
// A matrix can act as a data frame:
var all = Matrix.JoinHorizontal(independents, dependents);
var model3 = new PartialLeastSquaresModel(all, yNames, xNames, 5);
The fourth constructor takes three arguments. The first argument once again contains the data. The second is a string that contains a formula that describes the model. See the section on formulas for details. The same model as above can be defined using a formula as:
var model4 = new PartialLeastSquaresModel(all, "y1 + y2 + y3 ~ .", 5);
We used the special . term in the right-hand side to capture all remaining columns as independent variables.
Computing the Model
The Compute method performs the actual analysis. Most properties and methods throw an exception when they are accessed before the Compute method is called. You can verify that the model has been calculated by inspecting the Computed property.
Fitting the model is done with one of two standard algorithms: NIPALS (Nonlinear Iterative PArtial Least Squares) or SIMPLS (Statistically Inspired Modification of Partial Least Squares). The two algorithms give identical results when there is only one dependent variable.
By default, the NIPALS algorithm is used. You can change this by setting the Method property. This property is of type PartialLeastSquaresMethod and can take on the following values:
Method | Description |
---|---|
Nipals | Use the original Nonlinear Iterative PArtial Least Squares method (NIPALS). |
Simpls | Use the Statistically Inspired Modification of Partial Least Squares method (SIMPLS) of de Jong. |
The number of components to compute can be changed by setting the NumberOfComponents property. In the next example, we compute the first model we created earlier using default settings. For the second model, we change the number of requested components to 7 and compute the model using the SIMPLS algorithm:
model1.Fit();
model2.NumberOfComponents = 7;
model2.Method = PartialLeastSquaresMethod.Simpls;
model2.Fit();
Results
The PredictedValues property returns a Matrix<T> that contains the values of the dependent variable as predicted by the model. The YResiduals property returns a vector containing the difference between the actual and the predicted values of the dependent variable. Both vectors contain one element for each observation.
The Coefficients property returns the matrix of regression coefficients of the model. The Intercepts returns the vector of corresponding intercepts. The StandardizedCoefficients property returns a matrix of the standardized coefficients, based on centered and normalized variables.
Several properties give information about the factors and how they relate to the dependent and independent variables. In PLS, both the matrix of independent and dependent variables are decomposed into components. Similar terminology is used.
The XLoadings property returns a matrix that contains the loadings and XScores returns a matrix that contains the scores of the independent variables. These are the factors T and P in the decomposition of X into TPT. The YLoadings property returns a matrix that contains the loadings and YScores returns a matrix that contains the scores of the dependent variables. These are the factors U and Q in the decomposition of Y into UQT. In addition, the WeightMatrix property returns a matrix containing the projection weights for the independent variables.
Making predictions
The Predict method can be used to predict the values of the dependent variables for new data. The method has three overloads, which all take two arguments. The first overload takes a vector as its first argument. The vector contains the values of the independent variables for which a prediction should be made. The second argument, which is always optional, specifies how the values in the vector relate to the variables in the model. This overload returns a vector that contains the predictions for each of the dependent variables.
The second and third overloads take a matrix and a data frame, respectively, as their first argument. Each row in the matrix or data frame corresponds to an observation. The methods return a matrix whose rows contain the corresponding predictions for the dependent variables.
Verifying the Quality of the Model
One of the objectives of Partial Least Squares is to capture as much as possible of the variance in both the dependent and the independent variables. The XVarianceExplained and YVarianceExplained properties return vectors that contain the proportion of variance explained by each factor. Corresponding XCumulativeVarianceExplainedYCumulativeVarianceExplained return the cumulative proportions.
The quality of a PLS model is often assessed using a validation test set. The Press(Matrix<Double>, Matrix<Double>) method computes the PRESS (Predicted REsidual Sum of Squares) of the model for the supplied data. It takes two arguments. The first is a matrix that contains the values of the independent variables to be tested. The second argument is a matrix that contains the values of the dependent variables. The method returns a vector of the PRESS values for each dependent variable. The RootMeanPress(Matrix<Double>, Matrix<Double>) method returns a single value: the square root of the mean of these values.
These methods can be used to determine the ideal number of components using cross validation. In the example below, we split the input into a training and a test dataset. We print out the PRESS value for the test set for a model based on a varying number of components, from 0 to 10:
// Create subsets (sets of indices) for train and test data:
var trainingSet = new Subset(all.RowCount, 0, 9);
var testSet = new Subset(all.RowCount, 10, 20);
// Generate the train and test data sets:
var XTrain = independents.GetRows(trainingSet);
var YTrain = dependents.GetRows(trainingSet);
// Set up the model:
var model = new PartialLeastSquaresModel(YTrain, XTrain, 0);
for (int k = 0; k <= 10; k++)
{
model.NumberOfComponents = k;
model.Fit();
var XTest = independents.GetRows(testSet);
var YTest = dependents.GetRows(testSet);
double rmPress = model.RootMeanPress(YTest, XTest);
Console.WriteLine("{0}: {1:F6}", k, rmPress);
}