Principal Component Analysis

Principal component analysis (PCA) is a data reduction technique that expresses a data set in terms of components or combinations of variables that contribute most to the variation in the data. As a result, the total number of variables used to describe the data is reduced, at the cost of losing some of the fine-grained information in the data.

Defining PCA models

All classes related to Principal Component Analysis reside in the Extreme.Statistics.Multivariate namespace. The main type is PrincipalComponentAnalysis, which represents a PCA analysis.

The PrincipalComponentAnalysis class has three constructors. The first constructor takes one argument: a Matrix<T> whose columns contain the data to be analyzed. The second constructor also takes one argument: an array of Vector<T> objects.

C#
var matrix = Matrix.CreateRandom(100, 10);
var pca1 = new PrincipalComponentAnalysis(matrix);
var vectors = matrix.Columns.ToArray();
var pca2 = new PrincipalComponentAnalysis(vectors);

The third constructor takes two arguments. The first is a IDataFrame (a DataFrame<R, C> or Matrix<T>) that contains the variables that may be used in the analysis. The second argument is an array of strings that contains the names of the variables from the collection that should be included in the analysis. The example creates a data frame from a matrix and then constructs a PCA object using a subset of columns:

C#
var rowIndex = Index.Default(matrix.RowCount);
var allNames = new string[] { "x1", "x2", "x3",
    "x4", "x5", "x6", "x7", "x8", "x9", "x10" };
var columnIndex = Index.Create(allNames);
var dataFrame = matrix.ToDataFrame(rowIndex, columnIndex);
var names = new string[] { "x1", "x2", "x3", "x8", "x9", "x10" };
var pca3 = new PrincipalComponentAnalysis(dataFrame, names);

Performing the analysis

When the variables in a PCA analysis use very different scales, the principal components will give more weight to the variable with the larger values. To put all variables on an equal footing, the variables are often scaled. The ScalingMethod property determines if and how this transformation is performed. This value is of type ScalingMethod which can take on the following values:

Value

Description

None

No scaling is performed.

UnitVariance

The columns are scaled to have unit variance. This is the default.

VectorNorm

The columns are scaled to have unit norm.

Pareto

The columns are scaled by the square root of the standard deviation.

Range

The columns are scaled to have unit range (difference between largest and smallest value).

Level

The columns are scaled by the column mean.

The Compute method performs the actual calculations. The code below sets the scaling method for the PCA object created earlier and runs the analysis:

C#
pca3.ScalingMethod = ScalingMethod.VectorNorm;
pca3.Fit();

Results of the Analysis

Once the computations are complete, a number of properties and methods give access to the results in detail. The Components property provides access to a collection of PrincipalComponent objects that provide details about each of the principal components. The components are sorted in order of their contribution to the variance in the data, in descending order.

The VarianceProportions and CumulativeVarianceProportions properties summarize the contribution of the components. The GetVarianceThreshold method calculates how many components are needed to explain a certain proportion of the total variation in the data.

PrincipalComponent objects provide more detailed information. The Eigenvalue property returns the eigenvalue corresponding to the component. This is an absolute measure for the size of the contribution. The EigenvalueDifference property returns the difference between the eigenvalues of the component and the next most significant component. This gives another indication of the signficance of a component. The greater the difference, the more important the component is compared to the remaining components. The ProportionOfVariance and CumulativeProportionOfVariance properties give the contribution of the component to the variation in the data in relative terms. Finally, the Value property returns the component as a Vector<T>. The code below illustrates these properties:

C#
Console.WriteLine(" #    Eigenvalue Difference Contribution Contrib. %");
for (int i = 0; i < 5; i++)
{
    var component = pca3.Components[i];
    Console.WriteLine("{0,2}{1,12:F4}{1,11:F4}{2,14:F3}%{3,10:F3}%",
        i, component.Eigenvalue, component.EigenvalueDifference,
        100 * component.ProportionOfVariance,
        100 * component.CumulativeProportionOfVariance);
}

The ComponentMatrix property returns the components as the columns of a matrix. The ScoreMatrix property expresses the observations in terms of the components. The GetPredictions method returns the observations if only the specified number of components is taken into account. The sample code below shows how to get the predictions for the components that explain 90% of the variation in the data:

C#
int count = pca3.GetVarianceThreshold(0.9);
Console.WriteLine("Components needed to explain 90% of variation: {0}", count);
var prediction = pca3.GetPredictions(count);
Console.WriteLine("Predictions using {0} components:", count);
Console.WriteLine("   Pr. 1  Act. 1   Pr. 2  Act. 2   Pr. 3  Act. 3   Pr. 4  Act. 4", count);
for (int i = 0; i < 10; i++)
    Console.WriteLine(
        "{0,8:F4}{1,8:F4}{2,8:F4}{3,8:F4}"
      + "{4,8:F4}{5,8:F4}{6,8:F4}{7,8:F4}",
        prediction[i, 0], matrix[i, 0],
        prediction[i, 1], matrix[i, 1],
        prediction[i, 2], matrix[i, 2],
        prediction[i, 3], matrix[i, 3]);