Principal Component Analysis (PCA) in IronPython QuickStart Sample

Illustrates how to perform a Principal Component Analysis using classes in the Numerics.NET.Statistics.Multivariate namespace in IronPython.

This sample is also available in: C#, Visual Basic, F#.

Overview

This QuickStart sample demonstrates how to perform Principal Component Analysis (PCA) on a dataset using the Numerics.NET Statistics library.

The sample loads data from a text file containing depression study measurements and performs the following operations:

Loads and prepares the data from a delimited text file
Creates and fits a PCA model to the data
Examines the eigenvalues and variance contributions of each principal component
Shows how to determine the number of components needed to explain a given proportion of variance
Displays the component vectors themselves
Calculates and displays component scores
Demonstrates how to use the components to predict/reconstruct the original data

The sample shows key PCA concepts including variance proportions, cumulative variance explained, component vectors, and scores. It illustrates how to interpret PCA results and use them for dimensionality reduction by selecting components that explain a target proportion of variance.

The code uses real-world depression study data to make the example concrete and practical. It shows common PCA workflow steps from data loading through analysis and prediction.

The code

import numerics

from Extreme.Mathematics import *
from Extreme.Mathematics.LinearAlgebra.IO import *
from Extreme.Statistics import *
from Extreme.Statistics.Multivariate import *

# Demonstrates how to use classes that implement
# Principal Component Analysis (PCA).

# This QuickStart Sample demonstrates how to perform
# a principal component analysis on a set of data.
#
# The classes used in this sample reside in the
# Extreme.Statistics.Multivariate namespace..

# First, our dataset, 'depress.txt', which is from
#     Computer-Aided Multivariate Analysis, 4th Edition
#     by A. A. Afifi, V. Clark and S. May, chapter 16
#     See http:#www.ats.ucla.edu/stat/Stata/examples/cama4/default.htm

# The data is in delimited text format. Use a matrix reader to load it into a matrix.
reader = DelimitedTextMatrixReader(r"..\Data\Depress.txt")
reader.MergeConsecutiveDelimiters = True
reader.SetColumnDelimiters(' ')
m = reader.ReadMatrix()

# The data we want is in columns 8 through 27:
m = m.GetSubmatrix(0, m.RowCount - 1, 8, 27)

# 
# Principal component analysis
#

# We can construct PCA objects in many ways. Since we have the data in a matrix, # we use the constructor that takes a matrix as input.
pca = PrincipalComponentAnalysis(m)
# and immediately perform the analysis:
pca.Compute()
            
# We can get the contributions of each component:
print " #    Eigenvalue Difference Contribution Contrib. %"
for i in range(5):
    # We get the ith component from the model...
    component = pca.Components[i]
    # and write out its properties
    print "{0:2}{1:12.4f}{2:11.4f}{2:14.3f}%{3:10.3f}%" \
        .format( i, component.Eigenvalue, component.EigenvalueDifference, \
        100 * component.ProportionOfVariance, 100 * component.CumulativeProportionOfVariance)

# To get the proportions for all components, use the
# properties of the PCA object:
proportions = pca.VarianceProportions

# To get the number of components that explain a given proportion
# of the variation, use the GetVarianceThreshold method:
count = pca.GetVarianceThreshold(0.9)
print "Components needed to explain 90% of variation:", count
print 

# The value property gives the components themselves:
print "Components:"
print "Var.      1       2       3       4       5"
pcs = pca.Components
for i in range(pcs.Count):
    print "{0:4}{1:8.4f}{2:8.4f}{3:8.4f}{4:8.4f}{5:8.4f}" \
        .format(i, pcs[0].Value[i], pcs[1].Value[i], pcs[2].Value[i], pcs[3].Value[i], pcs[4].Value[i])
print 

# The scores are the coefficients of the observations expressed as a combination
# of principal components.
scores = pca.ScoreMatrix

# To get the predicted observations based on a specified number of components, # use the GetPredictions method.
prediction = pca.GetPredictions(count)
print "Predictions using", count, "components:"
print "   Pr. 1  Act. 1   Pr. 2  Act. 2   Pr. 3  Act. 3   Pr. 4  Act. 4", count
for i in range(0, 10):
    print "{0:8.4f}{1:8.4f}{2:8.4f}{3:8.4f}{4:8.4f}{5:8.4f}{6:8.4f}{7:8.4f}" \
        .format(prediction[0].GetValue(i), m[i, 0], prediction[1].GetValue(i), m[i, 1], \
        prediction[2].GetValue(i), m[i, 2], prediction[3].GetValue(i), m[i, 3])