Hierarchical Cluster Analysis

Cluster analysis is the collective name given to a number of algorithms for grouping similar objects into distinct categories. It is a form of exploratory data analysis aimed at grouping observations in a way that minimizes the difference within groups while maximizing the difference between groups.

Hierarchical clustering organizes observations in a tree structure based on similarity or dissimilarity between clusters. The algorithm starts with each observation as its own cluster, and successively combines the clusters that are most similar. Two important properties of the algorithm are the distance measure and the linkage method.

Distance measures

The distance measure determines how the similarity between clusters is measured. The most common distance measure is the squared Euclidean distance. The following distance measures have been predefined, and are available as properties of the DistanceMeasures class:

Value

Description

EuclideanDistance

The standard Euclidean distance.

SquaredEuclideanDistance

The square of the Euclidean distance. This is the default.

ManhattanDistance

The Manhattan or city block distance, computed as the sum of the absolute values of the difference between corresponding elements.

MaximumDistance

The distance computed as the largest absolute difference between corresponding elements.

CorrelationDistance

A measure based on the correlation between the observations.

CosineDistance

The distance computed as the cosine of the angle between two observations.

MinkowskiDistance(Double)

The general Minkowski distance for the specified power. The power must be specified as a parameter to the method.

CanberraDistance

The so-called Canberra distance, which gives more weight to elements that add up to small values.

Linkage

The linkage method determines how clusters are combined and how the similarities between the merged cluster and the remaining clusters are computed. It can take on the following values:

Value

Description

Single

The distance between two clusters is the distance between their closest neighboring points.

Complete

The distance between two clusters is the distance between their two furthest member points.

Average

Also called unweighted pair-group method using averages (UPGMA). The distance between two clusters is the average distance between all inter-cluster pairs.

Ward

The cluster to be merged is the one which will produce the least increase in the sum of squared Euclidean distances from each case in a cluster to the mean of all variables.

Median

Also called unweighted pair-group method using centroid averages (UPGMC). The cluster to be merged is the one with the smallest sum of Euclidean distances between cluster means (centroids) for all variables.

McQuitty

The distance between a cluster and a newly merged cluster is the mean of the distance between the cluster and each of the two component clusters.

Centroid

The distance between two clusters is the distance between the centroids or the means of the clusters.

Running a cluster analysis

Hierarchical clustering is implemented by the HierarchicalClusterAnalysis class. This class has three constructors. The first constructor takes one argument: a Matrix<T> whose columns contain the data to be analyzed. The second constructor also takes one argument: an array of Vector<T> objects. Both these constructors are illustrated below:

C#
var matrix = Matrix.CreateRandom(100, 10);
var hc1 = new HierarchicalClusterAnalysis(matrix);
var vectors = matrix.Columns.ToArray();
var hc2 = new HierarchicalClusterAnalysis(vectors);

The third constructor takes two arguments. The first is a IDataFrame (a DataFrame<R, C> or Matrix<T>) that contains the variables that may be used in the analysis. The second argument is an array of strings that contains the names of the variables from the collection that should be included in the analysis.

The distance measure can be set through the DistanceMeasure property. It defaults to squared Euclidean distance. To set or get the linkage method, use the LinkageMethod property. The default here is the centroid method. One last option that can be set is whether the variables should be standardized, using the Standardize property. When variables are unequally scaled, some variables will make a larger contribution to the distance than others. This can distort the clustering. To avoid this problem, the variables can be standardized so they all contribute equally to the distance. The default is to standardize variables. The following example sets up a hierarchical cluster analysis using a data frame and sets some options:

C#
var rowIndex = Index.Default(matrix.RowCount);
var names = new string[] { "x1", "x2", "x3",
    "x4", "x5", "x6", "x7", "x8", "x9", "x10" };
var columnIndex = Index.Create(names);
var dataFrame = matrix.ToDataFrame(rowIndex, columnIndex);
var hc3 = new HierarchicalClusterAnalysis(dataFrame, names);
hc3.Standardize = true;
hc3.LinkageMethod = LinkageMethod.Centroid;
hc3.DistanceMeasure = DistanceMeasures.SquaredEuclideanDistance;

Results of the analysis

The Compute method performs the actual calculations. Once the computations are complete, a number of properties and methods give access to the results in detail.

Use the GetClusterPartition method to divide the observations into the specified number of sets. This method returns a HierarchicalClusterCollection object which, as the name implies, is a collection of HierarchicalCluster objects. In addition to the usual collection properties and methods, this class also has a GetMemberships method, which returns a CategoricalVector<T> that for each observation indicates the cluster to which it belongs.

Each HierarchicalCluster describes one cluster in detail. The Size property returns the number of observations in the cluster. The MemberIndexes property returns a vector containing the indexes of its members in the original dataset. The Root property returns the DendrogramNode node that is the root of the set. The following example creates a partition of 5 clusters, prints the size of each, and also prints out which cluster each observation belongs to:

C#
hc1.Fit();
var partition = hc1.GetClusterPartition(5);
foreach (HierarchicalCluster cluster in partition)
    Console.WriteLine("Cluster {0} has {1} members.", cluster.Index, cluster.Size);
// Get a categorical variable that shows memberships:
var memberships = partition.GetMemberships();
for (int i = 15; i < memberships.Length; i++)
    Console.WriteLine("Observation {0} belongs to cluster {1}", i,
        memberships.GetLevelIndex(i));

Dendrograms

A dendrogram is a tree-like representation of the results of a hierarchical cluster analysis. The nodes of the tree are clusters that represent either original observations, or clusters that resulted from merging two clusters. The nodes of a dendrogram are represented by DendrogramNode objects. The dendrogram of a cluster analysis can be accessed through its root node, available through the a DendrogramRoot property.

The IsLeaf property indicates whether the cluster is a leaf node (an observation from the original data set), or a merged cluster. If it is a leaf node, then the ObservationIndex returns the index of this observation in the original data set. The LeftChild and RightChild properties specify the two clusters that were merged. DistanceMeasure returns the distance between the two merged clusters. This property is zero for leaf nodes. Level returns the number of nodes between the current node and the root node. The root node is at level 0. The children of the root are at level 1, and so on.

A dendrogram is also useful to give a graphical representation of the results. The Position property gives the horizontal position of the node. The leaf nodes (observations) have integer positions that correspond to their GetDendrogramOrder and can be used as one coordinate. It is centered over the two clusters that were merged. The DistanceMeasure or Level properties can be used as the other coordinate.

The next example shows how to access the dendrogram nodes and their properties:

C#
var root = hc1.DendrogramRoot;
Console.WriteLine("Root position: ({0:F4}, {1:F4})", root.Position, root.DistanceMeasure);
Console.WriteLine("Position of left child: {0:F4}", root.LeftChild.Position);
Console.WriteLine("Position of right child: {0:F4}", root.RightChild.Position);