Cluster Analysis in Visual Basic QuickStart Sample
Illustrates how to use the classes in the Numerics.NET.Statistics.Multivariate namespace to perform hierarchical clustering and K-means clustering in Visual Basic.
This sample is also available in: C#, F#, IronPython.
Overview
This QuickStart sample demonstrates how to perform cluster analysis using both hierarchical and K-means clustering methods in C#. It shows how to analyze multivariate data to discover natural groupings within the data.
The sample uses a real-world dataset of company financial metrics to demonstrate:
- How to perform hierarchical cluster analysis, including:
- Standardizing variables using Z-scores
- Selecting linkage methods and distance measures
- Creating cluster partitions
- Working with dendrograms
- Accessing cluster memberships and ordering
- How to perform K-means clustering, including:
- Initializing the model with a specified number of clusters
- Standardizing variables
- Accessing cluster assignments and distances
- Computing cluster statistics
- Working with cluster centers and inter-cluster distances
The sample includes detailed comments explaining each step and demonstrates best practices for working with both clustering methods.
The code
Option Infer On
Imports Numerics.NET.Data.Stata
Imports Numerics.NET
Imports Numerics.NET.Statistics
Imports Numerics.NET.Statistics.Multivariate
' <summary>
' Demonstrates how to use classes that implement
' hierarchical and K-means clustering.
' </summary>
Module ClusterAnalysisExample
Sub Main()
' The license is verified at runtime. We're using
' a 30 day trial key here. For more information, see
' https://numerics.net/trial-key
Numerics.NET.License.Verify("your-trial-key-here")
' This QuickStart Sample demonstrates how to run two
' common multivariate analysis techniques:
' hierarchical cluster analysis and K-means cluster analysis.
'
' The classes used in this sample reside in the
' Numerics.NET.Statistics.Multivariate namespace..
' First, our dataset, which is from
' Computer-Aided Multivariate Analysis, 4th Edition
' by A. A. Afifi, V. Clark and S. May, chapter 16
' See http:'www.ats.ucla.edu/stat/Stata/examples/cama4/default.htm
Dim frame = StataFile.ReadDataFrame("..\..\..\..\..\Data\companies.dta")
'
' Hierarchical cluster analysis
'
Console.WriteLine("Hierarchical clustering")
' Create the model:
Dim columns = {"ror5", "de", "salesgr5", "eps5", "npm1", "pe", "payoutr1"}
Dim hc = New HierarchicalClusterAnalysis(frame, columns)
' Alternatively, we could use a formula to specify the variables
Dim formula = "ror5 + de + salesgr5 + eps5 + npm1 + pe + payoutr1"
hc = New HierarchicalClusterAnalysis(frame, formula)
' Rescale the variables to their Z-scores before doing the analysis:
hc.Standardize = True
' The linkage method defaults to centroid:
hc.LinkageMethod = LinkageMethod.Centroid
' We could set the distance measure. We use the default:
hc.DistanceMeasure = DistanceMeasures.SquaredEuclideanDistance
' Fit the model:
hc.Fit()
' We can partition the cases into clusters:
Dim partition As HierarchicalClusterCollection = hc.GetClusterPartition(5)
' Individual clusters are accessed through an index, or through enumeration.
For Each cluster As HierarchicalCluster In partition
Console.WriteLine("Cluster {0} has {1} members.", cluster.Index, cluster.Size)
Next
' And get the indexes of the observations in a single cluster:
Dim indexes = partition(3).MemberIndexes
Console.WriteLine($"Number of items in the partition: {indexes.Length}")
' Get a variable that shows memberships:
Dim memberships = partition.GetMemberships()
For i As Integer = 15 To memberships.Length - 1
Console.WriteLine("Observation {0} belongs to cluster {1}", i, memberships.GetLevelIndex(i))
Next i
' A dendrogram is a graphical representation of the clustering in the form of a tree.
' You can get all the information you need to draw a dendrogram starting from
' the root node of the dendrogram:
Dim root As DendrogramNode = hc.DendrogramRoot
' Position and DistanceMeasure give the x and y coordinates:
Console.WriteLine("Root position: ({0:F4}, {1:F4})", root.Position, root.DistanceMeasure)
' The left and right children:
Console.WriteLine($"Position of left child: {root.LeftChild.Position:F4}")
Console.WriteLine($"Position of right child: {root.RightChild.Position:F4}")
' You can also get a filter that defines a sort order suitable for
' drawing the dendrogram:
Dim sortOrder = hc.GetDendrogramOrder()
Console.WriteLine()
'
' K-Means Clustering
'
Console.WriteLine("K-means clustering")
' Create the model. We need to specify the number of clusters up front:
Dim kmc As New KMeansClusterAnalysis(frame, columns, 3)
' Rescale the variables to their Z-scores before doing the analysis:
kmc.Standardize = True
' Fit the model:
kmc.Fit()
' The Predictions property Is a categorical vector that contains
' the cluster assignments
Dim predictions = kmc.Predictions
' The GetDistancesToCenters method returns a vector containing
' the distance of each observations to its center.
Dim distances = kmc.GetDistancesToCenters()
' For example
For i = 18 To predictions.Length - 1
Console.WriteLine("Observation {0} belongs to cluster {1}, distance: {2:F4}.",
i, predictions(i), distances(i))
Next
' You can use this to compute several statistics
Dim Descriptives = distances.SplitBy(predictions).
Map(Function(x) New Descriptives(Of Double)(x))
' Individual clusters are accessed through an index, Or through enumeration.
For i = 0 To Descriptives.Length - 1
Console.WriteLine("Cluster {0} has {1} members. Sum of squares: {2:F4}",
i, Descriptives(i).Count, Descriptives(i).SumOfSquares)
Console.WriteLine($"Center: {kmc.Clusters(i):F4}")
Next
' The distances between clusters are also available
Console.WriteLine(kmc.GetClusterDistances().ToString("F4"))
' You can get a filter for the observations in a single cluster.
' This uses the GetIndexes method of categorical vectors.
Dim level1Indexes = kmc.Predictions.GetIndexes(1).ToArray()
Console.WriteLine($"Number of items in cluster 1: {level1Indexes.Length}")
Console.Write("Press any key to exit.")
Console.ReadLine()
End Sub
End Module