Histograms

A histogram is a table used to tally the frequency of data. Each data value is mapped to a bin. The histogram itself is just a vector of real numbers labeled by the categories, the bin index. The Histogram<T> class represents a histogram where the generic type argument defines the type of the bins. For categorical data, there is one bin for every category. The type of the bins is the same as the data. For continuous variables (real or date/time), the bins are defined by intervals. The bins are of type Interval<T>, and the bin index is of type IntervalIndex<T>.

Constructing histograms

There are two basic ways to create a histogram: you can create an empty histogram ready to receive data to tally, or you can create a histogram from a data source that has the data tallied.

Constructing empty histograms

Empty histograms are created using one of the overloads of the Histogram.CreateEmpty method. This method has three overloads. The first two can be used for continuous data. The third overload can be used for both continuous and categorical data.

The first overload takes one or two arguments. The first is a list of boundaries for the bins. The optional second argument is a SpecialBins value that specifies whether to create bins for values smaller than the lowest bound or larger than the highest bound. The following table lists the possible values:

Values of the SpecialBins enumeration

Name

Description

None

No special bins are included.

BelowMinimum

There is a special bin for values below the scale's minimum value.

AboveMaximum

There is a special bin for values above the scale's maximum value.

If the BelowMinimum bin is included, this bin is the first bin in the collection. If the AboveMaximum bin is included, it is the last bin in the collection. The following creates two empty histograms. The second has a bin for values smaller than 50:

C#
double[] bounds = new double[] { 50, 62, 74, 88, 100 };
var histogram1 = Histogram.CreateEmpty(bounds);
var histogram2 = Histogram.CreateEmpty(bounds, SpecialBins.BelowMinimum);

The second overload takes three arguments. The first two are the lower bound of the lowest bin, and the upper bound of the highest bin. The third argument is the total number of bins. This creates an empty histogram with the specified number of bins that are all equal in width. An optional fourth argument is once again a SpecialBins value that indicates which special values should be tabulated in addition to those within the specified interval. The code below creates a histogram with five bins for values between 50 and 100:

C#
var histogram3 = Histogram.CreateEmpty(50.0, 100.0, 5);
var bins = new IntervalIndex<double>(50.0, 100.0, 5);
var histogram4 = Histogram.CreateEmpty(bins);

The third overload takes 1 argument: the Index<T> that contains the labels for the bins. This constructor is suitable for categorical data as well as continuous data when the supplied index is a previously created IntervalIndex<T>

C#
var index = Index.Create(new[] { "High", "Medium", "Low" });
var histogram5 = Histogram.CreateEmpty(index);

Constructing histograms from data

The Histogram class defines an extension method, CreateHistogram that transforms the data in a vector into a histogram. This method has many overloads which mirror to some degree the overloads of the CreateEmpty<T>(Index<T>) method.

For categorical data, the overload takes as its only argument a categorical vector. The method returns a count of each value in the vector's category index:

C#
var data = Vector.CreateCategorical(
    new[] { "High", "Low", "High", "High", "Medium", "Low" });
var histogram4 = data.CreateHistogram();

Three more overloads work on continuous data. Each of these take a list of values as their first argument. This may be a vector, an array, or any other type that implements IList<T>. The first overload takes three or four additional arguments: the lower bound, the upper bound, and the number of bins. Optionally, a SpecialBins value may be supplied that determines which special bins to include in the histogram.

C#
var values = new double[]
    {62.0, 77.0, 61.0, 94.0, 75.0, 82.0, 86.0, 83.0, 64.0, 84.0,
     68.0, 82.0, 72.0, 71.0, 85.0, 66.0, 61.0, 79.0, 81.0, 73.0};
var histogram1 = values.CreateHistogram(50.0, 100.0, 5);

Another overload takes one additional argument: an IntervalIndex<T> that specifies the bin index for the histogram.

C#
var bins = Index.CreateBins(50.0, 100.0, 5);
var histogram2 = values.CreateHistogram(bins);

The last overload is like the previous one, but takes an additional vector argument that specifies weights for the values. The bin for each value will be incremented by the corresponding weight instead of the value 1:

C#
var weights = Vector.CreateRandom(20);
var histogram3 = values.CreateHistogram(bins, weights);

Tabulating Data

There are three ways to set the totals for the bins in a histogram.

The first way is to use the Increment method. This method takes one or two arguments. The first argument is the number to tabulate. The second argument is an optional weight. If no weight is specified, it is assumed to be 1. This method increments the total of the bin that contains the first argument by 1 or the weight from the second argument.

C#
histogram1.Increment(83.0);
histogram1.Increment(78.0, 2.5);
histogram5.Increment("High");
histogram5.Increment("Medium", 4.4);

The second way is to use the Tabulate method. This method tabulates the data specified in its first argument. This can be any list of values. An optional second argument specifies the weight for each data value. This argument is of the same type as the first argument.

C#
var data = new double[]
    {62, 77, 61, 94, 75, 82, 86, 83, 64, 84,
     68, 82, 72, 71, 85, 66, 61, 79, 81, 73};
histogram2.Tabulate(data);

Finally, you can set the value of all bins directly using the SetTotals(Vector<Double>) method. This method takes a vector of real numbers as its only argument. The length of this vector must be equal to the number of bins. It sets the total of each bin to the corresponding value in the array.

The AddTotals(Vector<Double>) method is similar, but adds the totals specified by the argument to the bin totals.

C#
var totals = Vector.Create(2.0, 7.0, 9.0, 8.0, 1.0);
// histogram1.SetTotals(totals);
totals.CopyTo(histogram1);
// histogram2.AddTotals(totals);
histogram2.AddInPlace(totals);

To set all totals to zero, use the Clear() method.

Histogram Bins

Individual bins are represented by Interval<T> objects, which have a LowerBound and an UpperBound property. Together, these define the interval that is covered by the bin. The Width property returns the total width of the bin. Note that this may be infinite. All these properties are read-only.

You can use for-each to iterate through a histogram's bins:

C#
foreach (var pair in histogram1.BinsAndValues)
    Console.WriteLine("{0}-{1}: total = {2}",
        pair.Key.LowerBound, pair.Key.UpperBound, pair.Value);

You can find the bin corresponding to a specific value through the FindBin<T> method. This returns the Interval<T> corresponding to its argument.

Other Properties and Methods

The TotalValue property returns the sum of all totals in all bins. The GetTotals() method returns a Double array containing the totals for each bin.

The GoodnessOfFitTest method returns a ChiSquareGoodnessOfFitTest object that can be used to verify the hypothesis that the data in the histogram follows a certain distribution. The method takes two arguments. The first is a ContinuousDistribution object that specifies the distribution to be tested against. The second is an integer that specifies the number of parameters of the distribution that were estimated. Any estimated argument reduces the degrees of freedom by one.