Kernel Density Estimation

Kernel density estimation (KDE) is a method for estimating the probability density function of a variable. The estimated distribution is taken to be the sum of appropriately scaled and positioned kernels. The bandwidth specifies how far out each observation affects the density estimate.

Kernel density estimation is implemented by the KernelDensity class.

In the code examples, we will repeatedly use a sample generated from a mixture of two Gaussian distributions:

C#
``````var norm1 = new NormalDistribution(-1, 1);
var norm2 = new NormalDistribution(1, 0.3);
var X = Vector.Join(norm1.Sample(400), norm2.Sample(100));``````

Kernels

A kernel is a non-negative function with mean 0 and area 1. Kernels are implemented by the Kernel class. The KernelDensity class provides several fields that represent common kernels. The available kernels are listed below:

Bandwidth

The bandwidth is a parameter in the kernel density estimation that indicates how far the influence of each observation reaches in the density estimate. It is important to get a good value. When the bandwidth is too large, some important features of the true density may be missed. If the bandwidth is too small, the estimated density will be very noisy.

The bandwidth can be supplied directly to the kernel estimation method, or it can be estimated automatically. Three methods are available, enumerated by the KernelDensityBandwidthEstimator enumeration:

Value

Description

NormalReference

The bandwidth is chosen so it minimizes the integrated square error for normal data.

Silverman

Use Silverman's rule of thumb.

Scott

Use Scott's rule of thumb.

The EstimateBandwidth(Vector<Double>, Kernel, KernelDensityBandwidthEstimator) method returns an estimate of the bandwidth for the specified input. It takes three arguments. The first is a Vector<T> that specifies the data on which the density estimate will be based. The second argument is the kernel. The third argument is a KernelDensityBandwidthEstimator value that specifies the estimation method.

Methods exist that estimate the bandwidth for each of the three techniques. The KernelDensity class has several methods that allow you to estimate the bandwidth. The SilvermanBandwidth(Vector<Double>) and ScottBandwidth(Vector<Double>) methods return the bandwidth using Silverman's and Scott's rule of thumb, respectively. These methods take one argument: a Vector<T> that specifies the data on which the density estimate will be based. The NormalReferenceBandwidth(Vector<Double>, Kernel) method returns the normal reference bandwidth. It takes two arguments: a Vector<T> that specifies the data on which the density estimate will be based, and the kernel.

In the code below, we compute the normal reference bandwidth for our sample for a Gaussian kernel. We also compute the bandwidth using Silverman's rule of thumb:

C#
``````var bwRef = KernelDensity.NormalReferenceBandwidth(X, KernelDensity.GaussianKernel);
var bwSilverman = KernelDensity.EstimateBandwidth(X, KernelDensity.GaussianKernel,
KernelDensityBandwidthEstimator.Silverman);``````

Computing Kernel Density Estimates

The Estimate method computes the estimated density for one value or a range of values. This method takes up to 5 arguments. The first is a Vector<T> that contains the observations for which the density is to be estimated. The second argument is the kernel. The third argument is the value at which to evaluate the density. If a scalar is supplied, then the density at this value is returned. If a vector is supplied, then a vector of the densities at each value of the vector is returned.

The remaining arguments are all optional. The fourth argument is the bandwidth. If omitted, the bandwidth is estimated using the method specified by the KernelDensityBandwidthEstimator value passed as the fifth argument. The default is to use the normal reference bandwidth. The final argument is an adjustment factor for the bandwidth. This is useful when you want to specify the bandwidth as a fraction of an estimated bandwidth. Both these arguments are ignored if the bandwidth was provided explicitly.

In the next example, we compute three different kernel density estimates. First, we use a Gaussian kernel and use the Silverman bandwidth we found earlier. Then we use an Epanechnikov kernel using Scott's rule to get the bandwidth. Finally, we use a tri-weight kernel and for the bandwidth we use half the normal reference bandwidth:

C#
``````var density1 = KernelDensity.Estimate(X, KernelDensity.GaussianKernel, bwSilverman);
var density2 = KernelDensity.Estimate(X, KernelDensity.EpanechnikovKernel,
bandwidthEstimator: KernelDensityBandwidthEstimator.Scott);
var density3 = KernelDensity.Estimate(X, KernelDensity.TriweightKernel,
``````var dist1 = KernelDensity.EstimateDistribution(X, KernelDensity.GaussianKernel);