Testing for Outliers
An outlier is an observation that appears out of place in a sample. There may be several reasons why outliers are present. It may be an error in the data. It may be an unlikely but random variation. Or it may be an indication that some model assumptions are incorrect.
This section describes two tests for outliers in normal samples.
Grubbs' Test
Grubbs' test is the test of choice for a single outlier. The test has one and two-sided versions. In the one-sided version, the null hypothesis is that the smallest (one-tailed lower) or largest (one-tailed upper) value in the sample is not an outlier. The alternative hypothesis is that this value is an outlier.
In the two-sided version, the null hypothesis is that neither the smallest or the largest value are outliers. The alternative hypothesis is that either of these is an outlier.
Grubbs' test is implemented by the GrubbsTest class. It has 3 constructors. The first constructor has no arguments. The sample and other test properties must be set manually. A two-tailed test with a significance level of 0.05 is assumed by default.
The second constructor takes as its only argument a vector that contains the sample that is to be tested for outliers. The third constructor takes a second argument that specifies whether the test is one or two-tailed.
The example below tests whether the largest value in a sample of 8 measurements from a mass spectrometer of a uranium is an outlier:
var grubbsSample = Vector.Create(199.31, 199.53,
200.19, 200.82, 201.92, 201.95, 202.18, 245.57);
var grubbs = new GrubbsTest(grubbsSample, HypothesisType.OneTailedUpper);
Console.WriteLine("Grubbs G: {0:F5}", grubbs.Statistic);
Console.WriteLine(" Crit.: {0:F5}", grubbs.GetUpperCriticalValue());
Console.WriteLine(" Reject:{0}", grubbs.Reject());
The value of the test statistic is 2.46876, which is greater than the critical value of 2.03165. We therefore reject the null hypothesis and conclude that the largest value is an outlier at the 0.05 significance level.
The Generalized ESD Test
Grubbs' test can only test for a single outlier. When multiple outliers may be present, the Generalized Extreme Studentized Deviate (ESD) test is appropriate. It consists of a sequence of tests similar to Grubbs' test for a specific number of outliers from 1 to a supplied maximum.
Like Grubbs' test, the generalized ESD test has one and two-sided versions. In the one-sided version, the null hypothesis is that the smallest (one-tailed lower) or largest (one-tailed upper) values are not outliers. The alternative hypothesis is that there are up to the specified number of outliers.
In the two-sided version, the null hypothesis is that there are no outliers at either end of the sample. The alternative hypothesis is that there are up to the specified number of outliers
The generalized ESD test is implemented by the GeneralizedEsdTest class. It has 3 constructors. The first constructor has no arguments. The sample and other test properties must be set manually. A two-tailed test with a significance level of 0.05 is assumed by default. The number of outliers to test for is the smaller of 10 and half the number of samples.
The second constructor takes two arguments. The first is a vector that contains the sample that is to be tested for outliers. The second argument is the number of outliers to test for. This must be at least 1. The third constructor takes a third argument that specifies whether the test is one or two-tailed.
The example below tests a sample of 54 observations for up to 10 outliers.
var sample = Vector.Create(
-0.25, 0.68, 0.94, 1.15, 1.20, 1.26, 1.26, 1.34,
1.38, 1.43, 1.49, 1.49, 1.55, 1.56, 1.58, 1.65,
1.69, 1.70, 1.76, 1.77, 1.81, 1.91, 1.94, 1.96,
1.99, 2.06, 2.09, 2.10, 2.14, 2.15, 2.23, 2.24,
2.26, 2.35, 2.37, 2.40, 2.47, 2.54, 2.62, 2.64,
2.90, 2.92, 2.92, 2.93, 3.21, 3.26, 3.30, 3.59,
3.68, 4.30, 4.64, 5.34, 5.42, 6.01);
var test = new GeneralizedEsdTest(sample, 10, HypothesisType.TwoTailed);
Because the generalized ESD test is, in fact, a collection of tests, the interpretation of the results is slightly different.
Running the test consists of running individual tests for a specific number of outliers, from 1 to the specified maximum. If a test is found to be significant, that number is retained, and the results of that test are used as the overall test results. The NumberOfOutliers property returns the number of outliers that was found. Its value ranges from 0 to the maximum. The GetOutlierIndexes method returns a vector containing the indexes in the sample of any outliers that were found. The method optionally takes a significance level (for example 0.05) to use for the detection. Below we print the number of outliers and the values of the outliers from the example above:
Console.WriteLine("Number of outliers: {0}", test.NumberOfOutliers);
var outliers = sample[test.GetOutlierIndexes()];
Console.WriteLine("Outliers: {0}", outliers);
Individual tests can be accessed through the GetTest(Int32) method, which takes as its only argument the number of outliers. Since we detected 3 outliers in this sample, we might want to look at the test for 4 outliers:
var test4 = test.GetTest(4);
Console.WriteLine(test4.Summarize());
This shows that for 4 outliers, the value of the test statistic (2.8102) is less than the critical value (3.1362).