ANOVA Models

The label "analysis of variance" (ANOVA) brings together a series of techniques to determine and measure the source of the variation in data. Specifically, ANOVA procedures partition the total variation in a data set into its component parts.

ANOVA models come in many shapes and sizes, called designs. Extreme Numerics.NET supports the three most common designs: one-way, one-way with repeated measures, and two-way analysis of variance. However, the infrastructure is in place to handle designs of any size and complexity.

Defining ANOVA models

All classes that implement ANOVA models inherit from a common base class, AnovaModel, which in turn inherits from Model, the base class of all statistical model classes.

In regression models, the dependent variable is a linear function of the independent variables. In an ANOVA design, the independent variables are categorical. The contribution of each individual combination of values of the independent variables must be estimated separately. Some dependencies exist, so the actual number of parameters is smaller than the number of combinations. Depending on the design, some combinations may be excluded from the model, further decreasing the number of parameters.

The set of all possible values of a categorical variable is called a factor. The possible values are called the levels of the factor. The purpose of an ANOVA analysis is to investigate the contribution of each level of each factor, and/or combinations thereof to the total variation of the data.

So even though the model is initially defined in terms of the dependent and independent variables, the actual calculations are performed using the factors rather than the independent variables they are associated with.

The GetFactor method of the AnovaModel class returns the IIndex of the variable at the specified position. An overload allows you to retrieve the factor associated with an independent variable through the variable's name.

Cells

The first step in performing an analysis of variance is to divide the data set into groups of rows with the same values for the factors. The data that is associated with a particular combination of factor levels is called a cell.

Cells are implemented by the Cell class. This class has a number of properties that return summary statistics for the data in the cell. The most important ones are: Count which returns the number of observations in the cell, Mean which returns the cell mean, and Variance which returns the variance of the data in the cell only.

Cell objects can't be created directly. Instead, they are accessed through various properties of the models that return single cells or arrays of cells.

To access a specific cell, use the factor levels as indices. Using the special index All for a factor level indicates that the cell contains the totals for all levels of the factor. Setting all indices to Cell.All indicates that the cell represents summary data for the entire data set.

Results of the Analysis

The results of an analysis of variance are in the same format as those of other linear models.

The AnovaTable property returns the AnovaTable object that summarizes the results. The number of rows in the table varies with the details of the design. The TotalRow property always returns the AnovaRow for the complete data. The ErrorRow property returns the row for the residuals. The CompleteModelRow property returns the row for all the factors or interactions in the model combined. Rows corresponding to the individual factors and interactions in the model can be retrieved through the GetModelRow method.