Generalized Linear Models
Generalized linear models are an extension of linear regression models to situations where the distribution of the dependent variable is not normal. The types of models that can be represented as generalized linear models include: classic linear regression, logistic regression, probit regression and Poisson regression.
Two properties define the nature of a specific generalized linear model. The ModelFamily specifies the distribution of the errors. The LinkFunction defines the relationship between the dependent variable and the linear combination of predictor variables.
Generalized linear models are implemented by the GeneralizedLinearModel class.
Constructing Generalized Linear Models
The GeneralizedLinearModel class has four constructors.
The first constructor takes three arguments. The first is a Vector<T> that represents the dependent variable. The second is an array of vectors that represent the independent variables. The third argument is the model family.
var dependent = Vector.Create(yData);
var independent1 = Vector.Create(x1Data);
var independent2 = Vector.Create(x2Data);
var model1 = new GeneralizedLinearModel(dependent,
new[] { independent1, independent2 }, ModelFamily.Gamma);
The second constructor takes 4 arguments. The first argument is a IDataFrame (a DataFrame<R, C> or Matrix<T>) that contains the variables to be used in the regression. The second argument is a string containing the name of the dependent variable. The third argument is an array of strings containing the names of the independent variables. All names must exist in the column index of the data frame specified by the first argument. The third argument is the model family. The fourth, optional argument specifies the link function. If none is specified, the canonical link function for the selected model family is used. The fifth argument, also optional, is a vector containing weights for the observations.
var dataFrame = DataFrame.FromColumns(new Dictionary<string, object>()
{ { "y", dependent }, { "x1", independent1 }, { "x2", independent2 } });
var model2 = new GeneralizedLinearModel(dataFrame, "y", new[] { "x1", "x2" },
ModelFamily.Gamma, LinkFunction.Log);
The third constructor takes two arguments. The first is a Vector<T> containing the data of the dependent variable. The second is a Matrix<T> whose columns contain the data for each independent variable. The length of the vector must equal the number of rows of the matrix.
Model Families
The model family specifies the distribution of the errors in the dependent variable. The model family of a generalized linear model can be accessed through the ModelFamily property. It is of type ModelFamily. All common model families can accessed as static (Shared in Visual Basic) member on this type:
Member | Description |
---|---|
The normal distribution. This is the default. | |
The binomial distribution. | |
The gamma distribution. | |
The inverse Gaussian or inverse normal distribution. | |
The Poisson distribution. |
Link Functions
The link function specifies the relationship between the dependent variable and the linear combination of predictor variables. The link function of a generalized linear model can be accessed through the LinkFunction property. It is of type LinkFunction.
The link function and the model family together determine the exact form of the distribution of the dependent variable. Not all link functions are compatible with a given model family. To check for compatibility, use the model family's IsLinkFunctionCompatible method.
Every model family has a canonical link function, which can be thought of as the natural choice of link function for the family of distributions. When no link function is specified, the canonical link function of the model family is used. The canonical link function of a model family is available through the CanonicalLinkFunction property.
All common link functions can accessed using static (Shared in Visual Basic) members of the LinkFunction class:
Member | Description |
---|---|
The identity function. This is the canonical link function for the normal family. | |
The log link is the canonical link function for the Poisson family and the negative binomial famliy. | |
The logit link is the canonical link function for the binomial family. | |
The probit function is often used in logistic regression. | |
The complementary log-log link is used in logistic regression and is related to the extreme value distribution. | |
The log complement link function is sometimes used in logistic regression. | |
The negative log log link function is sometimes used in logistic regression. | |
The reciprocal link function is the canonical link function for the gamma family. | |
The squared reciprocal link function is the canonical link function for the inverse Gaussian family. | |
The power link function for a specified exponent. This is a generalization of several other link functions, like the Identity, Reciprocal, and ReciprocalSquared link functions. | |
The odds power link function for a specified exponent. If the exponent is zero, this function is equivalent to the Logit link function. |
Computing the Regression
The model family and link function have to be set before the model can be computed. The following example creates a probit regression model and sets the model family and link through properties:
var model3 = new GeneralizedLinearModel(dataFrame, "y", new[] { "x1", "x2" },
ModelFamily.Binomial, LinkFunction.Probit);
When the link function is the canonical link function of the model family, it does not have to be set explicitly. The example below creates a Poisson regression model with a log link, which is the canonical link:
var model4 = new GeneralizedLinearModel(dataFrame, "y", new[] { "x1", "x2" },
ModelFamily.Poisson);
Once the model family and link function have been set, the model can be computed. The Compute method performs the actual analysis. Most properties and methods throw an exception when they are accessed before the Compute method is called. You can verify that the model has been calculated by inspecting the Computed property.
model1.Fit();
The Predictions property returns a Vector<T> that contains the values of the dependent variable as predicted by the model. The Residuals property returns a vector containing the difference between the actual and the predicted values of the dependent variable. Both vectors contain one element for each observation.
Regression Parameters
The GeneralizedLinearModel class' Parameters property returns a ParameterVector<T> object that contains the parameters of the regression model. The elements of this vector are of type Parameter<T>. Regression parameters are created by the model. You cannot create them directly.
Parameters can be accessed by numerical index or by name. The name of a parameter is usually the name of the variable associated with it.
A generalized linear model has as many parameters as there are independent variables, plus one for the intercept (constant term) when it is included. The intercept, if present, is the first parameter in the collection, with index 0.
The Parameter<T> class has four useful properties. The Value property returns the numerical value of the parameter, while the StandardError property returns the standard deviation of the parameter's distribution.
The Statistic property returns the value of the z-statistic corresponding to the hypothesis that the parameter equals zero. The PValue property returns the corresponding p-value. A high p-value indicates that the variable associated with the parameter does not make a significant contribution to explaining the data. The p-value always corresponds to a two-tailed test. The following example prints the properties of the slope parameter of our earlier example:
var x1Parameter = model1.Parameters.Get("x1");
Console.WriteLine("Name: {0}", x1Parameter.Name);
Console.WriteLine("Value: {0}", x1Parameter.Value);
Console.WriteLine("St.Err.: {0}", x1Parameter.StandardError);
Console.WriteLine("t-statistic: {0}", x1Parameter.Statistic);
Console.WriteLine("p-value: {0}", x1Parameter.PValue);
The Parameter class has one method: GetConfidenceInterval. This method takes one argument: a confidence level between 0 and 1. A value of 0.95 corresponds to a confidence level of 95%. The method returns the confidence interval for the parameter at the specified confidence level as an Interval structure.
Verifying the Quality of the Regression
Generalized linear models are fitted by maximizing the likelihood function. The logarithm of the likelihood function of the final result is available through the LogLikelihood method. A related method, GetKernelLogLikelihood, returns the part of the log likelihood that depends on the dependent variable. The GetChiSquare method compares the log likelihood of the model to the log likelihood of the minimal model.
Other measures for goodness of fit are, suitable for comparing different models of the same data are: the Akaike Information Criterion or AIC (GetAkaikeInformationCriterion), the corrected AIC (GetCorrectedAkaikeInformationCriterion), and the Bayesian Information Criterion or BIC (GetBayesianInformationCriterion).