The glm() function

The glm() function in S is designed to handle the class of models known as "Generalized Linear Models", which includes the logistic regression model. The basic format of the glm function is similar to that of the lm function. The logistic regression model specifies a linear model for the log odds, log(p/(1-p)):

	Y ~ Binomial(n,p)

	log(p/(1-p)) = a + b*X

To specify the model we need to specify the family of probability distributions (binomial), and the form of the relationship between the mean of the response variable and the covariates, known as the "link function". Standard probability families have standard link functions. For example, the standard link function for the Normal distribution is the identity function, so the generalized linear model for the Normal distribution is

	Y ~ Normal(mu, sigma^2)

	mu = a + b* X

The standard link function for the binomial distribution is the logit (log(p/(1-p))), and this is the default assumed by the S glm function. There are other models for binomial responses, including the "probit" (probability integral transform, which yields the S "qnorm" function as the link) and the complementary log-log tranform, which is a standard model used in toxicology:

	probit:  qnorm(p) = a + b*X

	cloglog: log(-log(1-p)) = a + b*X

The logistic model is a natural model to start with in most cases, but you should always bear in mind that it is a model, and may or may not be appropriate for any particular situation.

Let's fit the logistic regression model to the skin cancer data:

sc.glm<- glm(cbind(cases,n-cases)~age + Ft.Worth,
         data=SkinCancer,family=binomial)

Notes:

Unless the response variable has one trial per case, the response has two components, and is specified as a matrix. The first column is the number of successes, the second is the number of failures. The cbind function takes two (or more) datasets and "column binds" them to form a matrix whose columns are the original datasets.
The linear part of the model is specified in exactly the same way as for the lm function.
The data frame is specified exactly as in the lm function.
The binomial family must be specified, as the default family is the normal.
The link function defaults to the logit.

The resulting data structure can be used with the standard extraction functions, for example:

summary(sc.glm)

The glm summary looks similar to the lm summary, but there are some important differences. The standard errors in the table are only approximate standard errors, and may not be reliable unless the sample size is large. The glm analog of the residual sum of squares is called "residual deviance".

The definition of a residual is different for a glm, in fact there are several plausible definitions. The default here is a "deviance residual", which measures the contribution of each observation to the residual deviance. This maintains the analogy with least squares, where the contribution of an observation to the residual sum of squares is just the residual, squared.

plot(fitted.values(sc.glm),residuals(sc.glm))

The residual plot doesn't look too good: clear lack of fit. Let's try an analysis of variance model - treat the ages as groups:

sc1.glm<- glm(cbind(cases,n-cases)~agegrp + Ft.Worth,
         data=SkinCancer,family=binomial)
summary(sc1.glm)

Note that we have not specified a model with interaction this time: that would fit perfectly, having one free parameter for each data point.

In the next sections we will study hypothesis tests and residuals in more detail.

Next: Inference for glm's

Up: Math 141 Index

Back: Logistic Regression

Albyn Jones
jones@reed.edu