Y ~ Binomial(n,p) log(p/(1-p)) = a + b*XTo specify the model we need to specify the family of probability distributions (binomial), and the form of the relationship between the mean of the response variable and the covariates, known as the "link function". Standard probability families have standard link functions. For example, the standard link function for the Normal distribution is the identity function, so the generalized linear model for the Normal distribution is
Y ~ Normal(mu, sigma^2) mu = a + b* XThe standard link function for the binomial distribution is the logit (log(p/(1-p))), and this is the default assumed by the S glm function. There are other models for binomial responses, including the "probit" (probability integral transform, which yields the S "qnorm" function as the link) and the complementary log-log tranform, which is a standard model used in toxicology:
probit: qnorm(p) = a + b*X cloglog: log(-log(1-p)) = a + b*XThe logistic model is a natural model to start with in most cases, but you should always bear in mind that it is a model, and may or may not be appropriate for any particular situation.
Let's fit the logistic regression model to the skin cancer data:
sc.glm<- glm(cbind(cases,n-cases)~age + Ft.Worth, data=SkinCancer,family=binomial)Notes:
The resulting data structure can be used with the standard extraction functions, for example:
summary(sc.glm)The glm summary looks similar to the lm summary, but there are some important differences. The standard errors in the table are only approximate standard errors, and may not be reliable unless the sample size is large. The glm analog of the residual sum of squares is called "residual deviance".
The definition of a residual is different for a glm, in fact there are several plausible definitions. The default here is a "deviance residual", which measures the contribution of each observation to the residual deviance. This maintains the analogy with least squares, where the contribution of an observation to the residual sum of squares is just the residual, squared.
plot(fitted.values(sc.glm),residuals(sc.glm))The residual plot doesn't look too good: clear lack of fit. Let's try an analysis of variance model - treat the ages as groups:
sc1.glm<- glm(cbind(cases,n-cases)~agegrp + Ft.Worth, data=SkinCancer,family=binomial) summary(sc1.glm)Note that we have not specified a model with interaction this time: that would fit perfectly, having one free parameter for each data point.
In the next sections we will study hypothesis tests and residuals in more detail.