Up: Math 141 Index


Logistic Regression

Logistic regression is a standard model for handling a dichotomous response variable with independent trials. The method of least squares, which in this case could be justified by the Central Limit Theorem for sample proportions (ie.binomial random variables), doesn't yield valid inferential statements when the variance of the response is not constant. The logistic regression model is based on the following set of assumptions:

We could fit this model by least squares, by transforming the responses:

	P = Y/n

	log( P/ (1-P) ) = a + b*X
and fitting by least squares. Unfortunately, the transformed response still has non-constant variance.

The method of estimation we will use is called "Maximum Likelihood Estimation", MLE for short. MLE amounts to choosing the value of the parameters that maximizes probability of the observed data. In practice, this is accomplished by an algorithm similar to that used in nonlinear least squares: start with an initial guess, and then alter the parameter values to increase the objective function, ie. the probability of the observed data. In the case of a wide class of distributions, which includes the normal, binomial, poisson, and gamma distributions, the initial guesses can be made automatically, relieving us of the problem of choosing them.

Logistic Regression as a linear model analog

topic The Linear Model Logistic Regression
call
 lm(Y ~ X, data=DataFrame) 
 
      glm(cbind(Y,N-Y) ~ X, 
       family = binomial,data=DataFrame) 
Summary Table summary() summary()
Residuals residuals() residuals()
Fitted Values fitted() fitted()
Model Comparisons anova(): F-test anova(): ChiSquared-test
Case Diagnostics cooks.distance(), hatvalues(), dfbetas(), etc. cooks.distance(), hatvalues(), dfbetas(), etc.

Example: Heart Disease

This example makes use of data from the Framingham Longitudinal Study:

"Framingham" <- structure(.Data = list(
"present" = c(2, 3, 3, 4, 3, 2, 0, 3, 8, 11, 6, 6, 7, 12, 11, 11), 
"absent" = c(117, 121, 47, 22, 85, 98, 43, 20, 119, 209, 68, 43, 
             67, 99, 46, 33), 
"SC" = structure(
   .Data = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4), 
    levels = c("<200", "200-219", "220-259", "260+"), class = "factor"), 
"BP" = structure(
    .Data = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), 
    levels = c("<127", "127-146", "147-166", "167+"), class = "factor")), 
names = c("present", "absent", "SC", "BP"), 
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", 
         "9", "10", "11", "12", "13", "14", "15", "16"), 
class = "data.frame")
You can simply paste the data above into your R session, then type "Framingham" (the name of the data.frame) to display the data. The variables are

There are several models that we might be interested in fitting here.

In each case, the response "Y" is the matrix "cbind(present,absent)". For example, to fit the model "m2" use:
m2 <- glm(cbind(present,absent) ~ SC+BP,Framingham,family=binomial)

The anova function produces the information needed to test null hypotheses that arise by restricting the parameters of a model. For example, m1a is a restriction of m2: all the parameters for BP have been set to zero. The following schematic diagram represents the nesting relationships between the models; comparable models are linked by vertical or diagonal lines:


                      m0
                     /  \
                    /    \
                  m1a    m1b
                    \    /
                     \  /
                      m2
                      |
                      |
                      m3

Models "m1a" and "m1b" are not comparable, since neither is a restriction of the other. You should look at the summary output for each model.

Here is a table giving for each model, the residual df and the residual deviance.


	Analysis of Deviance Table

	model	df	deviance
	m0      15      58.76
	m1a     12	26.80
	m1b	12	35.16
	m2	 9	 8.07
	m3	 0       0.00

The ChiSquared test comparing two models is simple: the change in deviance has approximately a ChiSquared distribution with degrees of freedom equal to the change in df. For example, the test comparing model m2 to model m3 yields a change in deviance of 8.07, with a change in df of 9. The p-value is
> pchisq(8.076,9,lower.tail=FALSE)
[1] 0.5265058
You can ask anova() to compute the p-value:
> anova(m2,m3,test="Chisq")
Analysis of Deviance Table

Model 1: cbind(present, absent) ~ SC + BP
Model 2: cbind(present, absent) ~ SC * BP
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1         9     8.0762                     
2         0     0.0000  9   8.0762   0.5265
The Chisquared test is really a Likelihood Ratio test, so you can do the same analysis with:
        anova(m2,m3,test="LRT")
It is useful to know that the expected value of a ChiSquared random variable is equal to it's degrees of freedom.

Notice that model "m3" is a saturated model: there is one parameter for each case, and thus no residual df, and no residual deviance. In effect, model "m3" has fit each case with its sample proportion. Comparing model "m2" to model "m3" may be done by comparing the deviance to the 95th percentile of the chi-squared distribution with 9df. Since the expected value of a Chi-squared random variable is its degrees of freedom, the deviance is clearly not large. We can conclude that model "m2" fits the data well, or to put it in more formal language, we can not reject the null hypothesis that there is no interaction between SC and BP at the 5% level.

Compare model "m1a" to model "m2"; you should conclude that model "m1a" does not fit well, compared to "m2". Formally speaking, we can reject the null hypothesis that BP has no effect on CHD.



Albyn Jones
jones@reed.edu