Logistic Regression

Logistic regression is a standard model for handling a dichotomous response variable with independent trials. The method of least squares, which in this case could be justified by the Central Limit Theorem for sample proportions (ie.binomial random variables), doesn't yield valid inferential statements when the variance of the response is not constant. The logistic regression model is based on the following set of assumptions:

The response Y is binomial(n,p), ie. a binomial random variable with sample size n and probability of "success" p on each trial.
For each Y, p depends on a covariate X in the following way:
```
p = exp(a + b*X)/(1+exp(a+b*X))
```

We could fit this model by least squares, by transforming the responses:

	P = Y/n

	log( P/ (1-P) ) = a + b*X

and fitting by least squares. Unfortunately, the transformed response still has non-constant variance.

The method of estimation we will use is called "Maximum Likelihood Estimation", MLE for short. MLE amounts to choosing the value of the parameters that maximizes probability of the observed data. In practice, this is accomplished by an algorithm similar to that used in nonlinear least squares: start with an initial guess, and then alter the parameter values to increase the objective function, ie. the probability of the observed data. In the case of a wide class of distributions, which includes the normal, binomial, poisson, and gamma distributions, the initial guesses can be made automatically, relieving us of the problem of choosing them.

Logistic Regression as a linear model analog

topic	The Linear Model	Logistic Regression
call	lm(Y ~ X, data=DataFrame)	glm(cbind(Y,N-Y) ~ X, family = binomial,data=DataFrame)
Summary Table	summary()	summary()
Residuals	residuals()	residuals()
Fitted Values	fitted()	fitted()
Model Comparisons	anova(): F-test	anova(): ChiSquared-test
Case Diagnostics	cooks.distance(), hatvalues(), dfbetas(), etc.	cooks.distance(), hatvalues(), dfbetas(), etc.

Example: Heart Disease

This example makes use of data from the Framingham Longitudinal Study:

"Framingham" <- structure(.Data = list(
"present" = c(2, 3, 3, 4, 3, 2, 0, 3, 8, 11, 6, 6, 7, 12, 11, 11), 
"absent" = c(117, 121, 47, 22, 85, 98, 43, 20, 119, 209, 68, 43, 
             67, 99, 46, 33), 
"SC" = structure(
   .Data = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4), 
    levels = c("<200", "200-219", "220-259", "260+"), class = "factor"), 
"BP" = structure(
    .Data = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), 
    levels = c("<127", "127-146", "147-166", "167+"), class = "factor")), 
names = c("present", "absent", "SC", "BP"), 
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", 
         "9", "10", "11", "12", "13", "14", "15", "16"), 
class = "data.frame")

You can simply paste the data above into your R session, then type "Framingham" (the name of the data.frame) to display the data. The variables are

present number of cases with Coronary Heart Disease (CHD)
absent number of cases without CHD
SC serum cholesterol level
BP blood pressure level

There are several models that we might be interested in fitting here.

m0: Y ~ 1
m1a: Y ~ SC
m1b: Y ~ BP
m2: Y ~ SC+BP
m3: Y ~ SC*BP

In each case, the response "Y" is the matrix "cbind(present,absent)". For example, to fit the model "m2" use:

m2 <- glm(cbind(present,absent) ~ SC+BP,Framingham,family=binomial)

The anova function produces the information needed to test null hypotheses that arise by restricting the parameters of a model. For example, m1a is a restriction of m2: all the parameters for BP have been set to zero. The following schematic diagram represents the nesting relationships between the models; comparable models are linked by vertical or diagonal lines:

Models "m1a" and "m1b" are not comparable, since neither is a restriction of the other. You should look at the summary output for each model.

Here is a table giving for each model, the residual df and the residual deviance.

	Analysis of Deviance Table

	model	df	deviance
	m0      15      58.76
	m1a     12	26.80
	m1b	12	35.16
	m2	 9	 8.07
	m3	 0       0.00

The ChiSquared test comparing two models is simple: the change in deviance has approximately a ChiSquared distribution with degrees of freedom equal to the change in df. For example, the test comparing model m2 to model m3 yields a change in deviance of 8.07, with a change in df of 9. The p-value is

> pchisq(8.076,9,lower.tail=FALSE)
[1] 0.5265058

You can ask anova() to compute the p-value:

> anova(m2,m3,test="Chisq")
Analysis of Deviance Table

Model 1: cbind(present, absent) ~ SC + BP
Model 2: cbind(present, absent) ~ SC * BP
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1         9     8.0762                     
2         0     0.0000  9   8.0762   0.5265

The Chisquared test is really a Likelihood Ratio test, so you can do the same analysis with:

        anova(m2,m3,test="LRT")

It is useful to know that the expected value of a ChiSquared random variable is equal to it's degrees of freedom.

Notice that model "m3" is a saturated model: there is one parameter for each case, and thus no residual df, and no residual deviance. In effect, model "m3" has fit each case with its sample proportion. Comparing model "m2" to model "m3" may be done by comparing the deviance to the 95th percentile of the chi-squared distribution with 9df. Since the expected value of a Chi-squared random variable is its degrees of freedom, the deviance is clearly not large. We can conclude that model "m2" fits the data well, or to put it in more formal language, we can not reject the null hypothesis that there is no interaction between SC and BP at the 5% level.

Compare model "m1a" to model "m2"; you should conclude that model "m1a" does not fit well, compared to "m2". Formally speaking, we can reject the null hypothesis that BP has no effect on CHD.

Albyn Jones
jones@reed.edu