F tests and anova()

The F test is the basic tool for model comparison. Fit a full model, and restricted model, then compare:

   Mf <- lm(y ~ x1+x2+x3+x4)
   Mr <- lm(y ~ x1+x2)
   anova(Mr,Mf)

Note: the models must be fit with the same observations.

Assignment

Load the Brain/Body weight data set. Plot Brain.WT vs Body.WT, fit the regression model Brain.WT~Body.WT. Plot the residuals and other diagnostics. What are the problems with this model?
Now fit the following linear models in the log scale:
- M1: log(Brain.WT)~ log(Body.WT)
- M2: log(Brain.WT)~ log(Body.WT)+ Class
- M3: log(Brain.WT)~ log(Body.WT)*Class
Use the F-test (anova() function) to compare the models. Which is the best model?
There are three classes, birds, fish, and mammals. Which group is the baseline group represented by the intercept term? Which group has the highest average log(Brain.WT) after controlling for log(Body.WT)? Which species seem unusually "big brained" or especially "small brained", relative to their groups? It will be helpful to look at plots with different symbols or colors for the different groups!
Load the Florida election dataset with
```
FL <- read.csv("http://people.reed.edu/~jones/141/FL.dat")
```
Read the data description on the 141 website!!
We are primarily interested in over- and under-voted ballots. Define the new variable NoVote <- over+under. Your mission is to relate the sum of over and under vote counts to explanatory factors including voting technology (Tech), ballot layout (Layout), number of ballots cast, and possible socio-economic correlates of voting efficacy: education (PctHS or PctColGrad), percent elderly population, poverty, unemployment, median household income.
- a) Fit the full model
```
  lmF <- lm(log(NoVote) ~ Tech*log(Ballots) + Tech*Pct65 + Tech*PctHS)
```
  Examine the summary table: why are some coefficients not estimated? Hint: look at the names of the Tech interactions. Omit the two counties causing the trouble, and rerun the full model.
- b) Fit the restricted model
```
  lmR <- lm(log(NoVote) ~ Tech + log(Ballots) + PctHS)
```
  Note:I haven't included the case numbers of the counties to be omitted! You need to omit the same cases in the restricted model that you did in the full model. Examine the summary table.
- c) Test the null hypothesis that all omitted coefficients are zero. Which model should we prefer?
- d) What can we conclude about the various voting technologies? Which were better, which were worse? R will code dummy variables for each, make sure you know which is the baseline category!
- e) Plot fitted values vs residuals, the normal quantile plot of residuals, and check for influential cases with cooks distance and leverage plots. Do you see any major violations of the model assumptions, interesting or influential cases?
  Math 141 Index
  
  Albyn Jones