Multiple Regression |
SAT <- read.csv("http://people.reed.edu/~jones/141/sat.csv")The dataset includes for each state expend (expenditure per pupil), ratio (average pupil/teacher ratio), salary (average annual salary of teachers), frac (percentage of students taking the SAT), and average sat scores verbal, math and total sat. The data are from 1994.
Let's fit linear model with sat as the response:
sat.lm <- lm(sat ~ expend + ratio + salary + frac, data=SAT)Before paying a lot of attention to the summary output, we should plot residuals:
plot(fitted(sat.lm),residuals(sat.lm),pch=18,col="blue")If the pattern isn't obvious, try
lines(lowess(fitted(sat.lm),residuals(sat.lm)),lty=2,col="red")
OK, now, let's try to isolate the source of the trend: it may help to plot the residuals against the individual explanatory variables:
plot(SAT$expend,residuals(sat.lm),pch=18,col="blue") plot(SAT$ratio,residuals(sat.lm),pch=18,col="blue") plot(SAT$salary,residuals(sat.lm),pch=18,col="blue") plot(SAT$frac,residuals(sat.lm),pch=18,col="blue")
This suggests fitting a quadratic model:
SAT$frac2 <- SAT$frac^2 sat.qlm <- lm(sat ~ expend + ratio + salary + frac + frac2, data=SAT)Again, plot the residuals before thinking about inference!
Just for fun, let's start with the first regression model:
plot(fitted(sat.lm),cooks.distance(sat.lm),pch=19,col="brown") # you may wish to identify the interesting case identify(fitted(sat.lm),cooks.distance(sat.lm),SAT$State)The identify function allows you to click on a point to label it. To exit terminate the function, hit the esc key.
Cook's Distance measures relative change in the coefficients as each case is deleted.
plot(fitted(sat.lm),dffits(sat.lm),pch=19,col="brown") identify(fitted(sat.lm),dffits(sat.lm),SAT$State)dfbetas coefficients recomputed omitting each case.
dfbetas(sat.lm)dffits measures relative change in the fitted values as each case is deleted.
plot(fitted(sat.lm),rstudent(sat.lm),pch=19,col="brown") identify(fitted(sat.lm),rstudent(sat.lm),SAT$State)rstudent gives (standardized) residuals computed as if each case were omitted in turn.
plot(fitted(sat.lm),hatvalues(sat.lm),pch=19,col="brown") identify(fitted(sat.lm),hatvalues(sat.lm),SAT$State)hatvalues measure the leverage of each case: an indication of location in joint distribution of the explanatory variables. Cases with large hat values are outliers.
Finally, you might plot some of these against each other!
plot(hatvalues(sat.lm),cooks.distance(sat.lm),pch=19,col="brown")
Now, for added fun, repeat with sat.qlm...