Multiple Regression Assignment |
GunReg <- read.table("http://people.reed.edu/~jones/141/Guns.dat", header=TRUE)The dataset contains data by state, including population, area in square miles, percent urban population, percent below poverty line, whether there are gun registration laws or not, and the number of homicides. The socioeconomic data are from 1990/91. The gun registration indicator is taken from a USA Today article (Tuesday, January 7, 1992, PAGE 5A). The dummy variable gunreg is 1 for states with registration laws, and 0 else. You may wish to attach the data frame to provide direct access to variable names: attach(GunReg); otherwise use the with() function, for example with(Gunreg,plot(pop,area)).
a) The USA Today article compared the average number of homicides in states with gun registration laws to those without. They didn't do a formal analysis, but you can! Do the t-test
t.test(homicides ~ gunreg, var.equal=TRUE)Compare to the linear model
lm(homicides ~ gunreg)What is the connection? What conclusion is suggested by the t-test?
b) This is observational rather than experimental data, so it is necessary to consider controlling for covariates that might affect the number homicides. Explore other regression models, including gunreg and other covariates that might be important for explaining homicides. You may find it useful to consider variations on the response variable (the log or sqrt transform, or per capita rates).
c) Check the case diagnostics. Are there any influential cases? If so, how do the coefficients change if you omit the case(s)?
d) What should we conclude about the evidence for a relationship between gun registration laws and homicides?
Load the data into R with:
Bwt = read.csv("http://www.reed.edu/~jones/141/Bwt.dat")The variables are
The mean birthweight for non-smokers was 123oz, for smokers 113.8oz. A t-test comparing the two indicates the difference is statistically significant (t=8.7, p << .01, 1172 df). However, this is observational data, not experimental data. The subjects were not randomly assigned to smoking and non-smoking conditions!
There are other factors known to be associated with birthweight. The length of gestation is a major determinant of birthweight, and physiological factors such as the mother's weight, height, age and parity may also be related to birthweight. Make scatterplots with birthweight on the Y axis, and explanatory variables on the X axis. You can color code for smoking in plots using:
plot(X,Y,pch=18, col=ifelse(smoke==1,"red","blue")) or plot(X,Y,pch=ifelse(smoke,1,4))Since smoke is 1 for smokers and 0 for non-smokers, ifelse() will treat it as TRUE (red) for smokers, and FALSE (blue) for non-smokers. Since there are almost 1200 cases, you may or may not be able to see anything obvious in these plots. If you see anything interesting, include that plot and discuss!
a) Fit the regression model
bwt ~ gestation + smoke + height + weight + parity
b) Plot fitted values vs residuals, the normal quantile plot of residuals, and check for influential cases with cooks distance and leverage plots. Do you see any major violations of the model assumptions, or worrisome cases? If so, do they affect our evaluation of the relationship between maternal smoking and birthweight?
c) Estimate the difference in birthweight between mothers who smoke, and mothers who don't smoke (assuming gestation length and other relevant factors are equal for both). Give a 95% CI for that difference.
d) Give a 95% CI for the slope in gestation.
e) Explain in substantive terms what each coefficient means.