Least Squares |
Perhaps the single most common data analytic tool in use is the method of least squares for fitting lines to data, often called "regression" or "ordinary least squres".
The least squares criterion is "choose the parameter values that minimize the sum of the squares of the (vertical) distances from the points to the line".
We will explore some of the issues that arise in this context using data from the CPS, the Current Population Survey, on education level and wages.
Load the data with
CPS <- read.csv("http://people.reed.edu/~jones/141/cps.csv")
# how big is the dataset? dim(CPS) # what are the names of the columns in the table? names(CPS) # what kind of data are represented? summary(CPS)
We will focus on two variables: "educ" (years of education) and "wage" (dollars per hour). One natural question: is education related to wages? Thinking of education as the explanatory variable, and wages as the response, make a scatterplot:
plot(educ,wage,pch=19,col="blue")You probably got an error message like object 'educ' not found. There are several solutions:
educ <- CPS[,"educ"] wage <- CPS[,"wage"] # then plot!
with(CPS,plot(educ,wage,pch=19,col="blue"))
attach(CPS) plot(educ,wage,pch=19,col="blue")
wage.lm <- lm(wage ~ educ, data=CPS) summary(wage.lm)The object we assign the lm output to contains various useful bits of information. To see what is there, use
names(wage.lm)Generally we don't want, or need, to access those objects directly. R has various extraction functions to display the information we want, for example, the summary() function above. The most important information displayed is the table of coefficients and information about the residuals.
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.74598 1.04545 -0.714 0.476 educ 0.75046 0.07873 9.532 <2e-16 Residual standard error: 4.754 on 532 degrees of freedomThe column labeled Estimate is the estimated coefficients, labeled "Intercept" (the Y-intercept) and "educ" (the slope). There is an estimated SE, a t-statistic to test the null hypothesis that the coefficient is equal to zero, and a p-value for that test.
The "residual standard error" is the estimated standard deviation of the residuals. The residual df equals 532, or 534-2.
Here the estimated equation is approximately
wage = -.746 + .75*educ
plot(fitted(wage.lm),residuals(wage.lm), pch=18,col="red") # add a horizontal reference line abline(h=0, lty=2) # add lines 2 SE's above and below the line abline(h=c(-2,2)*4.75,lty=2)If the model is correct, the plot of residuals against fitted values should look like symmetrically distributed random points with constant spread. Does it?
You should also make the normal quantile plot of the residuals:
qqnorm(residuals(wage.lm),pch=19,col="blue") qqline(residuals(wage.lm),lty=2,col="red")If all is well, it should look roughly like a straight line. Does it?
with(CPS,plot(educ,log(wage), pch=19,col="blue"))Fit and print the lm() summary:
logwage.lm <- lm(log(wage) ~ educ, data = CPS) summary(logwage.lm) #add the fitted line to the scatterplot abline(coef(logwage.lm),col="red")Now, plot the residuals again:
plot(fitted(logwage.lm),residuals(logwage.lm), pch=18,col="red") # add a horizontal reference line abline(h=0, lty=2) # add lines 2 SE's above and below the line abline(h=c(-2,2)*.488,lty=2)Make sure you can find the residual SE in the summary output! If the data are roughly normally distributed, then most of the data (about 95 percent) should lie within 2 SE's of the fitted line, that is between the two outer dotted lines.
qqnorm(residuals(logwage.lm),pch=19,col="blue") qqline(residuals(logwage.lm),lty=2,col="red")