Least Squares |
Perhaps the single most common data analytic tool in use is the method of least squares for fitting lines to data, often called "regression" or "ordinary least squres".
The least squares criterion is "choose the parameter values that minimize the sum of the squares of the (vertical) distances from the points to the line".
We will explore some of the issues that arise in this context using data from the CPS, the Current Population Survey, on education level and wages.
Load the data with
CPS <- read.csv("http://people.reed.edu/~jones/141/cps.csv")
# how big is the dataset?
dim(CPS)
# what are the names of the columns in the table?
names(CPS)
# what kind of data are represented?
summary(CPS)
We will focus on two variables: "educ" (years of education) and "wage" (dollars per hour). One natural question: is education related to wages? Thinking of education as the explanatory variable, and wages as the response, make a scatterplot:
plot(educ,wage,pch=19,col="blue")
You probably got an error message like
object 'educ' not found. There are several solutions:
educ <- CPS[,"educ"]
wage <- CPS[,"wage"]
# then plot!
with(CPS,plot(educ,wage,pch=19,col="blue"))
attach(CPS)
plot(educ,wage,pch=19,col="blue")
wage.lm <- lm(wage ~ educ, data=CPS)
summary(wage.lm)
The object we assign the lm output to contains various
useful bits of information. To see what is there, use
names(wage.lm)
Generally we don't want, or need, to access those objects
directly. R has various extraction functions to
display the information we want, for example, the summary()
function above. The most important information displayed is
the table of coefficients and information about the residuals.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.74598 1.04545 -0.714 0.476
educ 0.75046 0.07873 9.532 <2e-16
Residual standard error: 4.754 on 532 degrees of freedom
The column labeled Estimate is the estimated coefficients,
labeled "Intercept" (the Y-intercept) and "educ" (the slope).
There is an estimated SE, a t-statistic to test the null
hypothesis that the coefficient is equal to zero, and a
p-value for that test.
The "residual standard error" is the estimated standard deviation of the residuals. The residual df equals 532, or 534-2.
Here the estimated equation is approximately
wage = -.746 + .75*educ
plot(fitted(wage.lm),residuals(wage.lm),
pch=18,col="red")
# add a horizontal reference line
abline(h=0, lty=2)
# add lines 2 SE's above and below the line
abline(h=c(-2,2)*4.75,lty=2)
If the model is correct, the plot of residuals against
fitted values should look like symmetrically distributed
random points with constant
spread. Does it?
You should also make the normal quantile plot of the residuals:
qqnorm(residuals(wage.lm),pch=19,col="blue")
qqline(residuals(wage.lm),lty=2,col="red")
If all is well, it should look roughly like a straight
line. Does it?
with(CPS,plot(educ,log(wage), pch=19,col="blue"))
Fit and print the lm() summary:
logwage.lm <- lm(log(wage) ~ educ, data = CPS)
summary(logwage.lm)
#add the fitted line to the scatterplot
abline(coef(logwage.lm),col="red")
Now, plot the residuals again:
plot(fitted(logwage.lm),residuals(logwage.lm),
pch=18,col="red")
# add a horizontal reference line
abline(h=0, lty=2)
# add lines 2 SE's above and below the line
abline(h=c(-2,2)*.488,lty=2)
Make sure you can find the residual SE in the
summary output! If the data are roughly normally distributed, then
most of the data (about 95 percent) should lie within
2 SE's of the fitted line, that is between the two outer dotted
lines.
qqnorm(residuals(logwage.lm),pch=19,col="blue")
qqline(residuals(logwage.lm),lty=2,col="red")