Least Squares



Perhaps the single most common data analytic tool in use is the method of least squares for fitting lines to data, often called "regression" or "ordinary least squres".

The least squares criterion is "choose the parameter values that minimize the sum of the squares of the (vertical) distances from the points to the line".

We will explore some of the issues that arise in this context using data from the CPS, the Current Population Survey, on education level and wages.

Load the data with

       CPS <- read.csv("http://people.reed.edu/~jones/141/cps.csv")


First Steps: Know your Data!


You should always start by familiarizing yourself with the data:
# how big is the dataset?
        dim(CPS) 

# what are the names of the columns in the table?
        names(CPS)

# what kind of data are represented?
        summary(CPS)

We will focus on two variables: "educ" (years of education) and "wage" (dollars per hour). One natural question: is education related to wages? Thinking of education as the explanatory variable, and wages as the response, make a scatterplot:

        plot(educ,wage,pch=19,col="blue")
You probably got an error message like object 'educ' not found. There are several solutions:
  1. create copies of the data in your workspace:
            educ <- CPS[,"educ"]
            wage <- CPS[,"wage"]
    # then plot!
    
  2. use the with() command:
            with(CPS,plot(educ,wage,pch=19,col="blue"))
    
  3. use the attach() command:
            attach(CPS)
            plot(educ,wage,pch=19,col="blue")
    
Either of the first two is preferable to the latter when writing R code, as the attach command changes the name search sequence for R, though for simple data analysis sessions I often use attach().


Fit a regression line using Least Squares


In R, the function used to fit lines by least squares is called lm(), for Linear Model. To fit a line to the scatterplot we just made, with wage on the vertical axis and educ on the horizontal axis, use
        wage.lm <- lm(wage ~ educ, data=CPS)
        summary(wage.lm)
The object we assign the lm output to contains various useful bits of information. To see what is there, use
        names(wage.lm)
Generally we don't want, or need, to access those objects directly. R has various extraction functions to display the information we want, for example, the summary() function above. The most important information displayed is the table of coefficients and information about the residuals.
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.74598    1.04545  -0.714    0.476    
educ         0.75046    0.07873   9.532   <2e-16

Residual standard error: 4.754 on 532 degrees of freedom
The column labeled Estimate is the estimated coefficients, labeled "Intercept" (the Y-intercept) and "educ" (the slope). There is an estimated SE, a t-statistic to test the null hypothesis that the coefficient is equal to zero, and a p-value for that test.

The "residual standard error" is the estimated standard deviation of the residuals. The residual df equals 532, or 534-2.

Here the estimated equation is approximately

       wage = -.746 + .75*educ


Plot the Residuals


Always plot the residuals! R has extraction functions so we don't have to manually compute the residuals and fitted values. They are called residuals and fitted.values!
          plot(fitted(wage.lm),residuals(wage.lm),
               pch=18,col="red")

      # add a horizontal reference line     
          abline(h=0, lty=2)

      # add lines 2 SE's above and below the line
          abline(h=c(-2,2)*4.75,lty=2)
If the model is correct, the plot of residuals against fitted values should look like symmetrically distributed random points with constant spread. Does it?

You should also make the normal quantile plot of the residuals:

        qqnorm(residuals(wage.lm),pch=19,col="blue")
        qqline(residuals(wage.lm),lty=2,col="red")
If all is well, it should look roughly like a straight line. Does it?

Transforming the Response


The symptoms are non-constant variance and positively skewed residuals. Since the variability increases with the fitted value, it is natural to try representing the response in the log scale. Repeat the process starting with a scatterplot:
         with(CPS,plot(educ,log(wage), pch=19,col="blue"))
Fit and print the lm() summary:
         logwage.lm <- lm(log(wage) ~ educ, data = CPS)
         summary(logwage.lm)

         #add the fitted line to the scatterplot
         abline(coef(logwage.lm),col="red")
Now, plot the residuals again:
          plot(fitted(logwage.lm),residuals(logwage.lm),
               pch=18,col="red")

      # add a horizontal reference line     
          abline(h=0, lty=2)

      # add lines 2 SE's above and below the line
          abline(h=c(-2,2)*.488,lty=2)
Make sure you can find the residual SE in the summary output! If the data are roughly normally distributed, then most of the data (about 95 percent) should lie within 2 SE's of the fitted line, that is between the two outer dotted lines.
        qqnorm(residuals(logwage.lm),pch=19,col="blue")
        qqline(residuals(logwage.lm),lty=2,col="red")

Math 141 Index

Albyn Jones