Least Squares

Next: Residual Plots Up: Introduction Previous: Introduction

Let's start by making a scatterplot of the data. For some reason, the tradition is to put the explanatory variable on the horizontal axis, and the response variable on the vertical axis. It seems natural to consider vehicle weight as the explanatory variable, and miles per gallon as the response, so let's plot them that way. What seems to be the nature of the relationship?

To fit a straight line by ordinary least squares, we use the lm() ("linear model") function:

	lm(Hmpg ~ Wgt, data=Cars93)

This should produce the following output:

	Call:
	lm(formula = Hmpg ~ Wgt, data = Cars93)

	Coefficients:
	(Intercept)          Wgt  
	  51.601365    -0.007327

The "Call" component is just a reminder of the model you are fitting; the variable to the left of the tilde is the response, the variable to the right is the explanatory variable.

The "Coefficients" component tells us what the computed intercept and slope were (note that the slope is labeled by the name of the explanatory variable). The estimated intercept is 51.6, and the slope is -.007. In other words, for every additional pound of vehicle weight, we expect the mileage to decrease by .007 mpg. We might want to be cautious in interpreting this coefficient, since we are basing our line on the trend we see when comparing different makes of cars with different weights, not on a single make of car carrying different loads. What does the intercept term tell us?

In fact, the lm() function computes a lot more information than we see above. To work with it, we need to save the results of the lm computation in a data structure:


	lm.mpg <- lm(Hmpg ~ Wgt,Cars93)

To see some of the possible information we can extract from the lm data structure, use the summary function:

	summary(lm.mpg)

The Summary function output includes a 5 number summary of the residuals, which can give you a quick check for wild outliers or skewness:

Residuals:
     Min       1Q   Median       3Q      Max 
-7.65007 -1.83591 -0.07741  1.82353 11.61722

The summary function also produces a table of information about the estimated coefficients:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 51.6013654  1.7355498   29.73   <2e-16 
Wgt         -0.0073271  0.0005548  -13.21   <2e-16

The t-value is for testing the null hypothesis that the coefficient is equal to zero; note that it is always the estimated coefficient divided by the corresponding standard error. The column labeled "Pr(>|t|)" contains p-values for those hypothesis tests.

The "residual standard error" is the estimated standard deviation of the residuals. It has become common to use the term "standard error" for any estimate of a standard deviation other than the standard deviation of the data itself, presumably to avoid confusing the standard deviation of the data with the standard deviation of the residuals. Thus we might expect that a typical residual is about 3.14 mpg in magnitude. The residual standard error is the square root of the residual sum of squares divided by the degrees of freedom. Try computing it directly:

sqrt(sum(residuals(lm.mpg)^2)/91)

Degrees of freedom measures how much data we have left to estimate the residual variance, and thus the residual standard error, after fitting the line. For regression models, it will be n-p, where n is the number of observations, and p is the number of parameters. When fitting a line

      Y = B0 + B1*X

there are two parameters, the intercept (B0), and the slope (B1).

Next: Residual Plots Up: Introduction Previous: Introduction

Math 141 Index
Introduction to S

Albyn Jones
August 2004