Inference for Regression

Least squares, or linear regression, is a technique for fitting lines to data. Regression is also a powerful tool for inference. We will explore the basic ideas of inference in the regression context: confidence intervals and hypothesis tests for regression parameters, and the model validation tools that are at our disposal.

Copy and paste the following dataset into your R session:

"Manatees" <- 
structure(.Data = list(
"year" = c(1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 
           1986, 1987, 1988, 1989, 1990), 
"powerboats" = c(447, 460, 481, 498, 513, 512, 526, 559, 585, 
                 614, 645, 675, 711, 719), 
"deaths" = c(13, 21, 24, 16, 24, 20, 15, 34, 33, 33, 39, 43, 50, 47)), 
names = c("year", "powerboats", "deaths"), 
row.names = c("1","2","3","4","5","6","7","8","9",
               "10","11","12","13","14"), 
class = "data.frame")

It should look like this:

> Manatees
   year powerboats deaths 
 1 1977        447     13
 2 1978        460     21
 3 1979        481     24
 4 1980        498     16
 5 1981        513     24
 6 1982        512     20
 7 1983        526     15
 8 1984        559     34
 9 1985        585     33
10 1986        614     33
11 1987        645     39
12 1988        675     43
13 1989        711     50
14 1990        719     47

The dataset consists of the number of powerboat registrations (in thousands) for each year, and the number of manatees killed by powerboats that year. You should start by plotting the data, as usual. Does it look like a straight line is appropriate for describing the relationship between boat registrations and manatee deaths? (Don't forget to "attach" the data.frame!)

Confidence intervals and Hypothesis tests for coefficients

First plot the data. We will look at manatee deaths as the response (on the Y axis) and powerboats as the explanatory variable.

Recall that to fit a regression line by least squares we use the lm() function, and the summary() function to display the results:

      manatee.lm <- lm(deaths ~ powerboats, data=Manatees)

      summary(manatee.lm)

Unless we extract the columns, and store them as separate variables, we must tell R the name of the dataframe containing our dataset, hence the "data=Manatees" in the lm function above.

Don't forget to plot the residuals!

t-tests
Look at the table of coefficients. You should see that the estimated intercept is -41.4304, and the estimated slope is 0.1249. The table gives several other important pieces of information. For each coefficient, there is a standard error (7.4122 for the intercept, and .0129 for the slope). The t-value is the estimate divided by its standard error, which is appropriate for testing the null hypothesis that the corresponding coefficient is zero. The column labeled "Pr(>|t|)" is the p-value for those tests.

You can see that we would reject the null hypothesis that each coefficient is 0 here, since the t-statistics are much bigger than the 95% critical point on the t distribution with 12df, which is qt(.975,12), or about 2.18. The p-values are correspondingly small. Small p-values are evidence against the null hypothesis: either the null hypothesis is false, or we have observed an extremely rare event. It is important to remember that the distribution of the t-statistics depends on having exactly normally distributed deviations from the line, which we probably don't, so we shouldn't be concerned with the precise magnitude of the p-value.

Confidence Intervals
To produce a 95% confidence interval for a parameter, simply use the estimate plus or minus the product of the critical t value (ie. qt(.975,12)) and the standard error. Try it: you should get about (0.097, 0.153) for the 95% confidence interval for the slope.

You can also use the confint() function to get confidence intervals for the coefficients:

                  confint(manatee.lm)

The default level is a 95% confidence interval, for a 99% CI use

                  confint(manatee.lm,level=.99)

Interpretation
What do the coefficients mean? The slope tells us that, for an increase of 1000 boats (ie. a change of 1 unit in the variable powerboats), we expect to see an increase of about .12 manatee deaths in a year. How big an increase in boat registrations does it take to kill a whole manatee, on the average? That question was slightly tongue-in-cheek, but from a public policy point of view, it matters. We might want to know how much we have to reduce boat registrations to effect a given decrease in the death rate. Now, what about the intercept? How many manatees would you expect to be killed by powerboats if there were no powerboats? Most likely the relationship is really nonlinear, but it appears to be at least approximately linear in the range of the observed data. It is usually dangerous to extrapolate very far outside the range of the observed data. It is clearly nonsense to suppose that a negative number of manatees will be killed by powerboats if there are no powerboats.

Standard error of the fitted line

The fitted line consists of our estimates of the expected values (or means) of the responses for each value of the explanatory variable. That is:

 -41.4304 + 0.1249*powerboats

We can use this equation to make predictions. For example, we might want to guess the effect of a policy decision, such as restricting powerboat registrations to 700,000 each year. According to our equation, we expect that if powerboats = 700, then there will be about

-41.4304 + 0.1249*700

or roughly 46 manatees killed each year. We would also like to get an indication of the accuracy of our estimate of the line at that point. Unfortunately, we can't just add the standard errors of the coefficients, or their variances, in general. The formula is more complicated, and takes into account the covariance between the estimates of the intercept and the slope. In principle, you can compute the variance of the fitted line at a point (say powerboats=700) using the expression:

  var(fit) = var(B0) + var(700*B1) + 2*cov(B0,700*B1)

This will get very complicated as we consider more than one explanatory variable. Fortunately S has a function that computes this for us automagically: predict(). We need to set up a dataframe containing the values of the explanatory variables where we want to compute the fitted values, then pass the lm data structure and the new data frame to the predict function.

M <- data.frame(powerboats=seq(400,750,10))
	# set up data frame with tthe explanatory variable(s)
	# the names of the explanatory variables must be 
	# the same as in the original data frame

M
	#look at the dataframe you just created!

predM <- predict(manatee.lm,M,interval="confidence")

plot(M[,"powerboats"],predM[,"fit"],xlab="powerboats",
      ylab="deaths", type="l")
          # plot the fitted line
lines(M[,"powerboats"],predM[,"lwr"],lty=3) 
          # plot the lower 95% confidence interval
lines(M[,"powerboats"],predM[,"upr"],lty=3)
          # plot the upper 95% confidence interval

We can do the same operation manually:

predM <- predict(manatee.lm,M,se.fit=T)
	# compute the fitted line at the new points
	# compute the standard errors of the fitted values

predM
	# look at the result!

plot(M[,"powerboats"],predM$fit,type="l")
lines(M[,"powerboats"],predM$fit + qt(.975,12)*predM$se,lty=2)  
lines(M[,"powerboats"],predM$fit - qt(.975,12)*predM$se,lty=2)
	# plot a 95% CI for the fitted line

The 95% CI for the fitted line is narrowest in the middle, and bends out as we move to the ends of the range of the data. This is an indication that the uncertainty is least in the middle, and greatest at the ends of the range. Remember, this is not a confidence interval for observations, but for the location of the line!

Standard Errors of Predictions

The name of "predict" function is perhaps slightly misleading; the SE's it produces are standard errors of the mean, ie. the fitted line, not standard errors for predictions, which must also incorporate the spread of the points about the line. Fortunately, the component containing the standard errors is called "se.fit". Suppose we wish to make a prediction at powerboats = 700. There are two sources of uncertainty, the uncertainty in the fitted line, and the variability of observations about the line. Fortunately, a new observation will be independent of the rest, so its variance will be additive. The estimated variance of a deviation from the line is just the square of the residual standard error, thus we have

var(prediction at powerboats =700) = 4.276388^2 + 2.056706^2

2.056706 was the value of predM$se.fit corresponding to 700, and 4.27... was the residual standard error.

To compute the standard error of prediction at all the values in the data frame "M", use the predict function again:

predM <- predict(manatee.lm,M,interval="prediction")

plot(M[,"powerboats"],predM[,"fit"],type="l")
          # plot the fitted line
lines(M[,"powerboats"],predM[,"lwr"],lty=3) 
          # plot the lower 95% prediction interval
lines(M[,"powerboats"],predM[,"upr"],lty=3)
          # plot the upper 95% prediction interval
points(powerboats,deaths)
          # add the observed data

Homework!

Math 141 Index
Introduction to S