Inference for Regression |
Least squares, or linear regression, is a technique for fitting lines to data. Regression is also a powerful tool for inference. We will explore the basic ideas of inference in the regression context: confidence intervals and hypothesis tests for regression parameters, and the model validation tools that are at our disposal.
"Manatees" <- structure(.Data = list( "year" = c(1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990), "powerboats" = c(447, 460, 481, 498, 513, 512, 526, 559, 585, 614, 645, 675, 711, 719), "deaths" = c(13, 21, 24, 16, 24, 20, 15, 34, 33, 33, 39, 43, 50, 47)), names = c("year", "powerboats", "deaths"), row.names = c("1","2","3","4","5","6","7","8","9", "10","11","12","13","14"), class = "data.frame")
> Manatees year powerboats deaths 1 1977 447 13 2 1978 460 21 3 1979 481 24 4 1980 498 16 5 1981 513 24 6 1982 512 20 7 1983 526 15 8 1984 559 34 9 1985 585 33 10 1986 614 33 11 1987 645 39 12 1988 675 43 13 1989 711 50 14 1990 719 47
The dataset consists of the number of powerboat registrations (in thousands) for each year, and the number of manatees killed by powerboats that year. You should start by plotting the data, as usual. Does it look like a straight line is appropriate for describing the relationship between boat registrations and manatee deaths? (Don't forget to "attach" the data.frame!)
Recall that to fit a regression line by least squares we use the lm() function, and the summary() function to display the results:
manatee.lm <- lm(deaths ~ powerboats, data=Manatees) summary(manatee.lm)Unless we extract the columns, and store them as separate variables, we must tell R the name of the dataframe containing our dataset, hence the "data=Manatees" in the lm function above.
Don't forget to plot the residuals!
You can see that we would reject the null hypothesis that each coefficient is 0 here, since the t-statistics are much bigger than the 95% critical point on the t distribution with 12df, which is qt(.975,12), or about 2.18. The p-values are correspondingly small. Small p-values are evidence against the null hypothesis: either the null hypothesis is false, or we have observed an extremely rare event. It is important to remember that the distribution of the t-statistics depends on having exactly normally distributed deviations from the line, which we probably don't, so we shouldn't be concerned with the precise magnitude of the p-value.
You can also use the confint() function to get confidence intervals for the coefficients:
confint(manatee.lm)The default level is a 95% confidence interval, for a 99% CI use
confint(manatee.lm,level=.99)
The fitted line consists of our estimates of the expected values (or means) of the responses for each value of the explanatory variable. That is:
-41.4304 + 0.1249*powerboatsWe can use this equation to make predictions. For example, we might want to guess the effect of a policy decision, such as restricting powerboat registrations to 700,000 each year. According to our equation, we expect that if powerboats = 700, then there will be about
-41.4304 + 0.1249*700or roughly 46 manatees killed each year. We would also like to get an indication of the accuracy of our estimate of the line at that point. Unfortunately, we can't just add the standard errors of the coefficients, or their variances, in general. The formula is more complicated, and takes into account the covariance between the estimates of the intercept and the slope. In principle, you can compute the variance of the fitted line at a point (say powerboats=700) using the expression:
var(fit) = var(B0) + var(700*B1) + 2*cov(B0,700*B1)This will get very complicated as we consider more than one explanatory variable. Fortunately S has a function that computes this for us automagically: predict(). We need to set up a dataframe containing the values of the explanatory variables where we want to compute the fitted values, then pass the lm data structure and the new data frame to the predict function.
M <- data.frame(powerboats=seq(400,750,10)) # set up data frame with tthe explanatory variable(s) # the names of the explanatory variables must be # the same as in the original data frame M #look at the dataframe you just created! predM <- predict(manatee.lm,M,interval="confidence") plot(M[,"powerboats"],predM[,"fit"],xlab="powerboats", ylab="deaths", type="l") # plot the fitted line lines(M[,"powerboats"],predM[,"lwr"],lty=3) # plot the lower 95% confidence interval lines(M[,"powerboats"],predM[,"upr"],lty=3) # plot the upper 95% confidence intervalWe can do the same operation manually:
predM <- predict(manatee.lm,M,se.fit=T) # compute the fitted line at the new points # compute the standard errors of the fitted values predM # look at the result! plot(M[,"powerboats"],predM$fit,type="l") lines(M[,"powerboats"],predM$fit + qt(.975,12)*predM$se,lty=2) lines(M[,"powerboats"],predM$fit - qt(.975,12)*predM$se,lty=2) # plot a 95% CI for the fitted lineThe 95% CI for the fitted line is narrowest in the middle, and bends out as we move to the ends of the range of the data. This is an indication that the uncertainty is least in the middle, and greatest at the ends of the range. Remember, this is not a confidence interval for observations, but for the location of the line!
The name of "predict" function is perhaps slightly misleading; the SE's it produces are standard errors of the mean, ie. the fitted line, not standard errors for predictions, which must also incorporate the spread of the points about the line. Fortunately, the component containing the standard errors is called "se.fit". Suppose we wish to make a prediction at powerboats = 700. There are two sources of uncertainty, the uncertainty in the fitted line, and the variability of observations about the line. Fortunately, a new observation will be independent of the rest, so its variance will be additive. The estimated variance of a deviation from the line is just the square of the residual standard error, thus we have
var(prediction at powerboats =700) = 4.276388^2 + 2.056706^22.056706 was the value of predM$se.fit corresponding to 700, and 4.27... was the residual standard error.
To compute the standard error of prediction at all the values in the data frame "M", use the predict function again:
predM <- predict(manatee.lm,M,interval="prediction") plot(M[,"powerboats"],predM[,"fit"],type="l") # plot the fitted line lines(M[,"powerboats"],predM[,"lwr"],lty=3) # plot the lower 95% prediction interval lines(M[,"powerboats"],predM[,"upr"],lty=3) # plot the upper 95% prediction interval points(powerboats,deaths) # add the observed data