More Diagnostic Plots |
Now try the following plot:
plot(Wgt,Hgpm,type="n") text(Wgt,Hgpm,US)The cars will be labeled by the value of the variable "US". We can also use other methods of highlighting:
plot(Wgt,Hgpm,col=rgb(US,0,0))or
plot(Wgt,Hgpm,pch=16,col=rgb(US,0,0))One more option:
plot(Wgt,Hgpm,pch=19,col=ifelse(US,"blue","red"))
plot(Wgt,Hgpm) identify(Wgt,Hgpm,Model)First you make a plot, then use the identify function with the same variables. Click with the mouse on any interesting points. Depending on the type of mouse, the case numbers will appear in the plot window next to the corresponding points either as you click on them or after you terminate. To exit from the identify function with a two-button mouse, use the right mouse button. With a one-button mouse, either hit the escape key or put the cursor in the R command window and type `enter'. The identified case numbers will be returned as the value of the function. You can then print the names of the cars that you picked out. For example, I identified case 60 as a car with relatively large gpm (ie. high fuel consumption). To see which car this is, type:
Cars93[60,]You could also label points in a residual plot using this method!
?influence.measuresTwo you should use routinely are cooks.distance() and hatvalues(). The Cook's distance for each case is a measure of how much the regression coefficients would change if that case were deleted, relative to the variability of the estimates. A Cook's distance of 1 is pretty big. The hat value for each case is a measure of its distance from the center of the distribution of the explanatory variables, and thus is useful for identifying outliers in the explanatory variables which might be influential points. The term "high leverage" is often used to describe observations with a large hat value: these observations have the potential to be influential.
lm.mpg <- lm(Hmpg ~ Wgt) b <- cooks.distance(lm.mpg) h <- hatvalues(lm.mpg)To plot Cook's distances by case number use:
plot(b)It may be more informative to plot Cook's distances by other variables, such as the explanatory variables:
plot(Wgt,b)The same plots are useful for the hat values:
plot(h)and
plot(Wgt,h)With a single explanatory variable the hat value is essentially just the squared distance from the mean; when we work with multiple explanatory variables, the hat value is more informative.
The influence() function computes other statistics of interest, most notably "leave one out" computations for each case: coefficients and residual standard errors. For example, to look at how the estimate of the residual standard error is affected by each case, use
plot(influence(lm.mpg)$sigma)or
plot(Wgt,influence(lm.mpg)$sigma)