More Diagnostic Plots



Next: Up: Introduction Previous: Transforming the response


Plotting Indicator variables

Often we want to examine the relationship between two variables while indicating the value of a third. For example, the data include an indicator variable for domestic cars. We might want to see if there is any indication that the pattern is different for domestic and foreign cars (bearing in mind that some domestic cars are produced overseas, and some foreign cars are built in the US). The indicator variable "US" is 1 if the car is domestic, and 0 if it is a foreign make.

Now try the following plot:

	plot(Wgt,Hgpm,type="n")
	text(Wgt,Hgpm,US)  
The cars will be labeled by the value of the variable "US". We can also use other methods of highlighting:
	plot(Wgt,Hgpm,col=rgb(US,0,0)) 
or
	plot(Wgt,Hgpm,pch=16,col=rgb(US,0,0)) 
One more option:
        plot(Wgt,Hgpm,pch=19,col=ifelse(US,"blue","red"))

Graphical Identification of cases of interest

Another very useful function is the identify() function, which allows us to click on a point in a plot and see to which case it corresponds.
	plot(Wgt,Hgpm)
	identify(Wgt,Hgpm,Model)
First you make a plot, then use the identify function with the same variables. Click with the mouse on any interesting points. Depending on the type of mouse, the case numbers will appear in the plot window next to the corresponding points either as you click on them or after you terminate. To exit from the identify function with a two-button mouse, use the right mouse button. With a one-button mouse, either hit the escape key or put the cursor in the R command window and type `enter'. The identified case numbers will be returned as the value of the function. You can then print the names of the cars that you picked out. For example, I identified case 60 as a car with relatively large gpm (ie. high fuel consumption). To see which car this is, type:
	Cars93[60,]
You could also label points in a residual plot using this method!

Analytical Identification of cases of interest

Troublesome points may not be obvious in a residual plot if they are influential in determining the location of the fitted line. Sometimes such points can be discovered through the process of recomputing the fitted line repeatedly, each time omitting a single case. Rather than explicitly recompute the regression line each time, R has several functions that compute diagnostic measures from the lm data structure. Read the help page for influence.measures:
	?influence.measures
Two you should use routinely are cooks.distance() and hatvalues(). The Cook's distance for each case is a measure of how much the regression coefficients would change if that case were deleted, relative to the variability of the estimates. A Cook's distance of 1 is pretty big. The hat value for each case is a measure of its distance from the center of the distribution of the explanatory variables, and thus is useful for identifying outliers in the explanatory variables which might be influential points. The term "high leverage" is often used to describe observations with a large hat value: these observations have the potential to be influential.
       lm.mpg <- lm(Hmpg ~ Wgt)
       b <- cooks.distance(lm.mpg)
       h <- hatvalues(lm.mpg)
To plot Cook's distances by case number use:
       plot(b)
It may be more informative to plot Cook's distances by other variables, such as the explanatory variables:
       plot(Wgt,b)
The same plots are useful for the hat values:
       plot(h)
and
       plot(Wgt,h)
With a single explanatory variable the hat value is essentially just the squared distance from the mean; when we work with multiple explanatory variables, the hat value is more informative.

The influence() function computes other statistics of interest, most notably "leave one out" computations for each case: coefficients and residual standard errors. For example, to look at how the estimate of the residual standard error is affected by each case, use

       plot(influence(lm.mpg)$sigma)
or
       plot(Wgt,influence(lm.mpg)$sigma)


Next: Up: Introduction Previous: Transforming the response


Math 141 Index
Introduction to S


Albyn Jones
August 2004