Quantile-Quantile Plots



We have already seen that we can compare two distributions by making adjacent boxplots, or back-to-back stem and leaf plots.

For a more detailed comparison of two distributions, it is sometimes useful to make use of the whole set of order statistics. We could just list the two sets of sorted values, and examine them:

> sort(A1)
 [1] -23 -22 -19 -18 -17 -15 -14 -11 -11 -10  -8  -5  -3  -2  -1  0  0  0  1
[20]   2   4   8   9   9   9   9  11  11  12  12  14  14  15  16 16 16 18 18
[39]  18  20  24  24  25  25  26  28  30  30  30  31  32  39  50

> sort(A2)
 [1] -37 -36 -30 -26 -22 -15 -13 -13 -12 -10  -8  -6  -6  -5  -5 -4 -1  0  0
[20]   0   2   2   2   3   6   7   7   7   9  10  10  11  13  15 16 16 17 18
[39]  19  20  20  22  23  24  25  30  33  34  36  36  44  47  53
>

If the two data sets are essentially samples from the same distribution, and the samples are the same size, then the corresponding ordered values ( order statistics) should roughly match up. We can examine this graphically in a qqplot, or quantile-quantile plot, which is just a scatterplot of the two sorted datasets, ie. the smallest observation from each are plotted against each other, then the next smallest, and so on.

The quantile-quantile plot is an effective display of the relationship between corresponding order statistics from two samples: plot the corresponding pairs as points in a scatter plot. If the samples differ in size, qqplot() interpolates between the sorted values of the larger set to get the quantiles to plot. If the two distributions are similar, the plot should be approximately a straight line with slope 1. If the two distributions are the same shape, but have different spreads, then the points should lie (approximately) on a straight line with slope proportional to the ratio of the spreads (i.e. the ratio of the standard deviations). Since the medians are plotted against each other, the locations or centers of the two groups are easily compared, as well. If the two distributions are very different in shape, then no straight line may fit the plot well!

The Q-Q plot for the two groups appears to be roughly a straight line, but clearly the locations and spreads differ. We can clarify the shape comparison by plotting a straight line on the Q-Q plot with the abline() function. The syntax is abline(c(intercept,slope)), with "intercept" and "slope" replaced by the appropriate values for our qqplot. The slope should be roughly the ratio of the standard deviations:

> sqrt(var(B1)/var(A1))
[1] 1.636474
The intercept is the height of the line at A1 = 0. We don't know exactly what that will be, but we do know that the two medians are plotted against each other, hence we know that the height of the line at A1 = 11 (median(A1)) must be the median of B1, ie. 38.5. Since the equation of the line is B1 = b0 + b1* A1, b1 is approximately 1.636, and the point (A1,B1) = (11,38.5) is on the line, we have
	 38.5 = b0 + 1.636*11
or b0 = 20.5. Try plotting the line with slope 1.636 and intercept 20.5:
	qqplot(A1,B1,main="qqplot(A1,B1)")
	abline(c(20.5,1.636))
It looks like we guessed a bit low, so let's try again:
	qqplot(A1,B1,main="qqplot(A1,B1)")
	abline(c(25,1.636))
This produces the Q-Q plot with roughly the right line. There is a slight hint of curvature, but it is not dramatic. It is reasonable to conclude that the two distributions are roughly similar in shape, but differ in location and scale.

We can learn a lot about how two distributions differ from each other by examination of the departures of a Q-Q plot from a line with slope 1.



Math 141 Index

Albyn Jones