Analysis of Variance |
Analysis of variance is a term used primarily to refer to the linear model when all the explanatory variables are categorical variables. It is also used for other models which appear similar, so it is important to understand the various situations. There is a distinction between `fixed effects' analysis of variance, which literally is the standard linear model, and `random effects' or `mixed model' analysis of variance, which includes coefficients that aren't assumed to be constants. `Repeated measures' and `within subjects' designs are examples where the standard linear model is not appropriate, although for balanced designs one can still use the linear model and least squares to do analyses. For unbalanced random effects and mixed models other techniques, represented by the R functions lme and nlme, are appropriate.
The examples will make use of the data collected by Michelson and Morley on the speed of light, consisting of 5 batches of 20 measurements. Follow the link to Michelson's data and paste the dataset into your R session.
Use the attach command to allow direct reference to the variables in the data frame (Speed and Run).
attach(Michelson)
Before doing any formal analysis, always start by looking at the data. For data in groups like this, boxplots are a natural choice.
boxplot(Speed ~ Run)To highlight the boxes, add color:
boxplot(Speed ~ Run,col="yellow")The first run looks a bit different from the others, and the third seems to have possible outliers, although its overall spread is not inconsistent with the other groups. Let's compute some summary statistics: the averages and standard deviations by group. The easiest way to do this in R is to use a function called "tapply":
tapply(Speed,Run,mean)
computes the mean of each run;
tapply(Speed,Run,sd)
The first group appears to have both a larger mean and a larger
standard deviation than the other groups. Let's not worry about
that for the moment; one can always use the t.test function
to compare pairs of groups with the Welch test if one has to
deal with variations in within group variances.
Fitting the anova model is easy, use the aov function:
Maov <- aov(Speed ~ Run,data=Michelson)
The summary function takes the resulting object and
produces the standard analysis of variance table:
summary(Maov)
The F-test given in the table tests the null hypothesis
that all runs have the same mean; the resulting p-value
is about .003, which suggests that the runs did not all have the
same mean. You should make sure you understand why the
degrees of freedom are 4 and 95.
There is more analysis to do, but first, we should always look at some standard diagnostic plots. Since the model is based on the groups each having a normal distribution with the same variance, the residuals (the differences between the observations and their group means) should all be normally distributed with mean 0 and a common variance. We can assess these assumptions with the normal plot and a plot of residuals vs. group means:
qqnorm(residuals(Maov))
qqline(residuals(Maov))
This plot is useful for assessing the assumption that the
residuals are normally distributed. Here the left tail looks a bit
long.
The residuals function computes the residuals from the
aov data structure. We could do this directly, but some
statistical models involve more complicated computations for the
residuals, so it is useful to get into the habit of using the
residuals function.
plot(fitted.values(Maov),residuals(Maov))
abline(h=0)
The fitted.values function computes the group means.
In more complicated models it computes the fitted model
at the observed data points, which might correspond to a
line or a curve, or something more complicated.
This plot is most useful for assessing the assumption of constant
variance across the groups. You can see that the group with the
largest mean (Run 1) appears to be a little more scattered than
the others, and the third group (Run 3) appears to have outliers.
We can check more easily if observations are outliers by adding a line at 2 times the residual standard deviation from the horizontal line at zero. The residual variance (the variance of the residuals!) appears in the anova table as the "residual mean square", which was 5511. Thus the residual standard error (the standard deviation of the residuals) is
sqrt(5511)
or about 74.
abline(h=2*sqrt(5511),col="red")
abline(h=-2*sqrt(5511),col="red")
Our estimate of the residual standard error may be
inflated by the first group, but the procedure is still useful.
We would expect about 5 of 100 residuals to be outside these lines
if the data were normally distributed. That's about what we have, but
4 of those are from one group.
We can use the identify function to discover which cases are the outliers (make sure the plot of residuals on fitted values is still in your graphics window, if not you must replot it):
identify(fitted.values(Maov),residuals(Maov))
Click on the points of interest with the left mouse button.
The case number should appear in the plot. When you are done,
clicking back in the session window should terminate the
identify function, which returns the list of case numbers.
There are various ways to test the hypothesis that the first group differs from the others. Perhaps the simplest idea is to repeat the analysis leaving out the first run:
Maov1 <- aov(Speed ~ Run,data=Michelson,subset= (Run != 1) )
The "!=" is "not equal to", so we are selecting the subset
of cases that aren't in the first run. You should repeat
the summary function and plots. Notice the big change in the
residual variance (down from 5511 to 4136), and the F statistic
which is now close to 1. In other words, there is no evidence
that the other 4 groups differ from each other.
Recall that the two sample t-test is the ratio of the difference between the group means to the standard error of the difference. We can compare two groups, say groups 1 and two, using a t-test ; first compute the group means
tapply(Speed,Run,mean)
1 2 3 4 5
299909.0 299856.0 299845.0 299820.5 299831.5
The residual standard error, assuming the groups have the same variance
is the square root of the residual mean square in the anova table:
sqrt(5511)The standard error of a difference is s*sqrt(1/n1 + 1/n2) (see section 10.1 of the course notes), where n1 and n2 are the number of observations in the groups (here 20 in each). If the variances are all equal, our best estimate of the common standard deviation is the residual standard error from the anova. Finally, our t statistic for testing the null hypothesis of no difference between the means of the first two groups is:
(299909.0 - 299856.0)/(sqrt(5511)*sqrt(1/20 + 1/20))
The rejection point is based on the t distribution with 95
degrees of freedom, since that was the degrees of freedom for computing our
variance estimate.
qt(.975,95)
The t statistic is bigger than 1.98, so we reject the null hypothesis.
Our 95% confidence interval for the difference is given by
(299909.0 - 299856.0) + qt(.975,95)*(sqrt(5511)*sqrt(1/20 + 1/20))
(299909.0 - 299856.0) - qt(.975,95)*(sqrt(5511)*sqrt(1/20 + 1/20))
Compare to the standard two sample t-test:
t.test(Speed[Run == 1],Speed[Run == 2],var.equal=T)
Notice that the degrees of freedom is 38 now, not 95: we are estimating
the supposedly common variance using only the first two groups.
The critical t value for a 5% significance level is
qt(.975,38)
Which is a bit bigger than the value with 95 df (1.98). These effects
combine to make it harder to reject the null hypothesis of no difference.
Here we have good reason to suspect that the variance of group one is bigger than the variances of the other groups, so the conservative approach would be to use the Welch test:
t.test(Speed[Run == 1],Speed[Run == 2])
Since the other groups don't appear to differ from each other, lets
compare all to the first group:
t.test(Speed[Run == 1],Speed[Run != 1])
More data helps!
To see the connection to linear models and analysis of variance,
compare the results of the following two procedures:
t.test(Speed[Run != 1],Speed[Run == 1],var.equal=T)
D2345 = Run != 1
summary(lm(Speed ~ D2345))
Construct the 95% confidence interval for the coefficient for
D2345 and compare it to the confidence interval given by t.test.
Corrections for multiple comparisons: If we have an experiment with several groups, we can compare all group means to each other. The danger of doing this is that the more comparisons we make, the more likely we are to see a 'false positive'. Here we have to correct for the fact a set of sample means from the same population has a largest and a smallest value which will be farther apart than two randomly chosen sample means from the same population, we can use a procedure known as "Tukey's Honest Significant Difference". In R the function TukeyHSD does this calculation for us from the aov data structure:
TukeyHSD(Maov,type="mean")
The function returns a set of confidence intervals for all
group differences. The first row is labeled "2-1", meaning
the confidence interval is for the mean of group two minus the
mean of group one. As always, rejecting the null hypothesis
of no difference corresponds to a confidence intervval that
does not include 0.
We can also plot the resulting confidence intervals:
plot(TukeyHSD(Maov,type="mean"))