Analysis of Variance

Analysis of variance is a term used primarily to refer to the linear model when all the explanatory variables are categorical variables. It is also used for other models which appear similar, so it is important to understand the various situations. There is a distinction between `fixed effects' analysis of variance, which literally is the standard linear model, and `random effects' or `mixed model' analysis of variance, which includes coefficients that aren't assumed to be constants. `Repeated measures' and `within subjects' designs are examples where the standard linear model is not appropriate, although for balanced designs one can still use the linear model and least squares to do analyses. For unbalanced random effects and mixed models other techniques, represented by the R functions lme and nlme, are appropriate.

The examples will make use of the data collected by Michelson and Morley on the speed of light, consisting of 5 batches of 20 measurements. Follow the link to Michelson's data and paste the dataset into your R session.

Use the attach command to allow direct reference to the variables in the data frame (Speed and Run).


Before doing any formal analysis, always start by looking at the data. For data in groups like this, boxplots are a natural choice.

    boxplot(Speed ~ Run)
To highlight the boxes, add color:
    boxplot(Speed ~ Run,col="yellow")
The first run looks a bit different from the others, and the third seems to have possible outliers, although its overall spread is not inconsistent with the other groups. Let's compute some summary statistics: the averages and standard deviations by group. The easiest way to do this in R is to use a function called "tapply":
computes the mean of each run;
The first group appears to have both a larger mean and a larger standard deviation than the other groups. Let's not worry about that for the moment; one can always use the t.test function to compare pairs of groups with the Welch test if one has to deal with variations in within group variances.

Fitting the anova model is easy, use the aov function:

        Maov <- aov(Speed ~ Run,data=Michelson)
The summary function takes the resulting object and produces the standard analysis of variance table:
The F-test given in the table tests the null hypothesis that all runs have the same mean; the resulting p-value is about .003, which suggests that the runs did not all have the same mean. You should make sure you understand why the degrees of freedom are 4 and 95.

There is more analysis to do, but first, we should always look at some standard diagnostic plots. Since the model is based on the groups each having a normal distribution with the same variance, the residuals (the differences between the observations and their group means) should all be normally distributed with mean 0 and a common variance. We can assess these assumptions with the normal plot and a plot of residuals vs. group means:

This plot is useful for assessing the assumption that the residuals are normally distributed. Here the left tail looks a bit long. The residuals function computes the residuals from the aov data structure. We could do this directly, but some statistical models involve more complicated computations for the residuals, so it is useful to get into the habit of using the residuals function.
The fitted.values function computes the group means. In more complicated models it computes the fitted model at the observed data points, which might correspond to a line or a curve, or something more complicated. This plot is most useful for assessing the assumption of constant variance across the groups. You can see that the group with the largest mean (Run 1) appears to be a little more scattered than the others, and the third group (Run 3) appears to have outliers.

We can check more easily if observations are outliers by adding a line at 2 times the residual standard deviation from the horizontal line at zero. The residual variance (the variance of the residuals!) appears in the anova table as the "residual mean square", which was 5511. Thus the residual standard error (the standard deviation of the residuals) is

or about 74.
Our estimate of the residual standard error may be inflated by the first group, but the procedure is still useful. We would expect about 5 of 100 residuals to be outside these lines if the data were normally distributed. That's about what we have, but 4 of those are from one group.

We can use the identify function to discover which cases are the outliers (make sure the plot of residuals on fitted values is still in your graphics window, if not you must replot it):

Click on the points of interest with the left mouse button. The case number should appear in the plot. When you are done, clicking back in the session window should terminate the identify function, which returns the list of case numbers.

There are various ways to test the hypothesis that the first group differs from the others. Perhaps the simplest idea is to repeat the analysis leaving out the first run:

        Maov1 <- aov(Speed ~ Run,data=Michelson,subset= (Run != 1) )
The "!=" is "not equal to", so we are selecting the subset of cases that aren't in the first run. You should repeat the summary function and plots. Notice the big change in the residual variance (down from 5511 to 4136), and the F statistic which is now close to 1. In other words, there is no evidence that the other 4 groups differ from each other.

Recall that the two sample t-test is the ratio of the difference between the group means to the standard error of the difference. We can compare two groups, say groups 1 and two, using a t-test ; first compute the group means

          1        2        3        4        5 
       299909.0 299856.0 299845.0 299820.5 299831.5 
The residual standard error, assuming the groups have the same variance is the square root of the residual mean square in the anova table:
The standard error of a difference is s*sqrt(1/n1 + 1/n2) (see section 10.1 of the course notes), where n1 and n2 are the number of observations in the groups (here 20 in each). If the variances are all equal, our best estimate of the common standard deviation is the residual standard error from the anova. Finally, our t statistic for testing the null hypothesis of no difference between the means of the first two groups is:
       (299909.0 - 299856.0)/(sqrt(5511)*sqrt(1/20 + 1/20))
The rejection point is based on the t distribution with 95 degrees of freedom, since that was the degrees of freedom for computing our variance estimate.
The t statistic is bigger than 1.98, so we reject the null hypothesis. Our 95% confidence interval for the difference is given by
       (299909.0 - 299856.0) + qt(.975,95)*(sqrt(5511)*sqrt(1/20 + 1/20))
       (299909.0 - 299856.0) - qt(.975,95)*(sqrt(5511)*sqrt(1/20 + 1/20))
Compare to the standard two sample t-test:
       t.test(Speed[Run == 1],Speed[Run == 2],var.equal=T)
Notice that the degrees of freedom is 38 now, not 95: we are estimating the supposedly common variance using only the first two groups. The critical t value for a 5% significance level is
Which is a bit bigger than the value with 95 df (1.98). These effects combine to make it harder to reject the null hypothesis of no difference.

Here we have good reason to suspect that the variance of group one is bigger than the variances of the other groups, so the conservative approach would be to use the Welch test:

       t.test(Speed[Run == 1],Speed[Run == 2])
Since the other groups don't appear to differ from each other, lets compare all to the first group:
       t.test(Speed[Run == 1],Speed[Run != 1])
More data helps! To see the connection to linear models and analysis of variance, compare the results of the following two procedures:
       t.test(Speed[Run != 1],Speed[Run == 1],var.equal=T)

       D2345 = Run != 1
       summary(lm(Speed ~ D2345))
Construct the 95% confidence interval for the coefficient for D2345 and compare it to the confidence interval given by t.test.

Corrections for multiple comparisons: If we have an experiment with several groups, we can compare all group means to each other. The danger of doing this is that the more comparisons we make, the more likely we are to see a 'false positive'. Here we have to correct for the fact a set of sample means from the same population has a largest and a smallest value which will be farther apart than two randomly chosen sample means from the same population, we can use a procedure known as "Tukey's Honest Significant Difference". In R the function TukeyHSD does this calculation for us from the aov data structure:

The function returns a set of confidence intervals for all group differences. The first row is labeled "2-1", meaning the confidence interval is for the mean of group two minus the mean of group one. As always, rejecting the null hypothesis of no difference corresponds to a confidence intervval that does not include 0.

We can also plot the resulting confidence intervals:



Load the Cuckoo Egg Data . The scientific question of interest is whether we have evidence that the sizes of cuckoo eggs depend on the host species.
  1. Make boxplots of the egg sizes by host species.
  2. Test the null hypothesis that the mean size of cuckoo eggs is the same for all host species. State your conclusion about the hypothesis.
  3. Make the normal plot of residuals, and evaluate it.
  4. Make the plot of residuals on fitted values, and evaluate it.
  5. Use the TukeyHSD function to construct 95% confidence intervals for the group differences.

Math 141 Index
Introduction to S