## Analysis of Variance |

Analysis of variance is a term used primarily to refer
to the linear model when all the explanatory variables
are categorical variables. It is also used for other models which
appear similar, so it is important to understand the various
situations. There is a distinction between
`fixed effects' analysis of variance, which literally is
the standard linear model, and `random effects' or
`mixed model' analysis of variance, which includes coefficients
that aren't assumed to be constants. `Repeated measures'
and `within subjects' designs are examples where the standard
linear model is not appropriate, although for balanced designs
one can still use the linear model and least squares to do
analyses. For unbalanced random effects and mixed models
other techniques, represented by the **R** functions
**lme** and **nlme**, are appropriate.

The examples will make use of the data collected by
Michelson and Morley on the speed of light, consisting of
5 batches of 20 measurements. Follow the link to
Michelson's data and paste the
dataset into your **R** session.

Use the attach command to allow direct reference to the variables in the data frame (Speed and Run).

attach(Michelson)

Before doing any formal analysis, **always** start by looking at
the data. For data in groups like this, boxplots are a natural
choice.

boxplot(Speed ~ Run)To highlight the boxes, add color:

boxplot(Speed ~ Run,col="yellow")The first run looks a bit different from the others, and the third seems to have possible outliers, although its overall spread is not inconsistent with the other groups. Let's compute some summary statistics: the averages and standard deviations by group. The easiest way to do this in

tapply(Speed,Run,mean)computes the mean of each run;

tapply(Speed,Run,sd)The first group appears to have both a larger mean and a larger standard deviation than the other groups. Let's not worry about that for the moment; one can always use the t.test function to compare pairs of groups with the Welch test if one has to deal with variations in within group variances.

Fitting the anova model is easy, use the **aov** function:

Maov <- aov(Speed ~ Run,data=Michelson)The

summary(Maov)The F-test given in the table tests the null hypothesis that all runs have the same mean; the resulting p-value is about .003, which suggests that the runs did not all have the same mean. You should make sure you understand why the degrees of freedom are 4 and 95.

There is more analysis to do, but first, we should **always**
look at some standard **diagnostic plots**.
Since the model is based on the
groups each having a normal distribution with the same variance,
the residuals (the differences between the observations and their
group means) should all be normally distributed with mean 0 and
a common variance. We can assess these assumptions with the normal
plot and a plot of residuals vs. group means:

qqnorm(residuals(Maov)) qqline(residuals(Maov))This plot is useful for assessing the assumption that the residuals are normally distributed. Here the left tail looks a bit long. The

plot(fitted.values(Maov),residuals(Maov)) abline(h=0)The

We can check more easily if observations are outliers by adding a line at 2 times the residual standard deviation from the horizontal line at zero. The residual variance (the variance of the residuals!) appears in the anova table as the "residual mean square", which was 5511. Thus the residual standard error (the standard deviation of the residuals) is

sqrt(5511)or about 74.

abline(h=2*sqrt(5511),col="red") abline(h=-2*sqrt(5511),col="red")Our estimate of the residual standard error may be inflated by the first group, but the procedure is still useful. We would expect about 5 of 100 residuals to be outside these lines if the data were normally distributed. That's about what we have, but 4 of those are from one group.

We can use the **identify** function to discover which
cases are the outliers (make sure the plot of residuals
on fitted values is still in your graphics window, if not
you must replot it):

identify(fitted.values(Maov),residuals(Maov))Click on the points of interest with the left mouse button. The case number should appear in the plot. When you are done, clicking back in the session window should terminate the identify function, which returns the list of case numbers.

There are various ways to test the hypothesis that the first group differs from the others. Perhaps the simplest idea is to repeat the analysis leaving out the first run:

Maov1 <- aov(Speed ~ Run,data=Michelson,subset= (Run != 1) )The "!=" is "not equal to", so we are selecting the subset of cases that aren't in the first run. You should repeat the summary function and plots. Notice the big change in the residual variance (down from 5511 to 4136), and the F statistic which is now close to 1. In other words, there is no evidence that the other 4 groups differ from each other.

Recall that the two sample t-test is the ratio of the difference between the group means to the standard error of the difference. We can compare two groups, say groups 1 and two, using a t-test ; first compute the group means

tapply(Speed,Run,mean) 1 2 3 4 5 299909.0 299856.0 299845.0 299820.5 299831.5The residual standard error, assuming the groups have the same variance is the square root of the residual mean square in the anova table:

sqrt(5511)The standard error of a difference is s*sqrt(1/n1 + 1/n2) (see section 10.1 of the course notes), where n1 and n2 are the number of observations in the groups (here 20 in each). If the variances are all equal, our best estimate of the common standard deviation is the residual standard error from the anova. Finally, our t statistic for testing the null hypothesis of no difference between the means of the first two groups is:

(299909.0 - 299856.0)/(sqrt(5511)*sqrt(1/20 + 1/20))The rejection point is based on the t distribution with 95 degrees of freedom, since that was the degrees of freedom for computing our variance estimate.

qt(.975,95)The t statistic is bigger than 1.98, so we reject the null hypothesis. Our 95% confidence interval for the difference is given by

(299909.0 - 299856.0) + qt(.975,95)*(sqrt(5511)*sqrt(1/20 + 1/20)) (299909.0 - 299856.0) - qt(.975,95)*(sqrt(5511)*sqrt(1/20 + 1/20))Compare to the standard two sample t-test:

t.test(Speed[Run == 1],Speed[Run == 2],var.equal=T)Notice that the degrees of freedom is 38 now, not 95: we are estimating the supposedly common variance using only the first two groups. The critical t value for a 5% significance level is

qt(.975,38)Which is a bit bigger than the value with 95 df (1.98). These effects combine to make it harder to reject the null hypothesis of no difference.

Here we have good reason to suspect that the variance of group one is bigger than the variances of the other groups, so the conservative approach would be to use the Welch test:

t.test(Speed[Run == 1],Speed[Run == 2])Since the other groups don't appear to differ from each other, lets compare all to the first group:

t.test(Speed[Run == 1],Speed[Run != 1])More data helps! To see the connection to linear models and analysis of variance, compare the results of the following two procedures:

t.test(Speed[Run != 1],Speed[Run == 1],var.equal=T) D2345 = Run != 1 summary(lm(Speed ~ D2345))Construct the 95% confidence interval for the coefficient for D2345 and compare it to the confidence interval given by t.test.

**Corrections for multiple comparisons:** If we have an experiment
with several groups, we can compare all group means to each other.
The danger of doing this is that the more comparisons we make, the
more likely we are to see a 'false positive'. Here we have to correct
for the fact a set of sample means from the same population has a
largest and a smallest value which will be farther apart than two
randomly chosen sample means from the same population, we can use a
procedure known as "Tukey's Honest Significant Difference". In R the
function
**TukeyHSD** does this calculation for us from the aov
data structure:

TukeyHSD(Maov,type="mean")The function returns a set of confidence intervals for all group differences. The first row is labeled "2-1", meaning the confidence interval is for the mean of group two minus the mean of group one. As always, rejecting the null hypothesis of no difference corresponds to a confidence intervval that does not include 0.

We can also plot the resulting confidence intervals:

plot(TukeyHSD(Maov,type="mean"))

- Make boxplots of the egg sizes by host species.
- Test the null hypothesis that the mean size of cuckoo eggs is the same for all host species. State your conclusion about the hypothesis.
- Make the normal plot of residuals, and evaluate it.
- Make the plot of residuals on fitted values, and evaluate it.
- Use the TukeyHSD function to construct 95% confidence intervals for the group differences.

Introduction to S