Comparing Two Samples

Often we wish to compare two samples, in order to make inferences about possible differences between the two sampled populations, or differences between subjects in different experimental conditions. It is important to remember that distributions may differ in many respects: location, spread, and shape. The standard t-test assumes that the two distributions are identical in every way except possibly for differences between mean values.

Before testing the difference between means, it is sensible to investigate the possibility that the two samples might differ in some other respect. The quickest way to compare all aspects of two samples is with a qqplot, though if you are comparing several samples you may also want to make simultaneous boxplots.

If the two samples do not appear to have the same shape, or the same spread, you should think hard about why you are using a hypothesis test to compare their means. For example, if one distribution has a standard deviation 10, and the other 5, then even if there is no difference between the means the first distribution will have much greater probability of getting a value more than 10 units above the common mean (15.8% vs 2.3%).

Supposing that you have decided to test the null hypothesis that there is no difference between the means of the two groups, then there are several options, depending on the details of the situation.

The t-test for independent samples

The conditions for which the two-sample t-test is appropriate for comparing two samples, say X's and Y's, include:

The X's are an IID sample from a normal distribution.
The Y's are an IID sample from a normal distribution.
The X's and the Y's are independent of each other.
The variances of the two distributions are the same.

It is best to check normality for each group with a normal quantile plot (qqnorm). There are tests to compare the variances of the two groups, but they are at least as sensitive to these conditions as the t-test is. I recommend comparing the distributions with a qqplot instead; the variances need only be approximately equal. Independence of the two samples is hard to verify after the fact. You can attempt to guarantee it by randomization in the design of the experiment.

The data used in the following examples is from a longitudinal study of children in Berkeley, CA. Follow the link to the Berkeley dataset, and paste in the entire "Berkeley" data structure - everything between the horizontal lines.

Now, let's compare the mean height in centimeters of two year old boys and girls at age 2 using the Berkeley dataset. The following commands select out the heights at age two for males and females.

        male <- Berkeley[,"sex"] == "Male"
        m.ht2 <- Berkeley[male,"ht2"]
        f.ht2 <- Berkeley[!male,"ht2"]

The exclamation point in front of "male" in the last line means "not": "not male" is of course female. As usual, there are other ways to separate the data into the two groups. If you print the whole dataset, you will see that the first 26 cases are males, and the remaining cases are females. You could use that information to select the two subsets, but it is good to have the more general tool for selecting from files where the subsets aren't already segregated.

Look at the qqplot for the two groups. Try to predict the outcome of the t-test before you continue.

You might wish to plot the confidence intervals for the two groups. See this script. Copy it into an R document, then substitute your variables for x and y everywhere.

Now let's run the t-test:

        t.test(m.ht2,f.ht2,var.equal=TRUE)

If all is well, the output should look like this:


		 Two Sample t-test

data:  m.ht2 and f.ht2 
t = 1.0709, df = 56, p-value = 0.2888
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.8135485  2.6822985 
sample estimates:
mean of x mean of y 
 88.40000  87.46563

You can see that there is hardly any difference between the two sample means - males were on average one cm. taller in the sample. The t-statistic is 1.08. The p-value is big: about 0.29, and the 95 percent confidence interval includes 0. In other words, the observed difference could easily occur due solely to sampling variation - random noise. We say that the difference is "not statistically significant" as a shorthand for "not big enough to rule out chance variation as an explanation".

Power Computations

The power.t.test function can be used to do power computations for the two sample t-test. It assumes that the sample sizes in the two groups are identical, which is not actually the case for our sample. To compute a crude estimate of the observed power, pretend the sample sizes were equal:

        power.t.test(n=26,sd=3.5,delta=1)

This is equivalent to

        power.t.test(n=26,sd=3.5,delta=1,type="two.sample")

Where did "sd=3.5" come from? In general, before we do an experiment, we don't know what the standard deviation of the data will be! Here we already have a sample, so we can use the standard deviation based on our dataset. When designing an experiment, it is often helpful to look at the published literature in the field to get a sense of the level of variability to expect.

How big a sample would we need to achieve power of .9?

        power.t.test(sd=3.5,delta=1,power=.9,type="two.sample")

The power.t.test function takes the given list of values (n,sd,delta,sig.level,power) and computes whichever ones are missing without a default specification. "sd" defaults to 1, and "sig.level" defaults to .05, so you need to explicitly set them to "NULL" if you wish to have the function compute them. Remember that the "n" computed is the sample size for each group, not the total.

A Non-Parametric Alternative to the t-test

One standard "non-parametric" test worth considering is the Wilcoxon rank-sum test. It also assumes that the two distributions have the same shape and spread, but does not assume they have normal distributions.

wilcox.test(m.ht2,f.ht2)

        Wilcoxon rank sum test with continuity correction

data:  m.ht2 and f.ht2 
W = 503, p-value = 0.1756 
alternative hypothesis: true mu is not equal to 0 

Warning message: 
Cannot compute exact p-value with ties in: wilcox.test.default(m.ht2, f.ht2)

The warning is not important - it simply announces that the p-value is an approximate p-value, not an exact p-value. That is the least of our worries!

Note that the alternative hypothesis is stated as though the test is specific for comparing the means of the two distributions: "true mu is not equal to 0". Be careful here! The null hypothesis can be rejected because of differing shape even when the two distributions have the same mean. It is always important to check the assumptions underlying the test!

Randomization Tests

Use of the randomization test here assumes we have random samples from the populations in question: we didn't randomly assign subjects to be males or females! There are 26 males and 32 females in our sample, and the difference (mean(m.ht2)-mean(f.ht2)) is 0.934375.

     randTest <- rep(0,10000) 
     x <- c(m.ht2,f.ht2)
     for(i in 1:10000)
     {
     y <- sample(x)
     randTest[i] <- mean(y[1:26])-mean(y[27:58])
     }
     ###  compute the p-value ###
     sum(abs(randTest) > 0.934375)/10000

Bootstrapping

Bootstrapping is always an option. Since we have two independent samples, we have to resample each group separately. In general, for bootstrapping any statistical computation, you need to mimic the original sampling scheme when resampling the data. Here is the bootstrap code for the height data, using a relatively small number of resamplings to avoid plugging up the system during the lab:

     boot <- rep(0,10000) 
     for(i in 1:10000)
     {
     m <- sample(m.ht2,replace=T)
     f <- sample(f.ht2,replace=T)
     boot[i] <- mean(m)-mean(f)
     }

The 95% confidence interval for the difference between the means is

     quantile(boot,c(.025,.975))

Compare the bootstrap CI to the standard t based CI (-0.8135, 2.682), and to the Welch CI (-0.7876, 2.656); it should be similar.

Albyn Jones