Testing independence in a 2x2 table

Consider two events A and B. We know that A and B are independent events if P(A|B)=P(A), or equivalently if P(A and B)=P(A)*P(B). The definition of course assumes that we know the probabilities. In practice, we don't know them. We have counts of the occurrences of each of the possible combinations of outcomes: A and B, A and not-B, not-A and B, and not-A anfd not-B. These counts form a 2x2 table, where the rows are defined by A and not-A, the columns by B and not-B. For example, data from the Physicians Health Study (1988 NEJM 318: 262-264). Here we could define the events as A: took aspirin, not-A: took a placebo, B: had a heart attack (Myocardial Infarction), not-B: no MI.

	MI	no MI
Placebo	189	10845
Aspirin	104	10933

The observed proportions within each row (estimates of P(MI|not-A) and P(MI|A) ) are:

> 189/(189+10845)
[1] 0.01712887

> 104/(104+10933)
[1] 0.00942285

The idea of the chi-squared test for independence is essentially to ask the question: how unlikely are we to see proportions this different if the two factors are really independent? If it is sufficiently unlikely, usually taken to mean `occurring less than 5 percent of the time', then we conclude that we have evidence of dependence.

The chi-squared test in R

We will walk through the process step by step: creation of the dataset, computing the chi-squared test statistic, and evaluation of the results.

create the dataset

There are two methods, one quick and dirty, the other creating a more self-explanatory dataset with labels.

Quick create a dataset of type matrix directly, entering the numbers in column order:

> Aspirin = matrix(c(189,104,10845,10933),ncol=2)

> Aspirin
     [,1]  [,2]
[1,]  189 10845
[2,]  104 10933

Now the chi-squared test:

> chisq.test(Aspirin,correct=F)

        Pearson's Chi-squared test

data:  Aspirin 
X-squared = 25.0139, df = 1, p-value = 5.692e-07

Not so quick: labeled data table called a "data.frame". This takes slightly more work, but you are less likely to forget what the data represent. First we create a data frame:
```
> aspirin = c("no","yes","no","yes")
> MI = c("yes","yes","no","no")
> N = c(189,104,10845,10933)
> A = data.frame(aspirin,MI,N)
> A
  aspirin  MI     N
1      no yes   189
2     yes yes   104
3      no  no 10845
4     yes  no 10933
```
Now we create the crosstabulation and compute the chi-squared:
```
> Aspirin = xtabs(N~aspirin+MI,data=A)
```
The xtabs() function creates the cross-tabulation. The two factors (aspirin and MI) define the table. N countains the counts for each cell of the table.
```
> Aspirin
       MI
aspirin no    yes  
    no  10845   189
    yes 10933   104

> chisq.test(Aspirin,correct=F)

        Pearson's Chi-squared test

data:  Aspirin 
X-squared = 25.0139, df = 1, p-value = 5.692e-07
```
You will note that the data are displayed in a slightly different order, but the values are the same, and the test statistic is identical.

What does it mean?

The chi-squared statistic is 25.01 with one degree of freedom (we will explain degrees of freedom later!). The p-value is desired probability; if it less than .05 we infer dependence. If it is not less than .05 we have failed to demonstrate dependence, which is not the same as demonstrating independence! Here it is roughly .0000005, which is really tiny. In other words, we have evidence that taking aspirin and heart attacks are not independent. The proportion of subjects who had an MI while taking aspirin was roughly half that of the placebo group.

Homeworksimulation!

Math 141 Index
Introduction to S

Albyn Jones
September 2005