Simulating Chi squared statistics

To get a feel for the behavior of the chi-squared statistic under random sampling, we can do a simulation: generate random samples corresponding to a specified pattern of probablilities for a 2x2 table, compute the chi-squared statistics, and study their behavior.

Later we will learn more about probability distributions like the multinomial. For now, it suffices to imagine that we have a table of probabilities:

B not B
A P11 P12
not A P21 P22
where P11+P12+P21+P22=1. Imagine pulling N stones out of a bag of stones with 4 colors corresponding to the four cells in the table. We will generate counts for the table by using the rmultinom() function. For example, suppose we want a sample size of 100 with all the probabilities equal to 1/4:
    rmultinom(1,100,c(1/4,1/4,1/4,1/4))
We have done one replication of our experiment of 100 trials each with the specified probabilies.

To use the chisq.test() function we need to put the data into a 2x2 matrix or table:

    matrix(rmultinom(1,100,c(1/4,1/4,1/4,1/4)),ncol=2)
Every time we use the rmultinom() function we get a new sample. Finally, let's generate a sample and use the chisq.test() function:
    A = matrix(rmultinom(1,100,c(1/4,1/4,1/4,1/4)),ncol=2)
    chisq.test(A,correct=F)
Or save the results in a new dataset:
    A = matrix(rmultinom(1,100,c(1/4,1/4,1/4,1/4)),ncol=2)
    B = chisq.test(A,correct=F)
    names(B)
    B
    B$stat
Look at the output from each step to make sure you understand it!

Now, let's do our experiment 1000 times. There are various ways to do this, here is a simple one:

  1. create a dataset (X) to hold the computed chi-squared values.
  2. repeat 1000 times: generate a sample, put it in a matrix, and compute the chi-squared statistic, storing it in our dataset X.
  3. look at X.
    X = rep(0,1000)

    for(i in 1:1000)
    {
    A = matrix(rmultinom(1,100,c(1/4,1/4,1/4,1/4)),ncol=2)
    B = chisq.test(A,correct=F)
    X[i] = B$stat
    }
Now look at X; too see the 1000 numbers, just enter the name of the dataset, X. That isn't very useful, is it? How about a histogram:
hist(X)
or just compute the 95-th percentile:
quantile(X,.95)
Notes: The chisq.test() function will return a missing value (actually "Inf") whenever there is a row or column that is entirely zeros. That is because the expected count is zero in that case, and the chi-squared statistic is computed as sum((O-E)^2/E). In the simulation you can replace these missing values by some rather large number to include them in the compuation. Make sure choose a number large enough so that you won't confuse it with a finite chi-squared value that might actually arise.
     X[is.na(X)]= 1000

Homework!

  1. Do a simulation generating 10000 chi-squared values for tables corresponding to a sample size of 20 with probabilities (1/6,1/6,1/3,1/3). Create a histogram and compute the 95th percentile.
  2. Do a simulation generating 10000 chi-squared values for tables corresponding to a sample size of 200 with probabilities (1/6,1/6,1/3,1/3). Create a histogram and compute the 95th percentile.
  3. Both simulations are generating chi-squared values with 1 degree of freedom. How do they compare to the theoretical chisquared value of 3.84 for the 95-th percentile?