Simulating Chi squared statistics

To get a feel for the behavior of the chi-squared statistic under random sampling, we can do a simulation: generate random samples corresponding to a specified pattern of probablilities for a 2x2 table, compute the chi-squared statistics, and study their behavior.

Later we will learn more about probability distributions like the multinomial. For now, it suffices to imagine that we have a table of probabilities:

	B	not B
A	P11	P12
not A	P21	P22

where P11+P12+P21+P22=1. Imagine pulling N stones out of a bag of stones with 4 colors corresponding to the four cells in the table. We will generate counts for the table by using the rmultinom() function. For example, suppose we want a sample size of 100 with all the probabilities equal to 1/4:

    rmultinom(1,100,c(1/4,1/4,1/4,1/4))

We have done one replication of our experiment of 100 trials each with the specified probabilies.

To use the chisq.test() function we need to put the data into a 2x2 matrix or table:

    matrix(rmultinom(1,100,c(1/4,1/4,1/4,1/4)),ncol=2)

Every time we use the rmultinom() function we get a new sample. Finally, let's generate a sample and use the chisq.test() function:

    A = matrix(rmultinom(1,100,c(1/4,1/4,1/4,1/4)),ncol=2)
    chisq.test(A,correct=F)

Or save the results in a new dataset:

    A = matrix(rmultinom(1,100,c(1/4,1/4,1/4,1/4)),ncol=2)
    B = chisq.test(A,correct=F)
    names(B)
    B
    B$stat

Look at the output from each step to make sure you understand it!

Now, let's do our experiment 1000 times. There are various ways to do this, here is a simple one:

create a dataset (X) to hold the computed chi-squared values.
repeat 1000 times: generate a sample, put it in a matrix, and compute the chi-squared statistic, storing it in our dataset X.
look at X.

    X = rep(0,1000)

    for(i in 1:1000)
    {
    A = matrix(rmultinom(1,100,c(1/4,1/4,1/4,1/4)),ncol=2)
    B = chisq.test(A,correct=F)
    X[i] = B$stat
    }

Now look at X; too see the 1000 numbers, just enter the name of the dataset, X. That isn't very useful, is it? How about a histogram:

hist(X)

or just compute the 95-th percentile:

quantile(X,.95)

Notes: The chisq.test() function will return a missing value (actually "Inf") whenever there is a row or column that is entirely zeros. That is because the expected count is zero in that case, and the chi-squared statistic is computed as sum((O-E)^2/E). In the simulation you can replace these missing values by some rather large number to include them in the compuation. Make sure choose a number large enough so that you won't confuse it with a finite chi-squared value that might actually arise.

     X[is.na(X)]= 1000

Homework!

Do a simulation generating 10000 chi-squared values for tables corresponding to a sample size of 20 with probabilities (1/6,1/6,1/3,1/3). Create a histogram and compute the 95th percentile.
Do a simulation generating 10000 chi-squared values for tables corresponding to a sample size of 200 with probabilities (1/6,1/6,1/3,1/3). Create a histogram and compute the 95th percentile.
Both simulations are generating chi-squared values with 1 degree of freedom. How do they compare to the theoretical chisquared value of 3.84 for the 95-th percentile?