Clusters, Hot Streaks and Lab Assignments

A small rural town experiences several cases of childhood leukemia, at a rate well above the national average rate. The local newspaper reports the disease cluster, and demands that the state health authorities investigate the cause.

A basketball player hits several baskets in a row. The announcer states that the player has a "hot hand" tonight, and that the other players on the team should get her the ball at every opportunity.

What do these situations have in common? In both cases it is possible that something interesting is going on. The water supply of the small town might be contaminated with a carcinogenic chemical. The basketball player might be experiencing an unusual conjunction of concentration and technique. It is also possible that we are simply observing variation arising naturally due to chance.

As Hume observed, people have a natural tendency to see patterns and make causal inferences from those patterns. The existence of constellations in the night sky is a testament to our ability to find patterns where none exist. In recent years cognitive psychologists have attempted to analyze the kinds of heuristic rules that people use in these settings. Those wishing to explore the literature might start with the following references:

Tversky, Amos , and Kahneman, Daniel  
     (1993),  Belief in the law of small numbers, A handbook for 
               data analysis in the behavioral sciences:
               Methodological issues, 341-349 

     (1982),  Judgment under uncertainty: Heuristics and biases
The "law of small numbers" refers to the tendency to expect that small samples will be typical or "representative" of the population from which they are taken. It is an expression of people's often false expectations, rather than a real law! Later we will meet the "law of large numbers", which in effect states that, under certain circumstances, statistics computed from large samples will accurately reflect the corresponding population values.

Suppose that I toss a coin 50 times. What do you think will be the longest run of identical outcomes? It is hard to solve this problem analytically, but easy to study by simulation. Enter the following function definition into your S session.

     CoinToss <- function(n) 
                 { 
                 # n is the number of coin tosses
                 rbinom(n,1,1/2) 
                 }
The function will generate a pseudo-random sequence of n 0's and 1's. If we associate 0 with Tails and 1 with Heads, then the function is generating a sequence of n coin-toss outcomes. Lines beginning with "#" are comments.

To use the function, you specify the number of coin tosses. For example, to generate 10 coin tosses, enter

     CoinToss(10)
You will probably want to store the resulting string of 0's and 1's in a dataset. For example:
     x <- CoinToss(10)
Examine the sequence, equating 1's with heads. A run of heads is a set of consecutive 1's starting either with the first trial or a preceding 0, and ending with either the last trial or a trailing 0. A run of tails is similarly defined.

We can use the rle (for "run length encoding") function to extract all sequences of runs from a dataset. Here is an example:

> x
 [1] 1 1 0 0 1 0 0 1 0 1

> runs <- rle(x)

> runs
Run Length Encoding
  lengths: int [1:7] 2 2 1 2 1 1 1
  values : num [1:7] 1 0 1 0 1 0 1
The rle function returns two components: "lengths" is the vector of run lengths, "values" is the vector of the repeated values for each run. We can reconstruct the original dataset (x, in the example above) from the run length encoding as follows: You should try it with a dataset you have generated.

To find the length of the longest run, use the max function

       x <- CoinToss(20)
       r <- rle(x)
       max(r$lengths)
You should inspect "x" and "r" afterwards to be sure that you understand what happened!

Try it several times for 20 tosses. You will see that the longest run of heads in 20 tosses is pretty variable in length - it might easily be as short as 2 or as long as 6 or 7.

In order to get a better idea of the variability of run lengths, we need to generate lots of coin toss strings of the desired length. While we could do this by generating the datasets one at a time, that would be boring and laborious.

Here is a function that does the simulation for us, and prints a table of the relative frequencies of the longest run of each trial. Enter it into your R session window:

RLDist <- function(n,trials)
{
#
#  n : the number of coin tosses
#  trials : the number of sets of n coin tosses
#
M <- rep(0,trials)
  for(i in 1:trials)
    {
    # for each trial, find the length of the longest run
    M[i] <- max(rle(CoinToss(n))$lengths)
    }
print(table(M)/length(M))
M
}
The name of the function, "RLDist", stands for "run length distrbution". The argument "n" is the number of coin tosses on each trial, and "trials" is the number of repetitions of n coin tosses you want. The "RLDist" function will print a table of the frequencies of runs of heads, and return the entire set of lengths of longest runs as a dataset.

Try it with a small problem first to get a sense of how much time it will take:

      M <- RLDist(20,100)
The most frequent longest runs should be 3 and 4 for 20 coin tosses. You can treat the simulation output as data, for example:
      hist(M)        # make a frequency histogram
      hist(M,freq=F) # make a relative frequency histogram
      mean(M)        # compute the average
      median(M)      # compute the median
      summary(M)     # a bit more information
We can often improve the choice of interval boundaries for the hist function. Here we would like the boundaries to be between integers rather than on the integers:
hist(M,breaks=seq(.5,14.5,1))


Exercises

Compute the mean of the length of the longest run for 10, 20, 40, 80, and 160 tosses, and plot them against the numbers of tosses.

Try using 1000 trials for each simulation. You will get more accurate results with more trials, but of course it will take longer. You can investigate the accuracy of each simulation by repeating it and to see how close the results are to each other. If you create a dataset with the averages, and a dataset with the corresponding numbers of tosses, then you can use the plot function to study the relationship. You may find it helpful to set the axes of the plot yourself. For example if the numbers of tosses are stored in the dataset "n", and the averages in the dataset named "m".

M10 <- RLDist(10,1000)
mean(M10)
produces the average length of the longest run for 1000 repetitions of 10 coin tosses.
n <- c(10,20,40,80,160)
produces a dataset with the numbers of coin tosses. You will need to collect the averages in a dataset in a similar fashion. If you have stored the output of the RLDist function in datasets called M10, M20, M40, M80, M160, combine them into a dataset with:
m <- c(mean(M10),mean(M20),mean(M40),mean(M80),mean(M160))
To plot, try:
plot(n,m)
The plot function sets the range on the axes to be just big enough to contain all the data. You might want to force the plot to display the origin too, as in the following plot:
plot(n,m,xlim=c(0,500),ylim=c(0,10), xlab="Number of Tosses",
     ylab="Average Length of Longest Run")
We can also "connect the dots" (type="l": that's an ell, not a one):
plot(n,m,xlim=c(0,500),ylim=c(0,10), xlab="Number of Tosses",
     ylab="Average Length of Longest Run", type="l",col="red")
It is often useful to plot one of the variables in the log scale. Try
plot(log(n),m,xlim=c(0,7),ylim=c(0,8))
The "log" function defaults to base e (natural logarithm), you can use a different base if you like. Since we doubled n for each trial, it may be useful to use base 2:
plot(log(n,2),m,xlim=c(0,8),ylim=c(0,8))

Exercise 0. For each exercise below copy the R commands you used into a text file or your favorite word processor. Annotate with comments so that the reader can follow the thread. Print and submit with your assignment.
########################################################
# All comments embedded in R code should look like this
#     it will make your R commands much easier to follow
########################################################
Exercise 1. Make copies of one of the plots of n and m, and one of the plots of log(n) or log(n,2) and m to turn in. (You can cut-and-paste the graphics into your favorite word processor using a menu item in R on the Macs.)

Exercise 2. Make a prediction for the average length of the longest run of heads in 640 tosses, and then test it by simulation. You don't need to do any computations, just look at your plots. You might want to remake the plot with axes extended to cover the necessary range. Be sure you make the prediction before doing the simulation!

Exercise 3. What is the connection between the runs example and the the `hot hand' example? The disease cluster example? Can you think of any other common instances where people impute meaning to patterns that might be purely random?

Exercise 4. There is a standard statistical parlor trick based on runs in coin tossing. Ask half of your audience to toss a coin 100 times and write down the sequence of heads and tails. Ask the other half of the audience to mentally generate sequences of 100 random heads and tails. You then look at several of the sequences, announcing for each one whether it was generated by coin tosses or mentally. What would be a good rule for discriminating between the two groups? Hint: how do your own naive beliefs about random sequences of coin tosses differ from those you have been studying today?



Notes and reminders:



Albyn Jones
Feb 2005