Binomial Distributions

Probability Computations Related to Binomial Distributions

For a binomial(n,p) random variable X, the R functions involve the abbreviation "binom":

    dbinom(k,n,p)  # binomial(n,p) density at k: Pr(X = k)

    pbinom(k,n,p)  # binomial(n,p) CDF at k: Pr(X <= k)

    qbinom(P,n,p)  # binomial(n,p) P-th quantile

    rbinom(N,n,p)  # N binomial(n,p) random variables

    help(Binomial) # documentation on the functions related
                   # to the Binomial distribution

n choose k

The related function "n choose k" which returns the binomial coefficients or the number of ways to choose k objects from a set of n objects without regard for order is:

    choose(n,k)

For example, choose(4,2) should be 6.

Self-check Exercise
A storyteller is given a set of 14 characters. Each night she is to tell a story involving 4 of the 14 characters, using each combination of 4 only once. For how many nights can she continue without repeating any selection of 4 characters?

Check your answer: Solutions

dbinom examples:

For a binomial(6,1/3) random variable named X, compute the probability of 2 successes, Pr(X=2):

> dbinom(2,6,1/3)
[1] 0.3292181

Compare to the direct computation (don't type all this into R, it's an equation not an R command :-):

Pr(X = 2) = ((6*5*4*3*2*1)/(2*1*4*3*2*1))*(1/3)^2*(2/3)^4 

          = 15*(1/3)^2*(2/3)^4

          =  0.3292181

or in R:

         choose(6,2)*(1/3)^2*(2/3)^4

We can compute the probabilities for the whole sample space at once if we like with:

          dbinom(0:6,6,1/3)

A nicer way to display those probabilities is in a table, using the data.frame() function:

          data.frame(X = 0:6, Prob = dbinom(0:6,6,1/3))

When printing tables like this, it is often better to round off the entries to fewer significant digits:

          M <- data.frame(X = 0:6, Prob = dbinom(0:6,6,1/3))
          round(M,4)

We can also produce the theoretical histogram for repeated trials of a given binomial experiment. Here is a function to draw the binomial density "curve", you can paste it in to your R session:

hist.binom <- function(n, p, ...)
{
# plots a relative frequency histogram of the binomial distribution
        k <- 0:n
        p <- dbinom(k, n, p)
        names(p) <- as.character(0:n)
        barplot(p,space=0, ...)
}

A relative frequency histogram is just a barplot with no space between the bars, where the total area is 1. The height of each bar here the probability associated with that binomial outcome. To use the hist.binom function, you must specify the values of n and p. For example, to get the histogram of a binomial(6,1/3) distribution, use

  
       hist.binom(6,1/3)

Try experimenting with the color ("col") argument; use help(colors) or colors() to see what colors R knows by name.

  
       hist.binom(6,1/3,col="yellow")

Self-check Exercise

First, set up the plot window to hold 4 plots using par(mfrow=c(2,2)). Then, plot the binomial densities for the following values of n and p:

Where is the peak of the density (the mode) located in each case? What is the relationship between the mode and the values of n and p?

Check your answer: Solutions

pbinom Examples:

For a binomial(6,1/3) random variable X, compute the probability that X is less than 3; in other words, Pr(X <= 2):

          pbinom(2,6,1/3)

Compare to summing the density (ie adding up the areas under the binomial histogram:

          dbinom(0,6,1/3)+dbinom(1,6,1/3)+dbinom(2,6,1/3)

or

          sum(dbinom(0:2,6,1/3))

We can find the probability that we get more than a given number of successes by subtraction: for the probability that X is greater than 2, Pr(X > 2):

          1 - pbinom(2,6,1/3)

We could of course compute the probability of that event by summing the individual probabilities:

          sum(dbinom(3:6,6,1/3))

Finally, R has another option for this computation:

          pbinom(2,6,1/3,lower.tail=FALSE)

The tails of a probability distribution are the values at either end of the range of the random variable. The pbinom function normally assumes that you want the lower tail of the distribution, that is the probability of getting less than or equal to a specified value. The specification "lower.tail=FALSE" tells S to compute the upper tail of the distribution, that is the probability of getting a value greater than the argument. Compare all three methods and make sure that they give the same probabilities. Note: since the Binomial sample space only includes integer values, Pr(X < k) is the same as Pr(X <= k-1).

qbinom Examples:

For continuous distributions, the "p" function and the "q" function are inverses (ie either one undoes the action of the other: try qnorm(pnorm(0)), or pnorm(qnorm(.5))). This doesn't work out quite so perfectly for the binomial distribution because of the discrete nature of the sample space. It is too "lumpy". Compare

      qbinom(.5,6,1/3)

      pbinom(2,6,1/3)

Self-check Exercises For a random variable X with a binomial(20,1/2) distribution, find the following probabilities.

Find Pr(X < 8)
Find Pr(X > 12)
Find Pr(8 <= X <= 12)

For a binomial(200,1/2) distribution:

Find Pr(X < 80)
Find Pr(X > 120)
Find Pr(80 <= X <= 120)

For a binomial(2000,1/2) distribution:

Find Pr(X < 800)
Find Pr(X > 1200)
Find Pr(800 <= X <= 1200)

You may use the either the dbinom or pbinom functions.

Note that these sets of computations pertain to the same range of sample proportions (number of successes divided by number of trials: 8/20 = 80/200 = 800/2000), but with different sample sizes. Remember the pbinom function gives Pr(X <= k), not Pr(X < k).

Check your answers: Solutions

Sampling without replacement: the hypergeometric distribution

    dhyper(k,r,b,n)  # hypergeometric density at k: Pr(X = k)
                     # r = number of red balls in the urn
                     # b = number of black balls in the urn
                     # n = sample size

    phyper(k,r,b,n)  # hypergeometric CDF at k: Pr(X <= k)

    rhyper(reps,r,b,n)
                     # generate random samples 
                     # reps = number of repetitions of sampling
                     # n draws from an urn with r reds and b blacks
                     #   value: a sample of size "reps" of numbers
                     #          of reds drawn in each trial of n draws

the Poisson distribution

     dpois(k, mu)    # poisson density at k: Pr(X = k)
     ppois(k, mu)    # poisson CDF at k: Pr(X <= k)
     rpois(n, mu)    # generate n poisson(mu) random variables

The Poisson distribution can be thought of as an approximation to the binomial when the number of independent trials (n) is large and the probability of an event (p) is small. It is a very important probability model, often useful when looking at counts of events like deaths per year, phone calls per minute, etc. The earliest known application is a dataset on deaths by horse-kick in Prussian Army cavalry corps (Bortkiewicz 1898)

    Deaths = c(0,1,2,3,4,5)
    N = c(109,65,22,3,1,0)
    weighted.mean(Deaths,N)
    [1] 0.61

"Deaths" correspond to the outcomes for a single unit, "N" is the number of cavalry units corresponding to each outcome (109 had 0 deaths, 65 had 1 death, etc.). Using the weighted.mean function is equivalent to computing the average of 109 0's, 65 1's, 22 2's, etc. You can verify this by createing the full dataset:

    D1 <- rep(Deaths,N)
    mean(D1)

Homework Exercises:

Let X be the number of heads in 10 tosses of a fair coin.

Find the probability of getting at least 5 heads (that is, 5 or more).
Find the probability of getting exactly 5 heads.
Find the probability of getting between 4 and 6 heads, inclusive.

Let X be the number of heads in 100 tosses of a fair coin.

Find the probability of getting at least 50 heads (that is, 50 or more).
Find the probability of getting exactly 50 heads.
Find the probability of getting between 40 and 60 heads, inclusive.

Let X be the number of heads in 1000 tosses of a fair coin.

Find the probability of getting at least 500 heads (that is, 500 or more).
Find the probability of getting exactly 500 heads.
Find the probability of getting between 400 and 600 heads, inclusive.

Describe any trends you see in the answers you gave to the preceding problems.
37 Reed students meet to plan the Renn Fayre Glow Opera. They decide to select a two person subcommittee to write the script for a version of Faust. How many ways could the Faust subcommittee be selected from the 37 students?
For the Bortkiewicz dataset, compare the observed data to the theoretical frequencies predicted by the Poisson distribution, using the observed rate (.61) as the Poisson parameter. The total count is 200 (the number of cavalry units). To compare observed counts to predicted counts, multiply the Poisson probabilities by the number of observations (200) to get the predicted counts. Display the results in a table with 3 columns: one for the number of deaths (0 to 5), one for the predicted counts, and one for the observed counts. See the data.frame() examples under dbinom if you don't recall how to create a table.
Compute the probability of getting outcomes 0, 1, 2, and 3 for the following distributions: Binomial(50, 1/25), Binomial(100,1/50),Poisson(2), Hypergeometric(r=4,b=96,n=50), Hypergeometric(r=40,b=960,n=50) and Hypergeometric(r=400,b=9600,n=50). How close are they to each other?

Math 141 Index Introduction to S

Albyn Jones
September 2004