The First Stop in Monte Carlo (Methods): Rejection Sampling

Welcome to Monte Carlo

Monte Carlo: site of the Formula 1 Monaco Grand Prix, home to a world-famous casino, and the one of the coolest code names (coined by John von Neumann) for something mathematical.

According to Wikipedia,

Monte Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.

There are a lot of different applications in fields like finance, engineering, computer graphics, and artificial intelligence. In fact, the best AI for the ancient board game Go relies on Monte Carlo tree searches instead of evaluating all possible moves like most Chess AIs do. You can do a variety of interesting things with Monte Carlo methods, like estimate by counting the number samples that fall inside a circle, or make an AI for your game.

In this article, I’m only going to go over one algorithm: rejection sampling.

How many times have you been driving down the road thinking, “I really wish I could sample some random variables from a non-standard distribution!” Probably never. That doesn’t mean it’s not useful. Here’s a cool example of someone using approximate Bayesian computation (the sampling portion is a form of rejection sampling) to estimate the number of pairs of socks and single socks in a laundry basket based on a very small sample.

So, let’s begin!

The Basics

Let’s say we’re interested in sampling from some completely off the wall distribution, like say this zig-zag function:

zig.zag <- function(x) {
    if (0 <= x && x <= 1) x
    else if (1 < x && x <= 2) x - 1
    else 0

x <- seq(0, 2, by = 0.01)
y <- sapply(x, zig.zag)

qplot(x, y, geom = 'line')


Does this look even look like a probability distribution function? No, it doesn’t. It’s not even a continuous function, and with rejection sampling it doesn’t even have to be a valid PDF! So how do we go about drawing from it, and how does it work?

The 10,000ft high perspective

To start off, we draw a box that completely surrounds our zig-zag function. Then, we pick a point at random from our box. If it falls under our zig-zag function, we say we accept this point and return it in the sample. If it falls above the zig-zag function, we reject it and repeat the drawing process. Do this again and again until you get the number of samples that you want. It’s a very simple algorithm.

Let’s look at an example outcome of these drawings.

# We want 5000 samples from our zig-zag distribution
N <- 5000
i <- 1
draws <- c()
ret <- c()

while (i <= N) {
    # Draw a number along the width of the box
    # (the support of zig-zag)
    T <- runif(1, min=0, max=2)
    # Draw a number along the height of the box
    U <- runif(1, min=0, max=1)

    draws <- rbind(draws, c(T, U))

    # If the point is under the zig-zag,
    # accept it
    if (U <= zig.zag(T)) {
        ret <- c(ret, T)
        i <- i + 1
    # Otherwise, try again

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0261  0.7110  0.9970  1.1700  1.7000  2.0000

A plot of our data drawn samples and wether or not they were accepted:


A histogram of our accepted data points:


That’s pretty close to our distribution. It looks like we’re sampling from our zig-zag!

Down the Rabbit Hole (Diving Into the Details)

Now, I’m going to dive down into the math behind this. I encourage you to keep reading, but if you’re just looking for an intuitive feel for this, you can stop here. The above is sufficient for the general idea.

The entire idea of rejection sampling hinges on the idea that if we want to sample from can find an easy to sample density, , and a number, , such that for every , then for any event, ,

Great. What’s that mean, and can you prove it?

This means that the probability of X being in our event is the same as the probability that T is in our event given that we have accepted this draw.

We’ll attempt to prove this by showing they are equivalent.

So if you will all take a minute to remember Bayes’ Rule for conditional probability and apply it to our problem:

We can now work on simplifying :

So, line by line we have:

  1. The definition of a joint probability
  2. An application of Bayes’ rule
  3. Remember, our acceptance algorithm is simply to draw a uniform random variable (pick a random number) from 0 to 1, which we’ll call , and compare . This step was just recognizing that the probabilty of accepting given a specific value of t is just the probability that U is greater than the quantity . This is a Bernoulli random variable.
  4. The last step was just some simplifying algebra.

No we move to the bottom:

If is defined on the region ,

Again going over this proof line by line:

  1. Definition of a marginal probability from a joing PDF
  2. Application of Bayes’ rule
  3. Again the probability of accepting given a specific value of t
  4. Simplyfing algebra
  5. Integrating over the support of a PDF gives a value of 1

Plugging these back into our original formula leads us to:

For that step it’s just taking our new pieces, doing some algebra, and applying the definition of the density in the last step.

Now that we’ve proved that those two quantities are equal, we can apply this to any density we want.

Notes on the choice of a bounding density and scaling constant

The easiest density to sample from is the uniform distribution, so that’s what we choose for most of the time. The constant, , allows us to scale the height up so that is greater than any point on the function we want to sample from.

The larger needs to be the larger the proportion of rejected samples will be and the longer the algorithm will take. Increasing increases the amound of rejection area there is. If you have someting that strange like an extremely sharp peak, try bounding with a normal distribution or a gamma distribution with parameters that allow the area between the function to be smaller, e.g. center the normal distribution and give it a small variance to cover the function without having a lot of extra space between.

Your Turn

Now that you can sample from whatever distribution you want. Can you think of a function you’d want to sample from? What if you wanted to sample from an observed sample distribution? (Those are called empirical density functions.) That’s easy now!

As always, if you have any questions or comments, commend below or write me an e-mail.

Markov Chains vs Simulation: Flipping a Million Little Coins


I saw an interesting question on Reddit the other day. The problem was about estimating the amount of decaying radioactive isotopes in a sample after a set amount of time. I don’t think anyone in the thread brought up Markov chains, but that’s what I immediately thought of. Most people used simulations or solved differential equations, but I thought it sounded like a perfect example to illustrate some simple examples of how to do simulations and tackle the very simple basics of Markov chains.

The Problem

This problem comes through the /r/dailyprogrammer page on Reddit.

The exact problem can be found here. I’m leaving out the parsing input portion because it’s not what I want to talk about.

It states that you are given a large amount of some radioactive isotope, Compound A. It decays into a different isotope, Compound B. That, in turn, decays into a different compound, and so on. You are tasked with finding the proportion of each compound present in the sample after some number of seconds.

What You’re Given

  • The isotopes decay as follows:
    • A -> B
    • B -> C
    • C -> D
    • D -> S
  • The probability that 1 atom of each isotope decays in a 1 second interval
  • Starting number of each atom
  • - The number of seconds elapsed
p <- c(0.00007, # P(A -> B)
       0.0005,  # P(B -> C)
       0.00013, # P(C -> D)
       0.00022, # P(D -> S)
       0)       # S doesn't decay

atoms <- c(1000, # 1000 A
               0,   # 0 B
               0,   # 0 C
               0,   # 0 D
               0)   # 0 S

s <- 5000

# Set the seed for simulations

The simple approach - Simulation

The first and simplest approach is to simply simulate each atom at each time point.

We’ll start with an atom the decays 50% of the time. To see if it decays we can consider it to be a Bernoulli random variable (either 0 or 1).

# Does it decay?
rbinom(1, 1, 0.5) == 1
## [1] FALSE

Nope. So now we know if one atom decays or not. Let’s test them all.

rbinom(1, atoms[1], 0.5)
## [1] 512

Great, now we know 512 atoms decay from A to B in the first one second interval. If we repeat this again for the next interval we will have to simulate the first and second atom types.

Let’s assume we start with 56 A and 44 B atoms. We move one second forward by simulating these.

second.interval <- c(56, 44, 0, 0, 0)
a.decay <- rbinom(1, second.interval[1], 0.5)
b.decay <- rbinom(1, second.interval[2], 0.5)

(third.interval <- second.interval - c(a.decay, b.decay-a.decay, -b.decay, 0, 0))
## [1] 30 45 25  0  0

Now we have atoms of type C!

We can put all of this together to create a simple function that calculates this for our example.

sim.decay <- function(atoms, s, p) {
  n.atoms = length(atoms)

  # Create a matrix to store the results after each interval
  results <- matrix(0, nrow = s+1,
                       ncol = n.atoms)

  # The first interval is the starting point
  results[1,] <- atoms

  # It runs for 's' intervals after the first one
  for (i in 2:(s+1)) {
    # Determine the number of atoms that decay of each type
    decayed <- rbinom(rep(1, n.atoms), # 1 trial each
                      results[i-1,],      # of the number of current atoms
                      p)                   # each with a probability of decaying

    # We subtract the net change in each atom.
    # e.g. ome B turn to C but some A turn to B
    results[i,] <- results[i-1,] - (decayed - c(0, decayed[1:(n.atoms-1)]))


Now we can simulate the original trial.

original.sim <- sim.decay(atoms, s, p)

Great, that gives us a rough idea of what we should expect! It’s not exact, but it’s good start. Before we dive into another way of doing this, here’s a visualization of the process:

## Loading required package: methods
plot.decay.history <- function(data) {
  df <- data.frame(data, interval=1:nrow(data))
  names(df) <- c('A', 'B', 'C', 'D', 'S', 'interval')
  df <- melt(df, id.vars=c('interval'))
  p <- ggplot(data=df, aes(x=interval, y=value))
  p <- p + geom_area(aes(fill=variable))
  p <- p + labs(x='Seconds elapsed', y='Number of atoms')
  p <- p + guides(fill=guide_legend(title=NULL))
  p <- p + scale_y_continuous(expand=c(0,0))
  p <- p + scale_x_continuous(expand=c(0,0))



# And another example
second.sim <- sim.decay(c(900, 200, 1000, 50, 0),    # atoms
                            500,                         # intervals
                            c(0.01, 0.03, 0.09, 0.02, 0)) # P(decay)



The simpler approach - Markov Chains

A brief diversion

Markov chains are really interesting, and very useful, mathematical ideas. Formally, they are a sequence of random variables that depend only on the current state, but that means nothing to a lot of people. You’ve probably seen them in the form of random walks.

The easiest way I can explain them is with an analogy to baseball. Imagine a batter is up to the plate with two outs, and no one on base. There are only a few things that can happend next. He can strike out and end the inning. He can hit a single, a double, a triple, and be on base. Finally, he can hit a home run and increase his teams score by one. These make of a set of discrete results for this state. Each one has a different probability of happening, hence why we there are things like batting averages, slugging percentages, and other fancier statistics. These are the states a Markov chain can be in. Each at-bat is one of the random variables in the sequence.

Each at-bat is also independent of the others. The probability that Barry Bonds hits a home run doesn’t depend on whether or not the player before him struck out. All of the information that we need to know to calculate the probability of each result for this state is contained in this state (the player at bat, the number of players on base, and the pitcher). What happened previously doesn’t affect what’s happening now; this is what we call memoryless. That means every baseball game is essentially a Markov chain played out by people instead of in some mathmetician’s head.

Back to business

I hope you can see how this applies to our radioactive decay problem. Our state is just the number of atoms of each type we have. We don’t care if it decayed 10 mintues ago or 10 years ago. We only care what it is now. Each atom has it’s own little at-bat in our model. It’s batting average is the probability that it decays in this interval.

Luckily, many mathemeticians have tackled this problem time and time again. All we need to do is perform a little linear algebra.

Each state can be represented by its number of atoms:

The probability of transitioning to a decayed state for each atom can be represented by something called a transition matrix. This is its most general form:

Where is the probability A decays to B and so on.

To find the proportion of each atom after a given number of intervals, just multiply by that many times. It’s that simple. So, to find just find .

To do this we need to calculate the transition matrix.

trans.mat <- function(p) {
  n.p <- length(p)

  T <- matrix(0, nrow=n.p, ncol=n.p)

  for (i in 1:(n.p-1))
    T[i,] <- c(rep(0, i-1), # Pad with i-1 0's for the offset
               1-p[i], p[i], # P(stay) = 1 - P(decay)
               rep(0, n.p - (i+1)))

  T[n.p,] <- c(rep(0, n.p-1), 1)


Now we can this to find the decay from a Markov chain.

markov.decay <- function(atoms, s, p) {
  n.atoms <- length(atoms)

  T <- trans.mat(p)
  M <- matrix(0, ncol=n.atoms,

  M[1,] <- atoms

  for (i in 2:(s+1))
    M[i,] <- M[i-1,] %*% T


Let’s try out our previous examples using our new method and compare the results.

original.markov <- markov.decay(atoms, s, p)


second.markov <- markov.decay(c(900, 200, 1000, 50, 0),
                              c(0.01, 0.03, 0.09, 0.02, 0))


## [1] 708 105 147  29  11
## [1] 704.68 101.36 151.00  31.96  11.01
## [1]   10    3    1    6 2130
## [1]    5.913    2.957    1.109    9.923 2130.098

That’s pretty close.

I hope you enjoyed this, and that you learned something from this brief introduction to Markov chains and simulations.

If you have any questions, comments, or anything really, please comment below or send me an e-mail.

What's this blog? A Short Introduction

What’s this blog about?

This blog is simply a place for me to put some of the more coherent thoughts I have into words. I love math, statistics, and programming. I think about them pretty much all day, every day. I often find myself captivated by a problem or find some cool, new trick or idea. I like to share these things, but often I don’t have a place I can do anything more than mention to someone in passing, “Hey, isn’t this neat!”

This blog was created to give me a place to practice writing and share things that I find interesting or entertaining.

What I’ll be writing about

I’m a graduate student in biostatistics (which is just a fancy, more confusing word for statistics typically applied to some medical or public health problem) so there will definitely be a heavy emphasis on math and statistics. I write a lot of code for analyses for work. I enjoy programming in my free time too, so there will also be a focus on programming from both a statistical and general perspective.

These are my current interests. Most likely the subject of any given post will come from one or more of these:

  • Machine Learning/Statistical Learning
  • Programming
    • General practices and ideas (e.g. OOP and functional programming idioms)
    • Statistical applications (e.g. Basics of simulation studies, How-To for R packages)
  • Interesting mathematical concepts and questions
  • Statistical analysis of games
    • I like to look for dominant strategies in games that I play, and I also like to study things like the effect of skill in games with a large element of randomness/non-randomness.
  • Other random interests that may not have anything to do with this stuff

Hopefully someone other than me will get something out of reading this site. If you do, please drop me a line here.