Академический Документы
Профессиональный Документы
Культура Документы
Simulation
1 Introduction
Simulation is a powerful tool in probability and statistics. Probabilistic models are often too
complex for us to derive closed-form solutions of the distribution or expectation of quantities
of interest, as we do in homework problems.
As an example, imagine that you set up a probabilistic model to determine the probability
of winning a game of solitaire. If the cards are well shuffled, this probability equals
Number of permutations that lead to a win
P (Win) = . (1)
Total number
The problem is that characterizing what permutations lead to a win is very difficult without
actually playing out the game to see the outcome. One could think of counting the number
of permutations that result in a win, but there are 52! 8 1067 permutations so this is
computationally intractable. However, there is a simple way to approximate the probability
of interest: simulating a large number of games and recording what fraction result in wins.
The game of solitaire was precisely what inspired Stanislaw Ulam to propose simulation-based
methods, known as Monte-Carlo techniques, in the context of nuclear-weapon research in
the 1940s:
The first thoughts and attempts I made to practice (the Monte Carlo Method) were suggested
by a question which occurred to me in 1946 as I was convalescing from an illness and playing
solitaires. The question was what are the chances that a Canfield solitaire laid out with 52
cards will come out successfully? After spending a lot of time trying to estimate them by
pure combinatorial calculations, I wondered whether a more practical method than abstract
thinking might not be to lay it out say one hundred times and simply observe and count the
number of successful plays.
This was already possible to envisage with the beginning of the new era of fast computers, and
I immediately thought of problems of neutron diffusion and other questions of mathematical
physics, and more generally how to change processes described by certain differential equa-
tions into an equivalent form interpretable as a succession of random operations. Later (in
1946), I described the idea to John von Neumann, and we began to plan actual calculations.1
The basic idea of Monte Carlo simulation is to repeatedly sample from a probabilistic model
to obtain realizations of a quantity of interest and then compute its average or its empirical
1
http://en.wikipedia.org/wiki/Monte_Carlo_method#History
distribution. The following example applies simulation to predict the upcoming presidential
election.
We assume that these random variables are independent, which is of course not completely
accurate. This is a simplified version of the probabilistic model used by the website fivethir-
tyeight to predict the outcome of the election, which does take into account historical corre-
lation between the outcomes in different states. The model is fitted using poll data (as well
as approval ratings, surveys, economic indicators, etc.). Later on in the course, we will learn
how to estimate distributions from data, but for now we will assume that the distribution of
the random variables Si , i 51, is known and use the estimates provided by fivethirtyeight
on September 26, before the first presidential debate. These estimates are shown in Table 1.
In order to use the model to predict the outcome of the election, we need to compute the
probability of the event
51
X
ni S i > 0 (3)
i=1
where ni denotes the number of electors assigned to state i (see Table 1). A brute-force
approach would require considering all possible outcomes, which equal 251 2 1015 !2 This
is intractable. Alternatively, we can simulate the election repeatedly using the probabilistic
model and record how often Clinton or Trump win. Figure 1 plots the fraction of times Clin-
ton wins as we increase the number of samples. Eventually, the estimate of the probability
of our event of interest converges to around 60% (we perform the simulation three times to
verify convergence).
2
It is possible to apply convolutions to compute the result more efficiently, but this technique is difficult
to generalize when the random variables are no longer independent.
2
ni P (Si = 1) P (Ei ) ni P (Si = 1) P (Ei )
Table 1: The table shows the number of electors per state, the probability that Clinton wins the
state as estimated by fivethirtyeight on September 26 and an estimate of the probability that if
the outcome in the state changes then this affects the result of the election. The estimate is less
accurate if the probability of Ei is very low, as we only used 104 samples. The five states for which
the probability of Ei is largest are shown in blue.
3
1.0
0.8
0.6
0.4
0.2
Figure 1: Fraction of the number of simulations that result in Clinton winning the election. We
repeat the simulation three times to verify that it converges to the same estimate.
We can also use the probabilistic model to estimate the probability of other interesting events.
For example, a candidates campaign could be interested in the influence of a particular state
in the final outcome of the election. One way to quantify this effect is by computing the
probability that if the outcome in state i changes then this also changes the result of the
election, i.e. the probability of the event
X 51 X51
X 51 X51
Ei := nj Sj < ni Si , nj Sj 0 nj Sj > ni Si , ni Sj 0 . (4)
j=1 j=1
j=1 j=1
j6=i j6=i
2. Transforming the uniform samples so that they have the desired distribution.
4
x3
x2
x1
0 u1 FX (x1 ) u2 u3 u4 FX (x2 ) u5 1
Figure 2: Illustration of the method to generate samples from an arbitrary discrete distribution
described in Example 2.1.
In these notes we will focus on the second step, assuming that we have access to a random-
number generator that produces iid samples following a uniform distribution in [0, 1]. The
construction of good uniform random generators is an important problem, but it is beyond
the scope of this course.
2 Inverse-transform sampling
Inverse-transform sampling allows to sample from an arbitrary distribution with a known cdf
by applying a deterministic transformation to uniform samples. To provide some intuition,
we first discuss how to sample from a discrete distribution.
Example 2.1 (Sampling from a discrete distribution). Let X be a discrete random variable
with pmf pX and U a uniform random variable in [0, 1]. Our aim is to transform a sample
from U so that it is distributed according to pX . We denote the values that have nonzero
probability under pX by x1 , x2 , . . .
For a fixed i, assume that we assign all samples of U within an interval of length pX (xi )
to xi . Then the probability that a given sample from U is assigned to xi is exactly pX (xi )!
Very conveniently, the unit interval can be partitioned into intervals of length pX (xi ). We
5
can consequently generate X by sampling from U and setting
x1 if 0 U pX (x1 ) ,
x 2 if pX (x1 ) U pX (x1 ) + pX (x2 ) ,
X = ... (5)
Pi1 Pi
xi if pX (xj ) U pX (xj ) ,
j=1 j=1
. . .
FX (x) = P (X x) (6)
X
= pX (xi ) , (7)
xi x
so our algorithm boils down to obtaining a sample u from U and then assigning to the xi
such that FX (xi1 ) u FX (xi ). This is illustrated in Figure 2 for a simple example.
1. Obtain a sample u of U .
The careful reader might point out that FX may not be invertible at every point. To avoid
this problem we define the generalized inverse of the cdf as
The function is well defined because all cdfs are non-increasing, so FX is equal to a constant
c in any interval [x1 , x2 ] where it is not invertible.
We now prove that Algorithm 2.2 works.
6
FX1 (u5 )
FX1 (u4 )
FX1 (u3 )
FX1 (u2 )
FX1 (u1 )
0 u1 u2 u3 u4 u5 1
7
3 Rejection sampling
Inverse-transform sampling requires having access to the inverse of the cdf corresponding to
the distribution of interest, which is not always the case. For instance, in the case of the
Gaussian random variable the cdf does not even have a closed-form expression. Rejection
sampling, also known as the accept-reject method, only requires knowing the pdf of the
distribution of interest. It allows to obtain samples according to a target pdf fY by choosing
samples obtained according to a different pdf fX that satisfies
for all y, where c is a fixed positive constant. In words, the pdf of Y must be bounded by a
scaled version of the pdf of X.
Algorithm 3.1 (Rejection sampling). Let X be a random variable with pdf fX and U a
random variable that is uniformly distributed in [0, 1] and independent of X. We assume
that (15) holds.
1. Obtain a sample y of X.
2. Obtain a sample u of U .
3. Declare y to be a sample of Y if
fY (y)
u . (16)
c fX (y)
The following theorem establishes that the samples obtained by rejection sampling have the
desired distribution.
Theorem 3.2 (Rejection sampling works). If assumption (15) holds, then the samples pro-
duced by rejection sampling are distributed according to fY .
Proof. Let Z denote the random variable produced by rejection sampling. The cdf of Z is
equal to
fY (X)
FZ (y) = P X y | U (17)
c fX (X)
P X y, U cffYX(X)
(X)
= . (18)
P U cffYX(X)
(X)
8
To compute the numerator we integrate the joint pdf of U and X over the region of interest
Z y Z fY (x)
fY (X) c fX (x)
P X y, U = fX (x) du dx (19)
c fX (X) x= u=0
Z y
fY (x)
= fX (x) dx (20)
x= c fX (x)
1 y
Z
= fY (x) dx (21)
c x=
1
= FY (y) . (22)
c
The denominator is obtained in a similar way
Z Z fY (x)
fY (X) c fX (x)
P U = fX (x) du dx (23)
c fX (X) x= u=0
Z
fY (x)
= fX (x) dx (24)
x= c fX (x)
1
Z
= fY (x) dx (25)
c x=
1
= . (26)
c
We conclude that
FZ (y) = FY (y) , (27)
so the method produces samples from the distribution of Y .
We now illustrate the method by applying it to produce a Gaussian random variable from
an exponential and a uniform random variable.
Example 3.3 (Generating a Gaussian random variable). In Example 2.5 we learned how
to generate an exponential random variables using samples from a uniform distribution. In
this example we will use samples from an exponential distribution to generate a standard
Gaussian random variable applying rejection sampling.
The following lemma shows that we can generate a standard Gaussian random variable Y
by:
9
cfX (x)
fH (x)
1
0.5
0
0 1 2 3 4 5
x
3. Setting Y := SH.
Lemma 3.4. Let H be a continuous random variable with pdf given by (28) and S a discrete
random variable which equals 1 with probability 1/2 and 1 with probability 1/2. The random
variable of Y := SH is a standard Gaussian.
The reason why we reduce the problem to generating H is that its pdf is only nonzero on
the positive axis, which allows us to bound it with the exponential pdf of an exponential
10
1.0
0.8
0.6
Histogram of 50 000 iid
samples from X (fX is 0.4
shown in black)
0.2
0.00 1 2 3 4 5
X
1.0
0.8
0.6
U
Scatterplot of samples
from X and samples
from U (accepted
samples
are colored
red, 0.4
exp (x 1)2 /2 is
shown in black)
0.2
0.00 1 2 3 4 5
X
0.9
0.8
0.7
0.6
0.5
Histogram of accepted
samples (fH is shown in
0.4
black) 0.3
0.2
0.1
0.00 1 2 3 4 5
H
Figure 5: Illustration of how to generate 50 000 samples from the random variable H defined in
Example 3.3 via rejection sampling.
11
p
random variable X with parameter 1. If we set c := 2e/ then fH (x) cfX (x) for all x,
as illustrated in Figure 4. Indeed,
2
2 exp x
fH (x) 2 2
= (33)
fX (x) exp (x)
!
(x 1)2
r
2e
= exp (34)
2
r
2e
. (35)
We can now apply rejection sampling to generate H. The steps are
3. Accept x as a sample of H if
!
(x 1)2
u exp . (36)
2
This procedure is illustrated in Figure 5. The rejection mechanism ensures that the accepted
samples have the right distribution.
12
Algorithm 4.1 (Metropolis-Hastings algorithm). We store the pmf pX of the target distri-
bution in a vector p~ Rs , such that
Let T denote the transition matrix of an irreducible Markov chain with the same state space
{x1 , . . . , xs } as p~.
Initialize X
e (0) randomly or to a fixed state, then repeat the following steps for i = 1, 2, 3, . . ..
2. Set
(
C with probability p acc X
e (i 1) , C ,
X
e (i) := (39)
Xe (i 1) otherwise,
It turns out that this algorithm yields a Markov chain that is reversible with respect to the
distribution of interest, which ensures that the distribution is stationary.
Theorem 4.2. The pmf in p~ corresponds to a stationary distribution of the Markov chain
X
e obtained by the Metropolis-Hastings algorithm.
Proof. We show that the Markov chain Xe is reversible with respect to p~, i.e. that
TXe kj p~j = TXe jk p~k , (41)
holds for all 1 j, k s. This establishes the result by Theorem 5.3 in Lecture Notes 6.
The detailed-balanced condition holds trivially if j = k. If j 6= k we have
TXe kj := P X e (i) = k | X
e (i 1) = j (42)
=P X e (i) = C, C = k | Xe (i 1) = j (43)
=P X e (i) = C | C = k, X e (i 1) = j P C = k | Xe (i 1) = j (44)
= pacc (j, k) Tkj (45)
13
and by exactly the same argument TXe jk
= pacc (k, j) Tjk . We conclude that
TXe kj
p~j = pacc (j, k) Tkj p~j (46)
Tjk p~k
= Tkj p~j min ,1 (47)
Tkj p~j
= min {Tjk p~k , Tkj p~j } (48)
Tkj p~j
= Tjk p~k min 1, (49)
Tjk p~k
= pacc (k, j) Tjk p~k (50)
= TXe jk p~k . (51)
You might be wondering about the point of using MCMC methods if we already have access
to the desired distribution. It seems much simpler to just apply the method described in
Example 2.1 instead. However, the Metropolis-Hastings method can be applied to discrete
distributions with infinite supports and also to continuous distributions (justifying this is
beyond the scope of the course), so it provides an alternative to inverse-transform and
rejection sampling. In addition, Metropolis-Hastings does not require having access to the
pmf pX or pdf fX of the target distribution, but rather to the ratio ppXX (x)
(y)
or ffXX (x)
(y)
for every
x 6= y. This is very useful when computing conditional distributions within probabilistic
models.
Imagine that we have access to the marginal distribution of a continuous random variable
A and the conditional distribution of another continuous random variable B given A. Com-
puting the conditional pdf
is not necessary feasible due to the integral in the denominator. However, if we apply
Metropolis-Hastings to sample from fA|B we dont need to compute the normalizing factor
since for any a1 6= a2
The following example is taken from Hastingss seminal paper Monte Carlo Sampling Methods
Using Markov Chains and Their Applications.
14
Example 4.3 (Generating a Poisson random variable). Our aim is to generate a Poisson
random variable X. Note that we dont need to know the normalizing constant in the Poisson
pmf, which equals to e , as long as we know that it is proportional to
x
pX (x) (54)
x!
The auxiliary Markov chain must be able to reach any possible value of X, i.e. all positive
integers. We will use a modified random walk that takes steps upwards and downwards with
probability 1/2, but never goes below 0. Its transition matrix equals
1
2
if j = 0 and k = 0,
1 if k = j + 1,
Tkj := 21 (55)
if j > 0 and k = j 1,
2
0 otherwise.
To compute the acceptance probability, we only consider transitions that are possible under
the random walk. If j = 0 and k = 0
If k = j + 1
j+1
( )
(j+1)!
pacc (j, j + 1) = min j
,1 (59)
j!
= min ,1 . (60)
j+1
If k = j 1
j1
( )
(j1)!
pacc (j, j 1) = min j
,1 (61)
j!
j
= min ,1 . (62)
15
0.35
0
0.30 1
2
0.25 3
4
Distribution
0.20 5
0.15
0.10
0.05
0.00 0
10 101 102 103
Iterations
We now spell out the steps of the Metropolis-Hastings method. To simulate the auxiliary
random walk we use a sequence of Bernoulli random variables that indicate whether the
random walk is trying to go up or down (or stay at zero). Initialize the chain at x0 = 0. For
i = 1, 2, . . .
Generate a sample b from a Bernoulli distribution with parameter 1/2 and a sample u
uniformly distributed in [0, 1].
If b = 0:
If xi1 = 0, xi := 0.
If xi1 > 0:
If u < xi1
, xi := xi1 1.
Otherwise xi := xi1 .
If b = 1:
If u < xi1 +1
, xi := xi1 + 1.
Otherwise xi := xi1 .
The Markov chain that we have built is irreducible: there is nonzero probability of going from
any nonnegative integer to any other nonnegative integer (although it could take a while!).
16
We have not really proved that the chain should converge to the desired distribution, since
we have not discussed convergence of Markov chains with infinite state spaces, but Figure 6
shows that the method indeed allows to sample from a Poisson distribution with := 6.
For the example in Figure 6, approximate convergence in distribution occurs after around
100 iterations. This is called the mixing time of the Markov chain. To account for it,
MCMC methods usually discard the samples from the chain over an initial period known as
burn-in time.
Assume that U > FX (y), then for all x, such that FX (x) = U , x > y because the cdf is
nondecreasing. In particular minx {FX (x) = U } > y.
Assume that minx {FX (x) = U } > y, then U > FX (y) because the cdf is nondecreasing.
The inequality is strict because U = FX (y) would imply that y belongs to {FX (x) = U },
which cannot be the case as we are assuming that it is smaller than the minimum of that
set.
17