Вы находитесь на странице: 1из 17

DS-GA 1002 Lecture notes 7 Fall 2016

Simulation
1 Introduction
Simulation is a powerful tool in probability and statistics. Probabilistic models are often too
complex for us to derive closed-form solutions of the distribution or expectation of quantities
of interest, as we do in homework problems.
As an example, imagine that you set up a probabilistic model to determine the probability
of winning a game of solitaire. If the cards are well shuffled, this probability equals
Number of permutations that lead to a win
P (Win) = . (1)
Total number
The problem is that characterizing what permutations lead to a win is very difficult without
actually playing out the game to see the outcome. One could think of counting the number
of permutations that result in a win, but there are 52! 8 1067 permutations so this is
computationally intractable. However, there is a simple way to approximate the probability
of interest: simulating a large number of games and recording what fraction result in wins.
The game of solitaire was precisely what inspired Stanislaw Ulam to propose simulation-based
methods, known as Monte-Carlo techniques, in the context of nuclear-weapon research in
the 1940s:
The first thoughts and attempts I made to practice (the Monte Carlo Method) were suggested
by a question which occurred to me in 1946 as I was convalescing from an illness and playing
solitaires. The question was what are the chances that a Canfield solitaire laid out with 52
cards will come out successfully? After spending a lot of time trying to estimate them by
pure combinatorial calculations, I wondered whether a more practical method than abstract
thinking might not be to lay it out say one hundred times and simply observe and count the
number of successful plays.
This was already possible to envisage with the beginning of the new era of fast computers, and
I immediately thought of problems of neutron diffusion and other questions of mathematical
physics, and more generally how to change processes described by certain differential equa-
tions into an equivalent form interpretable as a succession of random operations. Later (in
1946), I described the idea to John von Neumann, and we began to plan actual calculations.1
The basic idea of Monte Carlo simulation is to repeatedly sample from a probabilistic model
to obtain realizations of a quantity of interest and then compute its average or its empirical
1
http://en.wikipedia.org/wiki/Monte_Carlo_method#History
distribution. The following example applies simulation to predict the upcoming presidential
election.

Example 1.1 (Presidential election). The US presidential election is an indirect election,


where citizens of the US cast ballots for electors in the Electoral College. Each state is
entitled to a number of electors that is the same as its members of Congress (in addition
there are three for Washington D. C.). In every state, except Maine and Nebraska, the
electors are pledged to the candidate that wins the state.
Let us consider a probabilistic model that represents the US presidential election, where
we ignore third-party candidates and the particularities of Maine and Nebraska to simplify
matters. Each state (and Washington D.C.) is represented by a random variable Si , 1 i
51,
(
1 if Hillary Clinton wins state i,
Si = (2)
1 if Donald Trump wins state i.

We assume that these random variables are independent, which is of course not completely
accurate. This is a simplified version of the probabilistic model used by the website fivethir-
tyeight to predict the outcome of the election, which does take into account historical corre-
lation between the outcomes in different states. The model is fitted using poll data (as well
as approval ratings, surveys, economic indicators, etc.). Later on in the course, we will learn
how to estimate distributions from data, but for now we will assume that the distribution of
the random variables Si , i 51, is known and use the estimates provided by fivethirtyeight
on September 26, before the first presidential debate. These estimates are shown in Table 1.
In order to use the model to predict the outcome of the election, we need to compute the
probability of the event
51
X
ni S i > 0 (3)
i=1

where ni denotes the number of electors assigned to state i (see Table 1). A brute-force
approach would require considering all possible outcomes, which equal 251 2 1015 !2 This
is intractable. Alternatively, we can simulate the election repeatedly using the probabilistic
model and record how often Clinton or Trump win. Figure 1 plots the fraction of times Clin-
ton wins as we increase the number of samples. Eventually, the estimate of the probability
of our event of interest converges to around 60% (we perform the simulation three times to
verify convergence).
2
It is possible to apply convolutions to compute the result more efficiently, but this technique is difficult
to generalize when the random variables are no longer independent.

2
ni P (Si = 1) P (Ei ) ni P (Si = 1) P (Ei )

Alabama 9 0.002 3.90 104 Alaska 3 0.137 8.68 103


Arizona 11 0.233 5.67 102 Arkansas 6 0.012 1.82 103
California 55 0.992 9.17 101 Colorado 9 0.593 1.26 101
Connecticut 7 0.904 1.52 101 Delaware 3 0.830 5.34 102
D. C. 3 0.999 6.44 102 Florida 29 0.436 2.38 101
Georgia 16 0.164 5.25 102 Hawaii 4 0.990 8.94 102
Idaho 4 0.006 4.10 104 Illinois 20 0.938 4.44 101
Indiana 11 0.041 9.65 103 Iowa 6 0.297 4.24 102
Kansas 6 0.052 7.18 103 Kentucky 8 0.013 2.27 103
Louisiana 8 0.018 3.32 103 Maine 4 0.684 6.18 102
Maryland 10 0.996 2.41 101 Massachusetts 11 0.978 2.61 101
Michigan 16 0.670 2.52 101 Minnesota 10 0.717 1.71 101
Mississippi 6 0.023 3.06 103 Missouri 10 0.092 2.02 102
Montana 3 0.100 6.12 103 Nebraska 5 0.031 3.33 103
Nevada 6 0.442 6.06 102 New Hampshire 4 0.612 5.44 102
New Jersey 14 0.876 2.96 101 New Mexico 5 0.768 8.93 102
New York 29 0.985 6.38 101 North Carolina 15 0.401 1.33 101
North Dakota 3 0.054 3.57 103 Ohio 18 0.379 1.43 101
Oklahoma 7 0.003 3.90 104 Oregon 7 0.872 1.46 101
Pennsylvania 20 0.654 3.00 101 Rhode Island 4 0.823 7.35 102
South Carolina 9 0.066 1.30 102 South Dakota 3 0.080 4.91 103
Tennessee 11 0.014 3.16 103 Texas 38 0.059 2.35 102
Utah 6 0.016 2.11 103 Vermont 3 0.950 6.15 102
Virginia 13 0.737 2.28 101 Washington 12 0.910 2.65 101
West Virginia 5 0.008 8.50 104 Wisconsin 10 0.699 1.65 101
Wyoming 3 0.014 8.50 104

Table 1: The table shows the number of electors per state, the probability that Clinton wins the
state as estimated by fivethirtyeight on September 26 and an estimate of the probability that if
the outcome in the state changes then this affects the result of the election. The estimate is less
accurate if the probability of Ei is very low, as we only used 104 samples. The five states for which
the probability of Ei is largest are shown in blue.

3
1.0
0.8
0.6
0.4
0.2

100 101 102 103 104 105


Number of samples

Figure 1: Fraction of the number of simulations that result in Clinton winning the election. We
repeat the simulation three times to verify that it converges to the same estimate.

We can also use the probabilistic model to estimate the probability of other interesting events.
For example, a candidates campaign could be interested in the influence of a particular state
in the final outcome of the election. One way to quantify this effect is by computing the
probability that if the outcome in state i changes then this also changes the result of the
election, i.e. the probability of the event



X 51 X51
X 51 X51

Ei := nj Sj < ni Si , nj Sj 0 nj Sj > ni Si , ni Sj 0 . (4)


j=1 j=1
j=1 j=1

j6=i j6=i

Computing this probability explicitly is again a difficult combinatorial problem. However,


we can easily estimate it through simulation. The results are shown in Table 1. The states
with a higher probability of affecting the outcome are large states supporting Clinton.

In order to simulate a probabilistic model, it is necessary to sample from the correspond-


ing distributions. The most widespread approach to generating samples from an arbitrary
distribution is to decouple the sampling process into two separate steps:

1. Generating samples uniformly from the unit interval [0, 1].

2. Transforming the uniform samples so that they have the desired distribution.

4
x3

x2

x1

0 u1 FX (x1 ) u2 u3 u4 FX (x2 ) u5 1

Figure 2: Illustration of the method to generate samples from an arbitrary discrete distribution
described in Example 2.1.

In these notes we will focus on the second step, assuming that we have access to a random-
number generator that produces iid samples following a uniform distribution in [0, 1]. The
construction of good uniform random generators is an important problem, but it is beyond
the scope of this course.

2 Inverse-transform sampling
Inverse-transform sampling allows to sample from an arbitrary distribution with a known cdf
by applying a deterministic transformation to uniform samples. To provide some intuition,
we first discuss how to sample from a discrete distribution.

Example 2.1 (Sampling from a discrete distribution). Let X be a discrete random variable
with pmf pX and U a uniform random variable in [0, 1]. Our aim is to transform a sample
from U so that it is distributed according to pX . We denote the values that have nonzero
probability under pX by x1 , x2 , . . .
For a fixed i, assume that we assign all samples of U within an interval of length pX (xi )
to xi . Then the probability that a given sample from U is assigned to xi is exactly pX (xi )!
Very conveniently, the unit interval can be partitioned into intervals of length pX (xi ). We

5
can consequently generate X by sampling from U and setting


x1 if 0 U pX (x1 ) ,

x 2 if pX (x1 ) U pX (x1 ) + pX (x2 ) ,



X = ... (5)
Pi1 Pi
xi if pX (xj ) U pX (xj ) ,



j=1 j=1

. . .

Recall that the cdf of a discrete random variable equals

FX (x) = P (X x) (6)
X
= pX (xi ) , (7)
xi x

so our algorithm boils down to obtaining a sample u from U and then assigning to the xi
such that FX (xi1 ) u FX (xi ). This is illustrated in Figure 2 for a simple example.

Inverse-transform sampling generalizes the method in Example 2.1 to continuous distribu-


tions.

Algorithm 2.2 (Inverse-transform sampling). Let X be a continuous random variable with


cdf FX and U a random variable that is uniformly distributed in [0, 1] and independent of X.

1. Obtain a sample u of U .

2. Set x := FX1 (u).

The careful reader might point out that FX may not be invertible at every point. To avoid
this problem we define the generalized inverse of the cdf as

FX1 (u) := min {FX (x) = u} . (8)


x

The function is well defined because all cdfs are non-increasing, so FX is equal to a constant
c in any interval [x1 , x2 ] where it is not invertible.
We now prove that Algorithm 2.2 works.

Theorem 2.3 (Inverse-transform sampling works). The distribution of Y = FX1 (U ) is the


same as the distribution of X.

6
FX1 (u5 )

FX1 (u4 )

FX1 (u3 )

FX1 (u2 )
FX1 (u1 )
0 u1 u2 u3 u4 u5 1

Figure 3: Samples from an exponential distribution with parameter = 1 obtained by inverse-


transform sampling as described in Example 2.5.

Proof. We just need to show that the cdf of Y is equal to FX . We have


FY (y) = P (Y y) (9)
= P FX1 (U ) y

(10)
= P (U FX (y)) (11)
Z FX (y)
= du (12)
u=0
= FX (y) , (13)
where in step (11) we have to take into account that we are using the generalized inverse of
the cdf. This is resolved by the following lemma proved in Section A of the appendix.
Lemma 2.4. The events FX1 (U ) y and {U FX (y)} are equivalent.


Example 2.5 (Sampling from an exponential distribution). Let X be an exponential random


variable with parameter . Its cdf FX (x) := 1ex is invertible in [0, ]. Its inverse equals
 
1 1 1
FX (u) = log . (14)
1u
FX1 (U ) is an exponential random variable with parameter by Theorem 2.3. Figure 3
shows how the samples of U are transformed into samples of X.

7
3 Rejection sampling
Inverse-transform sampling requires having access to the inverse of the cdf corresponding to
the distribution of interest, which is not always the case. For instance, in the case of the
Gaussian random variable the cdf does not even have a closed-form expression. Rejection
sampling, also known as the accept-reject method, only requires knowing the pdf of the
distribution of interest. It allows to obtain samples according to a target pdf fY by choosing
samples obtained according to a different pdf fX that satisfies

fY (y) c fX (y) (15)

for all y, where c is a fixed positive constant. In words, the pdf of Y must be bounded by a
scaled version of the pdf of X.

Algorithm 3.1 (Rejection sampling). Let X be a random variable with pdf fX and U a
random variable that is uniformly distributed in [0, 1] and independent of X. We assume
that (15) holds.

1. Obtain a sample y of X.

2. Obtain a sample u of U .

3. Declare y to be a sample of Y if

fY (y)
u . (16)
c fX (y)

The following theorem establishes that the samples obtained by rejection sampling have the
desired distribution.

Theorem 3.2 (Rejection sampling works). If assumption (15) holds, then the samples pro-
duced by rejection sampling are distributed according to fY .

Proof. Let Z denote the random variable produced by rejection sampling. The cdf of Z is
equal to
 
fY (X)
FZ (y) = P X y | U (17)
c fX (X)
 
P X y, U cffYX(X)
(X)
=   . (18)
P U cffYX(X)
(X)

8
To compute the numerator we integrate the joint pdf of U and X over the region of interest
  Z y Z fY (x)
fY (X) c fX (x)
P X y, U = fX (x) du dx (19)
c fX (X) x= u=0
Z y
fY (x)
= fX (x) dx (20)
x= c fX (x)
1 y
Z
= fY (x) dx (21)
c x=
1
= FY (y) . (22)
c
The denominator is obtained in a similar way
  Z Z fY (x)
fY (X) c fX (x)
P U = fX (x) du dx (23)
c fX (X) x= u=0
Z
fY (x)
= fX (x) dx (24)
x= c fX (x)
1
Z
= fY (x) dx (25)
c x=
1
= . (26)
c
We conclude that
FZ (y) = FY (y) , (27)
so the method produces samples from the distribution of Y .

We now illustrate the method by applying it to produce a Gaussian random variable from
an exponential and a uniform random variable.

Example 3.3 (Generating a Gaussian random variable). In Example 2.5 we learned how
to generate an exponential random variables using samples from a uniform distribution. In
this example we will use samples from an exponential distribution to generate a standard
Gaussian random variable applying rejection sampling.
The following lemma shows that we can generate a standard Gaussian random variable Y
by:

1. Generating a random variable H with pdf


(  2
2 exp h if h 0,
2 2
fH (h) := (28)
0 otherwise.

9
cfX (x)
fH (x)
1

0.5

0
0 1 2 3 4 5
x

Figure 4: Bound on the pdf of the target distribution in Example 3.3.

2. Generating a random variable S which is equal to 1 or -1 with probability 1/2, for


example by applying the method described in Example 2.1.

3. Setting Y := SH.

Lemma 3.4. Let H be a continuous random variable with pdf given by (28) and S a discrete
random variable which equals 1 with probability 1/2 and 1 with probability 1/2. The random
variable of Y := SH is a standard Gaussian.

Proof. The conditional pdf of Y given S is given by


(
fH (y) if y 0,
fY |S (y|1) = (29)
0 otherwise,
(
fH (y) if y < 0,
fY |S (y| 1) = (30)
0 otherwise.

By Lemma 4.2 in Lecture Notes 3 we have

fY (y) = pS (1) fY |S (y|1) + pS (0) fY |S (y|0) (31)


 2
1 y
= exp . (32)
2 2

The reason why we reduce the problem to generating H is that its pdf is only nonzero on
the positive axis, which allows us to bound it with the exponential pdf of an exponential

10
1.0

0.8

0.6
Histogram of 50 000 iid
samples from X (fX is 0.4
shown in black)

0.2

0.00 1 2 3 4 5
X
1.0

0.8

0.6
U
Scatterplot of samples
from X and samples
from U (accepted
samples
 are colored
 red, 0.4
exp (x 1)2 /2 is
shown in black)
0.2

0.00 1 2 3 4 5
X
0.9
0.8
0.7
0.6
0.5
Histogram of accepted
samples (fH is shown in
0.4
black) 0.3
0.2
0.1
0.00 1 2 3 4 5
H
Figure 5: Illustration of how to generate 50 000 samples from the random variable H defined in
Example 3.3 via rejection sampling.

11
p
random variable X with parameter 1. If we set c := 2e/ then fH (x) cfX (x) for all x,
as illustrated in Figure 4. Indeed,
 2
2 exp x
fH (x) 2 2
= (33)
fX (x) exp (x)
!
(x 1)2
r
2e
= exp (34)
2
r
2e
. (35)

We can now apply rejection sampling to generate H. The steps are

1. Obtain a sample x from an exponential random variable X with parameter one

2. Obtain a sample u from U , which is uniformly distributed in [0, 1].

3. Accept x as a sample of H if
!
(x 1)2
u exp . (36)
2

This procedure is illustrated in Figure 5. The rejection mechanism ensures that the accepted
samples have the right distribution.

4 Markov-chain Monte Carlo


As we discussed in Lecture Notes 6, irreducible aperiodic Markov chains converge in distri-
bution to a unique stationary distribution. Markov-chain Monte Carlo (MCMC) methods
leverage this phenomenon to generate samples from an arbitrary distribution by constructing
a Markov chain for which the target distribution is stationary. These techniques are of huge
importance in modern statistics and in particular in Bayesian approaches. In this section we
describe one of the most popular MCMC methods and show to how to apply it to a simple
example.
The key challenge in MCMC methods is to design an irreducible aperiodic Markov chain
for which the target distribution is stationary. The Metropolis-Hastings algorithm uses an
auxiliary Markov chain to achieve this.

12
Algorithm 4.1 (Metropolis-Hastings algorithm). We store the pmf pX of the target distri-
bution in a vector p~ Rs , such that

p~j := pX (xj ) , 1 j s. (37)

Let T denote the transition matrix of an irreducible Markov chain with the same state space
{x1 , . . . , xs } as p~.
Initialize X
e (0) randomly or to a fixed state, then repeat the following steps for i = 1, 2, 3, . . ..

e (i 1) by using the transition matrix


1. Generate a candidate random variable C from X
T , i.e.
 
P C = k|X e (i 1) = j = Tkj , 1 j, k s. (38)

2. Set
(  
C with probability p acc X
e (i 1) , C ,
X
e (i) := (39)
Xe (i 1) otherwise,

where the acceptance probability is defined as


 
Tjk p~k
pacc (j, k) := min ,1 1 j, k s. (40)
Tkj p~j

It turns out that this algorithm yields a Markov chain that is reversible with respect to the
distribution of interest, which ensures that the distribution is stationary.
Theorem 4.2. The pmf in p~ corresponds to a stationary distribution of the Markov chain
X
e obtained by the Metropolis-Hastings algorithm.

Proof. We show that the Markov chain Xe is reversible with respect to p~, i.e. that
 
TXe kj p~j = TXe jk p~k , (41)

holds for all 1 j, k s. This establishes the result by Theorem 5.3 in Lecture Notes 6.
The detailed-balanced condition holds trivially if j = k. If j 6= k we have
  
TXe kj := P X e (i) = k | X
e (i 1) = j (42)
 
=P X e (i) = C, C = k | Xe (i 1) = j (43)
   
=P X e (i) = C | C = k, X e (i 1) = j P C = k | Xe (i 1) = j (44)
= pacc (j, k) Tkj (45)

13

and by exactly the same argument TXe jk
= pacc (k, j) Tjk . We conclude that

TXe kj
p~j = pacc (j, k) Tkj p~j (46)
 
Tjk p~k
= Tkj p~j min ,1 (47)
Tkj p~j
= min {Tjk p~k , Tkj p~j } (48)
 
Tkj p~j
= Tjk p~k min 1, (49)
Tjk p~k
= pacc (k, j) Tjk p~k (50)

= TXe jk p~k . (51)

You might be wondering about the point of using MCMC methods if we already have access
to the desired distribution. It seems much simpler to just apply the method described in
Example 2.1 instead. However, the Metropolis-Hastings method can be applied to discrete
distributions with infinite supports and also to continuous distributions (justifying this is
beyond the scope of the course), so it provides an alternative to inverse-transform and
rejection sampling. In addition, Metropolis-Hastings does not require having access to the
pmf pX or pdf fX of the target distribution, but rather to the ratio ppXX (x)
(y)
or ffXX (x)
(y)
for every
x 6= y. This is very useful when computing conditional distributions within probabilistic
models.
Imagine that we have access to the marginal distribution of a continuous random variable
A and the conditional distribution of another continuous random variable B given A. Com-
puting the conditional pdf

fA (a) fB|A (b|a)


fA|B (a|b) = R (52)
f (u) fB|A (b|u) du
u= A

is not necessary feasible due to the integral in the denominator. However, if we apply
Metropolis-Hastings to sample from fA|B we dont need to compute the normalizing factor
since for any a1 6= a2

fA|B (a1 |b) fA (a1 ) fB|A (b|a1 )


= . (53)
fA|B (a2 |b) fA (a2 ) fB|A (b|a2 )

The following example is taken from Hastingss seminal paper Monte Carlo Sampling Methods
Using Markov Chains and Their Applications.

14
Example 4.3 (Generating a Poisson random variable). Our aim is to generate a Poisson
random variable X. Note that we dont need to know the normalizing constant in the Poisson
pmf, which equals to e , as long as we know that it is proportional to
x
pX (x) (54)
x!
The auxiliary Markov chain must be able to reach any possible value of X, i.e. all positive
integers. We will use a modified random walk that takes steps upwards and downwards with
probability 1/2, but never goes below 0. Its transition matrix equals
1

2
if j = 0 and k = 0,
1 if k = j + 1,

Tkj := 21 (55)
if j > 0 and k = j 1,
2


0 otherwise.

T is symmetric so the acceptance probability is equal to the ratio of the pmfs:


 
Tjk pX (k)
pacc (j, k) := min ,1 (56)
Tkj pX (j)
 
pX (k)
= min ,1 . (57)
pX (j)

To compute the acceptance probability, we only consider transitions that are possible under
the random walk. If j = 0 and k = 0

pacc (j, k) = 1. (58)

If k = j + 1
j+1
( )
(j+1)!
pacc (j, j + 1) = min j
,1 (59)
j!
 

= min ,1 . (60)
j+1

If k = j 1
j1
( )
(j1)!
pacc (j, j 1) = min j
,1 (61)
j!
 
j
= min ,1 . (62)

15
0.35
0
0.30 1
2
0.25 3
4
Distribution

0.20 5

0.15
0.10
0.05
0.00 0
10 101 102 103
Iterations

Figure 6: Convergence in distribution of the Markov chain constructed in Example 6 for := 6.


To prevent clutter we only plot the empirical distribution of 6 states, computed by running the
Markov chain 104 times.

We now spell out the steps of the Metropolis-Hastings method. To simulate the auxiliary
random walk we use a sequence of Bernoulli random variables that indicate whether the
random walk is trying to go up or down (or stay at zero). Initialize the chain at x0 = 0. For
i = 1, 2, . . .

Generate a sample b from a Bernoulli distribution with parameter 1/2 and a sample u
uniformly distributed in [0, 1].
If b = 0:
If xi1 = 0, xi := 0.
If xi1 > 0:
If u < xi1

, xi := xi1 1.
Otherwise xi := xi1 .
If b = 1:

If u < xi1 +1
, xi := xi1 + 1.
Otherwise xi := xi1 .

The Markov chain that we have built is irreducible: there is nonzero probability of going from
any nonnegative integer to any other nonnegative integer (although it could take a while!).

16
We have not really proved that the chain should converge to the desired distribution, since
we have not discussed convergence of Markov chains with infinite state spaces, but Figure 6
shows that the method indeed allows to sample from a Poisson distribution with := 6.

For the example in Figure 6, approximate convergence in distribution occurs after around
100 iterations. This is called the mixing time of the Markov chain. To account for it,
MCMC methods usually discard the samples from the chain over an initial period known as
burn-in time.

A Proof of Lemma 2.4


FX1 (U ) y implies {U FX (y)}


Assume that U > FX (y), then for all x, such that FX (x) = U , x > y because the cdf is
nondecreasing. In particular minx {FX (x) = U } > y.

{U FX (y)} implies FX1 (U ) y




Assume that minx {FX (x) = U } > y, then U > FX (y) because the cdf is nondecreasing.
The inequality is strict because U = FX (y) would imply that y belongs to {FX (x) = U },
which cannot be the case as we are assuming that it is smaller than the minimum of that
set.

17

Вам также может понравиться