I N M AT H E M AT I C S 199
Applied
Stochastic
Analysis
Weinan E
Tiejun Li
Eric VandenEijnden
Applied
Stochastic
Analysis
GRADUATE STUDIES
I N M AT H E M AT I C S 199
Applied
Stochastic
Analysis
Weinan E
Tiejun Li
Eric VandenEijnden
EDITORIAL COMMITTEE
Daniel S. Freed (Chair)
Bjorn Poonen
Gigliola Staﬃlani
Jeﬀ A. Viaclovsky
2010 Mathematics Subject Classiﬁcation. Primary 6001, 60J22, 60H10, 60H35, 62P10,
62P35, 65C05, 65C10, 8201.
Copying and reprinting. Individual readers of this publication, and nonproﬁt libraries
acting for them, are permitted to make fair use of the material, such as to copy select pages for
use in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for permission
to reuse portions of AMS publication content are handled by the Copyright Clearance Center. For
more information, please visit www.ams.org/publications/pubpermissions.
Send requests for translation rights and licensed reprints to reprintpermission@ams.org.
c 2019 by the authors. All rights reserved.
Printed in the United States of America.
∞ The paper used in this book is acidfree and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at https://www.ams.org/
10 9 8 7 6 5 4 3 2 1 24 23 22 21 20 19
To our families
Preface xvii
Notation xix
Part 1. Fundamentals
vii
viii Contents
xiii
xiv Introduction to the Series
machine learning has also been added to the list. In addition to the summer
school, these courses have also been taught from time to time during regular
terms at New York University, Peking University, and Princeton University.
The lecture notes from these courses are the main origin of this textbook
series. The early participants, including some of the younger people who
were students or teaching assistants early on, have also become contributors
to this book series.
After many years, this series of courses has ﬁnally taken shape. Obvi
ously, this has not come easy and is the joint eﬀort of many people. I would
like to express my sincere gratitude to Tiejun Li, Pingbing Ming, Shi Jin,
Eric VandenEijnden, Pingwen Zhang, and other collaborators involved for
their commitment and dedication to this eﬀort. I am also very grateful to
Mrs. Yanyun Liu, Baomei Li, Tian Tian, Yuan Tian for their tireless eﬀorts
to help run the summer school. Above all, I would like to pay tribute to
someone who dedicated his life to this cause, David Cai. David had a pas
sion for applied mathematics and its applications to science. He believed
strongly that one has to ﬁnd ways to inspire talented young people to go into
applied mathematics and to teach them applied mathematics in the right
way. For many years, he taught a course on introducing physics to applied
mathematicians in the summer school. To create a platform for practicing
the philosophy embodied in this project, he cofounded the Institute of Nat
ural Sciences at Shanghai Jiaotong University, which has become one of the
most active centers for applied mathematics in China. His passing away last
year was an immense loss, not just for all of us involved in this project, but
also for the applied mathematics community as a whole.
This book is the ﬁrst in this series, covering probability theory and sto
chastic processes in a style that we believe is most suited for applied math
ematicians. Other subjects covered in this series will include numerical
algorithms, the calculus of variations and diﬀerential equations, a mathe
matical introduction to machine learning, and a mathematical introduction
to physics and physical modeling. The stochastic analysis and diﬀerential
equations courses summarize the most relevant aspects of pure mathemat
ics, the algorithms course presents the most important technical tool of ap
plied mathematics, and the learning and physical modeling courses provide
a bridge to the real world and to other scientiﬁc disciplines. The selection
of topics represents a possibly biased view about the true fundamentals of
applied mathematics. In particular, we emphasize three themes through
out this textbook series: learning, modeling, and algorithms. Learning is
about data and intelligent decision making. Modeling is concerned with
physicsbased models. The study of algorithms provides the practical tool
for building and interrogating the models, whether machine learning–based
Introduction to the Series xv
xvii
xviii Preface
Weinan E
Tiejun Li
Eric VandenEijnden
December 2018
Notation
xix
xx Notation
(5) Functions:
(a) ·: The ceil function. x = m+1 if x ∈ [m, m+1) for m ∈ Z.
(b) · : The ﬂoor function. x = m if x ∈ [m, m + 1) for m ∈ Z.
1
(c) x: 2 modulus of a vector x ∈ Rd : x = ( di=1 x2i ) 2 .
(d)
f
: Norm of function f in some function space.
(e) a ∨ b: Maximum of a and b: a ∨ b = max(a, b).
(f) a ∧ b: Minimum of a and b: a ∧ b = min(a, b).
(g)
f : The average of f with respect to a measure μ:
f =
f (x)μ(dx).
(h) (x, y): Inner product for x, y ∈ Rd : (x, y) = xT y.
(i) (f, g): Inner product for L2 functions f, g: (f, g) = f (x)g(x)dx.
(j)
f, g: Dual product for f ∈ B ∗ and g ∈ B.
(k) A : B: Twice contraction for secondorder tensors, i.e., A :
B = ij aij bji .
(l) S: The cardinality of set S.
(m) Δ: The subdivision size when Δ is a subdivision of a domain.
(n) χA (x): Indicator function, i.e., χA (x) = 1 if x ∈ A and 0
otherwise.
(o) δ(x − a): Dirac’s deltafunction at x = a.
(p) Range(P), Null(P): The range and null space of operator P,
i.e., Range(P) = {yy = Px}, Null(P) = {xPx = 0}.
(q) ⊥ C : Perpendicular subspace, i.e., ⊥ C = {xx ∈ B ∗ ,
x, y =
0, ∀y ∈ C } where C ⊂ B.
Notation xxi
(6) Symbols:
(a) cε dε : Logrithmic equivalence, i.e., limε→0 log cε / log dε = 1.
(b) cε ∼ dε or cn ∼ dn : Equivalence, i.e., limε→0 cε /dε = 1 or
limn→∞ cn /dn = 1.
Dx: Formal inﬁnitesimal element in path space, Dx =
(c)
0≤t≤T dxt .
Part 1
Fundamentals
Chapter 1
Random Variables
3
4 1. Random Variables
Prob
−
> → 0,
2n 2
as n → ∞. This can be seen as follows. Noting that the distribution
Prob{S2n = k} is unimodal and symmetric around the state k = n, we have
S2n 1
1 (2n)!
Prob
−
> ≤ 2 · 2n
2n 2 2 k!(2n − k)!
k>n+2n
1 (2n)!
≤ 2(n − 2n ) · 2n
2 n + 2n !n − 2n !
√ √
2 1 − 2 n
∼ · →0
π(1 + 2 ) (1 − 2 ) n(1−2) (1 + 2 )n(1+2)
for suﬃciently small and n 1, where · and · are the ceil and ﬂoor
functions, respectively, deﬁned by x = m+1 and x = m if x ∈ [m, m+1)
for m ∈ Z. This is the weak law of large numbers for this particular example.
In the example of a fair coin, the number of outcomes in an experiment
is ﬁnite. In contrast, the second class of examples involves a continuous set
of possible outcomes. Consider the orientation of a unit vector τ . Denote by
S2 the unit sphere in R3 . Deﬁne ρ(n), n ∈ S2 , as the orientation distribution
density; i.e., for A ⊂ S2 ,
Prob(τ ∈ A) = ρ(n)dS,
A
and
1
P(A) = n A,
2
where A is the cardinality of the set A.
6 1. Random Variables
Proposition
∞ 1.6 (Bayes’s theorem). If A1 , A2 , . . . are disjoint sets such
that j=1 Aj = Ω, then we have
P(Aj )P(BAj )
(1.6) P(Aj B) = ∞ for any j ∈ N.
n=1 P(An )P(BAn )
The Poisson distribution P(λ) may be viewed as the limit of the binomial
distribution in the sense that
λk −λ
Cnk pk q n−k −→ e as n → ∞, p → 0, np = λ.
k!
The proof is simple and is left as an exercise.
1.6. Independence
We now come to one of the most distinctive notions in probability theory,
the notion of independence. Let us start by deﬁning the independence of
events. Two events A, B ∈ F are independent if
P(A ∩ B) = P(A)P(B).
Deﬁnition 1.21. Two random variables X and Y are said to be indepen
dent if for any two Borel sets A and B, X −1 (A) and Y −1 (B) are independent;
i.e.,
(1.30) P X −1 (A) ∩ Y −1 (B) = P X −1 (A) P Y −1 (B) .
1.6. Independence 13
Consequently, we have
(1.32) μ = μ1 μ2 ;
i.e., the joint distribution of two independent random variables is the product
distribution. If both μ1 and μ2 are absolutely continuous, with densities p1
and p2 , respectively, then μ is also absolutely continuous, with density given
by
Note that pairwise independence does not imply independence (see Exercise
1.9).
To see that this deﬁnition reduces to the previous one for the case of discrete
random variables, let Ωj = {ω : Y (ω) = j} and
n
Ω= Ωj .
j=1
The σalgebra G is simply the set of all possible unions of the Ωj ’s. The
measurability condition of E(XY ) with respect to G means E(XY ) is con
stant on each Ωj , which is exactly E(XY = j) as we will see. By deﬁnition,
we have
(1.38) E(XY )P(dω) = X(ω)P(dω),
Ωj Ωj
which implies
1
(1.39) E(XY ) = X(ω)P(dω).
P(Ωj ) Ωj
Proof. We have
E(X − g(Y ))2 = E(X − E(XY ))2 + E(E(XY ) − g(Y ))2
+ 2E (X − E(XY ))(E(XY ) − g(Y ))
and
E (X − E(XY ))(E(XY ) − g(Y ))
= E E (X − E(XY ))(E(XY ) − g(Y ))Y
= E (E(XY ) − E(XY ))(E(XY ) − g(Y )) = 0,
which implies (1.40).
Remark 1.31. Note that the power p does not need to be greater than 1
since it still gives a metric for the random variables. But only when p ≥ 1
do we get a norm. The statements in this section hold when 0 < p < ∞.
Proof. The proof of the ﬁrst statements is straightforward. For the second
statement, we have
f (ξ1 ) − f (ξ2 ) = E(eiξ1 X − eiξ2 X ) = E(eiξ1 X (1 − ei(ξ2 −ξ1 )X ))
≤ E1 − ei(ξ2 −ξ1 )X .
Since the integrand 1 − ei(ξ2 −ξ1 )X  depends only on the diﬀerence between
ξ1 and ξ2 and since it tends to 0 almost surely as the diﬀerence goes to
0, uniform continuity follows immediately from the dominated convergence
theorem.
Example 1.34. Here are the characteristic functions of some typical dis
tributions:
(1) Bernoulli distribution.
f (ξ) = q + peiξ .
(2) Binomial distribution B(n, p).
f (ξ) = (q + peiξ )n .
(3) Poisson distribution P(λ).
iξ −1)
f (ξ) = eλ(e .
(4) Exponential distribution E(λ).
f (ξ) = (1 − λ−1 iξ)−1 .
(5) Normal distribution N (μ, σ 2 ).
σ2ξ2
(1.47) f (ξ) = exp iμξ − .
2
(6) Multivariate normal distribution N (μ, Σ).
1 T
(1.48) f (ξ) = exp iμ ξ − ξ Σξ .
T
2
Proof. We only prove the necessity part. The other part is less trivial and
readers may consult [Chu01]. Assume that f is a characteristic function.
Then
n
n
2
iξi x
P (X = xk ) = G (x)
.
k! x=0
k
(1.52) ck = aj bk−j .
j=0
with {ck } = {ak } ∗ {bk } satisfy C(x) = A(x)B(x). The following result is
more or less obvious.
P (X = j) = aj , P (Y = k) = bk ,
This gives
∞
tn
(1.55) M (t) = μn ,
n!
n=0
which explains why M (t) is called the moment generating function.
Theorem 1.40. Denote MX (t), MY (t), and MX+Y (t) the moment gener
ating functions of the random variables X, Y , and X + Y , respectively. If
X, Y are independent, then
(1.56) MX+Y (t) = MX (t)MY (t).
Lemma 1.41.
(1) If ∞ n=1 P(An ) < ∞, then P ({An i.o.}) = 0.
∞
(2) If n=1 P(An ) = ∞ and the An ’s are mutually independent, then
P ({An i.o.}) = 1.
∞
for any n. The last term goes to 0 as n → ∞ since k=1 P(Ak ) < ∞ by
assumption.
(2) Using independence, one has
∞
∞
∞ ∞
P Ak =1−P Ack =1− P(Ack ) = 1 − (1 − P(Ak )).
k=n k=n k=n k=n
Xn
lim =0 a.s.
n→∞ n
Then
∞
∞
∞
P(An ) = P(Xn  > n ) = P(X1  > n )
n=1 n=1 n=1
∞ ∞
= P (k < X1  ≤ (k + 1) )
n=1 k=n
∞
= kP (k < X1  ≤ (k + 1) )
k=1
∞
= k P(dω)
k=1 k<X1 ≤(k+1)
∞
1
≤ X1 P(dω)
k<X1 ≤(k+1)
k=1
1 1
= X1 P(dω) ≤ EX1  < ∞.
<X1 
Therefore if we deﬁne
B = {ω ∈ Ω : ω ∈ An i.o.},
then P(B ) = 0. Let B = ∞n=1 B 1 . Then P(B) = 0, and
n
Xn (ω)
lim =0 if ω ∈ B.
n→∞ n
The proof is complete.
With the BorelCantelli lemma, we can give the proof of (ii) in Theorem
1.32.
k0 ∞
1
≤ P(Xnk  ≥ ) + < ∞.
2k
k=1 k=k0 +1
24 1. Random Variables
P(Xnk  ≥ , i.o.) = 0.
Thus the subsequence Xnk converges to 0 almost surely by the same argu
ment as the example above.
Exercises
In the following, i.i.d. stands for independent, identically distributed.
1.1. Let X = nj=1 Xj , where the Xj are i.i.d. random variables with
Bernoulli distribution with parameter p. Prove that X ∼ B(n, p).
1.2. Let Sn be the number of heads in the outcome of n independent
tosses for a fair coin. Its distribution is given by
1 n
pj = P(Sn = j) = n .
2 j
Show that the mean and the variance of this random variable are
n n
ESn = , Var(Sn ) = .
2 4
This implies that Var(Sn )/(ESn )2 → 0 as n → ∞.
1.3. Prove that
λk −λ
Cnk pk q n−k −→ e for any ﬁxed k
k!
as n → ∞, p → 0, and np = λ. This is the classical Poisson
approximation to the binomial distribution.
1.4. Numerically investigate the limit processes
1.14. (Variance identity) Suppose the random variable X has the par
tition X = (X (1) , X (2) ). Prove the variance identity for any inte
grable function f :
n
H(X) = − pi log pi .
i=1
Prove that H(X) ≥ 0 and that the equality holds only when X is
reduced to a deterministic variable; i.e., pi = δik0 , for some ﬁxed
integer k0 in {1, 2, . . . , n}. Here we take the convention 0 log 0 = 0.
1.17. Let X = (X1 , X2 , . . . , Xn ) be i.i.d. continuous random variables
with common distribution function F and density ρ(x) = F
(x).
(X(1) , X(2) , . . . , X(n) ) are called the order statistics of X if X(k) is
the kth smallest of these random variables. Prove that the density
of the order statistics is given by
n
p(x1 , x2 , . . . , xn ) = n! ρ(xk ), x1 < x2 < · · · < xn .
k=1
a(X + Y − b)
where the notation means summing of the products over all
possible partitions of X1 , . . . , Xk into pairs; e.g., if (X, Y, Z) is
jointly Gaussian, we have
(1.59) E(X 2 Y 2 Z 2 ) =(EX 2 )(EY 2 )(EZ 2 ) + 2(EY Z)2 EX 2 + 2(EXY )2 EZ 2
+ 2(EXZ)2 EY 2 + 8(EXY )(EY Z)(EXZ).
Each term in (1.59) can be schematically mapped to some graph,
as shown below:
Notes
The axiomatic approach to probability theory starts with Kolmogorov’s mas
terpiece [Kol56]. There are already lots of excellent textbooks on the in
troduction to elementary probability theory [Chu01, Cin11, Dur10, KS07,
Shi96]. To know more about the measure theory approach to probability
theory, please consult [Bil79]. More on the practical side of probability
theory can be found in the classic book by Feller [Fel68] or the books by
some applied mathematicians and scientists [CH13, CT06, Gar09, VK04].
Chapter 2
Limit Theorems
Sn
→η in probability.
n
29
30 2. Limit Theorems
> ≤ 2 E
n n
for any > 0. Using independence, we have
n
ESn 2 = E(Xi Xj ) = nEX1 2 .
i,j=1
Hence
Sn
1
P
> ≤ 2 EX1 2 → 0,
n n
as n → ∞.
Theorem 2.2 (Strong law of large numbers (SLLN)). Let {Xj }∞ j=1 be a
sequence of i.i.d. random variables such that EXj  < ∞. Then
Sn
→η a.s.
n
Proof. We will only give a proof of the SLLN under the stronger assumption
that EXj 4 < ∞. The proof under the stated assumption can be found in
[Chu01].
Without loss of generality, we can assume η = 0. Using the Chebyshev
inequality, we have
Sn
1
Sn
P
> ≤ 4E
.
n n
Using independence, we get
n
ESn  =
4
E(Xi Xj Xk Xl ) = nEXj 4 + 3n(n − 1)(EXj 2 )2 .
i,j,k,l=1
> ≤ 2 4 .
n n
Now we are in a situation to use the BorelCantelli lemma, since the series
1/n2 is summable. This gives us
Sn
P
> , i.o. = 0.
n
Hence Sn /n → 0 a.s.
The weak and strong laws of large numbers may not hold if the moment
condition is not satisﬁed.
2.2. Central Limit Theorem 31
1
f √ = f (0) + √ f
(0) + √ f (0) + o
n n 2 n n
ξ2 1
=1− +o .
2n n
Hence
n
√ n ξ2 1
→ e− 2 ξ
1 2
(2.2) gn (ξ) = f ξ/ n = 1− +o as n → ∞
2n n
for every ξ ∈ R. Using Theorem 1.35, we obtain the stated result.
32 2. Limit Theorems
The CLT gives an estimate for the rate of convergence in the law of large
numbers. Since by the CLT we have
Sn σ
− η ≈ √ N (0, 1),
n n
the rate of convergence of Sn /n to η is O(n− 2 ). This is the reason why most
1
sample size.
Remark 2.6 (Stable laws). Theorem 2.4 requires that the variance of Xj be
ﬁnite. For random variables with unbounded variances, one can show that
the following generalized central limit theorem holds: If there exist {an } and
{bn } such that
P(an (Sn − bn ) ≤ x) → G(x) as n → ∞,
then the distribution G(x) must be a stable law. A probability distribution is
said to be a stable law if any linear combination of two independent random
variables with that distribution has again the same distribution, up to a
shift and rescaling. One simple instance of the stable law is the Cauchy
distribution considered in Example 2.3, for which we can take an = 1/n and
bn = 0 and the limit G(x) is exactly the distribution function of the Cauchy
Lorentz distribution (the limit process is in fact an identity in this case).
For more details about the generalized central limit theorem and stable law
see [Bre92, Dur10].
Here the symbol means logarithmic equivalence; i.e., limε→0 log cε / log dε
= 1 if cε dε .
This is the large deviation theorem (LDT) for i.i.d. random variables. It
states roughly that the probability of large deviation events converges to 0
exponentially fast. I(x) is called the rate function.
To study the properties of the rate function, we need some results on
the LegendreFenchel transform.
Lemma 2.8. Assume that f : Rd → R̄ = R ∪ {∞} is a lower semicontinu
ous convex function. Let F be its conjugate function (the LegendreFenchel
transform):
F (y) = sup{(x, y) − f (x)}.
x
Then:
(i) F is also a lower semicontinuous convex function.
(ii) The Fenchel inequality holds:
(x, y) ≤ f (x) + F (y).
(iii) The conjugacy relation holds:
f (x) = sup{(x, y) − F (y)}.
y
34 2. Limit Theorems
Proof. (i) The convexity of Λ(λ) follows from Hölder’s inequality. For any
0 ≤ θ ≤ 1,
Λ(θλ1 + (1 − θ)λ2 ) = log E exp(θλ1 Xj ) exp((1 − θ)λ2 Xj )
θ (1−θ)
≤ log (E exp λ1 Xj ) E exp(λ2 Xj )
= θΛ(λ1 ) + (1 − θ)Λ(λ2 ).
Thus Λ(λ) is a convex function. The rest is a direct application of Lemma
2.8.
2.3. Cramér’s Theorem for Large Deviations 35
= exp(−I(x)) < ∞.
Thus μ((x, ∞)) = 0 and
∞
exp(−I(x)) = lim exp(λn (y − x))μ(dy) = μ({x}).
n→∞ x
Since
μn (G) ≥ μn ({x}) ≥ (μ({x}))n = exp(−nI(x)),
we get
1
log μn (G) ≥ −I(x).
n
A similar argument can be applied to the case when x < 0.
Case 2. Assume that the supremum in the deﬁnition of I(x) is attained
at λ0 :
I(x) = λ0 x − Λ(λ0 ).
Then x = Λ
(λ0 ) = M
(λ0 )/M (λ0 ). Deﬁne a new probability measure:
1
μ̃(dy) = exp(λ0 y)μ(dy).
M (λ0 )
Its expectation is given by
1 M
(λ0 )
y μ̃(dy) = y exp(λ0 y)μ(dy) = = x.
R M (λ0 ) R M (λ0 )
2.3. Cramér’s Theorem for Large Deviations 37
Thus
1
lim log μn (G) ≥ −λ0 (x + δ) + Λ(λ0 ) = −I(x) − λ0 δ for all 0 < δ 1.
n→∞ n
A similar argument can be applied to the case when x < 0.
Example 2.10 (Cramér’s theorem applied to the Bernoulli distribution with
parameter p (0 < p < 1)). We have Λ(λ) = log(peλ + q) where q = 1 − p.
The rate function
x log xp + (1 − x) log 1−x
q , x ∈ [0, 1],
(2.9) I(x) =
∞, otherwise.
It is obvious that I(x) ≥ 0, and I(x) achieves its global minimum 0 at
x∗ = p. The function I is a special case of the socalled relative entropy,
or KullbackLeibler distance, deﬁned between two (discrete) distributions μ
and ν:
r
μi
(2.10) D(μν) = μi log ,
νi
i=1
Here Sn (E) is the Boltzmann entropy for the system with n spins. Denote
by S(E) the limit of Sn (E)/n as n → ∞. We get
(2.12) kB I(E) = kB log 2 − S(E).
So we obtain that the rate function is the negative entropy (with factor 1/kB )
up to an additive constant in this simplest setup. In fact this is a general
statement.
In the canonical ensemble in statistical mechanics (the number of spins
n and the temperature T are ﬁxed in this setup), let us investigate the
physical meaning of Λ. The logarithmic moment generating function of
Hn (ω) = nhn (ω) with normalization 1/n is
1
Λ(λ) = lim log EeλHn ,
n→∞ n
where we take Hn instead of nhn since it admits more general interpretation.
Take λ = −β = −(kB T )−1 . We have
1
Λ(−β) = lim log e−βHn (ω) − log 2.
n→∞ n
ω
Deﬁne the partition function
Zn (β) = e−βHn (ω)
ω
and Helmholtz free energy
Fn (β) = −β −1 log Zn (β).
We have
1
Λ(−β) = −β lim Fn (β) − log 2 = −βF (β) − log 2.
n
n→∞
Thus the free energy F (β) is the negative logarithmic moment generating
function up to a constant.
According to the large deviation theory we have
−1
−βF (β) − log 2 = sup{−βE − log 2 + kB S(E)};
E
i.e.,
F (β) = inf {E − T S(E)}.
E
The inﬁmum is achieved at the critical point E ∗ such that
∂S(E)
1
= ,
∂E E=E ∗ T
which is exactly a thermodynamic relation between S and T . Here E ∗ is
called the internal energy U . Readers may referd to [Cha87,Ell85] for more
details.
40 2. Limit Theorems
as n → ∞, or equivalently
P(n(Mn − 1) ≤ x) → e−x for x ≤ 0.
This implies that 1 − Mn = O (1/n) as n → ∞.
Exercises
2.1. Let Xj be i.i.d. U [0, 1] random variables. Prove that
'
n X12 + · · · + Xn2
lim −1 −1 , lim n
X1 X2 · · · Xn , lim
1 + · · · + Xn
n→∞ X n→∞ n→∞ n
Prove that under the condition EXj = μ < ∞, we have the service
rate of the queue
Nt 1
lim = a.s.
t→∞ t μ
2.3. The fact that the “central limit” of i.i.d. random variables must
be Gaussian can be understood from the following viewpoint. Let
X1 , X2 , . . . be i.i.d. random variables with mean 0. Let
X1 + · · · + Xn d X1 + · · · + X2n d
(2.15) Zn = √ →X and Z2n = √ → X.
n 2n
Let the characteristic function
√ of X be f (ξ).
2
(a) Prove that f (ξ) = f (ξ/ 2).
(b) Prove that f (ξ) is the characteristic function of a Gaussian
random variable if f ∈ C 2 (R).
√
(c) Investigate the situation when the scaling 1/ n in (2.15) is
replaced with 1/n. Prove that X corresponds to the Cauchy
Lorentz distribution under the symmetry condition f (ξ) =
f (−ξ), or f (ξ) ≡ 1.
√
(d) If the scaling 1/ n is replaced by 1/nα , what can we infer
about the characteristic function f (ξ) if we assume f (ξ) =
f (−ξ)? What is the correct range of α?
2.4. Prove the assertion in Example 2.3.
2.5. (Singlesided Laplace lemma) Suppose that h(x) attains the only
maximum at x = 0, h
∈ C 1 (0, +∞), h
(0) < 0, h(x) < h(0) for
∞
x > 0. Here h(x) → −∞ as x → ∞, and 0 eh(x) dx converges.
Exercises 43
TN = D ∈ S N :
Bj  − p(j)
< δ( ), j = 1, . . . , r ,
N
where Bj  is the number of elements in set Bj . Prove that for
any ﬁxed δ > 0
lim P(TN ) = 1.
N →∞
(c) For any > 0, there exists δ( ) > 0 such that for suﬃciently
large N
eN (H−) ≤ TN  ≤ eN (H+)
and for any D ∈ TN , we have
e−N (H+) ≤ P(D) ≤ e−N (H−) .
This property is called the asymptotic equipartition principle
for the elements in a typical set.
44 2. Limit Theorems
2.8. Let {Xj } be i.i.d. random variables. Suppose that the variance of
Xj is ﬁnite. Then by the CLT the distribution
G(x) in Remark 2.6
is Gaussian for bn = mean(X1 ), an = Var(X1 ). Show that one
also recovers WLLN for a diﬀerent choice of {an } and {bn }. What
is the stable distribution G(x) in this case?
2.9. For the statistics of extrema, verify that
an = 2 log n, bn = 2 log n(log log n + log 4π)
for the normal distribution N (0, 1).
Notes
Rigorous mathematical work on the law of large numbers for general random
variables starts from that of Chebyshev in the 1860s. The strong law of
large numbers stated in this chapter is due to Kolmogorov [Shi92]. The
fact that the Gaussian distribution and stable laws are the scaling limit of
empirical averages of i.i.d. random variables can also be understood using
renormalization group theory [ZJ07]. A brief introduction to the LDT can
be found in [Var84].
Chapter 3
Markov Chains
45
46 3. Markov Chains
Given Xn = i, we have
1
P(Xn+1 = i ± 1Xn = i) = P(ξn+1 = ±1) =
2
and P(Xn+1 = anything elseXn = i) = 0. We see that, knowing Xn , the
distribution of Xn+1 is completely determined. In other words,
(3.1) P(Xn+1 = in+1 {Xm = im }nm=0 ) = P(Xn+1 = in+1 Xn = in );
i.e., the probability of Xn+1 conditional on the whole past history of the se
quence, {Xm }nm=0 , is equal to the probability of Xn+1 conditional on the lat
est value alone, Xn . The sequence (or process) {Xn }n∈N is called a Markov
process. In contrast, instead we consider
n
1
Yn = (ξj + 1)ξj+1 .
2
j=1
{Yn }n∈N is also a kind of random walk on Z, but with a bias: the walk stops
if the last move was to the left, and it remains there until some subsequent
ξj = +1. It is easy to see that
P(Yn+1 = in+1 {Ym = im }nm=0 ) = P(Yn+1 = in+1 Yn = in )
and the sequence {Yn }n∈N is not a Markov process.
Example 3.2 (Ehrenfest’s diﬀusion model). Consider a container separated
by a permeable membrane in the middle and ﬁlled with a total of K particles.
At each time n = 1, 2 . . ., pick one particle in random among the K particles
and place it into the other part of the container. Let Xn be the number of
particles in the left part of the container. Then we have
P(Xn+1 = in+1 {Xm = im }nm=0 ) = P(Xn+1 = in+1 Xn = in ),
for ik ∈ {0, 1, . . . , K}, and this is also a Markov process.
3.1. Discrete Time Finite Markov Chains 47
1 1 1 1
2 2 2 2 2 2
1 1 3 1 1 1 1 3
2 2
Figure 3.1. Graphical representation of two Markov chains in (3.4) and (3.5).
The following questions are of special interest for a given Markov chain.
(i) Existence: Is there an invariant distribution?
This is equivalent to asking whether there exists a nonnegative
left eigenvector of P with eigenvalue equal to 1. Note that 1 is
always an eigenvalue of P since it always has the right eigenvector
1 = (1, . . . , 1)T by the second property in (3.2).
(ii) Uniqueness: When is the invariant distribution unique?
The answer of these questions is closely related to the spectral structure
of the matrix P . We begin with the following preliminary result.
Lemma 3.5. The spectral radius of P is equal to 1:
ρ(P ) = max λ = 1,
λ
λ ui  =
uj pji
≤ uj pji = uj .
i∈S i∈S j∈S i,j∈S j∈S
Hence λ ≤ 1.
Deﬁnition 3.6 (Irreducibility). If there exists a permutation matrix R such
that
T A1 B
(3.6) R PR = ,
0 A2
then P is called reducible. Otherwise P is called irreducible.
50 3. Markov Chains
Theorem 3.11. Assume that the Markov chain is primitive. Then for any
initial distribution μ0
μn = μ0 P n → π exponentially fast as n → ∞,
where π is the unique invariant distribution.
Proof. Given two distributions, μ0 and μ̃0 , we deﬁne the total variation
distance by
1
d(μ0 , μ̃0 ) = μ0,i − μ̃0,i .
2
i∈S
Since
0= (μ0,i − μ̃0,i ) = (μ0,i − μ̃0,i )+ − (μ0,i − μ̃0,i )− ,
i∈S i∈S i∈S
i∈S j∈S
+
≤ μ0,j − μ̃0,j (P s )ji ,
j∈S i∈B+
where B+ is the subset of indices where j∈S (μ0,j − μ̃0,j ) (P s )ji > 0. We
note that B+ cannot contain all the elements of S; otherwise one must have
(μ0 P s )i > (μ̃0 P s )i for all i and
(μ0 P s )i > (μ̃0 P s )i ,
i∈S i∈S
which is impossible since both sides sum to 1. Therefore at least one element
is missing in B+ . By assumption, there existsan s > 0 and α ∈ (0, 1) such
that (P s )ij ≥ α for all pairs (i, j). Hence i∈B+ (P )ji ≤ (1 − α) < 1.
s
Therefore
d(μs , μ̃s ) ≤ d(μ0 , μ̃0 )(1 − α);
i.e., the Markov chain is contractive after every s steps. Similarly for any
m≥0
d(μn , μn+m ) ≤ d(μn−sk , μn+m−sk )(1 − α)k ≤ (1 − α)k ,
where k is the largest integer such that n − sk ≥ 0. If n is suﬃciently large,
the righthand side can be made arbitrarily small. Therefore the sequence
{μn }∞
n=0 is a Cauchy sequence. Hence it has to converge to a limit π, which
satisﬁes
π = lim μ0 P n+1 = lim (μ0 P n )P = πP .
n→∞ n→∞
The probability distribution satisfying such a property is also unique, for if
there were two such distributions, π (1) and π (2) , then
d(π (1) , π (2) ) = d(π (1) P s , π (2) P s ) < d(π (1) , π (2) ).
This implies d(π (1) , π (2) ) = 0, i.e π (1) = π (2) .
Poisson processes are often used to model the number of telephone calls
received at a phone booth, or the number of customers waiting in a queue,
etc.
Theorem 3.14. For any ﬁxed time t, the distribution of Nt is Poisson with
parameter λt.
3.5. Qprocesses
Now let us turn to a type of continuous time Markov chains called Q
processes. We assume that the trajectory {Xt }t∈R+ is rightcontinuous and
the state space S is ﬁnite, i.e., S = {1, 2, . . . , I}. Let the transition proba
bility
pij (t) = P(Xt+s = jXs = i).
Here we have assumed that the Markov chain is stationary; i.e., the right
hand side is independent of s. By deﬁnition we have
pij (t) ≥ 0, pij (t) = 1.
j∈S
3.5. Qprocesses 55
Q is called the generator of the Markov chain. For simplicity, we also deﬁne
qi = −qii ≥ 0.
Since
P (t + h) − P (t) P (h) − I
(3.17) = P (t)
h h
as s → 0+, we get the wellknown forward equation for P (t)
dP (t)
(3.18) = P (t)Q.
dt
Since P (h) and P (t) commute, we also get from (3.17) the backward equation
dP (t)
(3.19) = QP (t).
dt
56 3. Markov Chains
Both the solutions of the forward and the backward equations are given by
P (t) = eQt P (0) = eQt
since P (0) = I.
Next we discuss how the distribution of the Markov chain evolves in
time. Let μ(t) be the distribution of Xt . Then
μj (t + dt) = μi (t)pij (dt) + μj (t)pjj (dt)
i=j
= μi (t)qij dt + μj (t)(1 + qjj dt) + o(dt)
i=j
Q̃ is called the jump matrix, and the corresponding Markov chain is called
the embedded chain or jump chain of the original Qprocess.
Deﬁne the jump times of {Xt }t≥0 by
J0 = 0, Jn+1 = inf{t : t ≥ Jn , Xt = XJn }, n ∈ N,
where we take the convention inf ∅ = ∞. Deﬁne the holding times by
Jn − Jn−1 , if Jn−1 < ∞,
(3.25) Hn =
∞, otherwise,
for n = 1, 2, . . .. We let X∞ = XJn if Jn+1 = ∞. The jump chain induced
by Xt is deﬁned to be
Yn = XJn , n ∈ N.
The following result provides a connection between the jump chain and
the original Qprocess.
58 3. Markov Chains
Figure 3.2. Schematics of the jump time and holding time of a Qprocess.
Theorem 3.15. The jump chain {Yn } is a Markov chain with Q̃ as the
transition probability matrix, and the holding times H1 , H2 , . . . are indepen
dent exponential random variables with parameters qY0 , qY1 , . . ., respectively.
The proof relies on the strong Markov property of the Qprocess. See
Section E of the appendix and [Nor97].
A state j is said to be accessible from i if there exists t > 0 such that
pij (t) > 0. Similarly to the discrete time case, one can deﬁne communication
and irreducibility.
Theorem 3.16. The Qprocess X is irreducible if and only if the embedded
chain Y is irreducible.
Proof. From irreducibility and Theorem 3.16, we know that pij (t) > 0 for
any i = j and t > 0. It is straightforward to see that
pii (t) ≥ Pi (J1 > t) = e−qi t > 0
for any i ∈ S and t > 0, where Pi is the probability distribution conditioned
on X0 = i. From this, we conclude that there exist t0 > 0 and α ∈ (0, 1)
such that pij (t0 ) ≥ α for any pairs (i, j). By following a similar induction
procedure as in the proof of Theorem 3.11, we obtain, for any two initial
distributions μ(0), μ̃(0),
d(μ(t), μ̃(t)) ≤ d (μ(t − kt0 ), μ̃(t − kt0 )) (1 − α)k , t ≥ kt0 , k ∈ N.
This implies the existence and uniqueness of the invariant distribution π for
the Q process. Furthermore, we have
d(μ(t), π) ≤ d(μ(t − kt0 ), π)(1 − α)k , t ≥ kt0 , k ∈ N.
Letting t, k → ∞, we get the desired result.
Theorem 3.18 (Ergodic theorem). Assume that Q is irreducible. Then for
any bounded function f we have
1 T
f (Xs )ds →
f π , a.s.,
T 0
as T → ∞, where π is the unique invariant distribution.
Note that
P(X̂0 = i0 , X̂1 = i1 , . . . , X̂N = iN )
= P(XN = i0 , XN −1 = i1 , . . . , X0 = iN )
= πiN piN iN −1 · · · pi1 i0 = πi0 p̂i0 i1 · · · p̂iN −1 iN
In this case, we have p̂ij = pij or q̂ij = qij . We call the chain reversible. A
reversible chain can be equipped with a variational structure and has nice
spectral properties. In the discrete time setup, deﬁne the Laplacian matrix
L=P −I
and correspondingly its action on any function f
(Lf )(i) = pij (f (j) − f (i)).
j∈S
Let L2π be the space of square summable functions f endowed with the
πweighted scalar product
(3.29) (f, g)π = πi f (i)g(i).
i∈S
One can show that D(f ) = (f, −Lf )π . Similar constructions can be done
for Qprocesses and more general stochastic processes. These formulations
are particularly useful in potential theory for Markov chains [Soa94].
Observations Y1 Y2 YT
Hidden States X1 X2 XT
t=1 t=2 t=T
In this setup, both P and R are time independent. The HMM is fully
speciﬁed by the parameter set:
θ = (P , R, μ).
The distribution P(Y θ) can be obtained via the following forward algo
rithm.
Algorithm 1 (Forward procedure). Finding the posterior probability under
each model.
(i) Initialization. Compute α1 (i) = μi riy1 for i ∈ S.
(ii) Induction. Compute {αn (i)} via (3.36).
(iii) Termination.
Compute the ﬁnal a posterior probability P(Y θ) =
α
i∈S N (i).
Optimal Prediction. From the joint distribution P(X, Y θ), we have the
conditional probability
1
P(XY , θ) = P(X, Y θ) ∝ P(X, Y θ)
P(Y θ)
since the parameter θ and the observation Y are already known. Thus
given θ and Y , the optimal state sequence X can be deﬁned as the most
probable path that maximizes the probability P(X, Y θ) in the path space
S N . Again, naive enumeration and maximization are not feasible. The
Viterbi algorithm is designed for this task.
The Viterbi algorithm also uses ideas from dynamic programming to
decompose the optimization problem into many subproblems. First, let us
deﬁne the partial probability δn (i):
# $
δn (i) = max P(X1:n−1 = x1:n−1 , Xn = i, Y1:n = y1:n θ) ,
x1:n−1
To retrieve the state sequence, we need to keep track of the argument that
maximizes δn (j) for each n and j. This can be done by recording the maxi
mizer in step n in (3.37) in an array, denoted by ψn . The complete Viterbi
algorithm is as follows.
Algorithm 2 (Viterbi algorithm). Finding the optimal state sequence X̂ =
(X̂1:N ) given θ and Y .
(i) Initialization. Compute δ1 (i) = μi riy1 for i ∈ S.
(ii) Induction. Compute
ψn (j) = arg max{δn (i)pij }, n = 1, . . . , N − 1; j ∈ S.
i∈S
δn (j) = max{δn−1 (i)pij } rjyn , n = 2, . . . , N ; j ∈ S.
i∈S
(iii) Termination. Obtain the maximum probability and record the
maximizer in the ﬁnal step
P ∗ = max{δN (i)}, X̂N = arg max{δN (i)}.
i∈S i∈S
given the observation Y , where P(Y θ) is deﬁned in (3.33). This maxi
mization is nontrivial since it is generally nonconvex and involves hidden
variables. Below we introduce a very simple iterative method that gives rise
to a local maximizer, the wellknown BaumWelch algorithm proposed in
the 1970s [BE67].
To do this let us ﬁrst introduce the backward procedure. We deﬁne the
backward variable βn (i), which represents the probability of the observation
sequence from time n + 1 to N , given that the hidden state at time n is at
state i and the model θ
(3.39) βn (i) = P(Yn+1:N = yn+1:N Xn = i, θ), n = N − 1, . . . , 1.
The calculation of {βn (i)} can be simpliﬁed using the dynamic programming
principle:
(3.40) βn (i) = pij rjyn+1 βn+1 (j), n = N − 1, . . . , 1.
j∈S
N −1
ξn (i, j): expected number of transitions from state i to j given (Y , θ).
n=1
Note that the above equivalence is up to a constant, e.g., the number of
transitions starting from state i in the whole observation time series.
Algorithm 4 (BaumWelch algorithm). Parameter estimation of the HMM.
(i) Set up an initial HMM θ = (P , R, μ).
(ii) Update the parameter θ through
(3.42) μ∗ = γ1 (i),
N −1
∗ ξn (i, j)
(3.43) pij = n=1 N −1
,
n=1 γn (i)
N
∗ n=1,Yn =k γn (j)
(3.44) rjk = N .
n=1 γn (j)
(iii) Repeat the above procedure until convergence.
node i to node j. The simplest example of the weight matrix is given by the
adjacency matrix: eij = 0 or 1, depending on whether i and j are connected.
Below we will focus on the situation when the network is undirected; i.e.,
W is symmetric.
Given a network, one can deﬁne naturally a discrete time Markov chain,
with the transition probability matrix P = (pij )i,j∈S given by
eij
(3.45) pij = , di = eik .
di
k∈S
Here di is the degree of the node i [Chu97, Lov96]. Let
di
(3.46) πi = .
k∈S dk
Then it is easy to see that
(3.47) πi pij = πj ;
i∈S
i.e., π is an invariant distribution of this Markov chain. Furthermore, one
has the detailed balance relation:
(3.48) πi pij = πj pji .
For u = (ui )i∈S , v = (vi )i∈S deﬁned on S, introduce the inner product
(u, v)π = ui vi πi .
i∈S
With this inner product, P is selfadjoint in the sense that
(P u, v)π = (u, P v)π .
Denote by {λk }k=0,...,I−1 the eigenvalues of P , {ϕk }k=0
I−1
and {ψ k }k=0
I−1
the
corresponding right and left eigenvectors. We have (ϕj , ϕk )π = δjk , ψk,i =
ϕk,i πi , and
(3.49) P ϕk = λk ϕk , ψ Tk P = λk ψ Tk , k = 0, 1, . . . , I − 1.
Moreover, one can easily see that 1 is an eigenvalue and all other eigenvalues
lie in the interval [−1, 1]. We will order them according to 1 = λ0 ≥ λ1  ≥
· · · ≥ λI−1 . Note that ψ 0 = π and ϕ0 is a constant vector. The spectral
decomposition of P t is then given by
I−1
(3.50) (P t )ij = λtk ϕk,i ψk,j .
k=0
Equation (3.53) says that the probability of jumping from any state in Ck
is the same and the walker enters Cl according to the invariant distribution.
This is consistent with the idea of coarsening the original dynamics onto the
new state space C = {C1 , . . . , CK } and ignoring the details of the dynamics
within the sets Ck . Note that P̃ is a stochastic matrix on S with invariant
distribution π if P̂ is a stochastic matrix on C with invariant distribution
π̂.
One can ask further about the optimal partition, and this can be approached
by minimizing the approximation error with respect to all possible partitions.
Algebraically, the existence of community structure or clustering property of
Exercises 71
Exercises
3.1. Write down the transition probability matrix P of Ehrenfest’s dif
fusion model and analyze its invariant distribution.
3.2. Suppose {Xn }n∈N is a Markov chain on a ﬁnite state space S. Prove
that
P(A∩{Xm+1 = im+1 , . . . , Xm+n = im+n }Xm = im )
= P(AXm = im )Pi ({X1 = im+1 , . . . , Xn = im+n }),
where the set A is an arbitrary union of elementary events like
{(X0 , . . . , Xm−1 ) = (i0 , . . . , im−1 )}.
3.3. Prove Lemma 3.7.
3.4. Show that the exponent s in P s in Theorem 3.11 necessarily satis
ﬁes s ≤ I − 1, where I = S is the number of states of the chain.
3.5. Show that the recurrence relation μn = μn−1 P can be written as
μn,i = μn−1,i 1 − pij + μn−1,j pji .
j∈S,j=i j∈S,j=i
The ﬁrst term on the righthand side gives the total probability
of not making a transition from state i, while the last term is the
probability of transition from one of states j = i to state i.
3.6. Prove (3.8) by mathematical induction.
3.7. Derive (3.8) through the moment generating function
∞
g(t, z) = pm (t)z m .
m=0
3.8. Let f be a function deﬁned on the state space S, and let
hi (t) = Ei f (Xt ), i ∈ S.
Prove that dh/dt = P h for h = (h1 (t), . . . , hI (t))T .
3.9. Let {Nt1 } and {Nt2 } be two independent Poisson processes with
parameters λ1 and λ2 , respectively. Prove that Nt = Nt1 + Nt2 is a
Poisson process with parameter λ1 + λ2 .
3.10. Let {Nt } be a Poisson process with rate λ and let τ ∼ E(μ) be an
independent random variable. Prove that
P(Nτ = m) = (1 − p)m p, m ∈ N,
where p = μ/(λ + μ).
72 3. Markov Chains
3.11. Let {Nt } be a Poisson process with rate λ. Show that the distribu
tion of the arrival time τk between k successive jumps is a gamma
distribution with probability density
λk tk−1 −λt
p(t) = e , t > 0,
Γ(k)
where Γ(k) = (k − 1)! is the gamma function.
3.12. Consider the Poisson process {Nt } with rate λ. Given that Nt = n,
prove that the n arrival times J1 , J2 , . . . , Jn have the same distribu
tion as the order statistics corresponding to n independent random
variables uniformly distributed in (0, t). That is, (J1 , . . . , Jn ) has
the probability density
n!
p(x1 , x2 , . . . , xn Nt = n) =
, 0 < x1 < · · · < xn < t.
tn
3.13. The compound Poisson process {Xt } (t ∈ R+ ) is deﬁned by
Nt
Xt = Yk
k=1
where {Nt } is a Poisson process with rate λ and {Yk } are i.i.d. ran
dom variables on R with probability density function q(y). Derive
the equation for the probability density function p(x, t) of Xt .
(k)
3.14. Suppose that {Nt } (k = 1, 2, . . .) areindependent Poisson pro
cesses with rate λk , respectively, λ = ∞ k=1 λk < ∞. Prove that
∞ (k)
{Xt = k=1 kNt } is a compound Poisson process.
3.15. Prove (3.36) and (3.40).
3.16. Consider an irreducible Markov chain {Xn }n∈N on a ﬁnite state
space S. Let H ⊂ S. Deﬁne the ﬁrst passage time TH = inf{nXn ∈
H, n ∈ N} and
hi = Pi (TH < ∞), i ∈ S.
Prove that h = (hi )i∈S satisﬁes the equation
(I − P ) · h = 0 with boundary condition hi = 1, i ∈ H,
where P is the transition probability matrix.
3.17. Connection between resistor networks and Markov chains.
(a) Consider a random walk {Xn }n∈N on a onedimensional lat
tice S with states {1, 2, . . . , I}. Assume
that the transition
probability pij > 0 for i − j = 1 and j−i=1 pij = 1, where
i, j ∈ S. Let xL = 1, xR = N . Deﬁne the ﬁrst passage times
TL = inf{nXn = xL , n ∈ N}, TR = inf{nXn = xR , n ∈ N}
Notes 73
Notes
The ﬁnite state Markov chain is one of the most commonly used stochastic
models in practice. More reﬁned notions, such as recurrence and transience,
can be found in [Dur10, Nor97]. Ergodicity for Markov chains with count
able or continuous states is discussed in [MT93]. Our discussion of the
hidden Markov model follows [Rab89]. The BaumWelch algorithm is es
sentially the EM (expectationmaximimization) algorithm applied to the
HMM [DLR77, Wel03]. However, it was proposed long before the EM
algorithm.
More theoretical studies about the random walk on graphs can be found
in [Chu97, Lov96]. One very interesting application is the PageRank algo
rithm for ranking webpages [BP98].
Chapter 4
U . Then as β = T −1 → ∞,
1 −βU (x)
(4.2) e −→ δ(x − x∗ ),
Zβ
where x∗ is the global minimum of U . For this reason, sampling
can be viewed as a “ﬁnite temperature” version of optimization.
This analogy between sampling and optimization can be used in at
least two ways. The ﬁrst is that ideas used in optimization algo
rithms, such as the GaussSeidel and the multigrid methods, can be
adopted in the sampling setting. The second is a way of perform
ing global optimization, by sampling the probability distribution
on the lefthand side of (4.2), in the “low temperature” limit when
β → ∞. This idea is the essence of one of the best known global
minimization technique, the simulated annealing method.
75
76 4. Monte Carlo Methods
1
N
(4.4) IN (f ) = f (Xi ).
N
i=1
1
N
1
(4.5) EeN 2 = 2
E (f (X i ) − I(f ))(f (X j ) − I(f )) = Var(f )
N N
i,j=1
1
by using independence, where Var(f ) = 0 (f (x) − I(f ))2 dx. Therefore, we
conclude that
σf
(4.6) IN (f ) − I(f ) ≈ √ ,
N
4.2. Generation of Random Variables 77
where σf = Var(f ) is the standard deviation. We can compare this with
the error bound for the trapezoidal rule [HNW08] with N points
1 2
(4.7) eN ≈ h max f
(x)
12 x∈[0,1]
One important fact about the LCG is that it shows very poor behav
ior for the stuples, i.e., the vectors (Xn , Xn+1 , . . . , Xn+s−1 ). In [Mar68],
Marsaglia proved the following:
Theorem 4.2. Let {Yn } be a sequence generated by (4.8) and let Xn =
Yn /m. Then the stuples (Xn , Xn+1 , . . . , Xn+s−1 ) lie on a maximum of
1
(s!m) s equidistant parallel hyperplanes within the sdimensional hypercube
[0, 1]s .
The deviation from the uniform distribution over [0, 1]s is apparent when
s is large. Inspite of this, the LCG is still one of the most widely used pseudo
random number generators. There are also nonlinear generators which aim
to overcome this problem [RC04]. Some very recent mathematical software
4.2. Generation of Random Variables 79
p(y)
F(y)
y1 y
y1 y2 y
y2
Figure 4.1. Left panel: The PDF of Y . Right panel: The distribution
function F (y).
The BoxMuller Method. The basic idea is to use the strategy presented
above in two dimensions instead of one. Consider a twodimensional Gauss
ian random vector with independent components. Using polar coordinates
x1 = r cos θ, x2 = r sin θ, we have
1 − x21 +x22 1 r2
e 2 dx1 dx2 = dθ · (e− 2 rdr).
2π 2π
2
This means that θ and r are also independent, θ is uniformly distributed
over the interval [0, 2π], and r√2 is exponentially distributed. Hence if X, Y ∼
U [0, 1] i.i.d. and we let R = −2 log X, Θ = 2πY , then
Z1 = R cos Θ, Z2 = R sin Θ
are independent and have the desired distribution N (0, 1).
d
Rejection
f(x)
0
a Acceptance b
clearly limited by the computational cost we are willing to aﬀord. But the
variance can be manipulated in order to reduce the size of the error.
10
20 − 1 X 2
N
e− 2 x dx ≈
1 2
e 2 i,
−10 N
i=1
where {Xi }N
i=1 are i.i.d. random variables that are uniformly distributed
on [−10, 10]. However, notice that the integrand e− 2 x is an extremely
1 2
p(x)
f(x)
0 a b 1
If the {Xi }’s are sampled according to a diﬀerent distribution, say with
density function p(x), then since
f
f (x)
f (x)dx = p(x)dx = E (X)
p(x) p
1 f (Xi )
N
= lim ,
N →∞ N p(Xi )
i=1
we should approximate f (x)dx by
1 f (Xi )
N
p
IN (f ) = .
N p(Xi )
i=1
4.3. Variance Reduction 85
The error can be estimated in the same way as before, and we get
f
p 1 1 f 2 (x)
E(I(f ) − IN (f ))2 = Var = dx − I 2 (f ) .
N p N p(x)
From this, we see that an ideal importance function is
p(x) = Z −1 f (x)
if f is nonnegative, where Z is the normalization factor Z = f (x)dx. In
this case
p
I(f ) = IN (f ).
This is not a miracle since all the necessary work has gone into computing
Z, which was our original task.
Nevertheless, this suggests how the sample distribution should be con
structed. For the example discussed earlier, we can pick p(x) that behaves
like e− 2 x and at the same time can be sampled with a reasonable cost.
1 2
This is the basic idea behind importance sampling. One should note that
importance sampling bears a lot of similarity with preconditioning in linear
algebra.
A slight variant of these ideas is as follows [Liu04]. To compute
I = f (x)π(x)dx,
Notice that
−1 1, x ≤ 0,
(1 + r) ≈ h(x) =
0, x > 0.
We have
+∞
1 x2 1
I(f ) = √ (1 + r)−1 − h(x) e− 2 dx + .
2π −∞ 2
Here h(x) is the control variates.
1
N
ˆ
I= f (X i ).
N
i=1
Suppose that x can be decomposed into two parts (x(1) , x(2) ) and the con
ditional expectation E(f (X)x(2) ) can be approximated with good accuracy
and low cost. We can deﬁne another unbiased estimator of I by
1
N
(2)
I˜ = E(f (X)X i ).
N
i=1
If the computational eﬀort for obtaining the two estimates is similar, then
I˜ should be preferred. This is because [Dur10]
(4.10) Var(f (X)) = Var(E(f (X)X (2) )) + E(Var(f (X)X (2) )),
which implies
where Q = (Qij ) is a transition probability matrix with Qij being the proba
bility of generating state j from state i and Aij is the probability of accepting
state j. We will call Q the proposal matrix and A the decision matrix. We
will require Q to be symmetric for the moment:
Qij = Qji .
Then the detailed balance condition is reduced to
Aij
= e−βΔHij .
Aji
88 4. Monte Carlo Methods
• Glauber dynamics.
−1
(4.14) Aij = 1 + eβΔHij .
Both choices satisfy the detailed balance condition. Usually the Metropo
lis strategy results in faster convergence since its oﬀdiagonal elements are
larger than those of the Glauber dynamics [Liu04]. Note that in the Me
tropolis strategy, we accept the new conﬁguration if its energy is lower than
that of the current state. But if the energy of the new conﬁguration is
higher, we still accept it with some positive probability.
The choice of Q depends on the speciﬁc problem. To illustrate this, we
consider the onedimensional Ising model as an example.
x = (x1 , x2 , . . . , xM ),
M
1 2 Spin system
1
N
(4.17)
f ≈
f N := f (Xj ),
N
j=1
where {Xj } is the time series generated according to the Markov chain P .
The ergodic Theorem 3.12 ensures the convergence of the empirical average
(4.17) if the chain P satisﬁes the irreducibility and primitivity conditions.
Coming back to the choice of Q, let Nt = 2M , the total number of
microscopic conﬁgurations. The following are two simple options:
Proposal 1. Equiprobability proposal.
1
Q(x → y) = Nt −1 , x = y,
0, x = y.
We should remark that the symmetry condition for the proposal matrix
Q can also be relaxed, and this leads to the MetropolisHastings algorithm:
Qji πj
(4.19) Aij = min 1, .
Qij πi
There is an obvious balance between the proposal step and the decision
step. One would like the proposal step to be more aggresive since this will
allow us to sample larger parts of the conﬁguration space. However, this
may also result in more proposals being rejected.
Regarding the issue of convergence, we refer the readers to [Liu04,
Win03].
where Z = L(θX)μ(θ)dθ is simply a normalization constant that only
depends on the data X. One can sample the posterior distribution of θ
according to (4.21) or extract the posterior mean by
θ̄ = EΘ = θπ(θX)dθ
using the Metropolis algorithm. The Markov chain Monte Carlo method is
perfect here for computing θ̄ since it does not need the actual value of Z.
Table 4.1. Classiﬁcation of sites according to the spins at the site and
neighboring sites.
For each given state x, let nj be the number of sites in the jth class,
and deﬁne
i
Qi = nj Pj , i = 1, . . . , 6,
j=1
of U (x). This issue becomes quite severe when we have multiple peaks in
π(x), which is very common in real applications (see Figure 4.5).
To improve the mixing rate of the MCMC algorithm in the low tem
perature regime, Marinari and Parasi [MP92] and Geyer and Thompson
[GT95] proposed the simulated tempering technique, by varying the tem
perature during the simulation. The idea of simulated tempering is to utilize
the virtue of the fast exploration on the overall energy landscape at high
temperature and the ﬁne resolution of local details at low temperature, thus
reducing the chances that the simulation gets trapped at local metastable
states.
Figure 4.5. The Gibbs distribution at low temperature (left panel) and
high temperature (right panel).
not feasible in real applications and one only obtains some guidance through
some pilot studies [GT95].
To perform the MCMC in the augmented space, here we present a
mixturetype conditional sampling instead of a deterministic alternating ap
proach.
Algorithm 9 (Simulated tempering). Set mixture weight α0 ∈ (0, 1).
Step 1. Given (xn , in ) = (x, i), draw u ∼ U [0, 1].
Step 2. If u < α0 , let in+1 = i and sample xn+1 with the MCMC for the
Gibbs distribution πst (xi) from the current state xn .
Step 3. If u ≥ α0 , let xn+1 = x and sample in+1 with the MCMC for
the distribution πst (ix) from the current state in .
If x ∈ M, we have
1
πβ (x) = −β(H(z)−m)
;
M + z ∈M
/ e
hence πβ (x) monotonically increases with β.
If x ∈
/ M, we have
∂πβ (x) 1
= 2 e−β(H(x)−m) (m − H(x))Z̃β − e−β(H(z)−m) (m − H(z)) ,
∂β Z̃β z∈X
where Z̃β := z∈X exp(−β(H(z) − m)). Note that
lim (m − H(x))Z̃β − e−β(H(z)−m) (m − H(z))
β→+∞
z∈X
= M(m − H(x)) < 0.
Hence πβ (x) is monotonically decreasing in β.
96 4. Monte Carlo Methods
Exercises
4.1. For diﬀerent choices of N , apply the simple Monte Carlo method
and importance sampling to approximate the integral
10
e− 2 x dx.
1 2
I(f ) =
−10
1
N
I2N (f ) = (f (Xi ) + f (1 − Xi )), Xi ∼ i.i.d. U [0, 1].
2N
i=1
1
M Nk
(k)
IN = f (Xi ).
N
k=1 i=1
Prove that the variance is reduced with this uniformization; i.e.,
Var(IN ) ≤ Var(I) = (f (x) − I)2 dx.
S
4.7. Prove that both the Metropolis strategy (4.13) and Glauber dy
namics (4.14) satisfy the detailed balance condition.
4.8. Prove that when the proposal matrix Q is not symmetric, the
detailed balance condition can be preserved by choosing the
MetropolisHastings decision matrix (4.19).
4.9. Prove that the Markov chain set up by the Gibbs sampling (4.18)
and Metropolis strategy (4.13) is primitive for the onedimensional
Ising model.
98 4. Monte Carlo Methods
(b) Deﬁne the bond variable u = (uij ) on each edge and the ex
tended Gibbs distribution
π(x, u) ∝ χ{0≤uij ≤exp{βJ(1+xi xj )}} (x, u).
i,j
Notes
Buﬀon’s needle problem is considered as the most primitive form of the
Monte Carlo methods [Ram69]. The modern Monte Carlo method begins
from the pioneering work of von Neumann, Ulam, Fermi, Metropolis, et al.
in the 1950s [And86, Eck87].
A more detailed discussion on variance reduction can be found in
[Gla04]. In practical applications, the choice of the proposal is essential
for the eﬃciency of the Markov chain Monte Carlo method. In this re
gard, we mention Gibbs sampling which is easy to implement, but its ef
ﬁciency for highdimensional problems is questionable. More global moves
like the SwendsenWang algorithm have been proposed [Tan96]. Other
Notes 99
Stochastic Processes
Before going further, we discuss brieﬂy the general features of two most im
portant classes of stochastic processes, the Markov process and the Gaussian
process.
There are two ways to think about stochastic processes:
101
102 5. Stochastic Processes
Notice that for this process the number of all possible outputs is un
countable. We cannot just deﬁne the probability of an event by summing
up the probabilities of every atom in the event, as for the case of discrete
random variables. In fact, if we deﬁne Ω = {H, T }N , the probability of an
atom ω = (ω1 , ω2 , . . . , ωn , . . .) ∈ {H, T }N is 0; i.e.,
P(X1 (ω) = k1 , X2 (ω) = k2 , . . . , Xn (ω) = kn , . . .)
1 n
= lim = 0, kj ∈ {0, 1}, j = 1, 2, . . .
n→∞ 2
and events such as {Xn (ω) = 1} involve uncountably many elements.
To set up a probability space (Ω, F , P) for this process, it is natural to
take Ω = {H, T }N and the σalgebra F as the smallest σalgebra containing
all events of the form
# $
(5.1) C = ωω ∈ Ω, ωik = ak ∈ {H, T }, k = 1, 2, . . . , m
for any 1 ≤ i1 < i2 < · · · < im and m ∈ N. These are sets whose ﬁnite
time projections are speciﬁed. They are called cylinder sets, which form an
algebra in F . We denote them generically by C. The probability of an event
of the form (5.1) is given by
1
P(C) = m .
2
It is straightforward to check that for any cylinder set F ∈ C, the probability
P(F ) coincides with the one deﬁned in Example 5.1 for the independent coin
tossing process. From the extension theorem of measures, this probability
measure P is uniquely deﬁned on F since it is welldeﬁned on C [Bil79].
Another natural way is to take Ω = {0, 1}N , F being the smallest σ
algebra containing all cylinder sets in Ω, and P is deﬁned in the same way
as above. With this choice we have
Xn (ω) = ωn , ω ∈ Ω = {0, 1}N ,
5.1. Axiomatic Construction of Stochastic Process 103
which is called a coordinate process in the sense that Xn (ω) is just the nth
coordinate of ω.
In general, a stochastic process is a parameterized random variable
{Xt }t∈T deﬁned on a probability space (Ω, F , P) taking values in R; the
parameter set T can be N, [0, +∞), or some ﬁnite interval. For any ﬁxed
t ∈ T, we have a random variable
Xt : Ω → R, ω Xt (ω).
X· (ω) : T → R, t Xt (ω).
One simple example of stopping time for the coin tossing process is
% &
T = inf n : there exist three consecutive 0’s in {Xk }k≤n .
It is easy to show that the condition {T ≤ n} ∈ Fn is equivalent to
{T = n} ∈ Fn for discrete time processes.
106 5. Stochastic Processes
1 (λt)k−n −λt
∞
= lim f (n)(e−λt − 1) + f (n + 1)λte−λt + f (k) e
t→0+ t (k − n)!
k=n+2
= λ(f (n + 1) − f (n)).
It is straightforward to see that
A∗ f (n) = λ(f (n − 1) − f (n)).
For u(t, n) = En f (Xt ), we have
d (λt)k−n −λt
∞
du
(5.7) = f (k) e = λ(u(n + 1) − u(n)) = Au(t, n).
dt dt (k − n)!
k=n
The facts that the equation for the expectation of a given function has
the form of backward Kolmogorov equations (5.5) and (5.7) and the distri
bution satisﬁes the form of forward Kolmogorov equations (5.6) and (5.8)
are general properties of Markov processes. This will be seen again when
we study the diﬀusion processes in Chapter 8.
These discussions can be easily generalized to the multidimensional set
ting when Xt ∈ Rd .
Below we will show that a Gaussian process is determined once its mean
and covariance function
(5.9) m(t) = EXt , K(s, t) = E(Xs − m(s))(Xt − m(t))
are speciﬁed. Given these two functions, a probability space (Ω, F , P) can
be constructed and μt1 ,...,tk is exactly the ﬁnitedimensional distribution of
(Xt1 , . . . , Xtk ) under P.
110 5. Stochastic Processes
For the above to hold, there must exist numbers m and σ 2 such that
m = lim mk , σ 2 = lim σk2
k k
iξm− 21 σ 2 ξ 2
and correspondingly EeiξX = e . This gives the desired result.
in the L∞ ([0, T ] × [0, T ]) sense, where {λk , φk (t)}k is the eigensystem of the
operator K.
Theorem 5.13 (KarhunenLoève expansion). Let (Xt )t∈[0,1] be a Gaussian
process with mean 0 and covariance function K(s, t). Assume that K is
continuous. Let {λk }, {φk } be the sequence of eigenvalues and eigenfunctions
for the eigenvalue problem
1
K(s, t)φk (t)dt = λk φk (s), k = 1, 2, . . . ,
0
1
with the normalization condition φk φj dt = δkj . Then Xt has the repre
0
sentation
∞
√
(5.11) Xt = αk λk φk (t),
k=1
in the sense that the series
N √
(5.12) XtN = αk λk φk (t) → Xt in L∞ 2
t Lω ;
k=1
i.e., we have
lim sup EXtN − Xt 2 = 0.
N →∞ t∈[0,1]
Proof. We need to show that the random series is welldeﬁned and that it
is a Gaussian process with the desired mean and covariance function.
First consider the operator K : L2 ([0, 1]) → L2 ([0, 1]) deﬁned as in (5.10).
Thanks to Theorem 5.10, we know that K is nonnegative and compact.
Therefore it has countably many real eigenvalues, for which 0 is the only
possible accumulation point [Lax02]. From Mercer’s theorem, we have
N
λk φk (s)φk (t) → K(s, t), s, t ∈ [0, 1], N → ∞,
k=1
We denote the unique limit by Xt . For any t1 , t2 , . . . , tm ∈ [0, 1], the mean
square convergence of the Gaussian random vector (XtN1 , XtN2 , . . . , XtNm ) to
(Xt1 , Xt2 , . . . , Xtm ) implies the convergence in probability. From Theorem
5.11, we have that the limit Xt is indeed a Gaussian process. Since X N
converges to X in L∞ 2
t Lω , we have
Exercises
5.1. Prove that in Deﬁnition 5.4 the condition {T ≤ n} ∈ Fn is equiva
lent to {T = n} ∈ Fn for discrete time stochastic processes.
5.2. Prove Proposition 5.5.
5.3. Let T be a stopping time for the ﬁltration {Fn }n∈N . Deﬁne the
σalgebra generated by T as
FT = {A : A ∈ F , A ∩ {T ≤ n} ∈ Fn for any n ∈ N}.
Prove that FT is indeed a σalgebra and T is FT measurable. Gen
eralize this result to the continuous time case.
5.4. Let S and T be two stopping times. Prove that:
(i) If S ≤ T , then FS ⊂ FT .
(ii) FS∧T = FS ∩ FT , and the events
{S < T }, {T < S}, {S ≤ T }, {T ≤ S}, {T = S}
all belong to FS ∩ FT .
5.5. Consider the compound Poisson process Xt deﬁned in Exercise 3.13.
Find the generator of {Xt }.
5.6. Find the generator of the following Gaussian processes.
(i) Brownian bridge with the mean and covariance function
m(t) = 0, K(s, t) = s ∧ t − st, s, t ∈ [0, 1].
(ii) OrnsteinUhlenbeck process with the mean and covariance func
tion
m(t) = 0, K(s, t) = σ 2 exp(−t − s/η), η > 0, s, t ≥ 0.
114 5. Stochastic Processes
5.7. Let p(t) and q(t) be positive functions on [a, b] and assume that p/q
is strictly increasing. Prove that the covariance function deﬁned by
p(s)q(t), s ≤ t,
K(s, t) =
p(t)q(s), t < s,
is a strictly positive deﬁnite function in the sense that the matrix
n
K = K(ti , tj )
i,j=1
Notes
We did not cover martingales in the main text. Instead, we give a very
brief introduction in Section D of the appendix. The interested readers may
consult [Chu01, KS91, Mey66] for more details.
Our presentation about Markov processes in this chapter is very brief.
But the three most important classes of Markov processes are discussed
throughout this book: Markov chains, diﬀusion processes, and jump pro
cesses used to describe chemical kinetics. One most important tool for
treating Markov processes is the master equations (also referred to as Kol
mogorov equations or FokkerPlanck equations), which are either diﬀerential
or diﬀerence equations.
Notes 115
Wiener Process
However as we will see, one can understand the Wiener process in a number
of diﬀerent ways:
117
118 6. Wiener Process
where x = ml and the factor 1/2 in the last equation appears since m
can take only even or odd values depending on whether N is even or odd.
Combining the results above we obtain
1 x2
W (x, t)Δx = 5 exp − l2 Δx.
2
2πt lτ 2t τ
(we say that the particle is on the positive side in the interval [n − 1, n] if
at least one of the values Xn−1 and Xn is positive). One can prove that
(6.4) P2k,2n = u2k · u2n−2k ,
where u2k = P(X2k = 0). Now let γ(2n) be the number of time units that
the particle spends on the positive axis in the interval [0, 2n]. Then when
x < 1,
1 γ(2n)
P < ≤x = P2k,2n .
2 2n
k,1/2<2k/2n≤x
From the asymptotics (6.2), we have
1 1
u2k ∼ √ , P2k,2n ∼
πk π k(n − k)
as k, n − k → ∞. Therefore
1 k k − 12
P2k,2n − · 1− → 0 as n → ∞,
πn n n
{k,1/2<2k/2n≤x} k,1/2<2k/2n≤x
and thus
x
1 dt
P2k,2n → , as n → ∞.
{k,1/2<2k/2n≤x}
π 1
2
t(1 − t)
From the symmetry of random walk, we obtain the following theorem.
Theorem 6.3 (Arcsine law). The probability that the fraction of the time
√
spent by the particle on the positive side is at most x tends to π2 arcsin x;
that is,
2 √
(6.5) P2k,2n → arcsin x as n → ∞.
π
{k,k/n≤x}
We will not prove Theorem 6.4. Rather we will take for granted that the
limiting process W exists and we shall study its properties. It is instructive
to see that the spatialtemporal rescaling in (6.7) is exactly the same scaling
considered in the previous section.
2.5
10
8 2
6
4 1.5
2
1
0
−2
0.5
−4
−6
0
−8
−10 −0.5
10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
N
t
s
(n) (m) (n) (m)
EWtN WsN = E(αk αl ) Hk (τ )dτ Hl (τ )dτ
n,m=0 k∈In ,l∈Im 0 0
N
t
s
(n) (n)
= Hk (τ )dτ Hk (τ )dτ
n=0 k∈In 0 0
N
1
1
(n) (n)
= Hk (τ )χ[0,t] (τ )dτ Hk (τ )χ[0,s] (τ )dτ
n=0 k∈In 0 0
1
(6.14) → χ[0,t] χ[0,s] (τ )dτ = t ∧ s,
0
where χ[0,t] (·) is the indicator function of the set [0, t]. In the last step, we
(n)
used Parserval’s identity since {Hk } is an orthonormal basis.
Proof. First we show that almost surely, WtN converges uniformly to some
continuous function. For this purpose, we note that for the Gaussian random
variable ξ ∼ N (0, 1),
'
'
' x2
2 ∞ − y2 2 ∞ y − y2 2 e− 2
P(ξ > x) = e 2 dy ≤ e 2 dy = , x > 0.
π x π x x π x
(n)
Deﬁne an = maxk∈In αk . Then
⎛ ⎞ ' 2
(n) 2 e − n2
P(an > n) = P ⎝ αk  > n⎠ ≤ 2 n−1
, n ≥ 1.
π n
k∈In
∞ 5 n2
e− 2
Since n=1 2
n−1 2
π n < ∞, using the BorelCantelli lemma, we con
clude that there exists a set Ω̃ with P(Ω̃) = 1 such that for any ω ∈ Ω̃ there
124 6. Wiener Process
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−1.2
−1.4
−1.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
is an N (ω) such that am (ω) ≤ m for any m ≥ N (ω). For any such ω, we
have
∞
t
∞
t (m)
(m)
αk
(m)
Hk (s)ds
≤ m Hk (s)ds
m=N (ω) k∈Im 0
m=N (ω) k∈Im 0
∞
m2−
m+1
≤ 2 < ∞,
m=N (ω)
Then
1
1 1
1
E exp − 2
Wt dt = E exp − λk αk
2
= E exp − λk αk .
2
2 0 2 2
k k
where
∞
4
M −2 = 1+ .
(2k − 1)2 π 2
k=1
Using the identity
∞
4x2
cosh(x) = 1+ ,
(2k − 1)2 π 2
k=1
we get
'
− 21 2e
M = (cosh(1)) = .
1 + e2
In other words,
1
pn (w1 , w2 , . . . , wn ) = exp(−In (w)),
Zn
where
1 wj − wj−1 2
n
In (w) = (tj − tj−1 ), t0 := 0, w0 := 0,
2 tj − tj−1
j=1
n 1
Zn = (2π) 2 t1 (t2 − t1 ) · · · (tn − tn−1 ) 2 .
From this we see that the Wiener process is a stationary Markov process
with transition kernel p(x, ty, s) given by
(x−y)2
1 − 2(t−s)
P(Wt ∈ BWs = y) = e dx = p(x, ty, s)dx
B 2π(t − s) B
Proposition 6.8. Consider the Wiener process. For any t and subdivision
Δ of [0, t], we have
(6.18) E(QΔ
t − t) 2
= 2 (tk+1 − tk )2 .
k
128 6. Wiener Process
Consequently, we have
t −→ t in Lω
QΔ as Δ → 0
2
[p,q] → q − p,
QΔ n
on Ω0 .
Therefore, we have
q −p ← (Wtk+1 −Wtk )2 ≤ sup Wtk+1 −Wtk ·V (W (ω); [p, q]), ω ∈ Ω0 .
k k
From the uniform continuity of W on [p, q], supk Wtk+1 − Wtk  → 0. Hence,
as a function of t, the total variation of Wt (ω) is inﬁnite, for ω ∈ Ω0 .
Theorem 6.10 (Smoothness of the Wiener path). Let Ωα be the set of
functions that are Hölder continuous with exponent α (0 < α < 1):
6
f (t) − f (s)
Ωα = f ∈ C[0, 1], sup <∞ .
0≤s,t≤1 t − sα
The proof of Theorem 6.10 relies on the modiﬁcation concept and the
following Kolmogorov continuity theorem, which can be found in [RY05].
Deﬁnition 6.11 (Modiﬁcation). Two processes X and X̃ deﬁned on the
same probability space Ω are said to be modiﬁcations of each other if for
each t,
Xt = X̃t a.s.
They are called indistinguishable if for almost all ω ∈ Ω
Xt (ω) = X̃t (ω) for every t.
for every α ∈ [0, ε/γ). In particular, the paths of X̃ are Hölder continuous
with exponent α.
which is a contradiction.
The critical case of α = 1/2 is covered by Lévy’s modulus of continuity
theorem. The reader may refer to [RY05] for details.
2
N2
Paths not hitting m1
N1
1
Paths hitting m1
0 0 m m1 2m1− m
Wr
= δ(x),
⎪
⎪ t=0
⎪
⎪
⎪
⎪ ∂Wr
⎩
= 0.
∂x x=x1
Case 2: Absorbing wall at m = m1 .
The absorbing wall at m = m1 means that once the particle hits the
state m1 at time n, it will be stuck there forever. We still assume m1 > 0
and obtain the probability Wa (m, N ; m1 ) that the particle will arrive at m
that is less than m1 after N steps by the reﬂection principle
Wa (m, N ; m1 ) = W (m, N ) − W (2m1 − m, N ).
With the same rescaling and asymptotics, we have
2 1 m2 (2m − m)2
2 1
Wa (m, N ; m1 ) ≈ exp − − exp − ,
πN 2N 2N
132 6. Wiener Process
Wa
= δ(x),
⎪
⎪ t=0
⎪
⎪
⎪
⎩ Wa
= 0.
x=x1
n! dx
6.7. Wiener Chaos Expansion 133
in L2ω L2t , where {ξk } are i.i.d. Gaussian random variables with distribution
N (0, 1) and {mk (t)} is any orthonormal basis in L2 [0, T ]. This establishes
another correspondence between the Wiener paths ω ∈ Ω and an inﬁnite
sequence of i.i.d. normal random variables ξ = (ξ1 , ξ2 , . . .).
Theorem 6.14 (Wiener chaos expansion). Let W be a standard Wiener
process on (Ω, F , P). Let F be a Wiener functional in L2 (Ω). Then F can
be represented as
(6.21) F [W ] = fα Hα (ξ)
α∈I
This is an inﬁnitedimensional
√ analog of the Fourier expansion with
weight function ( 2π)−1 e−x /2 in each dimension. Its proof can be found
2
in [CM47].
To help appreciate this, let us consider the onedimensional case F (ξ)
where ξ ∼ N (0, 1). Then the Wiener chaos expansion reduces to
∞
(6.22) F (ξ) = fk Hk (ξ) in L2ω ,
k=0
134 6. Wiener Process
where fk = E F (ξ)Hk (ξ) . This is simply the probabilistic form of the
FourierHermite
√ −1 −x2 /2 expansion of F with the weight function w(x) =
( 2π) e . For the ﬁnitedimensional case ξ = (ξ1 , ξ2 , . . . , ξK ), we can
deﬁne
% &
IK = αα = (α1 , α2 , . . . , αK ), αk ∈ {0, 1, 2, . . .}, k = 1, 2, . . . , K
and the ﬁnite tensor product of Hermite polynomials
K
Hα (ξ) = Hαk (ξk ).
k=1
Then we have
(6.23) F (ξ) = fα Hα (ξ).
α∈IK
The Wiener chaos expansion can also be extended to the socalled gen
eralized polynomial chaos (gPC) expansion when the random variables in
volved are nonGaussian [XK02]. It is commonly used in uncertainty quan
tiﬁcation (UQ) problems [GHO17].
Exercises
6.1. Let {ξn }∞
n=1 be a sequence of i.i.d. random variables taking values
+1 with probability 2/3 and −1 with probability 1/3. Consider the
(asymmetric) random walk on Z deﬁned by
n
Sn = ξj .
j=1
U =
ψ(ω) e −iωt
f (t)dtdω
.
2π R R
Notes 137
Compute the mean detecting signal EU and compare this with the
case without random perturbation.
6.13. Prove that the Borel σalgebra on (C[0, 1],
·
∞ ) is the same as the
product σalgebra on C[0, 1], i.e., the smallest σalgebra generated
by the cylinder sets
C = {ω : (ω(t1 ), ω(t2 ), . . . , ω(tk )) ∈ Rk , for any k ∈ N}.
Notes
Theoretical study of Brownian motion began with Einstein’s seminal pa
per in 1905 [Ein05]. The mathematical framework was ﬁrst established
by Wiener [Wie23]. More material on Brownian motion can be found in
[KS91, MP10].
Chapter 7
Stochastic Diﬀerential
Equations
We now turn to the most important object in stochastic analysis: the sto
chastic diﬀerential equations (SDEs). Outside of mathematics, SDEs are
often written in the form
The ﬁrst term on the righthand side is called the drift term. The second
term is called the diﬀusion term. Ẇt is formally the derivative of the Wiener
process, known as the white noise. One can think of it as a Gaussian process
with mean and covariance functions:
d
m(t) = E(Ẇt ) = E(Wt ) = 0,
dt
∂2 ∂2
K(s, t) = E(Ẇs Ẇt ) = E(Ws Wt ) = (s ∧ t) = δ(t − s).
∂s∂t ∂s∂t
The name white noise comes from its power spectral density S(ω) deﬁned
as the Fourier transform of its autocorrelation function R(t) = E(Ẇ0 Ẇt ) =
7 = 1. We call it “white” from the analogy with the
δ(t): S(ω) = δ(t)
spectrum of white light. Noises whose spectrum is not a constant are called
colored noises.
From the regularity theory of the Wiener paths discussed in Section 6.5,
Ẇt is not an ordinary function since Wt is not diﬀerentiable. It has to be
understood in the sense of distribution [GS66]. Consequently, deﬁning the
product in the last term in (7.1) can be a tricky task.
139
140 7. Stochastic Diﬀerential Equations
To make this precise, the only obvious diﬃculty is in the last term, which is
a stochastic integral. This reduces the problem of the SDEs to that of the
stochastic integrals.
It turns out that one cannot use the classical theory of the Lebesgue
Stieltjes integral directly to deﬁne stochastic integrals since each Brownian
path has unbounded variations on any interval [t1 , t2 ]. Nevertheless, there
are several natural ways of deﬁning stochastic integrals. If we think of
approximating the integral by quadrature, then depending on how we pick
the quadrature points, we get diﬀerent limits. This may seem worrisome at
ﬁrst sight. If the stochastic integrals have diﬀerent meanings, which meaning
are we supposed to use in practice? Luckily so far this has not been an issue.
It is often quite clear from the context of the problem which version of the
stochastic integral one should use. In addition, it is also easy to convert the
result of one version to that for another version.
By far the most popular way of deﬁning stochastic integrals is that of
Itô.
where Δ is a subdivision of [0, t], Xj = Xt∗j , and t∗j is chosen arbitrarily from
the interval [tj , tj+1 ].
b
For the ordinary RiemannStieljes integral a f (t)dg(t), f and g are as
sumed to be continuous and g is assumed to have bounded variation. In
that case the sum
(7.4) f (t∗j ) g(tj+1 ) − g(tj )
j
where ej (ω) is Ftj measurable and χ[tj ,tj+1 ) (t) is the indicator function of
[tj , tj+1 ). For technical reason, we always assume that {Ft }t≥0 is the aug
mented ﬁltration of {FtW }t≥0 ; thus W is Ft adapted. Obviously in this case
the Itô integral has to be deﬁned as
T
(7.6) f (ω, t)dWt = ej (ω)(Wtj+1 − Wtj ).
0 j
Lemma 7.1. Assume that 0 ≤ S ≤ T . The stochastic integral for the simple
functions satisﬁes
T
(7.7) E f (ω, t)dWt = 0,
S
T 2
T
(7.8) (Itô isometry) E f (t, ω)dWt =E 2
f (ω, t)dt .
S S
142 7. Stochastic Diﬀerential Equations
where we have used the independence between δWk and ej ek δWj for j <
k.
From (7.9), the sequence {φn } is a Cauchy sequence in L2ω L2t . This implies
T
that { S φn dWt } is also a Cauchy sequence in L2ω . Therefore it has a unique
limit. We can also show that this limit is independent of the choice of the
approximating sequence {φn }. This is left as an exercise.
7.1. Itô Integral 143
T
T
f (ω, t)dWt
=
E f (ω, t)dWt − φn (ω, t)dWt
S S S
2
2 3 12
T T
≤ E f (ω, t)dWt − φn (ω, t)dWt → 0.
S S
X
and thus
Xn
2 →
X
2 , where
·
is the corresponding norm in H .
Using this, we have
T 2
T 2
E φn (ω, t)dWt → E f (ω, t)dWt
S S
and
T
T
E φ2n (ω, t)dt →E 2
f (ω, t)dt .
S S
From the Itô isometry for simple functions, we obtain (7.13) immediately.
Proof. Take the dyadic subdivision with mesh size 2−n on [0, T ]. From the
deﬁnition of Itô integral
t 2Wtj Wtj+1 − 2Wt2j
Ws dWs ≈ Wtj (Wtj+1 − Wtj ) =
0 2
j j
Wt2j+1 − Wt2j Wt2j+1 − 2Wtj+1 Wtj + Wt2j
= −
2 2
j j
Wt2 1 W2 t
= − (Wtj+1 − Wtj )2 → t − ,
2 2 2 2
j
in probability as the size of the subdivision goes to zero. Here t∗j ∈ [tj , tj+1 ]
but is otherwise arbitrary.
(f (t∗j ) − f (tj ))δWt2j
≤ sup f (t∗j ) − f (tj ) · δWt2j .
j j j
The ﬁrst term on the righthand side goes to zero almost surely because
of the uniform continuity of f on [0, T ]. The second term converges to the
quadratic variation of Wt in probability.
2
146 7. Stochastic Diﬀerential Equations
Remark 7.8. Equation (7.17) can be derived formally using Taylor expan
sion and the calculation rules:
(7.18) dt2 = 0, dtdWt = dWt dt = 0, (dWt )2 = dt.
First, we have
1
(7.19) dYt = f
(Xt )dXt + f
(Xt )(dXt )2
2
by Taylor expanding Yt = f (Xt ) to the second order. Using (7.18), we
obtain
(dXt )2 = (bdt + σdWt )2 = b2 dt2 + 2bσdtdWt + σ 2 (dWt )2 = σ 2 dt.
Substituting this into (7.19) yields (7.17).
are
bounded and continuous. If b and σ are simple functions, we have
1
(Xtj )δXt2j = f
Note that
f (Xtj )b (tj )δtj
≤ K
2
δt2j ≤ KT sup δtj → 0,
j j j
.
From Proposition 7.6, we get
t
The readers are referred to [IW81, KS91] for the full proof. The above
result can be generalized to the multidimensional case.
Theorem 7.9 (Multidimensional Itô formula). Let X t be an Itô process
deﬁned by dX t = b(ω, t)dt + σ(ω, t)dW t , where X t ∈ Rn , σ ∈ Rn×m ,
W ∈ Rm . Deﬁne Yt = f (X t ), where f is a twice diﬀerentiable function.
Then
1
(7.20) dYt = b · ∇f + σσ T : ∇2 f dt + ∇f · σ · dW t .
2
Remark 7.10. Equation (7.20) can be derived formally using the calcula
tion rules
(7.21)
dt2 = 0, dtdWti = dWti dt = dWti dWtj = 0 (i = j), (dWti )2 = dt.
Using Taylor expansion, we have
1
(7.22) dYt = ∇f (X t ) · dX t + (dX t )T · ∇2 f (X t ) · (dX t ).
2
With (7.21), we get
(dX t )T · ∇2 f (X t ) · (dX t ) = dWtl σil ∂ij
2
f σjk dWtk
l,k,i,j
(7.23) = 2
σik σjk ∂ij f dt = σσ T : ∇2 f dt,
k,i,j
where A : B = ij aij bji is the contraction for secondorder tensors. Equa
tion (7.20) is obtained by substituting (7.23) into (7.22).
Example 7.11. An example of integration by parts is
t
t
(7.24) sdWs = tWt − Ws ds.
0 0
Finally let us mention a very useful inequality about the Itô integral. De
ﬁne the martingale (see Section D of the appendix for the deﬁnition)[KS91]
t
Mt = σ(ω, s)dW s , t ∈ [0, T ],
0
+ sup
.
t∈[0,T ] 0
Thus
(k+1) (k) −k
P sup Xt − Xt  >2
t∈[0,T ]
2
2 3
T
(k) (k−1) −2k−2
≤P b(Xt , t) − b(Xt , t)dt >2
0
2
t
3
+P sup
> 2−k−1 := P1 + P2 .
t∈[0,T ] 0
(n) (0)
n−1
(k+1) (k)
Xt = Xt + (Xt − Xt )
k=0
in the almost sure sense. Denote its limit by Xt (ω). Then Xt is tcontinuous
(k)
and Ft adapted since Xt is tcontinuous and Ft adapted. Using Fatou’s
lemma (see Theorem C.12) and the bound (7.32), we get
EXt 2 ≤ L̃(1 + Eξ 2 ) exp(L̃t).
Now let us show that Xt satisﬁes the SDE. It is enough to take the limit
k → ∞ on both sides of (7.30). For the integral term with respect to t, we
have
t
t
(b(Xs(k) , s) − b(Xs , s))ds
≤ b(Xs(k) , s) − b(Xs , s)ds
0 0
T
≤K Xs(k) − Xs ds → 0 as k → ∞ a.s.
0
Thus
t
t
(7.36) b(Xs(k) , s)ds → b(Xs , s)ds a.s. for t ∈ [0, T ].
0 0
152 7. Stochastic Diﬀerential Equations
(k)
For the stochastic integral, we note that Xt is a Cauchy sequence in L2ω
(k) (k)
for any ﬁxed t. Based on the fact that Xt → Xt a.s., we obtain Xt → Xt
in L2ω . This gives
t 2
t
E (σ(Xs , s) − σ(Xs , s))dWs =
(k)
E(σ(Xs(k) , s) − σ(Xs , s))2 ds
0 0
t
≤ K2 EXs(k) − Xs 2 ds → 0 as k → ∞.
0
(k)
Combining (7.36), (7.37), and the fact that Xt → Xt almost surely,
we conclude that Xt is the solution of the SDE (7.27).
Simple SDEs.
Example 7.15 (OrnsteinUhlenbeck process). The OrnsteinUhlenbeck
(OU) process is the solution of the following linear SDE with the additive
noise:
(7.38) dXt = −γXt dt + σdWt , Xt t=0 = X0 .
Here we assume γ, σ > 0.
If we use this to model the dynamics of a particle, then the drift term
represents a restoring force that prevents the particle from getting too far
away from the origin. We will see that the distribution of the particle po
sition has a limit as time goes to inﬁnity. This is a very diﬀerent behavior
from that of the Brownian motion.
Solution. Dividing both sides by Nt , we have dNt /Nt = rdt + αdWt . Ap
plying Itô’s formula to log Nt , we get
1 1
d(log Nt ) = dNt − (dNt )2
Nt 2Nt2
1 1 2 2
= dNt − α Nt dt
Nt 2Nt2
1 α2
= dNt − dt.
Nt 2
Therefore, we have
α2
d(log Nt ) = r − dt + αdWt .
2
Integrating from 0 to t, we get
8
α2
Nt = N0 exp r− t + αWt .
2
Example 7.17 (Langevin equation). The simplest model for an inertial
particle in a force ﬁeld subject to friction and noise is given by
dX t = V t dt,
(7.41)
mdV t = − γV t − ∇U (X t ) dt + σdW t ,
where γ is the frictional coeﬃcient, U (X) is the potential for the force ﬁeld,
W t is the standard√Wiener process, and σ is the strength of the noise force.
We will show σ = 2kB T γ in Chapter 11 to ensure the correct physics.
154 7. Stochastic Diﬀerential Equations
Diﬀusion Process. Diﬀusion processes are the Markov process with con
tinuous paths. From what we have just established, Itô processes deﬁned as
the solutions of stochastic diﬀerential equations are diﬀusion processes if the
coeﬃcients satisfy the conditions presented above. It is interesting to note
that the converse is also true, namely that diﬀusion processes are nothing
but solutions of stochastic diﬀerential equations.
We will not prove this last statement. Instead, we will see formally
how a diﬀusion process can give rise to a stochastic diﬀerential equation.
Let p(x, ty, s) (t ≥ s) be the transition probability density of the diﬀusion
process, and let
1
(7.42) lim (x − y)p(x, ty, s)dx = b(y, s) + O( ),
t→s t − s x−y<
1
(7.43) lim (x − y)(x − y)T p(x, ty, s)dx = A(y, s) + O( ).
t→s t − s x−y<
Here b(y, s) is called the drift of the diﬀusion process and A(y, s) is called
the diﬀusion matrix at time s and position y. The conditions (7.42) and
(7.43) can also be represented as
1
(7.44) lim Ey,s (X t − y) = b(y, s),
t→s t − s
1
(7.45) lim Ey,s (X t − y)(X t − y) = A(y, s).
t−s
t→s
We use the special notation ◦ for the Stratonovich integral. One can show
that for a large class of the integrand, this limit is welldeﬁned. Based on
this, one can also give a treatment of the stochastic diﬀerential equation:
(7.46) dXt = b(Xt , t)dt + σ(Xt , t) ◦ dWt
in the Stratonovich sense. However, Itô isometry is lost, and the integral is
no longer a martingale.
7.4. Stratonovich Integral 155
Integrating from tn to tn+1 and taking f (x) = b(x) and σ(x), we have
tn+1
tn+1
Xtn+1 =Xtn + b(Xs )ds + σ(Xs )dWs
tn tn
(7.54) =Xtn + b(Xtn )δtn + σ(Xtn )(Wtn+1 − Wtn )
tn+1
s
(7.55) + dWs (L2 σ)(Xτ )dWτ
tn tn
tn+1
s
tn+1
s
(7.56) + dWs (L1 σ)(Xτ )dτ + ds (L2 b)(Xτ )dWτ
tn tn tn tn
tn+1
s
(7.57) + ds (L1 b)(Xτ )dτ,
tn tn
where δtn = tn+1 − tn is the step size. The above procedure can be iterated.
One just has to replace Li b(Xτ ), Li σ(Xτ ) by Li b(Xt ), Li σ(Xt ). This gives
us the socalled ItôTaylor expansion for SDEs. It is easy to see that the
terms in the ItoTaylor expansion have the form
tn+1
s1
sk−1
Ii (g) = i1
dWs1 dWs2 · · ·
i2
dWsikk g(Xsk )
tn tn tn
for some k ∈ {1, 2, . . .}. Here the index i = (i1 , i2 , . . . , ik ) and ij ∈ {0, 1}
for j = 1, 2, . . . , k. The integrand g is the action of some compositions of
operators L1 and L2 on the function b or σ. We use the convention that
Wt0 := t and Wt1 := Wt .
By truncating the ItôTaylor series at diﬀerent order, we obtain diﬀerent
schemes. For example, if we only keep terms in (7.54), we get
(1) EulerMaruyama scheme.
where δWn ∼ N (0, δtn ). Due to its simplicity, the EulerMaruyama scheme
is the most commonly used numerical scheme.
√
From the basic intuition dWt ∼ dt, we have (7.55)∼ O(δt), (7.56)∼
O(δt3/2 ), and (7.57)∼ O(δt2 ). By extracting the leading order term (7.55),
we obtain
tn+1
s
tn+1
s
dWs (L2 σ)(Xτ )dWτ ≈ (L2 σ)(Xtn ) dWs dWτ
tn tn tn tn
1
= (L2 σ)(Xtn )[(δWn )2 − δtn ].
2
Substituting this into the ItoTaylor expansion, we obtain the wellknown
Milstein scheme.
158 7. Stochastic Diﬀerential Equations
where Wti , Wtj are independent Wiener processes. Terms of this kind are
diﬃcult to sample[KP92].
Another problem for the Milstein scheme is that it can be hard to com
pute σ
(Xt ) even in the onedimensional case. To overcome this diﬃculty,
one can borrow ideas from the RungeKutta method for solving ODEs.
(3) RungeKutta scheme.
X̂n =Xn + b(Xn )δtn + σ(Xn ) δtn ,
Xn+1 =Xn + b(Xn )δtn + σ(Xn )δWn
1 1
(7.60) + √ [σ(X̂n ) − σ(Xn )][(δWn )2 − δtn ].
2 δtn
If we formally take a higherorder ItôTaylor expansion, say applying
(7.53) to
(L1 b)(Xτ ), (L2 b)(Xτ ), (L1 σ)(Xτ ), (L2 σ)(Xτ )
and dropping higherorder terms, we get the following higherorder scheme.
(4) Higherorder scheme.
1
Xn+1 =Xn + bδtn + σδWtn + σσ
{(δWn )2 − δtn }
2
1 1
+ σb
δZn + bb
+ σ 2 b
δt2n
2 2
1
+ bσ
+ σ 2 σ
1
(7.61) + σ σσ + (σ
)2 (δWn )2 − δtn δWn ,
2 3
where
tn+1 s
δZn := dWτ ds
tn tn
7.5. Numerical Schemes and Analysis 159
for any f ∈ Cb∞ (Rn ), the space of bounded, inﬁnitely diﬀerentiable functions
(with bounded derivatives), where Cf is a constant independent of δt but
may depend on f , then we say {Xtδt } weakly converges to Xt with order β.
Proposition 7.19. β ≥ α.
Strong Convergence.
Theorem 7.20 (Convergence order). Deﬁne the length of the multiindex
i = (i1 , i2 , . . . , ik ) as
l(i) := k, n(i) := {the number of zeros in i}
and the set of indices
8 8
1 1 3
Sα = i l(i) + n(i) ≤ 2α or l(i) = n(i) = α + for α ∈ , 1, , . . . ,
2 2 2
Wβ = {il(i) ≤ β} for β ∈ {1, 2, 3, . . .}.
Then with mild smoothness conditions on b, σ, and the function f in weak
approximation, the scheme derived by truncating the ItoTaylor expansion
up to all indices with i ∈ Sα , have strong order α; the scheme derived
by truncating the ItoTaylor expansion up to terms with i ∈ Wβ has weak
order β.
160 7. Stochastic Diﬀerential Equations
Table 7.1. The convergence order of some numerical schemes for the SDE.
(X̄t )(dX̄t )2 ;
2
i.e.,
t
t
1
f (X̄t ) = f (Xn ) + f
(X̄s )b(Xn ) + f
(X̄s ) ds + f
(X̄s )dWs
tn 2 tn
Since
t
X t − X tn = b(Xs )ds + (Wt − Wtn ),
tn
squaring both sides and taking the expectation, we get
t 2
EXt − Xtn  ≤ 2E
2
b(Xs )ds + 2δt.
tn
Hence, we have
t
EXt − Xtn 2 ≤ 2Lδt (1 + EXs 2 )ds + 2δt
tn
≤ 2Lδt (1 + K1 (T )) + 2δt, t ∈ [tn , tn+1 ).
2
1
N
(7.65) Yh,N = f (Xn(k) ), n = T /h ∈ N,
N
k=1
1
Nl
(k) (k)
Yl = Fl − Fl−1 , l = 0, 1, . . . , L.
Nl
k=1
The key point of multilevel Monte Carlo is that with the decomposition
(7.68), the term Fl − Fl−1 has smaller variance at higher levels provided
that the realizations of Fl − Fl−1 come from two discrete approximations
with diﬀerent time stepsizes but the same Brownian paths. This property
suggests that we can use fewer Monte Carlo samples at higher levels, i.e.,
164 7. Stochastic Diﬀerential Equations
ﬁner grids, but more samples for lower levels, i.e., coarser grids. This cost
accuracy tradeoﬀ is the origin of the improved eﬃciency of the multilevel
Monte Carlo method.
Now let us consider the minimization
L
Vl
L
min Var(ŶL ) = subject to the cost K = Nl h−1
l 1.
Nl Nl
l=0 l=0
From the strong and weak convergence result of the EulerMaruyama scheme,
we have
E(Fl ) − YE  = O(hl ), EXT − Xl,M l 2 = O(hl ).
By assuming the Lipschitz continuity of f , we obtain
Var(Fl − f (XT )) ≤ Ef (Xl,M l ) − f (XT )2 ≤ CEXT − Xl,M l 2 = O(hl )
and thus
Vl = Var(Fl − Fl−1 ) ≤ 2 Var(Fl − f (XT )) + 2 Var(Fl−1 − f (XT )) = O(hl )
since hl−1 = M hl and M ∼ O(1). We remark that the terms Fl (ω) and
Fl−1 (ω) should be driven by the same Brownian path W (ω); otherwise Fl −
Fl−1 would not be a small quantity. This point should be strictly realized
in the numerical computations.
For a given tolerance ε 1, take
(7.72) Nl = O(ε−2 Lhl );
according to the optimal choice (7.71), we get the variance estimate
(7.73) Var(ŶL ) = O(ε2 )
from (7.70). Further taking L = log ε−1 / log M , we have
hL = M −L = O(ε).
So the bias error
(7.74) EFL − YE  = O(hL ) = O(ε).
Combining (7.73) and (7.74), we obtain the overall mean square error
MSE = E(YE − ŶL )2 = O(ε2 )
Exercises 165
Exercises
7.1. Prove that the deﬁnition given in (7.10) is independent of the choice
of the approximating sequence {φn }.
7.2. Prove Proposition 7.3.
7.3. Prove that with the midpoint approximation
t W2
Ws dWs ≈ Wtj+ 1 (Wtj+1 − Wtj ) → t
0 2 2
j
7.11. Prove that if one takes the rightmost endpoint to deﬁne the sto
chastic integral (backward stochastic integral)
T
f (ω, t) ∗ dWt ≈ f (tj+1 )(Wtj+1 − Wtj ),
0 j
Notes
Since Itô’s groundbreaking work on the stochastic integral, the theory of
SDEs has been developed into a mature subject in stochastic analysis
[Oks98, IW81, KS91, RY05, RW94a, RW94b]. For people interested in
applications, we recommend the references [Gar09, Pav14, VK04].
The numerical solution of SDEs is still a developing ﬁeld. Among
the topics that we did not cover, we mention strong and weak approxi
mation of multiple Itô integrals [KP92], implicit methods for overcoming
the stiﬀness, stability analysis, and explicit methods for stiﬀ SDEs [AC08,
AL08], highorder RungeKutta methods [R0̈3], extrapolation methods
[TT90], structurepreserving methods and applications in molecular dy
namics [LM15], etc. Interested readers can refer to [KP92, MT04] for
more details.
Chapter 8
FokkerPlanck
Equation
There are three diﬀerent but related ways of studying diﬀusion processes.
The ﬁrst is to take a pathwise viewpoint and represent the paths by solutions
of SDEs. This was done in the last chapter. The second approach is to study
the dynamics of the probability distribution and expectation values using
partial diﬀerential equations. This is the approach that we will take in this
chapter. The third is to study the probability distribution on path space
induced by the diﬀusion process. This is the approach of path integrals,
which will be taken up in the next chapter.
Let us ﬁrst consider the situation of ordinary diﬀerential equations.
Imagine a point particle in Rd whose dynamics is described by
dx
(8.1) = b(x), xt=0 = x0 .
dt
Assuming that the initial distribution density of the particle is p0 (x), at
later times the distribution density of the particle p(x, t) then satisﬁes the
Liouville equation:
(8.2) ∂t p + ∇x · (bp) = 0.
Alternatively, one can also think of the solutions of the ODE (8.1) as being
the characteristics of this ﬁrstorder PDE (8.2).
We will see that if (8.1) is replaced by an SDE, then the ﬁrstorder PDE
above is replaced by a secondorder parabolic equation, the socalled Fokker
Planck equation. Alternatively, one can also think of the solutions of the
SDE as the characteristics of the corresponding FokkerPlanck equation.
169
170 8. FokkerPlanck Equation
where the diﬀusion matrix A(x, t) = σ(x, t)σ T (x, t) = (aij (x, t)). Taking
the expectation on both sides conditioned on X s = y, we have
t
(8.4) E f (X t ) − f (y) = E
y,s y,s
(Lf )(X τ , τ )dτ,
s
where the operator L is deﬁned by
1
(8.5) (Lf )(x, t) = b(x, t) · ∇f (x) + 2
aij (x, t)∂ij f (x).
2
i,j
Here (·, ·) denotes the standard inner product in L2 (Rd ). An easy calculation
gives
1
(8.7) (L∗ f )(x, t) = −∇x · (b(x, t)f (x)) + ∇2x : (A(x, t)f (x)),
2
where ∇x : (Af ) = ij ∂ij (aij f ).
2
and one can integrate both sides of the forward Kolmogorov equation against
ρ0 .
Example 8.1 (Brownian motion). The SDE reads
dX t = dW t , X 0 = 0.
Thus the FokkerPlanck equation is
1
(8.8) ∂t p = Δp, p(x, 0) = δ(x).
2
It is well known that
1 x2
p(x, t) = √ exp − ,
2πt 2t
which is the probability distribution density of N (0, tI).
Example 8.2 (Brownian dynamics). Consider the overdamped dynamics
of a stochastic particle described by
9
1 2kB T
(8.9) dX t = − ∇U (X t )dt + dW t .
γ γ
The FokkerPlanck equation is
1 k T
B
(8.10) ∂t p − ∇ · ∇U (x)p = Δp = DΔp,
γ γ
where D = kB T /γ is the diﬀusion coeﬃcient.
172 8. FokkerPlanck Equation
and with similar reasoning the probability transfer from region R2 to R1 has
the form
P2→1 = dx dyp(x, t + δt; y, t).
R1 R2
we obtain
J2→1 = dx dy∂t p(x, t; y, s = t) − dx dy∂t p(x, t; y, s = t)
R1 R2 R2 R1
= dx∇x · j(x, t; R1 , t) − dx∇x · j(x, t; R2 , t)
R2 R1
= dSn · (j(x, t; R1 , t) + j(x, t; R2 , t)),
S12
where j(x, t; R1 , t) := R1 dyj(x, t; y, t) and n is the normal pointing from
R2 to R1 . The last equality is obtained by the divergence theorem and the
fact that j(x, t; R2 , t) = 0 when x ∈ S1 and j(x, t; R1 , t)
= 0 when x ∈ S2 .
From the fact that x ∈ R1 ∪ R2 we have j(x, t) = Rd dyj(x, t; y, t) =
j(x, t; R1 , t) + j(x, t; R2 , t) and thus
J2→1 = dSn · j(x, t).
S12
Recall that in a discrete time Markov chain, the probability ﬂux is de
ﬁned as
Jijn = μn,i pij − μn,j pji
from state i to state j at time n and for a continuous time Markov chain
we see that n · j(x, t) is exactly the continuous space version of Jij (t) along
a speciﬁc direction n.
8.3. The Backward Equation 175
The two most commonly used boundary conditions are as follows. It will
be instructive for the readers to compare them with the boundary conditions
for the Wiener process in Chapter 6.
The reﬂecting boundary condition. Particles are reﬂected at the bound
ary ∂U . Thus there will be no probability ﬂux across ∂U . Therefore we have
(8.20) n · j(x, t) = 0, x ∈ ∂U.
In this case the total probability is conserved since
d
p(x, t)dx = − ∇x · j(x, t)dx
dt U
U
=− n · j(x, t)dS = 0.
∂U
When all the variables are of even type, i.e., s = I, the condition (ii) is
trivial and (i) gives exactly the detailed balance condition (8.26). This
general detailed balance condition is useful when we consider the invariant
distribution of the Langevin process (7.41). We have
(8.33)
x v 0 0
y= , b= γ , A(y) = .
v −m v−m1
∇U 0 2km BT γ
2 I
lim
Tt f − f
∞ = 0 for any f ∈ C0 (Rd ).
t→0+
Here Tt is called the Feller semigroup. With this setup, we can use tools
from semigroup theory to study Tt [Yos95]. We thus deﬁne the inﬁnitesimal
generator A of Tt by
Ex f (X t ) − f (x)
Af (x) = lim ,
t→0+ t
where f ∈ D(A) := {f ∈ C0 (Rd ) such that the limit exists}. For f ∈
Cc2 (Rd ) ⊂ D(A) we have
1
Af (x) = Lf (x) = b(x) · ∇f (x) + (σσ T ) : ∇2 f (x)
2
from Itô’s formula (8.4).
This means that the solution of the problem (8.37) in the case when q = 0
can be represented as averages over diﬀusion processes. The FeynmanKac
formula handles the situation when q = 0.
To motivate the formula, note that in the absence of the diﬀusion term,
the solution of the PDE
dX t
= b(X t ), X 0 = x,
dt
t
(8.41) v(x, t) = Ex exp q(X s )ds f (X t ) .
0
8.7. Boundary Value Problems 181
t
The proof goes as follows. Let Yt = f (X t ), Zt = exp( 0 q(X s )ds) and
deﬁne v(x, t) = Ex (Yt Zt ). Here v(x, t) is diﬀerentiable with respect to t and
1 x
E v(X s , t) − v(x, t)
s
1 x Xs
= E E Zt f (X t ) − Ex Zt f (X t )
s
1 x
t
= E exp q(X r+s )dr f (X t+s ) − Ex Zt f (X t )
s
s
0
1 x
= E exp − q(X r )dr Zt+s f (X t+s ) − Zt f (X t )
s 0
1 x
= E Zt+s f (X t+s ) − Zt f (X t )
s
1 x
s
+ E Zt+s f (X t+s ) exp − q(X r )dr − 1
s 0
→ ∂t v − q(x)v(x, t) as s → 0.
The lefthand side is Av(x, t) by deﬁnition.
(x,t)
Stochastic Characteristics
Deterministic Characteristics
0 x
(8.45) τU := inf {t ≥ 0, X t ∈
/ U }.
t
Proof. The proof of the fact that τU is a stopping time and that Ex τU < ∞
is quite technical and can be found in [KS91, Fri75a].
Standard PDE theory tells us that u ∈ C 2 (U )∩C(Ū ). Applying Dynkin’s
formula, we get to u(X t ):
τU
(8.46) E u(X τU ) − u(x) = E
x x
Au(X t )dt.
0
Thus
τU
τU
u(x) = E u(X τU ) −
x
Au(X t )dt = E x
f (X τU ) + g(X t )dt .
0 0
From the boundary conditions for the backward equation, we have R(x, t)x=b
= 0 and ∂x R(x, t)x=a = 0, which implies the boundary conditions for τ (x):
(8.49) ∂x τ (x)x=a = 0, τ (x)x=b = 0.
The study of the ﬁrst passage time problem is then turned into a simple
PDE problem.
In analogy with Section 3.10, we deﬁne a Hilbert space L2μ with the
invariant distribution μ = Z −1 exp(−U (x)) as the weight function. The
inner product and the norm are deﬁned by
(f, g)μ = f (x)g(x)μ(x)dx and
f
2μ = (f, f )μ .
[0,1]d
The operator L is selfadjoint in L2μ . This can be seen as follows. First note
that
(8.51) L∗ (gμ) = ∇ · (∇U μ)g + ∇U · ∇gμ + Δgμ + Δμg + 2∇g · ∇μ = μLg,
(8.52) (Lf, g)μ = Lf (x)g(x)μ(x)dx = f (x)L∗ (gμ)dx = (f, Lg)μ .
=
∇f
2μ ≥ 0.
Hence we conclude that the eigenvalues of L and L∗ are real and nonpositive.
Denote by φλ the eigenfunctions of L with eigenvalue λ, with the normal
ization
(φλ (x), φλ (x))μ = δλλ .
We immediately see that the eigenfunctions of L∗ can be chosen as
(8.53) ψλ (x) = μ(x)φλ (x)
because of (8.51). We also have as a consequence that (φλ , ψλ ) = δλλ .
In the case of the discrete Markov chain with detailed balance, there is a
simple way to symmetrize the generator. Let P be the transition probability
matrix, let μ = (μ1 , μ2 , · · · , μn ) be the invariant distribution, and let D =
diag(μ1 , μ2 , . . . , μn ). Then
(8.54) DP = P T D.
Thus we can deﬁne the symmetrization of P by
D 2 P D− 2 = D− 2 P T D 2 .
1 1 1 1
(8.55)
Let w = u+ + u− . We have
∂2w 2
2∂ w ∂w α
2 = α − 2β , wt=0 = f+ + f− , ∂t wt=0 = ∂x (f+ − f− ).
∂t2 ∂x2 ∂t
Consider the case when f+ = f− = f . In this case the time derivative of w
vanishes at t = 0; hence we avoid the extra complication coming from the
initial layer. Following the standard approach in asymptotic analysis, we
make the ansatz
w = w0 + w1 + 2 w2 + · · · .
To leading order, this gives
∂w0 α2 ∂ 2 w0
(8.61) = , w0 t=0 = 2f.
∂t 2β ∂x2
This means that to leading√ order, x behaves like Brownian motion with
diﬀusion constant D = α/ β. This is not surprising since it is what the
central limit theorem tells us for the process
1 t
x̃ (t) = y (s)ds as → 0.
0
We turn now to the general case. Suppose that the backward equation
for the stochastic process {X t } has the form
∂u 1 1
(8.62) = 2 L1 u + L2 u + L3 u , u (0) = f,
∂t
where L1 , L2 , and L3 are diﬀerential operators deﬁned on some Banach space
B, whose properties will be speciﬁed below. We would like to study the
asymptotic behavior of u when → 0 for 0 ≤ t ≤ T , T < ∞. This discussion
follows the work of Khasminski, Kurtz, Papanicolaou, etc. [Pap77].
As a general framework we assume that the following conditions hold.
(a) L1 is an inﬁnitesimal generator of a stationary Markov process, and
the semigroup exp(L1 t) generated by L1 converges to a projection
operator to the null space of L1 , which we will denote as P:
exp(L1 t) → P, t → ∞.
The reason and meaning for this will become clear later.
(b) Solvability condition: PL2 P = 0.
(c) Consistency condition for the initial value: Pf = f . This allows us
to avoid the initial layer problem.
Note that the following holds:
(8.63) Range(P) = Null(L1 ), Null(P) = Range(L1 ).
But P does not need to be an orthogonal projection since generally P ∗ = P
(see Exercise 8.13 for more details).
8.9. Asymptotic Analysis of SDEs 187
To see how this abstract framework actually works, we use it for the
simple model introduced at the beginning of this section. We have
+α 0 ∂
L1 = A, L2 = , L3 = 0.
0 −α ∂x
Thus the projection operator P is given by
1 1 + e−2βt 1 − e−2βt 1 1
P = lim exp(L1 t) = lim = 2 2
.
t→∞ t→∞ 2 1 − e−2βt 1 + e−2βt 1
2
1
2
We have
1
Lf (y) = b(y)f
(y) + f
(y)
2
1 1 1
(y) + b(y)f
(y) + f
(y) .
2 2 2
1
LN f (y) = b(x)f
(y) + f
(y)
2
and
1 1 1
(y) + b(x)f
(y) + f
(y) ,
2 2 2
where x is the initial value. Now it is obvious that
Since the local truncation error is of second order, we expect the global
error to be of weak order 1. In fact if one can bound the expansions in (8.71)
and (8.72) up to a suitable order, the above formal derivations based on the
local error analysis can be made rigorous.
It will become clear soon that the weak convergence analysis boils down
to some estimates about the solution of the backward equation. Let us ﬁrst
consider the PDE
1
(8.76) ∂t u(x, t) = Lu(x, t) = b(x)∂x u + ∂x2 u, u(x, 0) = f (x).
2
Let CPm (Rd , R) be the space of functions w ∈ C m (Rd , R) for which all partial
derivatives up to order m have polynomial growth. More precisely, there
exist a constant K > 0 and m, p ∈ N such that
∂xj w(x) ≤ K(1 + x2p ), ∀ j < m
for any x ∈ Rd , where j is a dmultiindex. Here we have d = 1 and we will
simply denote CPm (Rd , R) as CPm .
The following important lemma can be found in [KP92, Theorem 4.8.6,
p. 153].
Lemma 8.9. Suppose that f ∈ CP2β for some β ∈ {2, 3, . . .}, X t is time
homogeneous, and b ∈ CP2β with uniformly bounded derivatives. Then ∂u/∂t
is continuous and
u(·, t) ∈ CP2β , t ≤ T,
for any ﬁxed T < ∞.
Hence
en  = Ef (Xn ) − Ef (Xtn ) = Ev(Xn , tn ) − Ev(X0 , 0)
tn
=
E ∂t v(X̄s , s) + b(Xns )∂x v(X̄s , s) + ∂xx v(X̄s , s) − L̃v(X̄s , s) ds
,
0 2
where ns := {mtm ≤ s < tm+1 } and {X̄s } is the continuous extension of
{Xn } deﬁned by
(8.78) dX̄s = b(Xns )ds + dWs , x ∈ [tm , tm+1 ).
With this deﬁnition, we obtain
tn
en  =
E b(Xns )∂x v(X̄s , s) − b(X̄s )∂x v(s, X̄s ) ds
0
tn
≤
E b(Xns )∂x v(Xns , tns ) − b(X̄s )∂x v(X̄s , s) ds
0
tn
+
E b(X ns ) ∂ x v( X̄ s , s) − ∂ x v(X ns ns ds
, t )
0
tm+1
=
E b(Xm )∂x v(Xm , tm ) − b(X̄s )∂x v(X̄s , s) ds
m tm
tm+1
(8.79) +
E b(Xm ) ∂x v(X̄s , s) − ∂x v(Xm , tm ) ds
.
m tm
Using this with g = b∂x v and g = ∂x v in (8.79), we have the highest deriva
tives ∂xxx v, ∂xx b ∈ CP2β as long as β ≥ 2. Notice that, conditional on Xm ,
t
b(Xm ) is independent of tm ∂x g(X̄s , s)dWs . Together with the fact that
EXm 2r and EX̄t 2r ≤ C for any r ∈ N (Exercise 8.14), we get
en  ≤ C Δt2 ≤ CΔt,
m
1
N
uN,Δt = (XM,k )2 .
N
k=1
0.025
0.02
Error
0.015
0.01
0.005
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Stepsize
Exercises
8.1. Prove that X t ∈ Sn−1 if X 0 ∈ Sn−1 for the process deﬁned by
(8.15).
8.2. Prove that the FokkerPlanck equation for the Brownian motion on
S2 is given by (8.16). Prove that for Brownian motion on S1 , the
FokkerPlanck operator reduces to d2 /dθ2 if we use the parameter
ization: (x, y) = (cos θ, sin θ), θ ∈ [0, 2π).
8.3. Derive the FokkerPlanck equation (8.13) for the Stratonovich SDEs.
8.4. Let X be the solution of the SDE
dXt = a(t)Xt dt + σ(t)dWt .
Find the PDF of Xt .
8.5. Prove that for the multidimensional OU process
dX t = BX t dt + σdW t ,
if the invariant distribution is a nondegenerate Gaussian with den
sity Z −1 exp(−xT Σ−1 x/2), then the detailed balance condition
j s (x) = 0 implies
BΣ = (BΣ)T .
8.6. Prove that the detailed balance condition (8.30) is equivalent to
(8.32) when the diﬀusion process has the form (8.31).
8.7. Consider the onedimensional OU process (7.38).
(a) Find the eigenvalues and eigenfunctions of the FokkerPlanck
operator.
(b) Find the transition probability density.
8.8. Find the transition PDF of the linear Langevin equation
dXt = Vt dt, dVt = −Xt dt − γVt dt + 2γdWt
starting from the initial state (Xt , Vt )t=0 = (x0 , v0 ).
8.9. Suppose that Xt satisﬁes the SDE
dXt = b(Xt )dt + σ(Xt )dWt ,
where b(x) = −a0 − a1 x, σ(x) = b0 + b1 x, and ai , bi ≥ 0 (i = 0, 1).
(a) Write down the partial diﬀerential equations for the PDF
p(x, t) of Xt when X0 = x0 and u(x, t) = Ex f (Xt ). Spec
ify their initial conditions.
(b) Assume that X0 is a random variable with PDF p0 (x) that
has ﬁnite moments. Derive a system of diﬀerential equations
for the moments of Xt (i.e., the equations for Mn (t) = EXtn ).
194 8. FokkerPlanck Equation
8.10. Let Xt satisfy the SDE with drift b(x) = 0 and diﬀusion coeﬃcient
σ(x) = 1. Assume Xt ∈ [0, 1] with reﬂecting boundary conditions
and the initial state Xt t=0 = x0 ∈ [0, 1].
(a) Solve the initialboundary value problem for the FokkerPlanck
equation to calculate the transition PDF p(x, tx0 , 0).
(b) Find the stationary distribution of the process Xt .
8.11. Derive the FokkerPlanck equation for the SDEs deﬁned using the
backward stochastic integral (see Exercise 7.11 for deﬁnition)
dX t = b(X, t)dt + σ(X t , t) ∗ dW t .
8.12. Prove that the operator L∗ deﬁned in (8.50) is selfadjoint and non
positive in the space L2μ−1 with inner product
(f, g)μ−1 = f (x)g(x)μ−1 (x)dx.
[0,1]d
Notes
The connection between PDE and SDE is one of the most exciting areas in
stochastic analysis, with connections to potential theory [Doo84, Szn98],
semigroup theory [Fri75a, Fri75b, Tai04], among other topics. Historically,
the study of the distributions of diﬀusion processes was even earlier, and this
motivated the study of stochastic diﬀerential equations [Shi92].
Notes 195
Advanced Topics
Chapter 9
Path Integral
199
200 9. Path Integral
1 wj − wj−1 2
n
In (w) = (tj − tj−1 ), t0 := 0, w0 := 0.
2 tj − tj−1
j=1
We have
0 < μ(B1 ) = μ(B2 ) = · · · = μ(Bn ) = · · · < ∞, 0 < μ(B) < ∞.
9.1. Formal Wiener Measure 201
which is a contradiction!
These facts explain why the path integral is often used as a formal
tool. Nevertheless, the path integral is a very useful approach and in many
important cases, things can be made rigorous.
Example 9.1. Compute the expectation
1
1
E exp − Wt2 dt .
2 0
We have seen in Chapter 6 that the answer is 2e/(1 + e2 ) using the
KarhunenLoève expansion. Here we will use the path integral approach.
Step 1. Finitedimensional approximation.
Assume that, as before, {tj } deﬁnes a subdivision of [0, 1]:
1 n
1 1 2 1
exp − Wt dt ≈ exp − Wt2j δt = exp − δtX T AX ,
2 0 2 2
j=1
i }, {λi
where {λB A+B
} are eigenvalues of B and A + B, respectively.
202 9. Path Integral
We have formally
1
1
E exp − Wt2 dt
2 0
1 1 1
= exp − (Awt , wt ) · exp − (Bwt , wt ) δ(w0 )Dw
2 Z 2
where the operator Bu(t) := d2 u/dt2 and
1
Z = exp − (Bwt , wt ) δ(w0 )Dw.
2
Now compute formally the inﬁnitedimensional Gaussian integrals. We ob
tain
1
1 det B 1
2
E exp − 2
Wt dt = ,
2 0 det(A + B)
where det B, det (A + B) mean the products of all eigenvalues for the fol
lowing boundary value problems:
Bu = λu, or (A + B)u = λu,
u(0) = 0, u
(1) = 0.
Both determinants are inﬁnite. But their ratio is welldeﬁned. It is easy to
check that by discretizing and taking the limit, one obtains the same result
as before. In fact the eigenvalue problem for the operator B is the same as
the eigenvalue problem considered in the KarhunenLoève expansion (6.11).
Although it seems to be more complicated than before, this example
illustrates the main steps involved in computing expectation values using
the path integral approach.
9.2. Girsanov Transformation 203
can be viewed as a mapping between the Wiener path {Wt } and {Xt }:
Φ
{Wt } −→ {Xt }.
(1) Is the new measure absolutely continuous with respect to the Wiener
measure?
(2) If it is, what is the RadonNykodym derivative?
Φ
{Wt1 , Wt2 , . . . , Wtn } −→
n
{Xt1 , Xt2 , . . . , Xtn }.
With the notation deﬁned earlier, one can write this mapping as
∂(w1 , . . . , wn )
(9.7) = 1.
∂(x1 , . . . , xn )
n
xj − xj−1 2 1 2
n
1
I˜n (x) = δtj−1 + b (xj−1 , tj−1 )δtj−1
2 tj − tj−1 2
j=1 j=1
n
− b(xj−1 , tj−1 ) · (xj − xj−1 ).
j=1
9.2. Girsanov Transformation 205
Note that from (9.9) the stochastic integral is in the Itô sense. Since (9.10)
is valid for arbitrary F , we conclude that the distribution of {Xt }, μX , is
absolutely continuous with respect to the Wiener measure μW and
dμX 1
1
1
= exp − 2
b (Wt , t)dt + b(Wt , t)dWt .
dμW 2 0 0
(9.11) X t = W t + φ(t)
206 9. Path Integral
One can also see from the derivation that if the condition that σ = 1 is
violated, the measure induced by Xt fails to be absolutely continuous with
respect to the Wiener measure.
The following theorems are the rigorous statements of the formal deriva
tions above.
Theorem 9.2 (Girsanov Theorem I). Consider the Itô process
(9.13) dW̃ t = φ(t, ω)dt + dW t , W̃ 0 = 0,
where W ∈ Rd is a ddimensional standard Wiener process on (Ω, F , P).
Assume that φ satisﬁes the Novikov condition
1
T
(9.14) E exp φ2 (s, ω)ds < ∞,
2 0
where T ≤ ∞ is a ﬁxed constant. Let
t
1 t 2
(9.15) Zt (ω) = exp − φ(s, ω)dW s − φ (s, ω)ds
0 2 0
and
(9.16) P̃(dω) = ZT (ω)P(dω).
Then W̃ is a ddimensional Wiener process on [0, T ] with respect to (Ω, FT , P̃).
To see how (9.15) and (9.12) are related to each other, we note that for
any functional F
: ; : ;
F [W̃ t ] = F [W̃ t ]ZT
P̃ P
:
T
1 T 2 ;
= F [W̃ t ] exp − φ(s, ω)dW̃ s + φ (s, ω)ds
0 2 0 P
:
T 1
T dμ ;
= F [W t ] exp − φ(s, ω)dW s + φ2 (s, ω)ds W̃
2 0 dμW P
: ; 0
= F [W t ] .
P
where φ is deﬁned by
φ(t, ω) = (σ(X t , t))−1 · γ(t, ω).
Theorem 9.3 (Girsanov Theorem II). Let X t , Y t , and φ be deﬁned as
above. Assume that b and σ satisfy the same conditions as in Theorem 7.14,
σ is nonsingular, γ is an Ft adapted process, and the Novikov condition
holds for φ:
1
T
(9.18) E exp φ2 (s, ω)ds < ∞.
2 0
Deﬁne W̃ t , Zt , and P̃ as in Theorem 9.2. Then W̃ is a standard Wiener
process with respect to (Ω, FT , P̃) and Y satisﬁes
dY t = b(Y t , t)dt + σ(Y t , t)dW̃ t , Y 0 = x, t ≤ T.
Thus the law of Y t under P̃ is the same as that of X t under P for t ≤ T .
equation
2
(9.19) i∂t ψ = − Δψ + V (x)ψ, ψt=0 = ψ0 (x)
2m
can be expressed formally as
i
1
(9.20) ψ(x, t) = δ(w0 − x) exp I[w] ψ0 (wt )Dw,
Z
where I[·] is the action functional deﬁned in classical mechanics
t
m 2
I[w] = ẇ s − V (ws ) ds
0 2
and is the reduced Planck constant. Formally, if we take
m = 1, = −i
in (9.19) and (9.20), we obtain the FeynmanKac formula. This was what
motivated Kac to propose the FeynmanKac formula for the parabolic prob
lem.
Feynman’s formal expression is yet to be made rigorous, even though
Kac’s reinterpretation for the parabolic equation is a rigorous mathematical
tool widely used in mathematical physics.
Exercises
9.1. Derive (9.17) via the path integral approach.
9.2. Derive the inﬁnitedimensional characteristic function for Wiener
process Wt
:
1 ; 1
1
exp i ϕ(t)dWt = exp − ϕ2 dt
0 2 0
using the path integral approach.
Notes
The idea of path integral originated from Feynman’s thesis in which solu
tions to the Schrödinger equation were expressed formally as averages over
measures on the space of classical paths. Feynman’s formal expressions have
not yet been put on a ﬁrm mathematical basis. Kac had the insight that
Feynman’s work can be made rigorous if instead one considers the parabolic
analog of the Schrödinger equation. This can be understood as taking the
time variable to be imaginary in the Schrödinger equation.
One example of the application of the path integral approach is the
asymptotic analysis of Wiener processes (see Chapter 12 and Freidlin’s book
[Fre85]). For application of path integrals to physics, we refer to [Kle06].
Chapter 10
Random Fields
209
210 10. Random Fields
Our study of the random processes focused on two large classes: the
Gaussian processes and the Markov processes, and we devoted a large part
of this book to the study of diﬀusion processes, a particular class of Markov
processes with continuous paths. We will proceed similarly for random ﬁelds:
We will study Gaussian random ﬁelds and Markov random ﬁelds. However,
there are two important diﬀerences. The ﬁrst is that for Markov processes,
a very powerful tool is the FokkerPlanck equation. This allowed us to use
diﬀerential equation methods. This has not been the case so far for Markov
random ﬁelds. The second is that constructing nontrivial continuous random
ﬁelds, a subject in constructive quantum ﬁeld theory, is among the hardest
problems in mathematics. For this reason, there has not been much work
on continuous random ﬁelds, and for nonGaussian random ﬁelds, we are
forced to limit ourselves to discrete ﬁelds.
Figure 10.2. The original PKU (Peking University) tower image and
the degraded image by adding Gaussian random noise.
some powerful tools. For random ﬁelds, perhaps the most general tool one
can think of is the functional integral representation, which is a generaliza
tion of the path integrals.
We will skip the details of deﬁning random ﬁelds, except for one ter
minolgy: We say that a random ﬁeld is homogeneous if its statistics are
translation invariant. For example, the point process is homogeneous if and
only if the density ρ is a constant.
where
i, j means that the sites i and j are nearest neighbors and λ > 0 is
a penalty parameter. Deﬁne the probability measure
1
(10.14) P (x) = exp (−βU (x)) .
Z
−1
Here β is the parameter that plays the role of temperature. As β → 0,
this converges to the uniform distribution on Ω, i.e., equal weights for all
conﬁgurations. As β → ∞, then P converges to the point mass centered at
the global minimum of U .
This has been used as a model for image segmentation with u being the
image [MD10].
Exercise
10.1. Find the covariance operator K of the twodimensional Brownian
sheet.
Notes
For a mathematical introduction to random ﬁelds, we refer to the classical
book by Gelfand and Shilov [GS66]. Practical issues are discussed in the
wellknown book by Vanmarcke [Van83]. For an interesting perspective on
the application of random ﬁelds to pattern recognition, we refer to [MD10].
Chapter 11
Introduction to
Statistical Mechanics
217
218 11. Introduction to Statistical Mechanics
from the premise stated above. The fact that the observed macrostates are
quite deterministic even though the microstates are very much random is a
consequence of the law of large numbers and large deviation theory.
In equilibrium statistical mechanics, we assume that the macrostate is
stationary. Our aim is to understand the connection between the micro
and macrostates. Among other things, this will provide a foundation for
thermodynamics. For simplicity, we assume that the total volume V is
ﬁxed. According to how the system exchanges particles and energy with its
environment, one can consider three kinds of situations:
(1) Isolated system: The system under consideration does not exchange
mass or energy with its environment. In this case, the particle
number N and total energy E are both ﬁxed. We call this an
(N, V, E) ensemble, or microcanonical ensemble.
(2) Closed system: The system does not exchange mass with the en
vironment but it does exchange energy with the environment. We
further assume that the system is at thermal equilibrium with the
environment; i.e., its temperature T is the same as the environment.
We call this an (N, V, T ) ensemble, or canonical ensemble.
(3) Open system: The system exchanges both mass and energy. We
further assume that it is at both thermal and chemical equilibrium
with the environment. The latter is equivalent to saying that the
chemical potential μ is the same as that of the environment. Phys
ically one can imagine particles in a box but the boundary of the
box is both permeable and perfectly conducting for heat. We call
this a (μ, V, T ) ensemble, or grand canonical ensemble.
One can also consider the situation in which the volume is allowed to change.
For example, one can imagine that the position of the boundary of the box,
such as the piston inside a cylinder, is maintained at a constant pressure. In
this case, the system under consideration is at mechanical equilibrium with
the environment. This is saying that the pressure P is the same as that
of the environment. One can then speak about, for example, the (N, P, T )
ensemble, or isothermalisobaric ensemble.
In nonequilibrium statistical mechanics, the systems are timedependent
and generally driven by frictional force and random ﬂuctuations. Since the
frictional and random forces are both induced from the same microscopic
origin, that is, the random impacts of surrounding particles, they must be
related to each other. A cornerstone in nonequilibrium statistical mechanics
is the ﬂuctuationdissipation theorem. It connects the response functions,
11.1. Thermodynamic Heuristics 219
which are related to the frictional component of the system, with the corre
sponding correlation functions, which are related to the ﬂuctuation compo
nent. When the applied force is weak, the linear response theory is adequate
for describing the relaxation of the system. From the reductionist viewpoint,
one can derive the nonequilibrium dynamical model for a system interacting
with a heat bath using the MoriZwanzig formalism. We will demonstrate
this reduction for the simplest case, the KacZwanzig model, which gives
rise to a generalized Langevin equation for a distinguished particle coupled
with many small particles via harmonic springs.
We should remark that the theory of nonequilibrium statistical mechan
ics is still an evolving ﬁeld. For example, establishing thermodynamic laws
and using them to understand experiments for biological systems is an active
research ﬁeld in biophysics. It is even less known for quantum open systems.
where the summation is with respect to all possible states x and p(x) is the
probability of ﬁnding the system at state x. For an isolated system, we have,
as a consequence of the basic premise stated earlier,
1/g(E), H(x) = E,
(11.3) p(x) =
0, otherwise.
In this case, the Gibbs entropy reduces to the Boltzmann entropy (11.1).
220 11. Introduction to Statistical Mechanics
Let E1∗ be the value of E1 such that the summand attains its largest
value. Then
d
(11.5) (g1 (E1 )g2 (E − E1 )) = 0 at E1 = E1∗ .
dE1
Let σ1 (E1 ) = log g1 (E1 ), σ2 (E2 ) = log g2 (E2 ), E2∗ = E − E1∗ . Then the
corresponding entropy
S1 (E1 ) = kB σ1 (E1 ), S2 (E2 ) = kB σ2 (E2 )
by (11.1). Then the condition (11.5) becomes
∂S1 ∗ ∂S2 ∗
(11.6) (E ) = (E ).
∂E1 1 ∂E2 2
Deﬁne the temperatures
−1
∂S1 ∗ −1 ∂S2 ∗
(11.7) T1 = (E ) , T2 = (E ) .
∂E1 1 ∂E2 2
Then (11.6) says that
(11.8) T1 = T2 .
This means that if two systems are at thermal equilibrium, then their tem
peratures must be the same.
We postulate that the main contribution to the summation in the deﬁ
nition of g(E) above comes from the maximum term of the summand at E1∗ .
As we see later, this is supported by large deviation theory.
Assume that the initial energy of the two systems when they are brought
into contact are E10 and E20 , respectively. Then
(11.9) g1 (E10 )g2 (E20 ) ≤ g1 (E1∗ )g2 (E2∗ );
i.e., the number of states of the combined system can only increase. We will
see later that the number of states of a system is related to its entropy. So
this says that entropy can only increase. This gives an interpretation of the
second law of thermodynamics from the microscopic point of view.
11.1. Thermodynamic Heuristics 221
The Mechanical Equilibrium States. In the same way, one can show
that if the two systems can only exchange volume, i.e., their temperature
and particle numbers remain ﬁxed, then at equilibrium we must have
∂F1 ∂F2
(11.29) = .
∂V1 N1 ,T ∂V2 N2 ,T
This is the mechanical equilibrium. This deﬁnes the pressure:
∂F
(11.30) P =− .
∂V N,T
224 11. Introduction to Statistical Mechanics
This is the translational part of the temperature. If the particle has a shape,
say it shapes like a rod, and it can undergo other kinds of motion such as
rotation, then one can speak about the corresponding temperature such as
rotational temperature. The deﬁnition would be similar.
To get an expression for the pressure, assuming that the force between
the ith and jth particle is f ij , then the pressure of the system is given by
[E11]:
1
(11.35) P = (xi − xj ) · f ij .
V
i=j
where L is the cardinality of the set L and yj is the jth element in L. Given
β > 0, deﬁne the discrete probability
e−βyj /2
2
Then the canonical distribution on the N particle state space ΩN has the
form
(11.51) PN,eq (dω) = λ(dx1 ) · · · λ(dxN )ρβ (dv1 ) · · · ρβ (dvN ).
For this problem, one can simply integrate out the degrees of freedom for
the position and consider the reduced canonical ensemble for the velocities
ω = (v1 , . . . , vN ). The probability distribution is then deﬁned on ΩN =
{ω = (v1 , . . . , vN )}:
1
(11.52) PN,eq (dω) = e−βUN (ω) ρ(dv1 ) · · · ρ(dvN )
ZN (β)
N
j=1 vj 
where UN (ω) = 2 /2 and
(11.53) ZN (β) = e−βUN (ω) ρ(dv1 ) · · · ρ(dvN )
ΩN
where J0 > 0 and the ﬁrst part of H N indicates that only nearest neighbor
interactions are taken into account.
For such thermodynamic systems, we naturally deﬁne the Gibbs distri
bution
1
(11.56) PN,eq (dx) = exp −H N (x; β, h) PN (dx)
ZN (β, h)
in the canonical ensemble, where PN is the product measure on B(S N ) with
identical onedimensional marginals equal to 12 δ1 + 12 δ−1 and the partition
function
(11.57) ZN (β, h) = exp −H N (x; β, h) PN (dx).
SN
Macroscopically, suppose that we are interested in a particular observable,
the average magnetization
1
N
N
Y = O (X) = Xi , X ∈ SN ,
N
i=1
where PYN (dy) is simply the Lebesgue measure in the current setup. The
free energy per site FN (y; β, h) for Y has the form
(11.59)
FN (y; β, h)
6
1 1
= − log exp −H N (x; β, h) PN (dx) .
N ZN (β, h) {x∈S N ,ON (x)=y}
Evaluating the Partition Function via Transfer Matrices. First we will calcu
late the logasymptotics of the partition function ZN (β, h) using a powerful
technique known as transfer matrices. We write
2 −1
3
N
N
ZN (β, h) = 2−N exp βJ0 xi xj + βh xi
x∈S N i=1 i=1
(11.60)
1 % &
= ... a (x1 ) B (x1 , x2 ) . . . B (xN −1 , xN ) a (xN ) .
2
x1 =±1 x2 =±1 xN =±1
Recall from (11.59) that the free energy function F (y; β, h) can be thought of
as #
a rate function
$ for a large deviation principle for the sequence of measures
of PN,eq as N → ∞. Hence, an application of Laplace’s asymptotics gives
Y
J0 1 h 0 J0 1 h 0
F F
0.20
0.15
0.15
0.10
0.10
0.05
0.05
y y
1.0 0.5 0.5 1.0 1.0 0.5 0.5 1. 0
J0 1 h 0
F
0.20
0.15
0.10
0.05
y
1.0 0.5 0.5 1.0
J0 1 h 0 J0 1 h 0
F F
0.10
0.15
0.08
0.10 0.06
0.04
0.05
0.02
y y
1.0 0.5 0. 5 1.0 1. 0 0. 5 0. 5 1. 0
J0 1 h 0
F
0.15
0.10
0.05
y
1. 0 0. 5 0. 5 1. 0
h 0
Y
0. 8
0. 6
0. 4
0. 2
β
0.8 1.0 1.2 1.4
0
Y
1. 0
0. 5
h
0. 5 0. 5
0. 5
1. 0
Ising model. Indeed, the nearest neighbor Ising model only exhibits phase
transitions in more than one dimension [Gol92, Ell85].
In this chapter we will adopt the notation commonly used in the physics
literature for SDEs.
Consequently, we have
∞
(11.81) Vt = μ(t − s)ξs ds,
−∞
Ĉ(ω, ω
) = μ̂(ω)2 · 2Reγ̂+ (ω) · 2πδ(ω + ω
) = Ĝ(ω) · 2πδ(ω + ω
).
This is exactly the Fourier transform of G(t − s). This proves (11.78).
236 11. Introduction to Statistical Mechanics
Denote
t
(11.84)
δA(t) = χAB (t − s)F (s)ds,
0
where we used the weighted inner product (·, ·)eq with respect to the Gibbs
distribution peq (x) and the selfadjoint property (8.52) of L0 with respect to
(·, ·)eq .
On the other hand, the cross correlation satisﬁes
A(t)B(0)eq = dxdyA(x)B(y)peq (y) exp(tL∗0,x )δ(x − y)
= exp(tL0 )A(x), B(x) eq
(11.87) = A(x), exp(tL0 )B(x) eq ,
The relation (11.88) has another useful form in the frequency domain.
Deﬁne the socalled complex admittance μ̂(ω), the Fourier transform of the
response function
∞
(11.89) μ̂(ω) = χAB (t)e−iωt dt.
0
Due to the stationarity of the equilibrium process, we formally have
d
CAB (t) =
Ȧ(t)B(0)eq = −
A(t)Ḃ(0)eq .
dt
Thus we obtain the general form of the GreenKubo relation
∞
(11.90) μ̂(ω) = β
A(t)Ḃ(0)eq e−iωt dt.
0
If we take B(x) = x and A(x) = u for the Langevin dynamics (11.72) in
one dimension, we obtain
∞
(11.91) μ̂(0) = β
u(t)u(0)eq dt.
0
Under the ergodicity and stationarity condition
lim
u(t)u(t + τ ) =
u(0)u(τ )eq ,
t→∞
we have
x2 (t) 1 t t
D = lim = lim dt1 dt2
u(t1 )u(t2 )
t→∞ 2t t→∞ 2t
∞ 0 0
we obtain
t
dϕ
(11.98) = etL PLϕ0 + e(t−s)L K(s)ds + R(t),
dt 0
where
(11.99) R(t) = etQL QLϕ0 , K(t) = PLR(t).
Equation (11.98) is a generalized Langevin equation; see (11.76). The ﬁrst
term on the righthand side of (11.98) is a Markovian term. The second
term is a memorydependent term. The third term depends on the full
knowledge of the initial condition x0 . One can regard this term as a noise
term by treating the initial data for unresolved variable as being noise. The
240 11. Introduction to Statistical Mechanics
fact that the memory depends on the noise corresponds to the ﬂuctuation
dissipation relation, as will be exempliﬁed in next section. To eliminate the
noise, one can simply apply P to both sides and obtain
t
d
Pϕ = Pe PLϕ0 +
tL
Pe(t−s)L K(s)ds
dt 0
1 2
N
1 k
N
(11.100) H(x, v) = v + U (x) + 2
mi vi + (xi − x)2 ,
2 2 2N
i=1 i=1
k N
N
(11.101) Ẍ N = −U
(X N )− (X −XiN ), X N (0) = x0 , Ẋ N (0) = v0 ,
N
i=1
k
N
(11.105) γN (t) = cos ωi t
N
i=1
and
k N
vi
(11.106) ξN (t) = β (xi − x0 ) cos ωi t + sin ωi t .
N ωi
i=1
The initial condition for the bath variables is random, and its distribution
can be obtained as the marginal distribution of the Gibbs distribution asso
ciated with H(x, v). This gives the density
1
(11.107) ρ(x1 , . . . , xN , v1 , . . . , vN ) = exp(−β Ĥ(x, v)),
Z
where Z is a normalization constant and Ĥ is the reduced Hamiltonian
N
k vi2
Ĥ(x, v) = + (x i − x 0 ) 2
N
i=1
ωi2
k
N
(11.109) EξN (t)ξN (s) = cos ωi (t − s) = γN (t − s).
N
i=1
The noise term ξN (t) then converges to a Gaussian process ξ(t) with Eξ(t) =
0 and covariance function
k
N
Eξ(t)ξ(s) = lim cos ωi (t − s) = k cos ω(t − s)ρ(ω)dω
N →∞ N R
i=1
(11.111) = γ(t − s).
Exercises
11.1. By putting the system in contact with a heat reservoir, prove that
the Helmholtz free energy F achieves minimum at equilibrium using
the fact that the entropy S = kB log g(E) achieves maximum at
equilibrium in the canonical ensemble. Generalize this argument
to other ensembles.
11.2. Derive the Gibbs distribution p(x) by the maximum entropy prin
ciple in the canonical ensemble: Maximize the Gibbs entropy
S = −kB p(x) log p(x)
x
11.4. Derive the zero mass limit of the Langevin equation in Exercise
11.3, i.e., the limit as m = ε → 0 when γ is a constant.
11.5. Prove the positivity of the energy spectrum of a stationary stochas
tic process Xt ; i.e.,
ĈX (ω) ≥ 0
where CX(t − s) =
Xt Xs is the autocorrelation function and
∞
ĈX (ω) = −∞ CX (t)e−iωtdt.
11.6. Suppose that Xt is a stationary process and A(t) = A(Xt ), B(t) =
B(Xt ) are two observables. Prove that the crosscorrelation func
tions
A(t)B(s) satisﬁes the relation
Ȧ(t)B(0) = −
A(t)Ḃ(0) if
A(t), B(t) are diﬀerentiable.
11.7. Prove that under the ergodicity and stationarity condition
lim
u(t)u(t + τ ) =
u(0)u(τ )eq ,
t→∞
we have
t
t
∞
1
lim dt1 dt2
u(t1 )u(t2 ) =
u(t)u(0)eq dt.
t→∞ 2t 0 0 0
Notes
There are many good textbooks and monographs on the introduction of the
foundation of statistical mechanics. On the physics side we have chosen to
rely mainly on [Cha87, KK80, KTH95, Rei98]. On the mathematics side,
our main reference is [Ell85], a classic text on the mathematical foundation
of equilibrium statistical mechanics. It has been widely recognized that
the mathematical language behind statistical mechanics is ergodic theory
and large deviation theory. We discussed these topics from the viewpoint
of thermodynamic limit and phase transition. The interested readers are
referred to [Gal99, Sin89, Tou09] for further material.
The linear response theory is a general approach for studying diﬀerent
kinds of dynamics ranging from the Hamiltonian dynamics, Langevin dy
namics, Brownian dynamics to quantum dynamics. It provides the basis
for deriving response functions and sometimes equilibrium correlation func
tions. We only considered Brownian dynamics as a simple illustration. See
[KTH95, MPRV08] for more details.
The ﬂuctuationdissipation theorem is a cornerstone in nonequilibrium
statistical mechanics. In recent years, interest on small systems in biology
and chemistry has given rise to further impetus to pursue this topic and
this resulted in the birth of a new ﬁeld called stochastic thermodynamics.
Representative results include Jarzynski’s equality and Crooks’s ﬂuctuation
theorem. See [JQQ04, Sei12, Sek10] for the latest developments.
Model reduction is an important topic in applied mathematics. The
MoriZwanzig formalism gives a general framework for eliminating uninter
esting variables from a given system. However, approximations have to be
made in practical applications. How to eﬀectively perform such approx
imation is still an unresolved problem in many real applications. Some
interesting attempts can be found in [DSK09, HEnVEDB10, LBL16].
Chapter 12
Rare Events
Many phenomena in nature are associated with the occurence of some rare
events. A typical example is chemical reaction. The vibration of chemical
bonds occurs on the time scale of pico to femtoseconds (i.e., 10−15 to 10−12
seconds). But a typical reaction may take seconds or longer to accomplish.
This means that it takes on the order of 1012 to 1015 or more trials in order
to succeed in one reaction. Such an event is obviously a rare event. Other
examples of rare events include conformational changes of biomolecules, the
nucleation process, etc.
Rare events are caused by large deviations or large ﬂuctuations. Chemi
cal reactions succeed when the chemical bonds of the participating molecules
manage to get into some particular conﬁgurations so that a suﬃcient amount
of kinetic energy is converted to potential energy in order to bring the sys
tem over some potential energy barrier. These particular conﬁgurations are
usually quite far from typical conﬁgurations for the molecules and are only
visited very rarely. Although these events are driven by noise and conse
quently are stochastic in nature, some of their most important features, such
as the transition mechanism, are often deterministic and can be predicted
using a combination of theory and numerical algorithms.
The analysis of such rare events seems to have proceeded in separate
routes in the mathematics and science literature. Mathematicians have over
the years built a rigorous large deviation theory, starting from the work of
Cramérs and Schilder and ultimately leading to the work of Donsker and
Varadhan and of Freidlin and Wentzell [FW98, Var84]. This theory gives
a rigorous way of computing the leading order behavior of the probability
of large deviations. It has now found applications in many branches of
mathematics and statistics.
245
246 12. Rare Events
ε=0.03
U(x)=(x2−1)2/4 1.5
0.6 1
0.5 0.5
0.4
0
0.3
0.2 −0.5
Xs
0.1 B− B+
−1
0
X− X− −1.5
0 0.5 1 1.5 2 2.5
−0.1
−1.5 −1 −0.5 0 0.5 1 1.5 x 10 5
In physical terms, the local minima or the basin of attractions are called
metastable states. Obviously, when discussing metastability, the key issue
is that of the time scale. In rare event studies, one is typically concerned
about the following three key questions:
(1) What is the most probable transition path? In high dimension, we
should also ask the following question: How should we compute it?
(2) Where is the transition state, i.e., the neighboring saddle point,
for a transition event starting from a metastable state? Presum
ably saddle points can be identiﬁed from the eigenvalue analysis of
248 12. Rare Events
The readers are referred to [Gol80, Arn89] for more details about these
connections.
(12.11) Aτ (x) = −U
(x)τ
(x) + ετ
(x) = −1
for x ∈ (a, b) and τ x=b = 0, τ
x=a = 0.
The solution to this problem is given simply by
1 b U (y) y − U (z)
(12.12) τ (x) = e ε e ε dzdy.
x a
250 12. Rare Events
(x− ) and U
(xs )U
(x− )
for any x ≤ xs −δ0 , where δU = U (xs )−U (x− ) and δ0 is a positive constant.
Equation (12.14) tells us that the transition time is exponentially large
in O(δU/ε). This is typical for rare events. The transition time is insensitive
to where the particle starts from. Even if the particle starts close to xs , most
likely it will ﬁrst relax to the neighborhood of x− and then transit to x+ .
The choice of the ﬁrst passage point after xs does not aﬀect much the ﬁnal
result.
In the case considered above, it is natural to deﬁne the transition rate
1 U
(xs )U
(x− ) δU
(12.15) k= = exp − .
τ (x− ) 2π ε
This is the celebrated Kramers reaction rate formula, also referred to as the
Arrhenius’s law of reaction rates. Here δU is called the activation energy.
In the multidimensional case when xs is an indexone saddle point, one can
also derive the reaction rate formula using similar ideas:
9
λs  det H− δU
(12.16) k= exp − .
2π  det Hs  ε
Here λs < 0 is the unique negative eigenvalue of the Hessian at xs , Hs =
∇2 U (xs ), H− = ∇2 U (x− ). Interested readers are referred to [Sch80] for
more details.
The proof of Theorem 12.1 is beyond the scope of this book. We will
give a formal derivation using path integrals.
It is obvious that the onedimensional SDE
√
dXtε = εdWt , X0 = 0,
√
has the solution Xtε = εWt . Using the path integral representation, we
have that the probability distribution induced by {Xtε } on C[0, T ] can be
formally written as
(12.17)
T
1 1
dP ε [ϕ] = Z −1 exp − ϕ̇(s)2 ds Dϕ = Z −1 exp − IT [ϕ] Dϕ.
2ε 0 ε
This gives the statement of the theorem for the simple case in (12.4). Now
let us consider the general case:
√
dXtε = b(Xtε )dt + εσ(Xtε )dWt , X0 = y.
To understand how the action functional IT has the form (12.8)–(12.9), we
√
can reason formally as follows. From the SDE we have Ẇt = ( ε)−1 σ −1 (Xtε )·
(Ẋtε − b(Xtε )). Hence
T
T
Ẇt2 dt = ε−1 σ −1 (Xtε )(Ẋtε − b(Xtε ))2 dt.
0 0
√
Using the formal expression for the distribution (12.17) induced by εWt ,
we obtain
1
dP ε [ϕ] = Z −1 exp − IT [ϕ] Dϕ.
ε
From Theorem 12.1, we have
−ε log P ε (F ) ∼ inf IT [ϕ], ε → 0,
ϕ∈F
paths F in C[0, T ] we can deﬁne the optimal path in F as the path ϕ that
has the maximum probability or minimal action
IT [ϕ ] = inf IT [ϕ]
ϕ∈F
action of the noise, and ϕ2 simply follows the steepest descent dynamics
and therefore does not require help from the noise. Both satisfy
(12.23) ϕ̇(s) = ±∇U (ϕ(s)).
Paths that satisfy this equation are called the minimum energy path (MEP).
One can write (12.23) equivalently as
⊥
(12.24) ∇U (ϕ) = 0,
where (∇U (ϕ))⊥ denotes the component of ∇U (ϕ) normal to the curve
described by ϕ.
An important feature of the minimum energy paths is that they pass
through saddle points, more precisely, index one saddle points, i.e., saddle
points at which the Hessian of U has one negative eigenvalue. One can imag
ine the following coarsegrained picture: The conﬁguration space is mapped
into a network where the basins of the local minima are the nodes of the
network, neighboring basins are connected by edges, each edge is associated
with an index one saddle point. At an exponentially long time scale, the dy
namics of the diﬀusion process is eﬀectively described by a Markov chain on
this network, with hopping rates computed by a formula similar to (12.15).
For a concrete application of this reduction in micromagnetics, the readers
are referred to [ERVE03].
To implement (12.26), one can proceed in the following two simple steps:
Step 1. Evolution of the steepest descent dynamics. For example, one
can simply apply the forward Euler scheme
ϕ̃n+1
i = ϕni − Δt∇U (ϕni )
or RungeKutta type schemes for one or several steps.
Step 2. Reparameterization of the string. One can redistribute the
points {ϕ̃n+1
i } according to equiarclength (or other parame
terization). The positions of the new points along the string can
be obtained by simple interpolation:
a) Deﬁne s0 = 0, si = si−1 + ϕ̃ni − ϕ̃ni−1  for i = 1, . . . , N , and
α̃i = si /sN .
b) Interpolate ϕn+1 i at si = i/N using the data {α̃i , ϕ̃n+1 i }.
n+1
Here ϕ̃i is the value of ϕn+1 at α̃i .
Here we have assumed that the arclength is normalized so that
the total normalized arclength is 1.
A simple illustration of the application of the string method is shown in
Figure 12.2.
M = sup b(x)
x∈D
where φ(x, ty) is deﬁned in (12.7). Since φ(x, ty) satisﬁes (12.5) and is a
monotonically decreasing function of t and is bounded from below if y is a
stable steady state of the dynamics ẋ = b(x), we obtain a HamiltonJacobi
equation for V (x; y):
(12.30) H(x, ∇x V ) = 0,
256 12. Rare Events
which yields
d
∂L
0= L(ϕ, θ−1 ϕ̇)θ
θ=1 = L(ϕ, ϕ̇) − ϕ̇ = −H(ϕ, p).
dθ ∂ ϕ̇
That is why the minimum action path starting from any critical point of
the dynamics ẋ = b(x) is also called the zero energy path.
For gradient systems, the potential energy barrier quantiﬁes the diﬃculty
of transitions. Quasipotential plays a similar role for nongradient systems.
We show this by revisiting the exit problem. Let y = 0 be a stable ﬁxed
point of the dynamical system ẋ = b(x) and let D be a smooth part of the
domain of attraction of 0 which also contains 0 in its interior. Consider
the diﬀusion process X ε , (12.3), initiated at X ε0 = 0. It is obvious that
V (0; 0) = 0 by deﬁnition. We are interested in the ﬁrst exit time of X ε
from D, Tε and the most probable point of exit on ∂D, XTεε , as ε → 0.
12.6. Quasipotential and Energy Landscape 257
Theorem 12.3. Assume that (b(x), n̂(x)) < 0 for x ∈ ∂D, where n̂ is the
outward normal of ∂D. Then
lim ε log ETε = min V (x; 0)
ε→0 x∈∂D
and if there exists a unique minimizer x0 of V at ∂D, i.e.,
V (x0 ; 0) = min V (x; 0),
x∈∂D
then for any δ > 0,
P X εTε − x0  < δ → 1 as ε → 0.
where the ﬁrst equality is the minimization with respect to the path
(ϕ(t), p(t)) in R2d that connects y and x such that H(ϕ, p) = 0, and the
second equality is with respect to the path ϕ(t) in Rd that connects y and
x such that p = ∂L/∂ ϕ̇ and H(ϕ, p) = 0 [Arn89].
In this formulation, let us choose a parametrization such that α ∈ [0, 1]
and ϕ ∈ (C[0, 1])d . From the constraints, we obtain
1 ∂L
(12.34) H(ϕ, p) = b(ϕ) · p + p2 = 0, p = = λϕ
(α) − b(ϕ),
2 ∂ ϕ̇
where λ ≥ 0 is the result of the parameter change from t to α. Solving
(12.34) we get
b(ϕ) b(ϕ)
(12.35) λ= , p= ϕ − b.
ϕ
 ϕ

Thus
(12.36) V (x; y) = inf ˆ
I[ϕ],
ϕ(0)=y,ϕ(1)=x
where
1
(12.37) ˆ =
I[ϕ] b(ϕ)ϕ
 − b(ϕ) · ϕ
dα.
0
The formulation (12.36) and (12.37) is preferable in numerical computations
since it removes the additional optimization with respect to time and can
be realized in an intrinsic way in the space of curves.
This result can also be obtained from a direct calculation.
258 12. Rare Events
where ψ(α) := ϕ(τ (α)) and τ : [0, 1] → [0, t] is any diﬀerentiable and
monotonically increasing parameterization. The equality holds if and only
if b(ϕ) = ϕ̇. This proves
inf inf It [ϕ] ≥ inf ˆ
I[ψ].
t>0 ϕ(0)=y,ϕ(t)=x ψ(0)=y,ψ(1)=x
Consider again the exit problem from the basin of attraction of the
metastable state y = 0. We suppose that the quasipotential is U (x) =
V (x; y), which satisﬁes the steady HamiltonJacobi equation. Deﬁne
(12.40) l(x) = b(x) + ∇U (x).
We have ∇U (x) · l(x) = 0 from the HamiltonJacobi equation (12.30). As
sume U (0) = 0 and thus U ≥ 0 elsewhere. Using similar strategies to the
proof of Lemma 12.2, we get
T
1 T
IT [ϕ] = ϕ̇ − ∇U (ϕ) − l(ϕ) dt + 2
2
ϕ̇ · ∇U (ϕ)ds
2 0 0
≥ 2(U (x) − U (0)),
where the equality holds only when
ϕ̇ = ∇U (ϕ) + l(ϕ).
Exercises 259
This derivation shows that the quasipotential with respect to 0 has the form
V (x; 0) = 2(U (x) − U (0))
and the most probable path from 0 to x is not of a pure gradient type such
as ϕ̇ = ∇U (ϕ) but involves the nongradient term l(ϕ). The decomposi
tion (12.40) suggests that the quasipotential plays the role of the Lyapunov
function for the deterministic ODEs
dU (x(t)) dx
= ∇U (x) · = −∇U (x)2 ≤ 0.
dt dt
Quasipotential measures the action required to hop from one point to
another. It depends on the choice of the starting point y, which is usually
taken as a stable ﬁxed point of b. In this sense, V (x; y) should be thought
of as local quasipotentials of the system if it has more than one stable ﬁxed
points. On the other hand, we can introduce another potential by deﬁning
(12.41) U (x) = − lim ε log pε∞ (x),
ε→0
This deﬁnition of U does not depend on the starting point y, which is called
the global quasipotential of the dynamics (12.3). Note that the two limits
cannot be exchanged:
U (x) = − lim lim ε log pε (x, ty) = − lim lim ε log pε (x, ty) = V (x; y).
ε→0 t→∞ t→∞ ε→0
However, it can be shown that the local functions V (x; y) can be pruned
and stuck together to form the global quasipotential, U (x). The readers
may refer to [FW98, ZL16] for more details.
Exercises
12.1. Assume that xs is an extremal point of the potential function U
and that the Hessian Hs = ∇2 U (xs ) has n distinct eigenvalues
{λi }i=1,...,n and orthonormal eigenvectors {v i }i=1,...,n such that
Hs v i = λi v i . Prove that the (xs , v i ) are steady states of the
following dynamics:
ẋ = (I − 2v̂ ⊗ v̂) · (−∇U (x)),
v̇ = (I − v̂ ⊗ v̂) · (−∇2 U (x), v),
by imposing the normalization condition
v(0)
= 1 for the
initial value v(0). Prove that (xs , v i ) is linearly stable if and
260 12. Rare Events
Notes
The study of rare events is a classical but still a fast developing ﬁeld. The
classical transition state theory has been a cornerstone in chemical physics
[HPM90]. The latest developments include the transition path theory
[EVE10], Markov state models [BPN14], and advanced sampling meth
ods in molecular dynamics and Monte Carlo methods.
From a mathematical viewpoint, the large deviation theory is the start
ing point for analyzing rare transition events. In this regard, the Freidlin
Wentzell theory provides the theoretical basis for understanding optimal
transition paths [FW98, SW95, Var84]. In particular, the notion of quasi
potential is useful for extending the concept of energy landscape, very im
portant for gradient systems, to nongradient systems [Cam12, LLLL15].
From an algorithmic viewpoint, the string method is the most popular
pathbased algorithm for computing the optimal transition paths. Another
closely related approach is the nudged elastic band (NEB) method [HJ00],
which is the most popular chainofstate algorithm. The convergence anal
ysis of the string method can be found in [CKVE11]. Besides GAD, other
ideas for ﬁnding saddle points include the dimer method [HJ99], shrinking
dimer method [ZD12], climbing string method [RVE13], etc.
Chapter 13
Introduction to
Chemical Reaction
Kinetics
In this chapter, we will study a class of jump processes that are often used
to model the dynamics of chemical reactions, particularly biochemical reac
tions. We will consider only the homogeneous case; i.e., we will assume that
the reactants are well mixed in space and will neglect the spatial aspects of
the problem.
Chemical reactions can be modeled at several levels. At the most de
tailed level, we are interested in the mechanism of the reactions and the
reaction rates. This is the main theme for the previous chapter, except that
to model the system accurately enough, one often has to resort to quantum
mechanics. At the crudest level, we are interested in the concentration of the
reactants, the products, and the intermediate states. In engineering applica
tions such as combustion where one often has an abundance of participating
species, the dynamics of the concentrations is usually modeled by the re
action rate equations, which is a system of ODEs for the concentrations
of diﬀerent species. The rate equations model the result of averaging over
many reaction events. However, in situations when the discrete or stochastic
eﬀects are important, e.g., when there are only a limited number of some
of the crucial species, then one has to model each individual molecule and
reaction event. This is often the case in cell biology where there are usually
a limited number of relevant proteins or RNAs. In this case, the preferred
model is by a Markov jump process. As we will see, in the large volume
limit, one recovers the reaction rate equations from the jump process.
261
262 13. Introduction to Chemical Reaction Kinetics
where c1 , c2 , c3 , c4 are reaction constants and the symbol ∅ means some type
of species which we are not interested in. For example, for the reaction
2S1 → S2 , two S1 molecules are consumed and one S2 molecule is generated
after each ﬁring of the forward reaction; two S1 molecules are generated and
one S2 molecule is consumed after each ﬁring of the reverse reaction. This
is an example of reversible reaction. The other reactions in this system are
irreversible.
Denote by xj the concentration of the species Sj . The law of mass
action states that the rate of reaction is proportional to the reaction constant
and the concentration of the participating molecules [KS09]. This gives us
the following reaction rate equations for the above decayingdimerization
reaction:
dx1
(13.1) = −c1 x1 − 2c2 x21 + 2c3 x2 ,
dt
dx2
(13.2) = c2 x21 − c3 x2 − c4 x2 ,
dt
dx3
(13.3) = c 4 x2 .
dt
The term −2c2 x21 + 2c3 x2 in (13.1) and the term c2 x21 − c3 x2 in (13.2) are
the consequence of the reversible reaction discussed above. Notice that we
have, as a consequence of the reversibility for this particular reaction,
dx2 1 dx1
+ = 0.
dt 2 dt
Note that the constants kj in (13.6) are generally diﬀerent from cj in the
reaction rate equations. We will come back to this in Section 13.5.
264 13. Introduction to Chemical Reaction Kinetics
N
N N
x xm
ν −
j  = νj−,m , ν−
j ! = νj−,m !, = .
ν−
j νj−,m
m=1 m=1 m=1
M
P (x, t + dtx0 , t0 ) = P (x − ν j , tx0 , t0 )aj (x − ν j )dt
j=1
M
+ 1− aj (x)dt P (x, tx0 , t0 ),
j=1
M
t
(13.9) Xt = X0 + ν j Pj aj (X s )ds ,
j=1 0
M
t+dt
t
X t+dt = X t + ν j Pj aj (X s )ds − Pj aj (X s )ds .
j=1 0 0
From independence and Markovianity of Pj (·), we know that the ﬁring prob
ability of the jth reaction channel in [t, t + dt) is exactly aj (X t )dt since X
continues to be X t before jumping when dt is an inﬁnitesimal time. This ex
plains why (13.9) makes sense as the SDEs describing the chemical reaction
kinetics. The readers are referred to [EK86] for rigorous statements.
Another less direct but more general SDE formulation is to use Poisson
random measure
M
A
(13.10) dX t = νj cj (a; X t− )λ(dt × da), X t t=0 = X 0 .
j=1 0
Here A := maxX t {a0 (X t )}, which is assumed to be ﬁnite for a closed system,
and λ(dt × da) is called the reference Poisson random measure associated
with a Poisson point process (qt , t ≥ 0) taking values in (0, A]. That is,
t
for any Borel set B ⊂ (0, A] and 0 ≤ s < t. It is assumed that λ(dt × da)
satisﬁes Eλ(dt × da) = dt × da. The ﬁring rates for M reaction channels are
deﬁned through the acceptancerejection functions
1, if a ∈ (hj−1 (X t ), hj (X t )],
(13.12) cj (a; X t ) = j = 1, 2, . . . , M,
0, otherwise,
where hj (x) = ji=1 aj (x) for j = 0, 1, . . . , M . One can see from the above
constructions that X t satisﬁes the same law for the chemical reaction kinet
ics. For more on the Poisson random measures and SDEs for jump processes,
see [App04, Li07].
266 13. Introduction to Chemical Reaction Kinetics
where
κj ν − N
1 νj−,m
(13.14) aVj (x) = ãj (x) + O(V −1 ), ãj (x) = x j = κ
j x m ,
ν−
j ! ν −,m !
m=1 j
and aVj (x), ãj (x) ∼ O(1) if κj , x ∼ O(1). For simplicity, we will only
consider the case when aVj (x) = ãj (x). The general case can be analyzed
similarly.
Consider the rescaled process X Vt = X t /V , i.e., the concentration vari
able, with the original jump intensity aj (X t ). We have
t
(13.15) X Vt = X V0 + b(X Vs )ds + M Vt
0
where
M
M
t
−1
b(x) := ν j ãj (x), M Vt := νjV P̃j V V V
aj (X s )ds ,
j=1 j=1 0
and P̃j (t) := Pj (t) − t is the centered Poisson process with mean 0. We will
assume that b(x) satisﬁes the Lipschitz condition: b(x) − b(y) ≤ Lx − y.
Deﬁne the deterministic process x(t):
t
(13.16) x(t) = x0 + b(x(s))ds.
0
We have
t
X Vt − x(t) ≤ X V0 − x0  + LX Vs − x(s)ds + M Vt .
0
Using Gronwall’s inequality, we obtain
sup X Vs − x(s) ≤ X V0 − x0  + sup M Vs  eLt
s∈[0,t] s∈[0,t]
for any t > 0. From the law of large numbers for Poisson processes (see
Exercise 13.1), we immediately get sups∈[0,t] X Vs − x(s) → 0 almost surely
as V → ∞ and the initial condition X V0 → x0 . This is the celebrated large
volume limit for chemical reaction kinetics, also known as the Kurtz limit
in the literature [Kur72, EK86].
By comparing (13.16) with the reaction rate equations in Section 13.1,
we also obtain the relation
κj kj −
(13.17) cj = − = − V ν j −1
νj ! νj !
between kj , κj , and cj . This connects the deﬁnition of rate constants at
diﬀerent scales.
268 13. Introduction to Chemical Reaction Kinetics
The limit from (13.15) to (13.16) can also be derived using asymptotic
analysis of the backward operator introduced in Section 8.9. To this end,
we ﬁrst note that the backward operator for the original process X t has the
form
M
Af (x) = aj (x) f (x + ν j ) − f (x) .
j=1
Under the same assumption as (13.13) and with the ansatz u(x, t) = u0 (x, t)
+ V −1 u1 (x, t) + o(V −1 ), we get, to leading order,
M
∂t u0 (x, t) = A0 u0 = ãj (x)ν j · ∇u0 (x, t).
j=1
This is the backward equation for the reaction rate equations (13.16).
M
t
M
t
=Y + V
0 νj ∇ãj (x(s)) · Y ds +
V
s ν j Wj ãj (x(s))ds + h.o.t.
j=1 0 j=1 0
13.7. The Tauleaping Algorithm 269
This leads to a central limit theorem for the chemical jump process Y Vt ⇒
Y (t) and Y satisﬁes the following linear diﬀusion process:
M M 5
dY
(13.19) = ν j ∇ãj (x(t)) · Y (t) + νj ãj (x(t))Ẇj (t), Y 0 = 0,
dt
j=1 j=1
where we have used the deﬁnition Z V = Z/V . This is the chemical Langevin
equation in the chemistry literature [Gil00]. It can be shown that as V →
1
∞, Z Vt → x(t) and V 2 (Z Vt − x(t)) → Y t in ﬁxed time interval [0, T ]. In
this sense, Z V approximates X V at the same order as x(t) + V − 2 Y t .
1
M
(13.25) X t+τ = X t + ν j Pj (aj (X t )τ )
j=1
where the Pj (aj (X t )τ ) are independent Poisson random variables with pa
rameters λj = aj (X t )τ , respectively. The idea behind the tauleaping
algorithm is that if the reaction rate aj (X t ) is almost a constant during
[t, t + τ ), then the number of ﬁred reactions in the jth reaction channel will
be Pj (aj (X t )τ ).
To make the tauleaping idea a practical scheme, one needs a robust
stepsize selection strategy. For this purpose, note that the state after tau
leaping is
M
X→X+ ν j aj (X)τ := X + τ ξ.
j=1
It can be shown that the invariant distribution is unique with the multino
mial distribution
n!
p(x) = cx1 cx2 · · · cxNN ,
x1 !x2 ! · · · xN ! 1 2
where xi is the number of the molecules for species Si and n is the total
number of molecules for all species; i.e.,
c·Q=0
N
with constraint i=1 ci = 1. Here Q ∈ RN ×N is the generator
⎛ ⎞
−k1+ k1+ ··· 0 0
⎜ k1− −(k1− + k2+ ) · · · 0 0 ⎟
⎜ ⎟
⎜ .. . . . .. ⎟
Q=⎜ . .. .. .. . ⎟.
⎜ ⎟
⎝ 0 0 · · · −(k(N −2)− + k(N −1)+ ) k(N −1)+ ⎠
0 0 ··· k(N −1)− −k(N −1)−
k1 = k2 = 105 , k3 = k4 = 1, k5 = k6 = 105 .
13.9. Multiscale Analysis of a Chemical Kinetic System 273
Let us partition the reactions into two sets: the slow ones and the fast ones.
The fast reactions Rf = {(af , ν f )} are
M
(13.36) ∂t u(x, t) = aj (x)(u(x + ν j , t) − u(x, t)) = Lu(x, t),
j=1
for all ν fj . The set of such vectors form a linear subspace in RN . Let
{b1 , b2 , . . . , bJ } be a set of basis vectors of this subspace, and let
(13.41) zj = bj · x for j = 1, . . . , J.
Let z = (z1 , . . . , zJ ) and denote by Z the space over which z is deﬁned.
The stoichiometric vectors associated with these slow variables are naturally
deﬁned as
(13.42) ν̄ si = (b1 · ν si , . . . , bJ · ν si ), i = 1, . . . , Ns .
We now derive the eﬀective dynamics on the slow time scale using singu
lar perturbation theory. We assume, for the moment, that z is a complete
set of slow variables; i.e., for each ﬁxed value of the slow variable z, the
virtual fast process admits a unique equilibrium distribution μz (x) on the
state space x ∈ X .
The backward Kolmogorov equation for the multiscale chemical kinetic
system reads
1
(13.43) ∂t u = L0 u + L1 u,
ε
13.9. Multiscale Analysis of a Chemical Kinetic System 275
where L0 and L1 are the inﬁnitesimal generators associated with the slow
and fast reactions, respectively. For any f : X → R,
Ms
(L0 f )(x) = asj (x)(f (x + ν sj ) − f (x)),
j=1
(13.44) Mf
(L1 f )(x) = afj (x)(f (x + ν fj ) − f (x)),
j=1
where
(13.53) āsi (z) = (Pasi )(z) = asi (x)μz (x).
x∈X
Equation (13.53) gives the eﬀective rates on the slow time scale. The ef
fective reaction kinetics on this time scale are given in terms of the slow
variables by
Ms
(13.55) ∂t u0 (x, t) = ãsi (x)(u0 (x + ν si , t) − u0 (x, t)),
i=1
where ãsi (x) = āsi (b · x), in the sense that if u0 (x, t = 0) = f (b · x), then
(13.56) u0 (x, t) = Ū (b · x, t)
where Ū (z, t) solves (13.52) with the initial condition Ū (z, 0) = f (z). Note
that in (13.55), all of the quantities can be formulated over the original state
space using the original variables. This fact is useful for developing numerical
algorithms for this kind of problems [CGP05, ELVE05b, ELVE07], which
will be introduced in next section.
From a numerical viewpoint, parallel to the discussion in the last sec
tion, one would like to have algorithms that automatically capture the slow
dynamics of the system, without the need to precompute the eﬀective dy
namics or make ad hoc approximations. We will discuss an algorithm pro
posed in [ELVE05b] which is a simple modiﬁcation of the original SSA, by
adding a nested structure according to the time scale of the rates. For sim
ple systems with only two time scales, the nested SSA consists of two SSAs
organized with one nested inside the other: an outer SSA for the slow reac
tions only, but with modiﬁed slow rates which are computed in an inner SSA
that models the fast reactions only. Averaging is performed automatically
“ontheﬂy”.
where X kτis the result of the kth replica of the virtual fast process
at virtual time τ whose initial value is X kt=0 = X n and T0 is a
parameter we choose in order to minimize the eﬀect of the transients
to the equilibrium in the virtual fast process.
 Outer SSA: Run one step of the SSA for the modiﬁed slow reactions
R̃s = (ãs , ν s ) to generate (tn+1 , X n+1 ) from (tn , X n ).
 Repeat the above nested procedures until the ﬁnal time T .
Note that the algorithm is formulated in the original state space. Obvi
ously it can be extended to the situation with more time scales. Error and
cost analysis can also be performed. See [ELVE05b, ELVE07].
Exercises
13.1. Prove the strong law of large numbers for the unirate Poisson
process P(t) using Doob’s martingale inequality; i.e., for any
ﬁxed T > 0
lim sup n−1 P̃(ns) = 0, a.s.,
n→∞ s≤T
Notes
Chemical reaction kinetics modeled by jump processes has become a fun
damental language in systems biology. See [ARM98, ELSS02, GHP13,
LX11], etc. In this chapter, we only considered cases with algebraic propen
sity functions. In more general situations such as the enzymatic reactions,
the propensity functions may take the form of rational functions. The most
common example is the MichaelisMenten kinetics. Readers may consult
[KS09] for more details.
The tauleaping algorithm was proposed in 2001 [Gil01]. Since then,
much work has been done to improve its performance and robustness. These
eﬀorts included robust stepsize selection [CGP06], avoiding negative popu
lations [CVK05, MTV16, TB04], multilevel Monte Carlo methods
[AH12], etc.
278 13. Introduction to Chemical Reaction Kinetics
(0) = 0. We
further assume that for any c > 0, there exists b > 0 such that
h(x) − h(0) ≤ −b if x ≥ c.
Assume also that h(x) → −∞ fast enough as x → ∞ to ensure the conver
gence of F for t = 1.
Lemma A.1 (Laplace method). We have the asymptotics
√
F (t) ∼ 2π(−th
(0))− 2 exp(th(0))
1
(A.1) as t → ∞,
where the equivalence f (t) ∼ g(t) means that limt→∞ f (t)/g(t) = 1.
This formulation is what we will use in the large deviation theory. Its
abstract form in the inﬁnitedimensional setting is embodied in the socalled
Varadhan’s lemma to be discussed later [DZ98, DS84, Var84].
(0)x2 /2, h
(0) < 0, we
279
280 Appendix
have
√
2π(−th
(0))− 2 .
1
eth(x) dx =
R
In general, for any > 0, there exists δ > 0 such that if x ≤ δ,
h
(0) 2
h(x) − x
≤ x2 .
2
It follows that
tx2
(0) + 2 ) dx.
[−δ,δ] 2
By assumption, for this δ > 0, there exists η > 0 such that h(x) ≤ −η if
x ≥ δ. Thus
−(t−1)η
exp th(x) dx ≤ e eh(x) dx ∼ O(e−αt ), α > 0, for t > 1.
x≥δ R
(0) + 2 ) dx
R R 2
tx2
− exp (h
(0) + 2 ) dx + O(e−αt )
x≥δ 2
√ − 1
+ O(e−βt )
2
= 2π t(−h (0) − 2 )
(0)/2 here.
The proof of the lower bound is similar. By the arbitrary smallness of
, we have
√
lim F (t)/ 2π(−th
(0))− 2 = 1,
1
t→∞
which completes the proof.
Deﬁnition A.2 (Large deviation principle). Let X be a complete separable
metric space and let {Pε }ε≥0 be a family of probability measures on the Borel
subsets of X . We say that Pε satisﬁes the large deviation principle if there
exists a rate functional I : X → [0, ∞] such that:
(i) For any < ∞,
{x : I(x) ≤ } is compact.
(ii) Upper bound. For each closed set F ⊂ X ,
lim ε ln Pε (F ) ≤ − inf I(x).
ε→0 x∈F
B. Gronwall’s Inequality 281
Theorem A.3 (Varadhan’s lemma). Suppose that Pε satisﬁes the large de
viation principle with rate functional I(·) and F ∈ Cb (X ). Then
1
(A.2) lim ε ln exp F (x) Pε (dx) = sup (F (x) − I(x)).
ε→0 X ε x∈X
B. Gronwall’s Inequality
Theorem B.1 (Gronwall’s inequality). Assume that the function f : [0, ∞)
→ R+ satisﬁes the inequality
t
f (t) ≤ a(t) + b(s)f (s)ds,
0
where a(t), b(t) ≥ 0. Then we have
t
t
(B.1) f (t) ≤ a(t) + a(s)b(s) exp b(u)du ds.
0 s
t
Proof. Let g(t) = 0 b(s)f (s)ds. We have
g
(t) ≤ a(t)b(t) + b(t)g(t).
t
Deﬁne h(t) = g(t) exp(− 0 b(s)ds). We obtain
t
where we assume the arithmetic rules (2.7) on the extended reals R̄+ .
When the set function μ takes values in R̄ = [−∞, ∞] and only the
countable additivity condition is assumed, μ is called a signed measure.
The readers may refer to [Bil79, Cin11, Hal50] for more details.
D. Martingales
Consider the probability space (Ω, F , P) and a ﬁltration {Fn }n∈N or {Ft }t≥0 .
Deﬁnition D.1 (Martingale). A continuous time stochastic process Xt is
called an Ft martingale if Xt ∈ L1ω for any t, Xt is Ft adapted, and
(D.1) E(Xt Fs ) = Xs for all s ≤ t.
Xt is called a submartingale or supermartigale if the above equality is re
placed by ≥ or ≤. These concepts can be deﬁned for discrete time stochastic
processes if we replace (D.1) by E(Xn Fm ) = Xm for any m ≤ n. The dis
crete submartingale or supermartingale can be deﬁned similarly.
E. Strong Markov Property 285
F. Semigroup of Operators
Let B be a Banach space equipped with the norm
·
.
Deﬁnition F.1 (Operator semigroup). A family of bounded linear opera
tors {S(t)}t≥0 : B → B forms a strongly continuous semigroup if for any
f ∈ B:
(i) S(0)f = f ; i.e., S(0) = I.
(ii) S(t)S(s)f = S(s)S(t)f = S(t + s)f for any s, t ≥ 0.
(iii)
S(t)f − f
→ 0 for any f ∈ B as t → 0+.
We will call {S(·)} a contraction semigroup if
S(t)
≤ 1 for any t, where
S(t)
is the operator norm induced by the metric
·
.
The closed graph theorem ensures that (λI − A)−1 is a bounded linear
operator if λ ∈ ρ(A).
Theorem F.6 (HilleYosida theorem). Let A be a closed, densely deﬁned
linear operator on B. Then A generates a contraction semigroup {S(t)}t≥0
if and only if
1
λ ∈ ρ(A) and
Rλ
≤ for all λ > 0.
λ
Interested readers are referred to [Eva10, Paz83, Yos95] for more de
tails on semigroup theory.
Bibliography
[AC08] A. Abdulle and S. Cirilli, SROCK: Chebyshev methods for stiﬀ stochastic dif
ferential equations, SIAM J. Sci. Comput. 30 (2008), no. 2, 997–1014, DOI
10.1137/070679375. MR2385896
[ACK10] D. F. Anderson, G. Craciun, and T. G. Kurtz, Productform stationary distribu
tions for deﬁciency zero chemical reaction networks, Bull. Math. Biol. 72 (2010),
no. 8, 1947–1970, DOI 10.1007/s1153801095174. MR2734052
[AGK11] D. F. Anderson, A. Ganguly, and T. G. Kurtz, Error analysis of tauleap simula
tion methods, Ann. Appl. Probab. 21 (2011), no. 6, 2226–2262, DOI 10.1214/10
AAP756. MR2895415
[AH12] D. F. Anderson and D. J. Higham, Multilevel Monte Carlo for continuous time
Markov chains, with applications in biochemical kinetics, Multiscale Model. Simul.
10 (2012), no. 1, 146–179, DOI 10.1137/110840546. MR2902602
[AL08] A. Abdulle and T. Li, SROCK methods for stiﬀ Itô SDEs, Commun. Math. Sci.
6 (2008), no. 4, 845–868. MR2511696
[And86] H. L. Anderson, Metropolis, Monte Carlo and the MANIAC, Los Alamos Science
14 (1986), 96–108.
[App04] D. Applebaum, Lévy processes and stochastic calculus, Cambridge Studies in
Advanced Mathematics, vol. 93, Cambridge University Press, Cambridge, 2004.
MR2072890
[ARLS11] M. Assaf, E. Roberts, and Z. LutheySchulten, Determining the stability of ge
netic switches: Explicitly accounting for mrna noise, Phys. Rev. Lett. 106 (2011),
248102.
[ARM98] A. Arkin, J. Ross, and H. H. McAdams, Stochastic kinetic analysis of developmen
tal pathway bifurcation in phage lambdainfected Escherichia coli cells, Genetics
149 (1998), 1633–1648.
[Arn89] V. I. Arnold, Mathematical methods of classical mechanics, 2nd ed., translated
from the Russian by K. Vogtmann and A. Weinstein, Graduate Texts in Mathe
matics, vol. 60, SpringerVerlag, New York, 1989. MR997295
[BB01] P. Baldi and S. Brunak, Bioinformatics, 2nd ed., The machine learning approach;
A Bradford Book, Adaptive Computation and Machine Learning, MIT Press,
Cambridge, MA, 2001. MR1849633
289
290 Bibliography
[KS76] J. G. Kemeny and J. L. Snell, Finite Markov chains, reprinting of the 1960 origi
nal, Undergraduate Texts in Mathematics, SpringerVerlag, New YorkHeidelberg,
1976. MR0410929
[KS91] I. Karatzas and S. E. Shreve, Brownian motion and stochastic calculus, 2nd
ed., Graduate Texts in Mathematics, vol. 113, SpringerVerlag, New York, 1991.
MR1121940
[KS07] L. B. Koralov and Y. G. Sinai, Theory of probability and random processes, 2nd
ed., Universitext, Springer, Berlin, 2007. MR2343262
[KS09] J. Keener and J. Sneyd, Mathematical physiology, 2nd ed., SpringerVerlag, New
York, 2009.
[KTH95] R. Kubo, M. Toda, and N. Hashitsume, Statistical physics, vol. 2, SpringerVerlag,
Berlin, Heidelberg and New York, 1995.
[Kub66] R. Kubo, The ﬂuctuationdissipation theorem, Rep. Prog. Phys. 29 (1966), 255.
[Kur72] T. G. Kurtz, The relation between stochastic and deterministic models for chem
ical reactions, J. Chem. Phys. 57 (1972), 2976–2978.
[KW08] M. H. Kalos and P. A. Whitlock, Monte Carlo methods, 2nd ed., WileyBlackwell,
Weinheim, 2008. MR2503174
[KZW06] S. C. Kou, Q. Zhou, and W. H. Wong, Equienergy sampler with applica
tions in statistical inference and statistical mechanics, with discussions and
a rejoinder by the authors, Ann. Statist. 34 (2006), no. 4, 1581–1652, DOI
10.1214/009053606000000515. MR2283711
[Lam77] J. Lamperti, Stochastic processes: A survey of the mathematical theory, Ap
plied Mathematical Sciences, Vol. 23, SpringerVerlag, New YorkHeidelberg, 1977.
MR0461600
[Lax02] P. D. Lax, Functional analysis, Pure and Applied Mathematics (New York),
WileyInterscience [John Wiley & Sons], New York, 2002. MR1892228
[LB05] D. P. Landau and K. Binder, A guide to Monte Carlo simulations in statistical
physics, 2nd ed., Cambridge University Press, Cambridge, 2005.
[LBL16] H. Lei, N. A. Baker, and X. Li, Datadriven parameterization of the generalized
Langevin equation, Proc. Natl. Acad. Sci. USA 113 (2016), no. 50, 14183–14188,
DOI 10.1073/pnas.1609587113. MR3600515
[Li07] T. Li, Analysis of explicit tauleaping schemes for simulating chemically re
acting systems, Multiscale Model. Simul. 6 (2007), no. 2, 417–436, DOI
10.1137/06066792X. MR2338489
[Liu04] J. S. Liu, Monte Carlo strategies in scientiﬁc computing, SpringerVerlag, New
York, 2004.
[LL17] T. Li and F. Lin, Large deviations for twoscale chemical kinetic processes, Com
mun. Math. Sci. 15 (2017), no. 1, 123–163, DOI 10.4310/CMS.2017.v15.n1.a6.
MR3605551
[LLLL14] C. Lv, X. Li, F. Li, and T. Li, Constructing the energy landscape for genetic
switching system driven by intrinsic noise, PLoS One 9 (2014), e88167.
[LLLL15] C. Lv, X. Li, F. Li, and T. Li, Energy landscape reveals that the budding yeast cell
cycle is a robust and adaptive multistage process, PLoS Comp. Biol. 11 (2015),
e1004156.
[LM15] B. Leimkuhler and C. Matthews, Molecular dynamics: With deterministic and
stochastic numerical methods, Interdisciplinary Applied Mathematics, vol. 39,
Springer, Cham, 2015. MR3362507
[Loè77] M. Loève, Probability theory. I, 4th ed., Graduate Texts in Mathematics, Vol. 45,
SpringerVerlag, New YorkHeidelberg, 1977. MR0651017
296 Bibliography
[Lov96] L. Lovász, Random walks on graphs: a survey, Combinatorics, Paul Erdős is eighty,
Vol. 2 (Keszthely, 1993), Bolyai Soc. Math. Stud., vol. 2, János Bolyai Math. Soc.,
Budapest, 1996, pp. 353–397. MR1395866
[LX11] G. Li and X. S. Xie, Central dogma at the singlemolecule level in living cells,
Nature 475 (2011), 308–315.
[Mar68] G. Marsaglia, Random numbers fall mainly in the planes, Proc. Nat. Acad. Sci.
U.S.A. 61 (1968), 25–28, DOI 10.1073/pnas.61.1.25. MR0235695
[MD10] D. Mumford and A. Desolneux, Pattern theory: The stochastic analysis of real
world signals, Applying Mathematics, A K Peters, Ltd., Natick, MA, 2010.
MR2723182
[Mey66] P.A. Meyer, Probability and potentials, Blaisdell Publishing Co. Ginn and Co.,
Waltham, Mass.Toronto, Ont.London, 1966. MR0205288
[MN98] M. Matsumoto and T. Nishimura, Mersenne twister: A 623dimensionally equidis
tributed uniform pseudorandom number generator, ACM Trans. Mod. Comp.
Simul. 8 (1998), 3–30.
[MP92] E. Marinari and G. Parisi, Simulated tempering: A new Monte Carlo scheme,
Europhys. Lett. 19 (1992), 451–458.
[MP10] P. Mörters and Y. Peres, Brownian motion, with an appendix by Oded Schramm
and Wendelin Werner, Cambridge Series in Statistical and Probabilistic Mathe
matics, vol. 30, Cambridge University Press, Cambridge, 2010. MR2604525
[MPRV08] U. M. B. Marconi, A. Puglisi, L. Rondoni, and A. Vulpiani, Fluctuation
dissipation: Response theory in statistical physics, Phys. Rep. 461 (2008), 111–
195.
[MS99] C. D. Manning and H. Schütze, Foundations of statistical natural language pro
cessing, MIT Press, Cambridge, MA, 1999. MR1722790
[MS01] M. Meila and J. Shi, A random walks view of spectral segmentation, Proceedings
of the Eighth International Workshop on Artiﬁcial Intelligence and Statistics (San
Francisco), 2001, pp. 92–97.
[MT93] S. P. Meyn and R. L. Tweedie, Markov chains and stochastic stability, Commu
nications and Control Engineering Series, SpringerVerlag London, Ltd., London,
1993. MR1287609
[MT04] G. N. Milstein and M. V. Tretyakov, Stochastic numerics for mathematical
physics, Scientiﬁc Computation, SpringerVerlag, Berlin, 2004. MR2069903
[MTV16] A. Moraes, R. Tempone, and P. Vilanova, Multilevel hybrid Chernoﬀ tauleap,
BIT 56 (2016), no. 1, 189–239, DOI 10.1007/s105430150556y. MR3486459
[Mul56] M. E. Muller, Some continuous Monte Carlo methods for the Dirichlet prob
lem, Ann. Math. Statist. 27 (1956), 569–589, DOI 10.1214/aoms/1177728169.
MR0088786
[MY71] A. S. Monin and A. M. Yaglom, Statistical ﬂuid mechanics: Mechanics of turbu
lence, vol. 1, MIT Press, Cambridge, 1971.
[Nor97] J. R. Norris, Markov chains, Cambridge University Press, Cambridge, 1997.
[Oks98] B. Øksendal, Stochastic diﬀerential equations: An introduction with applications,
5th ed., Universitext, SpringerVerlag, Berlin, 1998. MR1619188
[Pap77] G. C. Papanicolaou, Introduction to the asymptotic analysis of stochastic equa
tions, Modern modeling of continuum phenomena (Ninth Summer Sem. Appl.
Math., Rensselaer Polytech. Inst., Troy, N.Y., 1975), Lectures in Appl. Math.,
Vol. 16, Amer. Math. Soc., Providence, R.I., 1977, pp. 109–147. MR0458590
[Pav14] G. A. Pavliotis, Stochastic processes and applications: Diﬀusion processes, the
FokkerPlanck and Langevin equations, Texts in Applied Mathematics, vol. 60,
Springer, New York, 2014. MR3288096
Bibliography 297
[Shi92] A. N. Shiryaev (ed.), Selected works of A.N. Kolmogorov, vol. II, Kluwer Academic
Publishers, Dordrecht, 1992.
[Shi96] A. N. Shiryaev, Probability, 2nd ed., translated from the ﬁrst (1980) Russian
edition by R. P. Boas, Graduate Texts in Mathematics, vol. 95, SpringerVerlag,
New York, 1996. MR1368405
[Sin89] Y. Sinai (ed.), Dynamical systems II: Ergodic theory with applications to dynam
ical systems and statistical mechanics, Encyclopaedia of Mathematical Sciences,
vol. 2, SpringerVerlag, Berlin and Heidelberg, 1989.
[SM00] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pat
tern Anal. Mach. Intel. 22 (2000), 888–905.
[Soa94] P. M. Soardi, Potential theory on inﬁnite networks, Lecture Notes in Mathematics,
vol. 1590, SpringerVerlag, Berlin, 1994. MR1324344
[Sus78] H. J. Sussmann, On the gap between deterministic and stochastic ordinary diﬀer
ential equations, Ann. Probability 6 (1978), no. 1, 19–41. MR0461664
[SW95] A. Shwartz and A. Weiss, Large deviations for performance analysis, Queues,
communications, and computing, with an appendix by Robert J. Vanderbei, Sto
chastic Modeling Series, Chapman & Hall, London, 1995. MR1335456
[Szn98] A.S. Sznitman, Brownian motion, obstacles and random media, Springer Mono
graphs in Mathematics, SpringerVerlag, Berlin, 1998. MR1717054
[Tai04] K. Taira, Semigroups, boundary value problems and Markov processes, Springer
Monographs in Mathematics, SpringerVerlag, Berlin, 2004. MR2019537
[Tan96] M. A. Tanner, Tools for statistical inference: Methods for the exploration of pos
terior distributions and likelihood functions, 3rd ed., Springer Series in Statistics,
SpringerVerlag, New York, 1996. MR1396311
[TB04] T. Tian and K. Burrage, Binomial leap methods for simulating stochastic chemical
kinetics, J. Chem. Phys. 121 (2004), 10356–10364.
[TKS95] M. Toda, R. Kubo, and N. Saitô, Statistical physics, vol. 1, SpringerVerlag, Berlin,
Heidelberg and New York, 1995.
[Tou09] H. Touchette, The large deviation approach to statistical mechanics, Phys. Rep.
478 (2009), no. 13, 1–69, DOI 10.1016/j.physrep.2009.05.002. MR2560411
[TT90] D. Talay and L. Tubaro, Expansion of the global error for numerical schemes
solving stochastic diﬀerential equations, Stochastic Anal. Appl. 8 (1990), no. 4,
483–509 (1991), DOI 10.1080/07362999008809220. MR1091544
[Van83] E. Vanmarcke, Random ﬁelds: Analysis and synthesis, MIT Press, Cambridge,
MA, 1983. MR761904
[Var84] S. R. S. Varadhan, Large deviations and applications, CBMSNSF Regional Con
ference Series in Applied Mathematics, vol. 46, Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA, 1984. MR758258
[VK04] N. G. Van Kampen, Stochastic processes in physics and chemistry, 2nd ed., Else
vier, Amsterdam, 2004.
[Wal02] D. F. Walnut, An introduction to wavelet analysis, Applied and Numerical Har
monic Analysis, Birkhäuser Boston, Inc., Boston, MA, 2002. MR1854350
[Wel03] L. R. Welch, Hidden Markov models and the BaumWelch algorithm, IEEE Infor.
Theory Soc. Newslett. 53 (2003), 10–13.
[Wie23] N. Wiener, Diﬀerential space, J. Math. and Phys. 2 (1923), 131–174.
[Win03] G. Winkler, Image analysis, random ﬁelds and Markov chain Monte Carlo meth
ods, 2nd ed., A mathematical introduction, with 1 CDROM (Windows), Stochas
tic Modelling and Applied Probability, Applications of Mathematics (New York),
vol. 27, SpringerVerlag, Berlin, 2003. MR1950762
Bibliography 299
[WZ65] E. Wong and M. Zakai, On the convergence of ordinary integrals to stochastic inte
grals, Ann. Math. Statist. 36 (1965), 1560–1564, DOI 10.1214/aoms/1177699916.
MR0195142
[XK02] D. Xiu and G. E. Karniadakis, The WienerAskey polynomial chaos for stochas
tic diﬀerential equations, SIAM J. Sci. Comput. 24 (2002), no. 2, 619–644, DOI
10.1137/S1064827501387826. MR1951058
[YLAVE16] T. Yu, J. Lu, C. F. Abrams, and E. VandenEijnden, Multiscale implementation
of inﬁniteswap replica exhange molecular dynamics, Proc. Nat. Acad. Sci. USA
113 (2016), 11744–11749.
[Yos95] K. Yosida, Functional analysis, reprint of the sixth (1980) edition, Classics in
Mathematics, SpringerVerlag, Berlin, 1995. MR1336382
[ZD12] J. Zhang and Q. Du, Shrinking dimer dynamics and its applications to sad
dle point search, SIAM J. Numer. Anal. 50 (2012), no. 4, 1899–1921, DOI
10.1137/110843149. MR3022203
[ZJ07] J. ZinnJustin, Phase transitions and renormalization group, Oxford Graduate
Texts, Oxford University Press, Oxford, 2007. MR2345069
[ZL16] P. Zhou and T. Li, Construction of the landscape for multistable systems: Poten
tial landscape, quasipotential, Atype integral and beyond, J. Chem. Phys. 144
(2016), 094109.
Index
301
302 Index
Гораздо больше, чем просто документы.
Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.
Отменить можно в любой момент.