Академический Документы
Профессиональный Документы
Культура Документы
Matthias Lowe
5 Independence 11
8 Zero-One Laws 23
11 Conditional Expectation 43
12 Martingales 50
13 Brownian motion 57
13.1 Construction of Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . 61
14 Appendix 66
1 Introduction
Chance, luck, and fortune have been in the centre of human mind ever since people have
started to think. The latest when people started to play cards or roll dices for money,
there has also been a desire for a mathematical description and understanding of chance.
Experience tells, that even in a situation, that is governed purely by chance there seems
to be some regularly. For example, in a long sequence of fair coin tosses we will typically
see about half of the time heads and about half of the time tails. This was formulated
already by Jakob Bernoulli (published 1713) and is called a law of large numbers. Not
much later the French mathematician de Moivre analyzed how much the typical number
of heads in a series of n fair coin tosses fluctuates around 21 n. He thereby discovered the
first form of what nowadays has become known as the Central Limit Theorem.
From this start-up the whole framework of probability theory was developed: Starting
from the laws of large numbers over the Central Limit Theorem to the very new field of
mathematical finance (and many, many others).
In this course we will meet the most important highlights from probability theory and
then turn into the direction of stochastic processes, that are basic for mathematical finance.
1
F is a -algebra over , A F is called an event.
X : Rd
The important fact about random variables is that the underlying probability space
(, F , P) does not really matter. For example, consider the two experiments
1
1 = {0, 1} , F1 = P1 , and P1 {0} =
2
and
1
2 = {1, 2, . . . , 6} , F2 = P2 , and P2 {i} = , i 2 .
6
Consider the random variables
X1 : 1 R
7
and
X2 : 2 R
0 i is even
7
1 i is odd.
Then
1
P1 (X1 = 0) = P2 (X2 = 0) =
2
and therefore X1 and X2 have same behavior even though they are defined on completely
different spaces. What we learn from this example is that what really matters for a random
variable is the distribution P X 1 :
d
d d
PX of a random variable X : R , is the following
Definition 2.3 The distribution
probability measure on R , B :
PX (A) := P (X A) = P X 1 (A) , A Bd .
Example 2.4 Important distributions of random variables X (we have already met in
introductory courses in probability and statistics) are
2
The Binomial distribution with parameters n and p, i.e. a random variable X is
Binomially distributed with parameters n and p (B(n, p)-distributed), if
n k
P(X = k) = p (1 p)nk 0 k n.
k
The binomial distribution is the distribution of the number of 1s in n independent
coin tosses with success probability p.
Here b R.
k e
P(X = k) = k N0 = N {0}.
k!
P(X A) =
Z a1 Z ad
1 1 1
p ... exp h (x ), (x )i dx1 . . . dxd .
(2)d det 2
3
R
Notice that for A F one has E(1A ) = 1A dP = P(A). Quite often one may want to
integrate f (X) for a function f : Rd Rm . How does this work?
Proof. If f = 1A , A B d we have
Z
E [f X] = E(1A X) = P (X A) = PX (A) = 1A dPX .
Pn d
Hence (3.1) holds true for functions f = i=1 i 1Ai , i R, Ai B . The standard
approximation techniques yield (3.1) for general integrable f .
In particular Proposition 3.2 yields that
Z
E [X] = xdPX (x).
V(X) := E (X EX)2 .
4
Proposition 3.6 V(X) < if and only if X L2 (P). In this case
V(X) = E X 2 (E (X))2
(3.2)
as well as
V(X) E X 2
(3.3)
and
(EX)2 E X 2
(3.4)
Proof. If V(X) < , then X EX L2 (P). But L2 (P) is a vector space and the constant
It will turn out in the next step that (3.4) is a special case of a much more general
principle. To this end recall the concept of a convex function. Recall that a function
: R R is convex if for all (0, 1) we have
(x + (1 ) y) (x) + (1 ) (y)
for all x, y R. In a first course in analysis one learns that convexity is implied by 00 0.
On the other hand convex functions do not need to be differentiable. The following exercise
shows that they are close to being differentiable.
Exercise 3.8 Let I be an interval and : I R be a convex function. Show that the
0 0
right derivative + (x) exist for all x I (the interior of x) and the left derivative (x)
0
exists for all x I. Hence is continuous on I. Show moreover that + is monotonely
increasing on I and it holds:
0
(y) (x) + + (x)(y x)
for x I, y I.
5
Theorem 3.9 (Jensens inequality) Let X : R be a random variable on (, F , P)
and assume X is P-integrable and takes values in an open interval I R. Then EX I
and for every convex
:IR
X is a random variable. If this random variable X is P-integrable it holds:
(E (X)) E( X).
Proof. Assume I = (, ). Thus X() < for all . But then E(X) , but
then also E(X) < . Indeed E(X) = , implies that the strictly positive random variable
X() equals 0 on a set of P measure one, i.e. P-almost surely. This is a contradiction.
Analogously: EX > . According to exercise 3.8 is continuous on I = I hence Borel-
measurable. Now we know
0
(y) (x) + + (x) (y x) (3.5)
and by integration
0
E( X) (x) + + (x) (E (X) x)
all x I. Together with (3.6) this gives
h 0
i
E( X) sup (x) + + (x)(EX x) = (E(X)).
xI
|E(X)|p E(|X|p ).
Exercise 3.11 Let IPbe an open interval and : I R be convex. For x1 , . . . , xn I and
1 , . . . n R+ with ni=1 i = 1, show that
n
! n
X X
i xi i (xi ) .
i=1 i=1
6
4 Convergence of random variables
Already in the course in measure theory we met three different types on convergence:
2. Almost sure convergence does not imply convergence in Lp and vice versa.
Definition 4.3 Let (, F ) be a measurable, topological space endowed with its Borel -
algebra F . This means F is generated by the topology on . Moreover for each n N let
n and be probability measures on (, F ). We say that n converges weakly to , if for
each bounded, continuous, and real valued function f : R (we will write C b () for the
space of all such functions) it holds
Z Z
n (f ) := f dn f d =: (f ) as n (4.1)
Theorem 4.4 Let (Xn )nN be a sequence of real valued random variables on a space
(, F , P). Assume (Xn ) converges to a random variable X stochastically. Then PXn (the
sequence of distributions) converges weakly to PX ,i.e.
Z Z
lim f dPXn = f dPX
n
or equivalently
lim E (f Xn ) = E (f X)
n
7
Proof. First assume f C b (R) is uniformly continuous. Then for > 0 there is a > 0
such that for any x0 , x00 R
Eventually put f := un0 f . Since f 0 outside the compact set [n0 1, n0 + 1], the
function f is uniformly continuous (and so is un0 ) and hence
Z Z
lim f dPXn = fdPX
n
as well as Z Z
lim un0 dPXn = un0 dPX ;
n
thus also Z Z
lim (1 un0 ) dPXn = (1 un0 ) dPX .
n
8
By the triangle inequality
Z Z
f dPXn f dPX (4.2)
Z Z Z Z
f f dPXn + fdPXn fdPX + f f dPX .
obtain Z
(1 un0 ) dPX PX (R \ In0 ) < ,
This yields Z Z
f f dPX =
|f | (1 un0 ) dPX kf k
9
Definition 4.5 Let Xn , X be random variables on a probability space (, F , P). If PXn
converges weakly to PX we also say that Xn converges to X in distribution.
Remark 4.6 If the random variables in Theorem 4.4 are Rd -valued and the f : Rd R
belong to C b (Rd ) the statement of the theorem stays valid.
n 2
Exercise 4.8 Show that the converse direction in Theorem 4.4 is not true, if we drop the
assumption that X P-a.s.
Exercise 4.9 Assume the following holds for sequence (Xn ) of random variables on a
probability space (, F , P):
P (|Xn | > ) <
For all n large enough (larger than n ()) for each given > 0. Is this equivalent with
stochastic convergence of Xn to 0 ?
Exercise 4.10 For a sequence of Poisson distributions (n ) with parameters n > 0 show
that
lim n = 0 ,
n
if lim n = ?
10
5 Independence
The concept of independence of events is one of the most essential in probability theory. It
is met already in the introductory courses. Its background is the following:
Assume we are given a probability space (, F , P). For two events A, B F with
P (B) > 0 one may ask, how the probability of the event A changes, if we know already
that B has happened, we only need to consider the probability of A B. To obtain a
probability again we normalize by P (B) and get the conditional probability of A given B:
P (A B)
P (A | B) = .
P (B)
now we would call A and B independent, if the knowledge that B has happened does not
change the probability that A will happen or not, i.e. if P (A | B) and P (A) are the same
P (A | B) = P (A) .
P (A B) = P (A) P (B) .
Definition 5.1 A family (Ai )iI of events on F is called independent if for each choice of
different indices i1 , . . . , in I
Exercise 5.2 Give an example of a sequence of events that are pairwise independent, i.e.
each pair of events from this sequence is independent, but not independent (i.e. all events
together not independent).
Definition 5.3 For each i I let Ei F be a collection of events. (Ei )iI are called
independent if (5.1) holds for each i1 , . . . , in I, each n N and each Ai Ei , =
1, . . . , n.
1. A family (Ei )iI is independent, if and only if every finite sub-family is independent.
2. Independence of (Ei )iI is maintained, if we reduce the families (Ei ). More precisely,
let (Ei ) be independent and Ei0 Ei then also the families (Ei0 ) are independent.
all n N, (Ein )iI are independent and for all n N and i I, Ein Ein+1 ,
3. If for S
then ( n Ein )iI are independent.
11
Exercise 5.5 If (Ei )iI are independent, then so are the Dynkin-systems (D (Ei ))iI . Here
D (A) is the Dynkin system generated by A, it coincides with the intersection of all Dynkin
systems containing A. (See the Appendix for definitions.)
Corollary 5.6 Let (Ei )iI be an independent family of -stable sets Ei F . Then also the
families ( (Ei ))iI of the -algebras generated by the Ei are independent.
Theorem 5.7 Let (Ei )iI be an independent family of -stable sets Ei F and
I = jJ Ij
with Ii Ij = , i 6= j. Let Aj := iIj Ei . Then also (Aj )jJ is independent.
E i1 . . . E in
P (A) = 0 or P(A) = 1.
12
Proof. Let A T and D the system of all sets D F that are independent of A. We
want to show that A D:
In the exercise below we show that D is a Dynkin system. By Theorem 5.7 the -algebra
Tn+1 is independent of the -algebra
An := (A1 . . . An ) .
Exercise 5.11 Show that D from the proof of Theorem 5.10 is a Dynkin system.
for all j N. Hence lim sup An T . Hence the assertion follows from Kolmogorovs
zero-one law.
13
Exercise 5.13 In every F the pairs (A, B) (any A, B F with P(A) = 0 or P(A) = 1 or
P(B) = 0 or P(B) = 1) are pairs of independent sets. If these are the only pairs of indepen-
dent sets we call F independence-free. Show that the following space is independence-free:
A special case of the above abstract setting is the concept of independent random
variables. This will be introduced next. Again we work over a probability space (, F , P).
Definition 5.14 A family of random variables (Xi )i is called independent, if the -algebras
( (Xi ))i generated by them are independent.
For finite families there is another criterion (which is important, since by definition of
independence we only need to check the independence of finite families).
for all Ei Ei , i = 1, . . . , n.
Proof. Put
Gi := Xi1 (Ei ) , Ei Ei .
for all choices of Gi Gi . Sufficiency is evident, since we may choose Gi = for appropriate
i.
Theorem 5.17 Let (Xi )iI be a family of independent random variables Xi with values in
(i , Ai ) and let
fi : (i , Ai ) (0i , A0i )
be measurable. Then (fi (Xi ))iI is independent.
14
Proof. Let i1 , . . . , in I. Then
Yn
P Xi fi1 A0i
=
=1
Yn
P fi (Xi ) A0i
=
=1
by the independence of (Xi )iI . Here the A0i A0i were arbitrary.
Already Theorem 5.15 gives rise to the idea that independence of random variables
may be somehow related to product measures. This is made more precise in the following
theorem. To this end let X1 , . . . , Xn be random variables such that
Xi : (, F ) (i , Ai ) .
Define
Y := X1 Xn : 1 n
Then the distribution of Y which we denote by PY can be computed as PY = PX1 Xn .
Note that PY is a probability measure on ni=1 Ai .
Theorem 5.18 The random variables X1 , . . . , Xn are independent if and only if their dis-
tribution is the product measure of the individual distributions, i.e. if
Y = X1 X n :
n
! n
!
Y Y
PY Ai = P Y Ai = P (X1 A1 , . . . , Xn An )
i=1 i=1
as well as
PXi (Ai ) = P (Xi Ai ) i = 1, . . . n.
Now PY is the product measure of the PXi if and only if
But according to Theorem 5.15 this is equivalent with the independence of the Xi .
15
6 Products and Sums of Independent Random Vari-
ables
In this section we will study independent random variables in greater detail.
Theorem 6.1 Let X1 , . . . , Xn be independent, real-valued random variables. Then
n n
!
Y Y
E Xi = E (Xi ) (6.1)
i=1 i=1
Qn
if EXi is well defined (and finite) for all i. (6.1) shows that then also E ( i=1 Xi ) is well
defined.
Proof. We know that Q := ni=1 PXi is the joint distribution of the X1 , . . . , Xn . By
Proposition 3.2 and Fubinis theorem
!
Y n Z
E Xi = |x1 . . . xn | dQ(x1 , . . . , xn )
i=1
Z Z
= . . . |x1 | . . . |xn | dPX1 (x1 ) . . . dPXn (xn )
Z Z
|x1 | dPX1 (x1 ) . . . |xn | dPXn (xn )
This shows that integrability of the Xi implies integrability of ni=1 Xi . In this case the
Q
equalities are also true without absolute values. This proves the result.
Exercise 6.2 For any two random variables X, Y that are integrable Theorem 6.1 tells
that independence of X, Y implies that
E (X Y ) = E (X) E (Y )
Show that the converse is not true.
Definition 6.3 For any two random variables X, Y that are integrable and have an inte-
grable product we define the covariance of X and Y to be
cov (X, Y ) = E [(X EX) (Y EY )]
= E (XY ) EXEY.
X and Y are uncorrelated if cov (X, Y ) = 0.
Remark 6.4 If X, Y are independent cov(X, Y ) = 0.
Theorem 6.5 Let X1 , . . . , Xn be square integrable random variables. Then
n
! n
X X X
V Xi = V(Xi ) + cov (Xi , Xj ) (6.2)
i=1 i=1 i6=j
16
Proof. We have
! !2
n
X n
X
V Xi = E (Xi EXi )
i=1 i=1
" n
#
X 2
X
= E (Xi EXi ) + (Xi EXi ) (Xj EXj )
i=1 i6=j
n
X X
= V (Xi ) + cov (Xi , Xj ) .
i=1 i=j
This proves (6.2). For (6.3 just note that for uncorrelated random variables X, Y one has
cov(X, Y ) = 0.
Eventually we turn to determining the distribution of the sum of independent random
variables.
Theorem 6.6 Let X1 , . . . , Xn be independent Rd valued random variables. Then the distri-
bution of the sum Sn := X1 +. . .+Xn is given by the convolution product of the distributions
of the Xi , i.e.
PSn := PX1 PX2 PXn
n
Proof. Again let Y := X1 . . . Xn : Rd , and vector addition
An : Rd . . . R d R d .
Then Sn = An Y , hence a random variable. Now PSn is the image measure of P under
An Y , which we denote by (An Y ) (P). Thus
PSn = (An Y ) (P) = An (PY ) .
Now PY = PXi . So by the definition of the convolution product
PX1 . . . PXn = An (PY ) = PSn
More explicitly, in the case d = 1, let g(x1 , . . . , xn ) = 1 for x1 + + xn s and
g(x1 , . . . , xn ) = 0, otherwise. Then application of Fubinis theorem yields
Z Z
P(Sn s) = E(g(X1 , . . . , Xn )) = g(x1 , x2 , . . . , xn )dPX1 (x1 )dP(X2 ,...,Xn ) (x2 , . . . , xn ) =
Z
= P(X1 s x2 xn )dP(X2 ,...,Xn ) (x2 , . . . , xn )
In the case that X1 has a density fX1 with respect to Lebesgue measure, and the order of
differentiation with respect to s and integration can be exchanged, it follows that Sn has a
density fSn and
Z
fSn (s) = fX1 (s x2 xn )dP(X2 ,...,Xn ) (x2 , . . . , xn )
The same formula holds in the case that X1 , . . . , Xn have a discrete distribution (that is,
almost surely assume values in a fixed countable subset of R) if densities are taken with
respect to the count measure.
17
Example 6.7 1. As we learned in Introduction to Statistics the convolution of a Bi-
nomial distribution with parameters n and p, B(n, p) and a Binomial distribution
B (m, p) is a B(n + m, p) distribution:
N , 2 N , 2 = N + , 2 + 2 .
n (n ) = 1 for all n.
Moreover we have the idea that a probability measure on should be defined by what
happens on the first n coordinates, n N. So for A1 A1 , A2 A2 , . . . , An An , n N
we want
A := A1 . . . An n+1 n+2 . . . (7.1)
to be in A. By independence we want to define a measure on (, A) that assigns to A
defined in (7.1) the mass
(A) = 1 (A) . . . n (An ).
18
We will solve this problem in greater generality. Let I be an index set and (i , Ai , i )iI
be measure spaces with i (i ) = 1. For 6= K I define
Y
K := i , (7.2)
iK
in particular := I . Let pK
J for J K denote the canonical projection from K to J .
For J = {i} we will also write pK K I
i instead of p{i} and pi in place of pi . Obviously
pLJ = pK L
J pK (J K L) (7.3)
and
pJ := pIJ = pK
J pK (J K) . (7.4)
Moreover denote by
H (I) := {J I, J 6= , |J| is finite} .
For J H (I) the -algebras and measures
AJ := iJ Ai and J := iJ i
Definition 7.1 The product -algebra iI Ai of the -algebras (Ai )iI is defined as the
smallest -algebra A on , such that all projections pi : i are (A, Ai )-measurable.
Hence
iI Ai := (pi , i I) . (7.5)
According to the above we are now looking for a measure on (, A), that assigns mass
1 (A1 ) . . . n (An ) to each A as defined in (7.1). In other words
!! !
Y Y
p1J Ai = J Ai .
iJ iJ
pJ () := p1
J = J (7.7)
19
Proof. We may assume |I| = , since otherwise the result is known from Fubinis theorem.
We start with some preparatory considerations:
In Exercise 7.4 below it will be shown that pK
J is (AK , AJ )-measurable for J K and
K
that pJ (K ) = J , (J K, J, K H (I)).
Hence, if we introduce the -algebra of the J-cylinder sets
ZJ := p1
J (AJ ) (J H (I)) (7.8)
1
the measurability of pK implies pK
J J (AJ ) AK and thus
ZJ Z K (J K, J, K H (I)) . (7.9)
Eventually we introduce the system of all cylinder sets
[
Z := ZJ .
JH(I)
Note that due to (7.9) for Z1 , Z2 Z we have Z1 , Z2 ZJ , for suitably chosen J H (I).
Hence Z is an algebra (but generally not a -algebra). From (7.5) and (7.6) it follows
A = (Z) .
Now we come to the main part of the proof. This will be divided into four parts.
1. Assume Z 3 Z = p1J (A), J H (I) , A AJ . According to (7.7) Z must get mass
(Z) = J (A). We have to show that this is well defined. So let
Z = p1 1
J (A) = pK (B)
and thus 1
p1 1 0
K (B) = pK (B ) with B 0 := pK
J (A) .
Since pK () = K we obtain
1
B = B 0 = pK
J (A) .
Thus by the introductory considerations
K (B) = J (A) .
For arbitrary J, K define L := J K. Since J, K L, (7.9) implies the existence of
C AL with p1 1 1
L (C) = pJ (A) = pK (B). Therefore from what we have just seen:
20
2. Now we show that 0 as defined in (7.10) is a volume on Z. Trivially it holds, 0 0
and 0 () = 0. Moreover, as shown above for Y, Z Z, Y Z = , there is a
J H (I) , A, B AJ such that Y = p1 1
J (A), Z = pJ (B). Now Y Z = implies
A B = and due to
Y Z = p1 J (A B)
we obtain
() = 0 () = J (J ) = 1.
Z J := { : (J , pI\J ()) Z}
is a cylinder set. This set consists of all with the following property: if we
replace the coordinates i with i J by the corresponding coordinates of J , we
obtain a point in Z. Moreover
Z
0 (Z) = 0 (Z J )dJ (J ). (7.11)
This is shown by the following consideration. For Z Z there are K H(I) and
A AK such that Z = p1 K (A), this means that 0 (Z) = K (A). Since I is infinite
we may assume J K and J 6= K. For the J -intersection of A in K , which we
call AJ , i.e. for the set of all 0 K\J with (J , 0 ) A, it holds
Z J = p1
K\J (AJ ).
21
4. Eventually we show that 0 is -continuous and thus -additive. To this end let (Zn )
be a decreasing family of cylinder sets in Z with := inf n 0 (Zn ) > 0. We will show
that
\
Zn 6= . (7.13)
n=1
Now each Zn is of the form Zn = p1Jn (An ), Jn H(I), An AJn . Due to (7.9) we may
assume J1 J2 J3 . . . . We apply the result proved in 3. to J = J1 and Z = Zn .
J
As J1 7 0 Zn 1 is AJ1 -measurable
n
J1 o
Qn := J1 J1 : 0 Zn A J1 .
2
Since all J s have mass one we obtain from (7.11):
0 (Zn ) J1 (Qn ) + ,
2
hence J1 (Qn ) 2 > 0, for all n N. Together with (ZT n ) also (Qn ) is decreasing.
a finite measure is - continuous, which implies
J1 as T n=1 Qn 6= . Hence there is
J1 n=1 Qn with
0 Zn J1 > 0 all n. (7.14)
2
Successive application
of 3. implies via induction that for each k N there is Jk
Jk J
2k > 0 and pJk+1
Jk with (7.14) 0 Zn k
Jk+1 = Jk .
0 () = J (J ) = 1.
Exercise 7.4 With the notations of this section, in particular of Theorem 7.3 show that
pK
J is (AK , AJ )-measurable (J K, J, K H (I)) and that
pK
J (K ) = J.
22
8 Zero-One Laws
Already in Section 5 we encountered the prototype of a zero-one law: For a sequence of
events (An )n that are independent we have Borels Zero-One-Law (Theorem 5.12):
In a first step we will now ask, when the probability in question is zero and when it is one.
This leads to the following frequently used lemma:
Lemma 8.1 (Borel-Cantelli Lemma) Let (An ) be a sequence of events over a probabil-
ity space (, F , P). Then
X
P (An ) < P (lim sup An ) = 0 (8.1)
n=1
Remark 8.2 The Borel-Cantelli Lemma is most often used in the form of (8.1). Note that
this part does not require any knowledge about the dependence structure of the A n .
This implies
[
A Ai for all n N.
i=n
and thus !
[
X
P (A) P Ai P (Ai ) (8.3)
i=n i=n
P P
Since i=1 P (Ai ) converges, i=n P (Ai ) converges to zero as n . This implies P (A) =
0, hence (8.1).
For (8.2) again put A := lim sup An and furthermore
n
X
In := 1An , Sn := Ij
j=1
and eventually
X
S := Ij .
j=1
23
Since the An are assumed to be pairwise independent they are pairwise uncorrelated as
well. Hence
Xn n
X
E Ij2 E (Ij )2
V (Sn ) = V (Ij ) =
j=1 j=1
n
X
= E (Sn ) E (Ij )2 ESn ,
j=1
P
where the last equality follows since Ij2 = Ij . Now by assumption n=1 E (In ) = +.
Since Sn S this is equivalent with
lim E (Sn ) = E (S) = (8.4)
n
On the other hand A, if and only if An for infinitely many n which is the case, if
and only if S () = +. The assertion thus is
P (S = +) = 1.
This can be seen as follows. By Chebyshevs inequality
V (Sn )
P (|Sn E(Sn )| ) 1
2
for all > 0. Because of (8.4) we may assume that ESn > 0 and choose = 21 ESn . Hence
1 1 V (Sn )
P Sn E (Sn ) P |Sn ESn | ESn 1 4
2 2 E (Sn )2
But V (Sn ) E (Sn ) and E (Sn ) . Thus
V (Sn )
lim = 0.
E (Sn )2
Therefore for all > 0 and all n large enough
1
P Sn ESn 1 .
2
But now S Sn and hence also
1
P S ESn 1
2
for all > 0. But this implies P (S = +) = 1 which is what we wanted to show.
Example 8.3 Let (Xn ) be a sequence of real valued random variables which satisfies
X
P (|Xn | > ) < (8.5)
n=1
for all > 0. Then Xn 0 P-a.s. Indeed the Borel-Cantelli Lemma says that (8.5)
implies that
P (|Xn | > infinitely often in n) = 0.
But this is exactly the definition of almost sure convergence of Xn to 0.
24
Exercise 8.4 Is (8.5) equivalent with P-almost sure convergence of X n to 0?
Corollary 8.7 Let (Xn )nN a sequence of independent, real-valued random variables. De-
fine
\
T := (Xi , i m)
m=1
to be the tail -algebra. If then T is a real-valued random variable, that is measurable with
respect to T , then T is P-almost surely constant. I.e. there is a R such that
P (T = ) = 1.
Such random variables T : R that are T -measurable are called tail functions.
{T } T .
This implies P (T ) {0, 1}. On the other hand, being a distribution function we have
P (T < ) = 0
which implies
P (T = ) = 1.
Exercise 8.8 A coin is tossed infinitely often. Show that every finite sequence
(1 , . . . , k ) , i {H, T } , k N
25
Exercise 8.9 Try to prove (8.2) in the Borel-Cantelli Lemma for independent events
(An ) as follows:
This implies
X n
Y
n = lim (1 i ) = 0
n
n=1 i=1
P
3. As P (An ) diverges we have because of 1.
N
Y
lim (1 P (Am )) = 0
N
m=n
Theorem 9.1 (Khintchine) Let (Xn )nN be a sequence of square integrable, real valued
random variables, that are pairwise uncorrelated. Assume
n
1 X
lim V (Xi ) = 0.
n n2
i=1
26
Proof. By Chebyshevs inequality for each > 0:
! !
1 Xn n
1 1X
P (Xi EXi ) > 2V (Xi EXi )
n n i=1
i=1
n
!
1 1 X
= 2 2V (Xi EXi )
n i=1
n
1 1 X
= V (Xi EXi )
2 n2 i=1
n
1 1 X
= 2 2 V (Xi ) .
n i=1
Here we used that the random variables are pairwise uncorrelated. By assumption the
latter expression converges to zero.
Remark 9.2 As we will learn in the next Theorem, for an independent sequence square
integrability is even not required.
Theorem 9.1 raises the question whether we can replace the stochastic convergence there
by almost sure convergence. This will be shown in the following theorem. Such a theorem
is called a strong law of large numbers. Its first form was proved by Kolmogorov. We
will present a proof due to Etemadi from 1981.
Theorem 9.3 (Strong Law of Large Numbers Etemadi 1981) For each sequence
(Xn )n of real-valued, pairwise independent, identically distributed (integrable) random vari-
ables the Strong Law of Large Numbers holds, i.e.
n !
1 X
P lim sup Xi EX1 > = 0 for each > 0.
n n i=1
Before we prove Theorem 9.3 let us make a couple of remarks. These should reveal the
structure of the proof a bit:
Pn 1
1. Denote Sn = i=1 Xi . Then Theorem 9.3 asserts that S
n n
:= EX1 , P-almost
surely.
2. Together with Xn also Xn+ and Xn (where Xn+ = max (Xn , 0) and Xn = (Xn )+ )
satisfy the assumptions of Theorem 9.3. Since Xn = Xn+ Xn it therefore suffices to
prove Theorem 9.3 for positive random variables. We therefore assume Xn 0 for
the rest of the proof.
3. All proofs of the Strong Law of Large Numbers use the following trick: We truncate
the random variables Xn by cutting off values that are too large. We therefore
introduce
Yn := Xn 1{|Xn |<n} = Xn 1{Xn <n}
27
Of course, if is the distribution of Xn and n is the distribution of Yn , then n 6= .
Indeed n = fn (), where
x if 0 x < n,
fn (x) :=
0 otherwise.
The idea behind truncation is that we gain square integrability of the sequence.
Indeed: Z Z n
2 2 2
x2 d(x) < .
E Yn = E fn Xn = fn (x)d(x) =
0
4. Of course, after having gained information about the Yn we need to translate these
results back to the Xn . To this end we will apply the Borel-Cantelli Lemma and show
that
X
P (Xn 6= Yn ) < .
n=1
P-a.s..
5. For the purposes of the proof we remark the following: Let > 1 and for n N let
kn := [n ]
kn n < kn + 1.
Since
n 1
lim =1
n n
there is a number c , 0 < c < 1, such that
We now turn to
Proof of Theorem 9.3.
Step 1: Without loss of generality Xn > 0. Define Yn = 1{Xn <n} Xn . Then Yn are
independent and square integrable. Define
n
X
Sn0 := (Yi EYi ) .
i=1
Let > 0 and > 1. Using Chebyshevs inequality and the independence of the random
variables (Yn ) we obtain
n
1 0 1 1 0 1 1 0 1 1 X
P Sn > 2 V
S = 2 2 V (Sn ) = 2 2 V (Yi ) .
n n n n n i=1
28
Observe that V (Yi ) = E (Yi2 ) (E (Yi ))2 E (Yi2 ). Thus
n
1 0 1 1 X
E Yi2 .
P Sn > 2 2
n n i=1
where
X 1
tj :=
n=n
kn2
j
1
where d = c2 2
(1 ) > 0. This implies
tj d j 2 .
By using the above
j Z
d X 1 X k 2
X 1 0
P S kn > 2
x d(x).
n=1
kn j=1 j 2 k=1 k1
2
x d(x) = 2
x2 d(x).
j=1
j k1 j k1
k=1 k=1 j=k
Since
X 1 1 1 1
2
< 2+ + +...
j=k
j k k (k + 1) (k + 1) (k + 2)
1 1 1 1 1 1 1 2
= 2+ + +... = 2 + ,
k k k+1 k+1 k+2 k k k
29
this yields
Z Z
2d X k x 2d X k
X 1 0 2d
P S kn > 2
xd(x) 2 xd(x) = 2 E (X1 ) < .
n=1
kn k=1 k1 k k=1 k1
Step 3: Now we are aiming at removing the truncation from the Xn . Consider the sum
X
X
P (Xn 6= Yn ) = P (Xn n)
n=1 n=1
According to Exercise 3.4 this is smaller than E(X1 ), so that it is bounded. Therefore
30
Hence there is a n0 (random) such that with probability one Xn = Yn for all n n0 . But
the finitely many differences drop out when averaging, hence
1
lim Skn = EX1 P-a.s.
n kn
Step 4: Eventually we show that the theorem holds not only for subsequences kn chosen
as above, but also for the whole sequence.
For fixed > 1, of course, the sequences (kn )n are fixed and diverge to +. Hence for
every m N there exists n N such that
kn < m kn+1 .
Skn Sm Skn+1 .
Hence
S kn k n Sm Sk kn+1
n+1 .
kn m m kn+1 m
The definition of kn yields
This gives
kn+1 n+1
< n =
m
as well as
kn n 1
> n+1 .
m
Now, given , for all n n1 = n1 () we have n 1 n1 . Hence, if m kn1 and thus
n n1 we obtain
kn n 1 n1 1
> n+1 > n+1 = 2
m
Now for each we have a set with P ( ) = 1 with
1
lim Sk () = EX1 for all .
kn n
Without loss of generality we may assume that Xi are not identically equal to zero P-a.s.,
otherwise the assertion of the Strong Law of Large Numbers is trivially true. Therefore we
may assume without loss of generality that EX1 > 0. Since > 1 we then have
1 1
EX1 < Skn () < EX1
kn
for all and all n large enough. For such m and this means
1
3 1 EX1 < Sm () EX1 < 2 1 EX1 .
m
31
Define
\
1 := 1+ 1 .
n
n=1
Then P (1 ) = 1 and
1
lim Sm () = EX1
m m
Remark 9.4 Theorem 9.3 in particular implies that for i.i.d. sequences of random vari-
ables (Xn ) with a finite first moment the Strong Law of Large Numbers holds true. Since
stochastic convergence is implied by almost sure convergence also Theorem 9.1 the Weak
Law of Large Numbers holds true for such sequences as well. Therefore the finiteness of
the second moment is not necessary for Theorem 9.1 to hold true for i.i.d. sequences.
Remark 9.5 One might, of course, ask whether a finite first moment is necessary for
Theorem 9.3Pto hold. Indeed one can prove that, if a sequence of i.i.d. random variables is
such that n1 ni=1 Xi converges almost surely to some random variable Y (necessarily a tail
function as in Corollary 8.7!), then EX1 exists and Y = EX1 almost surely. This will not
be shown in the context of this course.
Exercise 9.6 Let (am )m be real numbers such that limm am = a. Show that this implies
that their Cesaro mean
1
lim (a1 + a2 + . . . + an ) = a.
n n
Exercise 9.7 Prove the Strong Law of Large Numbers for a sequence of i.i.d. random
variables (Xn )n with a finite fourth moment, i.e. for random variables with E (X14 ) < .
Do not use the statement of Theorem 9.3 explicitly.
Remark 9.8 A very natural question to ask in the context of Theorem 9.3 is: how fast
does n1 Sn converge to EX1 , i.e. given a sequence of i.i.d. random variables (Xn )n what is
X
1
P Xi EX1 ?
n
If X1 has a finite moment generating function; i.e. if
the answer is: exponentially fast. Indeed, Cramers theorem (which cannot be proven in the
context of this course) asserts the following: let I : R R be given by
32
and for every open set O R
n
!
1 1X
lim inf log P Xi O inf I(x).
n n n i=1 xO
Pn
This is called a principle of large deviations for the random variables n1 i=1 Xi . In
particular, one can show that the function I is convex and non-negative, with
I(x) = 0 x = EX1 .
We therefore obtain
X
1
Xi EX1 en min(I(EX1 +),I(EX1 ))+n
> 0N n > N : P
n
where I is the I-function introduced above evaluated for the random variables X i EX1 .
The speed of convergence is thus exponentially fast.
Exercise 9.10 Determine the functions M and I for a normally distributed random vari-
able.
Exercise 9.11 Argue that if the moment generating function of a random variable X is
finite, all its moments are finite. In particular both Laws of Large Numbers are applicable
to a sequence X1 , X2 , . . . of iid random variables distributed like X.
At the end of this section we will turn to two applications of the Law of Large Numbers
which are interesting in their own right:
The first of these
two applications is in number theory. Let (, F , P) be given by =
1
1
[0, 1), F = B and P = (Lebesgue measure). For every number we may
consider its g-adic representation
X
= n g n (9.4)
n=1
33
for all = 0, . . . , g 1. Hence is g-normal, if in the long run all of its digits occur with the
same frequency. We will call absolutely normal, if is g-normal for all g N, g 2.
Now for a number [0, 1) randomly chosen according to Lebesgue measure the i () are
i.i.d. random variables; they have as their distribution the uniform distribution on the set
{0, . . . , g 1}. This has to be shown in Exercise 9.13 below and is a consequence of the
uniformity of Lebesgue measure. Hence the random variable
1 if n () =
Xn () =
0 otherwise
are i.i.d. random variables for each g and . Moreover Sn,g () = ni=1 Xi,g (). According
P
to the Strong Law of Large Numbers (Theorem 9.3)
1 ,g 1
Sn () E (X1,g ) = 1 -a.s.
n g
for all {0, . . . , g 1} and all g 2. This means 1 -almost every number is g-normal,
i.e. there is a set Ng with 1 (Ng ) = 0, such that is g-normal for all Ngc . Now
[
N := Ng
g=2
It is rather surprising that hardly any normal numbers (in the usual meaning, see
Footnote page 33), are known. Champernowne (1933) showed that
= 0, 1234567891011121314 . . .
is 10-normal. Whether 2, log 2, e or are normal of any kind has not been shown yet.
There are no absolutely normal numbers known at all.
Exercise 9.13 Show that for every g 2, the random variables n () introduced above
are i.i.d. random variables that are uniformly distributed on {0, . . . , g 1}.
The second application is to derive a classical result from analysis which in principle
has nothing to do with probability theory is related to the Strong Law of Large Numbers.
As may be well known the approximation theorem by Stone and Weierstra asserts
that every continuous function on [a, b] (more generally on every compact set) can be
approximated uniformly by polynomials. Obviously it suffices to prove this for [a, b] = [0, 1].
So let f C ([0, 1]) be a continuous function on [0, 1]. Define the nth Bernstein polynomial
for f as
n
X n k
Bn f (x) = f xk (1 x)nk .
k n
k=0
34
Theorem 9.14 For each f C ([0, 1]) the polynomials Bn f converge to f uniformly in
[0, 1].
Proof. Since f is continuous and [0, 1] is compact f is uniformly continuous on [0, 1], i.e.
for each > 0 there exists > 0 such that
2 kf k
+ 2 kf k P (|Sn p| ) +
4n 2 .
Here kf k is the sup-norm of f . Hence
2 kf k
sup |Bn f (p) f (p)| +
p[0,1] 4n 2
In this approach it is not clear that N can be chosen independently of p, so that we only
would get pointwise convergence.
35
Theorem analyzes the fine structure in the Law of Large Numbers. Its name is due to Polya,
the proof of the following theorem goes back to Charles Stein.
First of all notice that in a certain sense, in order to analyze the fine structure of ni=1 Xi
P
the scaling of the Weak Law of Large Numbers n1 is already an overscaling. P On this scale
we just cannot see the shape of the distribution anymore, since by scaling ni=1 Xi by a
factor n1 we have reduced its variance to a scale n1 which converges to zero. What we see
in the Law of Large Numbers is a bell shaped curve with a tiny, tiny width. Here is what
we get, if we scale the variance to one:
Remark 10.2 Indeed, the Central Limit Theorem also holds under the following weaker
assumption. Assume given for n = 1, 2, . . . an independent family of random variables X ni ,
i = 1, . . . , n. For j = 1, . . . n let
nj = EXnj
and v
u n
uX
sn := t VXni .
i=1
Ln () 0 as n
Intuitively speaking the Lindeberg condition asks that none of the variables dominates the
whole sum.
The generalized form of the CLT stated above now asserts that if the sequence (X n )
satisfies the Lindeberg condition, it also satisfies the CLT, i.e.
Pn ! Z a
i=1 (Xni ni ) 1 2
lim P pPn a = ex /2 dx.
n
i=1 VXni 2
The proof of this more general theorem basically mimicks the proof we will give below for
Theorem 10.1. We spare ourselves the additional technical work.
36
We will present a proof of the CLT that goes back to Charles Stein. It is based on a
couple of facts:
Fact 1: It suffices to prove the CLT for i.i.d. random variables with EX1 = 0. Otherwise
one just substracts EX1 from each of the Xi .
Fact 2: Define
n
X
Sn := Xi and 2 := V(X1 )
i=1
The importance of the above lemma becomes obvious, if we substitute a random variable
X into (10.3) and take expectations:
E [g 0 (X) Xg(X)] = E [f (X) N (f )] .
If X N (0, 1) is standard normal, the right hand side is zero and so is the left hand
side. The idea is thus that instead of showing that
E [f (Un ) N (f )]
converges to zero, we may show the same for
E [g 0 (Un ) Un g(Un )] .
The next step discusses the function g introduced above.
37
Lemma 10.4 Let f : R R be bounded and uniformly continuous and g be the solution
of
g 0 (x) xg(x) = f (x) N (f ). (10.4)
Then g(x), xg(x) and g 0 (x) are bounded and continuous
Proof. Obviously g is even differentiable, hence continuous. But then also xg(x) is con-
tinuous. Eventually
g 0 (x) = xg(x) + f (x) N (f )
is continuous as the sum of continuous functions. For the boundedness part first note that
any continuous function on a compact set is bounded, hence we only need to check that
the functions g, xg, and g 0 are bounded for x .
To this end first note that
Z x
x2 /2 2
g(x) = e (f (y) N (f )) ey /2 dy
Z
x2 /2 2
= e (f (y) N (f )) ey /2 dy
x
Now for x 0
x x
y y2 /2 1
Z Z
x2 /2 y 2 /2 x2 /2
e e dy e e dy =
|x| |x|
and similarly for x 0
y y2 /2 1
Z Z
x2 /2 y 2 /2 x2 /2
e e dy e e dy . (10.5)
x x x |x|
Thus we see that for x 1
as well as for x 1
|g(x)| |xg(x)| sup |f (y) N (f )| .
y0
38
Now we turn to proving the CLT.
Proof of Theorem 10.1. Besides assuming that EXi= 0 for all i we may P also assume
that = VX1 = 1. Otherwise we just replace Xi by Xi / . We write Sn := ni=1 Xi and
2 2
recall that in order to prove the assertion it suffices that for all f bounded and continuous
0 Sn Sn Sn
E g g 0
n n n
as n . Here g is defined as above.
But using the identity
Z 1
Sn X j
Xj 0 Sn Xj 0
g (1 s) g ds
0 n n n n
Sn X j Xj 0 S n X j
Sn
=g g g
n n n n
which is to be proven in Exercise 10.5 below we arrive at
0 Sn Sn Sn
E g g
n n n
n
X 1 0 Sn Xj Sn
= E g g
j=1
n n n n
n
Sn X j Xj2 0 Sn Xj
X 1 0 Sn Xj
= E g g g
j=1
n n n n n n
Xj2 1 0 Sn Sn X j
Xj
Z
0
g (1 s) g ds
n 0 n n n
n
1 0 Sn X j
X 1 0 Sn
= E g g
j=1
n n n n
Xj2 1 0 Sn Sn X j
Xj
Z
0
g (1 s) g ds .
n 0 n n n
In the last step we used linearity of expectation together with EXi = 0 for all i as well as
independence of Xi and Sn Xi for all i together with EXi2 = 1. Let us define
1 0 Sn X j
1 0 Sn
j := E g g
n n n n
2 Z 1
Xj Sn X j
0 Sn Xj 0
g (1 s) g ds .
n 0 n n n
The idea will now be that the continuous function
g 0 is uniformly
continuous on every
Xj 0 S 1 0 Sn Xj 0 Sn X
compact set. So, if n is small, so are g n n g
n
n
and g
n
(1 s) j
n
S X Sn
g 0 nn j as long as n
is inside the chosen compact set. On the other hand the probabil-
Sn X
ities that is outside a chosen large compact set or that j is large are very small. This
n n
39
together with the boundedness of g and g 0 will basically yield the proof. For K > 0, > 0
we write
1j := j 1| Xj | 1| Sn |K
n n
2j := j 1| Xj | 1| Sn |>K
n n
3j := j 1| Xj |> .
n
Hence n n n n n
X X X X X
j = 1j + 2j + 3j = 1j + 2j + 3j .
j=1 j=1 j=1 j=1 j=1
n
2||g 0 ||
X 1
E 2||g 0||1| Xj | 1| Sn |>K + E Xj2 1| Xj | 1| Sn |>K
j=1
n n n n n n
n
2||g 0 ||
X 1 0 2
E 2||g ||1| Xj | 1| Sn |>K + E Xj 1| Xj | 1| SnXj |>K
j=1
n n n n n n
2||g 0|| X
0 2
2||g ||E 1 Sn >K +
E Xj 1| Xj | E 1| SnXj |>K
n n j
n n
Sn Sn1
= 2||g 0 ||P > K + 2||g 0 ||P > K
n n
0
4||g ||
For the 1j -terms observe that for every fixed K > 0 the continuous function g 0 is
uniformly continuous on [K, K]. This means that given > 0, there is > 0 such that
40
For given > 0 we choose such , and K as in the first step. Then
n n
1 0 Sn X j
X
1
X 1 0 Sn
|j | = E n g n n g
j=1 j=1
n
Xj2 1 0 Sn
Sn X j
Xj
Z
0
g (1 s) g ds 1| Xj | 1| Sn |K
n 0 n n n n n
n
1 0 Sn X j
E 1 g 0 Sn
X
n g 1 Xj 1 | Sn
|K
j=1
n n n | n |
n
2 Z 1
Xj
0 S n X j 0 S n X j
+ g (1 s) g ds 1| Xj | 1| Sn |K
n 0 n n n n n
n
EXj2
X 1
+
j=1
n n
2
= n = 2.
n
This implies
n n
1 0 Sn X j
X X 1 0 Sn
|3j |
= E n g n n g
j=1 j=1
n
Xj2 1 0 Sn
Sn X j
Xj
Z
0
g (1 s) g ds 1| Xj |>
n 0 n n n n
n
X 2
2
E1| Xj |> ||g 0 || + ||g 0 ||E Xj2 1| Xj |>
j=1
n n n n
0
Xj 0 2
2||g ||P > + 2||g ||E Xj 1| Xj |>
n n
0
4||g ||
Hence for a given > 0 with the choice of > 0 and K as above we obtain
0 Sn Sn Sn
E g g
n n n
Xn X n X n
= 1j + 2j + 3j
j=1 j=1 j=1
0
2 + 8||g ||.
This can be made arbitrarily small by letting 0. This proves the theorem.
41
Pn
Exercise 10.5 Let X1 , . . . , Xn be i.i.d. random variables and Sn = i=1 Xi . Let
g:RR
be a continuously differentiable function. Show that for all j
Z 1
[g 0 (Sn (1 s)Xj ) g 0 (Sn Xj )]Xj ds = g(Sn ) g(Sn Xj ) Xj g 0 (Sn Xj ).
0
We conclude the section with the informal discussion of two extensions on the Central
Limit Theorem. The first is of practical importance, the second is of more theoretical
interest.
When one tries to apply the Central Limit Theorem, e.g. for a sequence of i.i.d. random
variables, it is of course not only important to know that
Pn
(Xi EX1 )
Xn := i=1
nVX1
converges to a random variable Z N (0, 1). One also needs to know, how close the
distributions of Xn and Z are. This is stated in the following theorem due to Berry and
Esseen:
Theorem 10.6 (Berry-Esseen) Let (Xi )iN be a sequence of i.i.d. random variables with
E(|X1 |3 ) < .Then for a N (0, 1)-distributed random variable Z it holds:
Pn
C E(|X1 EX1 |3 )
i=1 X i EX 1
sup P
a P (Z a)
.
aR nVX1 n (VX1 )3/2
The numerical value of C is below 6 and larger than 0.4. This is rather easy to prove.
The second extension of the Central Limit Theorem starts with the following obser-
vation: Let X1 , X2 , . . . be a sequence of i.i.d. random variables
Pn with finite variance and
1
expectation zero. Then the law of large numbers says that n i=1 Xi converges to EX1 = 0
in probability and almost surely. But it tells nothing about the size of the fluctuations.
This is considered in greater detail by the Central Limit Theorem. The latter describes the
asymptotic probabilities that
Pn !
i=1 (X i EX 1 )
P p a .
nV (X1 )
Since these probabilities are positive for allPa R according to the Central
Limit Theorem,
n
it can be shown that the fluctuations of i=1 Xi are larger than Pn
n, more precisely, for
Xi
each positive a R it holds with probability one that lim sup n a.
i=1
The question for the precise size of the fluctuations, i.e. for the right scaling (an ) such
that Pn
Xi
lim sup i=1
an n
is almost surely finite, is answered by the law of the iterated logarithm:
42
Theorem 10.7 (Law of the Iterated Logarithm by Hartmann and Winter) Let
(Xi )iNPbe a sequence of i.i.d. random variables with 2 := VX1 < ( > 0). Then for
Sn := ni=1 Xi it holds
Sn
lim sup = + P-a.s.
2n log log n
and
Sn
lim inf = P-a.s.
2n log log n
Due to the restricted time we will not be able to prove the Law of the Iterated Logarithm
in the context of this course. Despite its theoretical interest its practical
relevance is rather
limited. To understand why, notice that the correction to the nVX 1 from the Central
Limit Theorem to the Law of the Iterated Logarithm are of order log log n. Even for a
100
fantastically large number
of observation, 10 (which is more than one observation per
atom in the universe) log log n is really small,e.g.
p p p
log log 10100 = log(100 log 10) = log 100 + log log 10 ' 6.13 ' 2.47.
11 Conditional Expectation
To understand the concept of conditional expectation, we will start with a little example.
Example 11.1 Let be a finite population and let the random variable X () denote the
income of person . So, if we are only interested in income, X contains the full information
of our experiment. Now assume we are a sociologist and want to measure the influence of
a persons religion on his income. So we are not interested in the full information given by
X, but only in how X behaves on each of the sets,
The basic idea of conditional expectation will be that given a random variable
X : (, F ) R
for all C A. So X0 contains all information necessary when we only consider events in
A. First we need to see that such a X0 can be found in a unique way.
43
Theorem 11.2 Let (, F , P) be a probability space and X an integrable random variable.
Let C F be a sub--algebra. Then (up to P-a.s. equality) there is a unique random
variable X0 , which is C-measurable and satisfies
Z Z
X0 dP = XdP for all C C. (11.1)
C C
If X 0, then X0 0 P-a.s.
Proof. First we treat the case X 0. Denote P0 := P |C and Q = XP |C. Both, P0 and
Q are measures on C, P0 even is a probability measure. By definition
Z
Q (C) = XdP.
C
Hence Q (C) = 0 for all C with P (C) = 0 = P0 (C). Hence Q P0 . By the theorem of
Radon-Nikodym there is a C-measurable function X0 0 on such that Q = X0 P0 . Thus
Z Z
X0 dP0 = XdP for all C C.
C C
Hence Z Z
X0 dP = XdP for all C C.
C C
Hence
X0 satisfies (11.1).
R For X0R that is C-measurable and satisfies (11.1) theset C =
X0 < X0 is in C and C X0 dP = C X0 dP, whence P(C) = 0. In the same way P( X0 > X0 =
0. Therefore X0 is P-a.s. equal to X0 .
The proof for arbitrary, integrable X is left to the reader.
Definition 11.4 Under the conditions of Theorem 11.2 the random variable X 0 (which is
P-a.s. unique) is called the conditional expectation of X given C. It is denoted by
X0 =: E [X | C] =: EC [X] .
for all C C. This determines E [X | C] only P-a.s. on sets C C. We therefore also speak
about different versions of conditional expectation.
44
Example 11.5 1. If C = {, }, then the constant random variable EX is a version of
E [X | C]. Indeed if C = , then any variable does the job. If C =
Z Z
XdP = EX = EXdP.
C
Exercise 11.7 Show that the following assertions for the conditional expectation E [X | C]
of random variables
X, Y : (, A) R, B 1
(C A) are true:
1. E [E [X | C]] = EX
4. If X , then E [X | C] = P-a.s.
5. E [X + Y | C] = E [X | C] + E [Y | C] P-a.s. Here , R.
The following theorems have proofs that are almost identical with the proofs of the
theorems for expectations:
lim E [Xn | C] = E [X | C] .
n
45
Theorem 11.10 (Jensens inequality) Let X be an integrable random variable taking
values in an open interval I R and let
q:I R
E [X | C] : I
and
q (E [X | C]) E [q X | C] .
|E [X | C]|p E [|X|p | C]
which implies
E (|E [X | C]|p ) E (|X|p ) .
Denoting by
Z 1/p
p
Np (f ) = |f | dP
this means
Np (E [X | C]) Np (X) , X Lp (P) .
This holds for 1 p < . Np (f ) is called the Lp -norm of f . The case p = , which means
that if X is bounded P-a.s. by some M 0, then so is E [X | C], follows from Exercise
11.7.
We slightly reformulate the definition of conditional expectation to discuss its further
properties.
Proof. From (11.1) we obtain (11.2) for step functions. The general result follows from
monotone convergence.
We are now prepared to show a number of properties of conditional expectations which
we will call smoothing properties
E [XY | C] = XE [Y | C] (11.3)
46
2. If C1 , C2 F with C1 C2 then
E [E [X | C2 ] | C1 ] = E [E [X | C1 ] | C2 ] = E [X | C1 ] .
Proof.
1. First assume that X, Y 0. Let X be C-measurable and C C. Then
Z Z Z Z
XY dP = 1C XY dP = 1C XE [Y | C] dP = XE [Y | C] dP.
C C
E [XY | C] = XE [Y | C] P-a.s.
In the case X Lp (P) , Y Lq (P) we observe that then XY L1 (P) and conclude
as above.
E [E [X | C1 ] | C2 ] = E [X | C1 ] , P-a.s.
The previous theorem leads to yet another characterization of the conditional expecta-
tion. To this end take X L2 (P) and denote X0 := E [X | C] for a C F . Let Z L2 (P)
be C-measurable. Then X0 L2 (P) and by (11.3)
E [Z (X X0 ) | C] = ZE [X X0 | C] = Z (E [X | C] X0 ) = Z (X0 X0 ) = 0.
Theorem 11.13 For all X L2 (P) and each C F the conditional expectation E [X | C]
is (up to a.s. equality) the unique C-measurable random variable X 0 L2 (P) with
47
Proof. Let Y L2 (P) be C-measurable. Put X0 := E [X | C]. Then
E (X X0 )2 E (X Y )2 .
then
E (X0 Y )2 = 0
Exercise 11.14 Prove that for X L2 (P), = E(X) is the number that minimizes
E((X )2 ).
With the help of conditional expectation we can also give a new definition of conditional
probability
In a last step we will only introduce (but not prove) conditional expectations on events
with zero probability. Of course, in general this will just give nonsense but in the case of
a conditional expectation E [X | Y = y] where X, Y are random variables such that (X, Y )
has a Lebesgue density we can give this expression a meaning.
48
Theorem 11.17 Let X, Y be real valued random variables such that (X, Y ) has a density
f : R2 R+ 2
{0} with respect to two dimensional Lebesgue measure . Assume that X is
integrable and that Z
f0 (y) := f (x, y) dx > 0 for all y R.
y 7 E (X | Y = y)
In particular
1
Z
E (X | Y ) = xf (x, Y ) dx P-a.s.
f0 (Y )
We will also need the following relationship between conditional expectation and indepen-
dence, which is a generalization of Example 11.5, case 1.
Lemma 11.18 Let X be an integrable real valued random variable and C F a sub--
algebra such that X is independent of C, that is (X) and C are independent, then
Exercise 11.19 Let X and Y be as in Theorem 11.17, such that X and Y are independent.
Then X is independent of (Y ), and by Lemma 11.18 we have E(X | Y ) = E(X | (Y )) =
E(X). Apply Theorem 11.17 to give an alternative derivation of this fact.
49
12 Martingales
In this section we are going to define a notion, that will turn out to be of central interest
in all of so called stochastic analysis and mathematical finance. A key role in its definition
will be taken by conditional expectation. In this section we will just give the definition and
a couple of examples. There is a rich theory of martingales. Parts of this theory we will
meet in a class on Stochastic Calculus.
Definition 12.1 Let (, F , P) be a probability space and I be an ordered set (linearly or-
dered), i.e. for s, t I either s t or t s, with s t and t s implies s = t and s t,
t u implies s u. For t I let Ft F be a -algebra. (Ft )tI is called a filtration, if
s t implies Fs Ft . A sequence of random variables (Xt )tI is called (Ft )tI - adapted
if Xt is Ft -measurable for all t I.
Example 12.3 Let (Xt )tI be a family of random variables, and I a linearly ordered set,
then
Ft = {Xs , s t}
is a filtration and (Xt ) is adapted with respect to (Ft ). (Ft )tI is called the canonical
filtration with respect to (Xt )tI .
Definition 12.4 Let (, F , P) be a probability space and I a linearly ordered set. let (F t )tI
be a filtration and (Xt )tI be an (Ft )-adapted sequence of random variables. (Xt ) is called
an (Ft )-supermartingale, if
E [Xt | Fs ] = Xs P-a.s.
Exercise 12.5 Show that the conditions (12.1) and (12.2) are equivalent.
Remark 12.6 1. If (Ft ) is the canonical filtration with respect to (Xt )tI , then often
(Xt ) simply called a supermartingale, submartingale, or a martingale.
50
2. (12.1) and (12.2) are evidently correct for s = t (with equality). Hence these proper-
ties only need to be checked for s < t.
3. Putting C = in (12.2) we obtain for a supermartingale (Xt )t
s t E (Xs ) E (Xt ) .
Hence for supermartingales (E (Xs ))s is a decreasing sequence, while for a submartin-
gale (E (Xs )) is an increasing sequence.
4. In particular, if each of the random variables Xs is almost surely constant, e.g. if
is a singleton (a set with just one element) then (Xs ) is a decreasing sequence, if (Xs )
is a supermartingale. And it is an increasing sequence, if (Xs ) is a submartingale.
Hence martingales are (in a certain sense) the stochastic generalization of constant
sequences.
Exercise 12.7 Let (Xt ), (Yt ) be adapted to the same filtration and , R. Show the
following
1. If (Xt ) and (Yt ) are martingales, then (Xt + Yt ) is a martingale.
2. If (Xt ) and (Yt ) are supermartingales, then so is (Xt Yt ) = (min(Xt , Yt ))
3. If (Xt )) is a submartingale, so is Xt+ , Ft .
Of course, at first glance the definition of a martingale may look a bit weird. We will
therefore give a couple of examples to show that it is not as strange as expected.
Example 12.8 Let (Xn ) be an i.i.d. sequence of R-valued random variables. Put Sn =
X1 + . . . + Xn and consider the canonical filtration Fn = (Sm , m n). By Lemma 11.18
we have
E [Xn+1 | S1 , . . . , Sn ] = E [Xn+1 ] P-a.s.
and by part 2. of Exercise 11.7
E [Xi | S1 , . . . , Sn ] = Xi P-a.s.
for all i = 1, . . . , n. Adding these n + 1 equations gives
E [Sn+1 | Fn ] = Sn + E [Xn+1 ] P-a.s.
If EXi = 0 for all i, then
E [Sn+1 | Fn ] = Sn
i.e. (Sn ) is a martingale. If E [Xi ] 0 then
E [Sn+1 | Fn ] Sn ,
i.e. (Sn ) is a supermartingale. In the same way (Sn ) is a submartingale if EXi 0.
51
Example 12.9 Consider the following game. For each n N a coin with probability p
for heads is tossed. If it shows heads (Xn = +1) our player receives money otherwise he
(Xn = 1) looses money. The way he wins or looses is determined in the following way.
Before the game starts he determines a sequence (%n )n of functions
%n : {H, T }n R+ .
In round number n + 1 he plays for %n (X1 , . . . , Xn ) Euros depending on how the first n
games ended. If we denote by Sn his capital at time n, then
Hence
E [Sn+1 | X1 , . . . , Xn ] = Sn
1
so (Sn ) is a martingale while for p > 2
E [Sn+1 | X1 , . . . , Xn ] Sn ,
1
hence (Sn ) is a submartingale and for p < 2
E [Sn+1 | X1 , . . . Xn ] Sn ,
so (Sn ) is a supermartingale. This explains the idea that martingales are generalizations
of fair games.
Exercise 12.11 Consider the gamblers martingale. Consider an i.i.d. sequence (X n ) n=1
of Bernoulli variables with values 1 and 1, each with probability 1/2. Consider the sequence
n1
(Yn ) such that Yn =
P2n if X1 = = Xn1 = 1, and Yn = 0 if Xi = 1 for some i n1.
Show that Sn = i=1 Xi Yi is a martingale. Show that Sn almost surely converges and
determine its limit S . Observe that Sn 6= E(S | Fn ).
Example 12.12 In a sense Example 12.8 is both, a special case and a generalization of the
following example. To this end let X1 , . . . , Xn , . . . denote an i.i.d. sequence of Rd -valued
random variables. Assume
1
P (Xi = +%k ) = P (Xi = %k ) =
2d
52
for all i = 1, 2, . . . and all k = 1, . . . , d. Here %k denotes the k-th unit vector. Define the
stochastic process Sn by
S0 = 0,
and n
X
Sn = Xi .
i=1
This process is called a random walk in d directions. Some of its properties will be discussed
below. First we will see that indeed (Sn ) is a martingale. Indeed,
E [Sn+1 | X1 , . . . , Xn ] = E [Xn+1 ] + Sn = Sn .
As a matter of fact, not only is (Sn ) a martingale, but, in a certain sense it is the discrete
time martingale.
Since the random walk in d dimensions is the model for a discrete time martingale
(the standard model of a continuous time martingale will be introduced in the following
section) it is worth while studying some of its properties. This has been done in thousands
of research papers in the past 50 years. We will just mention one interesting property here,
that reveals a dichotomy in the random walks behavior for dimensions d = 1, 2 or d 3.
It is called transient in x, if
To prove a version of Theorem 12.16 we will first discuss the property of recurrence:
Lemma 12.17 Let fk denote probability that the random walk return to the origin after
k steps for the first time, and let pk denote probability that the random walk
Preturn to the
origin after k steps. Then a P random walk (Sn ) is recurrent if and only if k fk = 1 and
this is the case if and only if k pk = .
53
Proof. The first equivalence is easy. Denote by k the set of all realizations of the random
walk returning to the origin for the first time after k steps. Then if the random walk (Sn )
is recurrent, with P
probability one there exists a k >P 0 such that Sk = 0Sand Sl 6= 0 for all
0 < l < k. Hence k fk = 1. On the other hand, if k fk = 1, then P( k k ) = 1. Hence
with probability one there exists a k > 0 such that Sk = 0 and Sl 6= 0 for all 0 < l < k.
But then the situation at times 0 and k is completely the same and hence there exists
k 0 > k such that Sk0 = 0 and Sl 6= 0 for all k < l < k 0 . Iterating this gives that Sk = 0 for
infinitely many ks with probability one.
In order to relate fk and pk we derive the following recursion
pk = fk + fk1 p1 + + f0 pk (12.3)
(the last summand ist just added for completeness, we have f0 = 0). Indeed this is again
easy to see. The left hand side is just the probability to be at the origin at time k. This
event is the disjoint uinion of the events to be at 0 for the first time after 1 l k steps
and to walk from zero to zero in the remaining steps. Hence we obtain.
k
X
pk = fi pki and p0 = 1. (12.4)
i=1
Multiplying the left and right sides in (12.4) with z k and summing them from k = 0 to
infinity gives
P (z) = 1 + P (z)F (z)
i.e.
F (z) = 1 1/P (z).
By Abels theorem
X 1
fk = F (1) = lim F (z) = 1 lim .
k=1
z1 z1 P (z)
P
First assume that k pk < . Then
X
lim P (z) = P (1) = pk <
z1
k
and thus
1 X
lim = 1/ pk > 0.
z1 P (z) k
Hence
P
k=1 fk < 1 and
Pthe random walk (Sn ) is transient.
Next assume that k pk = and fix > 0. Then we find N such that
N
X 2
pk .
k=0
54
PN 1
Then for z sufficiently close to one we have k=0 pk z k
and consequently for such z
1 1
PN .
P (z) k=0 pk z
k
Exercise 12.18 What has the Borel-Cantelli Lemma 8.1 to say in the above situation?
Dont overlook that the events {Sn = x} (n N) may be dependent.
We will now apply this criterion to analyze recurrence and transience for a random walk
similar to the one defined in Example 12.12.
To this end define the following random walk (Rn ) in d dimensions. For k N let
Y1k , . . . , Ydk be i.i.d. random variables, taking values in {1, +1} with P(Y1k = 1) = P(Y1k =
1) = 1/2. Let Xk be the random vector Xk = (Y1k , . . . , Ydk ). Define R0 0 and for k 1
n
X
Rn = Xk .
k=1
Theorem 12.19 The random walk (Rn ) defined above is recurrent in dimensions d = 1, 2
and transient for d 3.
to obtain
2k 2k 2k! 2k
qk = 2 = 2
k k!k!
2k
4k 2k e 2k
2
k 2k
2k e
r
1
= .
k
Hence the probability of a single q
coordinate of R2n to be zero (Rn cannot be zero if n is
1
odd) asymptotically behaves like n . Hence
d2
1
P(Rn = 0) .
n
55
But d
X 1 2
=
n
n
for d = 1 and d = 2, while
d
X 1 2
<
n
n
for d 3. This proves the theorem.
We will give two results about martingales. The first one is inspired by the fact that
given a random variable M and a filtration Ft of F , the family of conditional expectations,
E(M | Ft ), yields a martingale. Can a martingale always be described in this way? That
is the content of the Martingale Limit Theorem.
Theorem 12.20 Suppose Mt is a martingale with respect to the filtration Ft and that
the martingale is (uniformly) square integrable, that is lim sup t E(Mt2 ) < . Then there
is a square integrable random variable M such that Mt = E(M | Ft ) a.s.. Moreover
lim Mt = M in L2 sense.
Proof. The basic property that we will use, is that the space of square integrable random
variables L2 (, F , P) with the L2 inner product is a complete vectorspace. In other words,
a Cauchy sequence converges. Recall that for any t < s it holds that Mt = E(Ms | Ft ), and
we have seen in Theorem 11.13 that then Ms Mt is perpendicular to Mt . In particular
we have the Pythagoras formula
E(Ms2 ) = E(Mt2 ) + E((Ms Mt )2 ).
This implies that E(Ms2 ) is increasing in s, and therefore its limit exists and equals
lim sup E(Mt2 ) which is finite. Therefore given > 0 there is a u such that for t > s > u we
have E((Ms Mt )2 ) < . That means that {Ms }s is a Cauchy sequence. Let M be its limit.
In particular M is a random variable. Since orthogonal projection onto the suspace of Ft
measurable functions is a continuous map, it holds that E(M | Ft ) = lim E(Ms | Ft ) = Mt .
With some extra effort one may show that M is also the limit in the sense of almost
sure convergence. The Martingale Limit Theorem is valid under more general circum-
stances, for example it is sufficient that only lim supt E(|Mt |) < , in which case M is
the limit in L1 sense (as well as almost surely).
An important concept for random processes is the concept of a stopping time.
Definition 12.21 A stopping time is a random variable : I {}, such that for
all t I, { ; () t} is Ft measurable. Here I {} is ordered such that t < for all
t I. Given a process Mt , the stopped process M t is given by M t () = Ms () where
s = t = min(, t).
56
Exercise 12.23 If T : R is constant, then T is a stopping time. If S and T are
stopping times then max(S, T ) and min(S, T ) are stopping times.
Exercise 12.25 Modify the proof of Theorem 12.24 to show that a stopped supermartingale
is a supermartingale.
Exercise 12.26 Consider the roulette game. There are several possibilities for a bet, given
by a number p (0, 1) such that with probability 36/37 p the return is p 1 times the stake
and the return is zero with probability 136/37p. The probabilities p are such that p 1 N.
Suppose you start with an initial fortune X0 N, and perform a sequence of bets until
this fortune is reduced to zero. We are interested in the expected value of the total sum of
stakes. To determine this consider the sequence of subsequent fortunes X i , and consider the
sequence of stakes Yi , meaning that the stake in bet i is Yi = Yi (X1 , . . . , Xi1 ) (Yi Xi1 ).
In particular, if for this stake the probability p is chosen, either X i = Xi1 Yi + p1 Yi
P 36/37 p) or Xi = Xi1 Yi (with probability 1 36/37 p). Show that
(with probability
(Xi + 1/37 ij=1 Yj )i is a martingale with respect to the filtration Fi = (X1 , . . . , Xi ). The
stopping time N is the first time i such that Xi = 0. Show that E( N
P
j=1 Yj ) = 37 X0 .
13 Brownian motion
In this section we will construct the continuous time martingale, Brownian motion. Besides
this, Brownian motion is also a building block of stochastic calculus and stochastic analysis.
In stochastic analysis one studies random functions of one variable and various kinds
of integrals and derivatives thereof. The argument of these functions is usually interpreted
as time, so the functions themselves can be thought of as the path of a random process.
57
Here, like in other areas of mathematics, going from the discrete to the continuous
yields a pay-off in simplicity and smoothness, at the price
R n 3 of a formally morePcomplicated
analysis. Compare, to make an analogy, the integral 0 x dx with the sum nk=1 k 3 . The
integral requires a more refined analysis for its definition and its properties, but once this
has been done the integral is easier to calculate. Similarly, in stochastic analysis you will
become acquainted with a convenient differential calculus as a reward for some hard work
in analysis.
Stochastic analysis can be applied in a wide variety of situations. We sketch a few
examples below.
1. Some differential equations become more realistic when we allow some randomness
in their coefficients. Consider for example the following growth equation, used among
other places in population biology:
d
St = (r + Nt )St . (13.1)
dt
Here, St is the size of the population at time t, r is the average growth rate of the
population, and the noise Nt models random fluctuations in the growth rate.
2. At time t = 0 an investor buys stocks and bonds on the financial market, i.e., he
divides his initial capital C0 into A0 shares of stock and B0 shares of bonds. The
bonds will yield a guaranteed interest rate r 0 . If we assume that the stock price St
satisfies the growth equation (13.1), then his capital Ct at time t is
0
C t = A t St + B t e r t , (13.2)
where At and Bt are the amounts of stocks and bonds held at time t. With a keen eye
on the market the investor sells stocks to buy bonds and vice versa. If his tradings
0
are self-financing, then dCt = At dSt + Bt d(er t ). An interesting question is:
- What would he be prepared to pay for a so-called European call option, i.e.,
the right (bought at time 0) to purchase at time T > 0 a share of stock at a
predetermined price K?
The rational answer, q say, was found by Black and Scholes (1973) through an analysis
of the possible strategies leading from an initial investment q to a payoff CT . Their
formula is being used on the stock markets all over the world.
3. The Langevin equation describes the behaviour of a dust particle suspended in a fluid:
d
m Vt = Vt + Nt . (13.3)
dt
Here, Vt is the velocity at time t of the dust particle, the friction exerted on the
particle due to the viscosity of the fluid is Vt , and the noise Nt stands for the
disturbance due to the thermal motion of the surrounding fluid molecules colliding
with the particle.
58
4. The path of the dust particle in example 3 is observed with some inaccuracy. One
measures the perturbed signal Z(t) given by
Zt = Vt + Nt . (13.4)
Here Nt is again a noise. One is interested in the best guess for the actual value of
Vt , given the observation Zs for 0 s t. This is called a filtering problem: how to
filter away the noise Nt . Kalman and Bucy (1961) found a linear algorithm, which
was almost immediately applied in aerospace engineering. Filtering theory is now a
flourishing and extremely useful discipline.
5. Stochastic analysis can help solve boundary value problems such as the Dirichlet
problem. If the value of a harmonic function f on the boundary of some bounded
regular region D Rn is known, then one can express the value of f in the interior
of D as follows:
E (f (Bx )) = f (x), (13.5)
t
where Btx := x + 0 Nt dt is an integrated noise or Brownian motion, starting at x,
R
and denotes the time when this Brownian motion first reaches the boundary. (A
harmonic function f is a function satisfying f = 0 with the Laplacian.)
The goal of the course Stochastic Analysis is to make sense of the above equations, and to
work with them.
In all the above examples the unexplained symbol Nt occurs, which is to be thought of
as a completely random function of t, in other words, the continuous time analogue of
a sequence of independent identically distributed random variables. In a first attempt to
catch this concept, let us formulate the following requirements:
1. Nt is independent of Ns for t 6= s;
3. E (Nt ) = 0.
However, when taken literally these requirements do not produce what we want. This is
seen by the following argument. By requirement 1 we have for every point in time an
independent value of Nt . We shall show that such a continuous i.i.d. sequence Nt is not
measurable in t, unless it is identically 0.
Let denote the probability distribution of Nt , which by requirement 2 does not depend
on t, i.e., ([a, b]) := P[a Nt b]. Divide R into two half lines, one extending from a to
and the other extending from a to . If Nt is not a constant function of t, then there
must be a value of a such that each of the half lines has positive measure. So
Now consider the set of time points where the noise Nt is low: E := { t 0 : Nt a }.
It can be shown that with probability 1 the set E is not Lebesgue measurable. Without
59
giving a full proof we can understand this as follows. Let denote the Lebesgue measure
on R. If E were measurable, then by requirement 1 and Eq. (13.6) it would be reasonable
to expect its relative share in any interval (c, d) to be p, i.e.,
On the other hand, it is known from measure theory that every measurable set E is ar-
bitrarily thick somewhere with respect to the Lebesgue measure , i.e., for all < 1 an
interval (c, d) can be found such that
(cf. Halmos (1974) Th. III.16.A). This clearly contradicts Eq. (13.7), so E is not measurable.
This is a bad property of Nt : for, in view of (13.1), (13.3), (13.4) and (13.5), we would like
to integrate Nt .
For this reason, let us approach the problem from another angle. Instead of Nt , let us
consider the integral of Nt , and give it a name:
Z t
Bt := Ns ds.
0
The three requirements on the evasive object Nt then translate into three quite sensible
requirements for Bt .
We add a normalisation:
Still, these four requirements do not determine Bt . For example, the compensated Poisson
jump process also satisfies them. Our fifth requirement fixes the process Bt uniquely:
The object Bt so defined is called the Wiener process, or (by a slight abuse of physical
terminology) Brownian motion. In the next section we shall give a rigorous and explicit
construction of this process.
Before we go into details we remark the following
60
Exercise 13.1 Show that BM5, together with BM1 and BM2, implies the following:
For any > 0
nP (|Bt+ 1 Bt | > ) 0 (13.8)
n
Exercise 13.1 helps us to specify the increments of Brownian motion in the following way2 .
Exercise 13.2 Suppose BM1, BM2, BM4 and (13.8) hold. Apply the Central Limit
Theorem (Lindebergs condition, page 36) to
Xn,k := B kt B (k1)t
n n
and conclude that Bs+t Bs , t > 0 has a normal distribution with variance t, i.e.
1
Z
x2
P (Bs+t Bs A) = e 2t dx.
2t A
As a matter of fact, BM1 and BM5 already imply that the increments Bs+t Bs are
normally distributed3 .
61
More precisely, we start with the following observation. Suppose we already had con-
structed Brownian motion, say (Bt )0tT . Take two times 0 s < t T , put := s+t 2
,
and let
1 2
p(, x, y) := e(yx) /2 , > 0, x, y, R
2
be the Gaussian kernel centered in x with variance . Then, conditioned on Bs = x and
Bt = z, the random variable B is normal with mean := x+z 2
and variance 2 := ts
4
.
Indeed, since Bs ,B Bs , and Bt B are independent we obtain
ts ts
P [Bs dx, B dy, Bt dz] = p(s, 0, x)p( , x, y)p( , y, z)dx dy dz
2 2
1 (y)2
= p(s, 0, x)p(t s, x, z) e 22 dx dy dz
2
(which is just a bit of algebra). Dividing by
we obtain
1 (y)2
P [B dy|Bs dx, Bt dz] = e 22 dy,
2
which is our claim.
This suggests that we might be able to construct Brownian motion on [0, 1] by interpo-
lation.
(n)
To carry out this program, we begin with a sequence {k , k I(n), n N0 } of inde-
pendent, standard normal random variables on some probability space (, F , P ). Here
denotes the set of odd, positive integers less than 2n . For each n N0 we define a process
(n)
B (n) := {Bt : 0 t 1} by recursion and linear interpolation of the preceeding process,
(n) (n1)
as follows. For n N, Bk/2n1 will agree with Bk/2n1 , for all k = 0, 1, . . . , 2n1 . Thus for
(n)
each n we only need to specify the values of Bk/2n for k I(n). We start with
(n)
We shall show that, almost surely, Bt converges uniformly in t to a continuous function
Bt (as n ) and that Bt is a Brownian motion.
62
We start with giving a more convenient representation of the processes B (n) , n = 0, 1, . . ..
We define the following Haar functions by H10 (t) 1, and for n N, k I(n)
k1
t < 2kn
(n1)/2
2 , 2n
(n) k
Hk (t) := 2(n1)/2 , 2n
t < k+1
2n
0 otherwise.
(0) (n)
Note that S1 (t) = t, and that for n 1 the graphs of Sk are little tents of height
2(n+1)/2 centered at k/2n and non overlapping for different values of k I(n). Clearly,
(0) (0) (0)
Bt = 1 S1 (t), and by induction on n, it is readily verified that
n
(n) (m) (m)
X X
Bt () = k ()Sk (t), 0 t 1, n N. (13.9)
m=0 kI(m)
(n)
Lemma 13.4 As n , the sequence of functions {Bt (), 0 t 1}, n N0 , given
by (13.9) converges uniformly in t to a continuous function {Bt (), 0 t 1} for almost
every .
(n)
Proof. Let bn := maxkI(n) |k |. Oberserve that for x > 0 and each n, k
r Z
(n) 2 2
P (|k | > x) = eu /2 du
r Zx r
2 u u2 /2 2 1 x2 /2
e du = e ,
x x x
which gives
r
[ (n) n (n) 2 2n n2 /2
P (bn > n) = P( {|k | > n}) 2 P (|1 | > n) e ,
n
kI(n)
the Borel-Cantelli Lemma implies that there is a set with P () = 1 such that for
there is an n0 () such that for all n n0 () it holds true that bn () n. But then
X X (n) (n)
X
|k ()Sk (t)| n2(n+1)/2 < ;
nn0 () kI(n) nn0 ()
(n)
so for , Bt () converges uniformly in t to a limit Bt . The uniformity of the
convergence implies the continuity of the limit Bt .
The following exercise facilitates the construction of Brownian motion substantially:
63
Exercise 13.5 Check the following in a textbook of functional analysis:
The inner product Z 1
hf, gi := f (t)g(t)dt
0
(n)
turns L2 [0, 1] into a Hilbert space, and the Haar functions {Hk ; k I(n), n N0 } form a
complete, orthonormal system.
Thus the Parseval equality
X
(n) (n)
X
hf, gi = hf, Hk ihg, Hk i (13.10)
n=0 kI(n)
holds true.
Proof. In view of our definition of Brownian motion it suffices to prove that for 0 = t0 <
t1 . . . < tn 1, the increments (Btj Btj1 )j=1,...,n are independent, normally distributed
with mean zero and variance (tj tj1 ). For this we will show that the Fourier transforms
satisfy the appropriate condition, namely that for j R (and as usual i := 1)
n n
X Y 1
exp 2j (tj tj1 ) .
E exp i j (Btj Btj1 ) = (13.12)
j=1 j=1
2
To derive (13.12) it is most natural to exploit the construction of Bt form Gaussian random
(n)
variables. Set n+1 = 0 and use the independence and normality of the k to compute for
64
M N
n
(M )
X
E exp i (j+1 j )Btj
j=1
M n
(m) (m)
X X X
= E exp i k (j+1 j )Sk (tj )
m=0 kI(m) j=1
M n
(m) (m)
Y Y X
= E exp ik (j+1 j )Sk (tj )
m=0 kI(m) j=1
M n
Y Y 1 X (m) 2
= exp (j+1 j )Sk (tj )
m=0 kI(m)
2 j=1
n n M
1 XX X X (m) (m)
= exp (j+1 j )(l+1 l ) Sk (tj )Sk (tl )
2 j=1 l=1 m=0 kI(m)
65
14 Appendix
Let be a set and P() the collection of subsets of .
A, B R A \ B R
A, B R A B R
If additionally
R
then R is called an algebra.
D D Dc D
For every sequence (Dn )nN of pairwise disjoint sets Dn D, their union n Dn is also in
D.
Theorem 14.3 A Dynkin system is a -algebra if and only of for any two A, B D we
have
AB D
Similar to the case of -algebras for every system of sets E P() there is a smallest
Dynkin system D(E) generated by (and containing) E. The importance of Dynkin systems
mainly is due to the following
A, B E A B E
we have
D(E) = (E).
: R [0, ]
66
and n
X
(ni=1 Ai ) = (Ai ) (14.1)
i=1
for all pairwise disjoint sequence of sets (Ai )iN R. We will call (14.1) finite additivity
and (14.2) -additivity.
In the case that R is an algebra and is -finite (i.e. is the countable union of subsets
of finite measure), this extension is unique.
67