ProbabilityC PDF

Probability Theory
Matthias Lowe
Academic Year 2001/2002

Contents
1 Introduction 1
2 Basics, Random Variables 1
3 Expectation, Moments, and Jensens inequality 3
4 Convergence of random variables 7
5 Independence 11
6 Products and Sums of Independent Random Variables 16
7 Infinite product probability spaces 18
8 Zero-One Laws 23
9 Laws of Large Numbers 26
10 The Central Limit Theorem 35
11 Conditional Expectation 43
12 Martingales 50
13 Brownian motion 57
13.1 Construction of Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . 61
14 Appendix 66
1 Introduction
Chance, luck, and fortune have been in the centre of human mind ever since people have
started to think. The latest when people started to play cards or roll dices for money,
there has also been a desire for a mathematical description and understanding of chance.
Experience tells, that even in a situation, that is governed purely by chance there seems
to be some regularly. For example, in a long sequence of fair coin tosses we will typically
see about half of the time heads and about half of the time tails. This was formulated
already by Jakob Bernoulli (published 1713) and is called a law of large numbers. Not
much later the French mathematician de Moivre analyzed how much the typical number
of heads in a series of n fair coin tosses fluctuates around 21 n. He thereby discovered the
first form of what nowadays has become known as the Central Limit Theorem.
However, a mathematically clean description of these results could not be given by

either of the two authors. The problem with such a description is that, of course, the
average number of heads in a sequence of fair coin tosses will only typically converge to
1
2
.
One could, e.g., imagine that we are rather unlucky and toss only heads. So the principal
question was: what is the probability of an event? The standard idea in the early days
was, to define it as the limit of the average time of occurrences of the event in a long
row of typical experiments. But here we are back again at the original problem of what
a typical sequence is. It is natural that, even if we can define the concept of typical
sequences, they are very difficult to work with. This is, why Hilbert in his famous talk
on the International Congress of Mathematicians 1900 in Paris mentioned the axiomatic
foundation of probability theory as the sixth of his 23 open problems in mathematics.
This problem was solved by A. N. Kolmogorov in 1933 by ingeniously making use of
the newly developed field of measure theory: A probability is understood as a measure on
the space of all outcomes of the random experiment. This measure is chosen in such a way
that it has total mass one.
From this start-up the whole framework of probability theory was developed: Starting
from the laws of large numbers over the Central Limit Theorem to the very new field of
mathematical finance (and many, many others).
In this course we will meet the most important highlights from probability theory and
then turn into the direction of stochastic processes, that are basic for mathematical finance.
2 Basics, Random Variables

As mentioned in the introduction the concept of Kolmogorov understands a probability on
space of outcomes as measure with total mass one on this space. So in the framework of
probability theory we will always consider a triple (, F , P) and call it a probability space:
Definition 2.1 A probability space is a triple (, F , P), where
is a set, is called a state of the world.
1
F is a -algebra over , A F is called an event.
P is a measure on F with P() = 1, P(A) is called the probability of event A.
A probability space can be considered as an experiment we perform. Of course, in

an experiment in physics one is not always interested to measure everything one could in
principle measure. Such a measurement in probability theory is called a random variable.
Definition 2.2 A random variable X is a mapping
X : Rd
that is measurable. Here we endow Rd with its Borel -algebra B d .
The important fact about random variables is that the underlying probability space
(, F , P) does not really matter. For example, consider the two experiments
1
1 = {0, 1} , F1 = P1 , and P1 {0} =
2
and
1
2 = {1, 2, . . . , 6} , F2 = P2 , and P2 {i} = , i 2 .
6
Consider the random variables
X1 : 1 R
7
and
X2 : 2 R
0 i is even

7
1 i is odd.
Then
1
P1 (X1 = 0) = P2 (X2 = 0) =
2
and therefore X1 and X2 have same behavior even though they are defined on completely
different spaces. What we learn from this example is that what really matters for a random
variable is the distribution P X 1 :
d
d d
PX of a random variable X : R , is the following
Definition 2.3 The distribution
probability measure on R , B :
PX (A) := P (X A) = P X 1 (A) , A Bd .

So the distribution of a random variable is its image measure in Rd .
Example 2.4 Important distributions of random variables X (we have already met in
introductory courses in probability and statistics) are
2
The Binomial distribution with parameters n and p, i.e. a random variable X is
Binomially distributed with parameters n and p (B(n, p)-distributed), if

n k
P(X = k) = p (1 p)nk 0 k n.
k
The binomial distribution is the distribution of the number of 1s in n independent
coin tosses with success probability p.
The Normal distribution with parameters and 2 , i.e. a random variable X is

normally distributed with parameters and 2 (N (, 2 )-distributed), if
Z a
1 1 x 2
P(X a) = e 2 ( ) dx.
2 2
Here a R.
The Dirac distribution with atom in a R, i.e. a random variable X is Dirac

distributed with parameter a, if

1 if b = a
P(X = b) = a (b) =
0 otherwise.
Here b R.
The Poisson distribution with parameter R, i.e. a random variable X is Poisson

distributed with parameter R (P()-distributed), if
k e
P(X = k) = k N0 = N {0}.
k!
The Multivariate Normal distribution with parameters Rd and , i.e. a random

variable X is normally distributed in dimension d with parameters and (N (, )-
distributed), if Rd , is a symmetric, positive definite d d matrix and for
A = (, a1 ] . . . (, ad ]
P(X A) =
Z a1 Z ad
1 1 1
p ... exp h (x ), (x )i dx1 . . . dxd .
(2)d det 2
3 Expectation, Moments, and Jensens inequality

We are now going to consider important characteristics of random variables
Definition 3.1 The expectation of a random variable is defined as

Z Z
E(X) := EP (X) := XdP = X()dP()
if this integral is well defined.
3
R
Notice that for A F one has E(1A ) = 1A dP = P(A). Quite often one may want to
integrate f (X) for a function f : Rd Rm . How does this work?
Proposition 3.2 Let X : Rd be a random variable and f : Rd R be a measurable

function. Then Z Z
f XdP = E [f X] = f dPX . (3.1)
Proof. If f = 1A , A B d we have
Z
E [f X] = E(1A X) = P (X A) = PX (A) = 1A dPX .
Pn d
Hence (3.1) holds true for functions f = i=1 i 1Ai , i R, Ai B . The standard
approximation techniques yield (3.1) for general integrable f .
In particular Proposition 3.2 yields that
Z
E [X] = xdPX (x).
Exercise 3.3 If X : N0 , then

X
X
EX = nP (X = n) = P (X n)
n=0 n=1
Exercise 3.4 If X : R, then

X
X
P (|X| n) E (|X|) P (|X| n) .
n=1 n=0
P
Thus X has an expectation, if and only if n=0 P (|X| n) converges.
Further characteristics of random variables are the p-th moments.
Definition 3.5 1. For p 1, the p-th moment of a random variable is defined as

E (X p ).
2. The centered p-th moment of a random variable is defined as E [(X EX) p ].
3. The variance of a random variable X is its centered second moment, hence
V(X) := E (X EX)2 .

Its standard deviation is defined as

p
:= V(X).
4
Proposition 3.6 V(X) < if and only if X L2 (P). In this case
V(X) = E X 2 (E (X))2

(3.2)
as well as
V(X) E X 2

(3.3)
and
(EX)2 E X 2

(3.4)
Proof. If V(X) < , then X EX L2 (P). But L2 (P) is a vector space and the constant
EX L2 (P) X = (X EX) + EX L2 (P) .
On the other hand if X L2 (P) X L1 (P). Hence EX exists and is a constant.

Therefore also
X EX L2 (P) .
By linearity of expectation then:
V(X) = E(X EX)2 = EX 2 2(EX)2 + (EX)2 = EX 2 (EX)2 .
This immediately implies (3.3) and (3.4).
Exercise 3.7 Show that X and X EX have the same variance.
It will turn out in the next step that (3.4) is a special case of a much more general
principle. To this end recall the concept of a convex function. Recall that a function
: R R is convex if for all (0, 1) we have
(x + (1 ) y) (x) + (1 ) (y)
for all x, y R. In a first course in analysis one learns that convexity is implied by 00 0.
On the other hand convex functions do not need to be differentiable. The following exercise
shows that they are close to being differentiable.
Exercise 3.8 Let I be an interval and : I R be a convex function. Show that the
0 0
right derivative + (x) exist for all x I (the interior of x) and the left derivative (x)
0
exists for all x I. Hence is continuous on I. Show moreover that + is monotonely
increasing on I and it holds:
0
(y) (x) + + (x)(y x)
for x I, y I.
Applying Exercise 3.8 yields the generalization of (3.4) mentioned above.
5
Theorem 3.9 (Jensens inequality) Let X : R be a random variable on (, F , P)
and assume X is P-integrable and takes values in an open interval I R. Then EX I
and for every convex
:IR
X is a random variable. If this random variable X is P-integrable it holds:
(E (X)) E( X).
Proof. Assume I = (, ). Thus X() < for all . But then E(X) , but
then also E(X) < . Indeed E(X) = , implies that the strictly positive random variable
X() equals 0 on a set of P measure one, i.e. P-almost surely. This is a contradiction.
Analogously: EX > . According to exercise 3.8 is continuous on I = I hence Borel-
measurable. Now we know
0
(y) (x) + + (x) (y x) (3.5)
for all x, y I with equality for x = y. Hence

h 0
i
(y) = sup (x) + + (x) (y x) (3.6)
xI
for all y I. Putting y = X() in (3.5) yields

0
X (x) + + (x) (X x)
and by integration
0
E( X) (x) + + (x) (E (X) x)
all x I. Together with (3.6) this gives
h 0
i
E( X) sup (x) + + (x)(EX x) = (E(X)).
xI
This is the assertion of Jensens inequality.
Corollary 3.10 Let X Lp (P) for some p 1. Then
|E(X)|p E(|X|p ).
Exercise 3.11 Let IPbe an open interval and : I R be convex. For x1 , . . . , xn I and
1 , . . . n R+ with ni=1 i = 1, show that
n
! n
X X
i xi i (xi ) .
i=1 i=1
6
4 Convergence of random variables
Already in the course in measure theory we met three different types on convergence:
Definition 4.1 Let (Xn ) be a sequence of random variables.
1. Xn is stochastically convergent (or convergent in probability) to a random variable

X, if for each > 0
lim P (|Xn X| ) = 0
n
2. Xn converges almost surely to a random variable X, if for each > 0

P lim sup |Xn X| = 0
n
3. Xn converges to a random variable X in Lp or in p-norm, if
lim E (|Xn X|p ) = 0.

n
Already in measure theory we proved:
Theorem 4.2 1. If Xn converges to X almost surely or in Lp , then it also converges

stochastically. None of the converses is true.
2. Almost sure convergence does not imply convergence in Lp and vice versa.
Definition 4.3 Let (, F ) be a measurable, topological space endowed with its Borel -
algebra F . This means F is generated by the topology on . Moreover for each n N let
n and be probability measures on (, F ). We say that n converges weakly to , if for
each bounded, continuous, and real valued function f : R (we will write C b () for the
space of all such functions) it holds
Z Z
n (f ) := f dn f d =: (f ) as n (4.1)
Theorem 4.4 Let (Xn )nN be a sequence of real valued random variables on a space
(, F , P). Assume (Xn ) converges to a random variable X stochastically. Then PXn (the
sequence of distributions) converges weakly to PX ,i.e.
Z Z
lim f dPXn = f dPX
n
or equivalently
lim E (f Xn ) = E (f X)
n
for all f C b (R).

If X is constant P-a.s. the converse also holds true.
7
Proof. First assume f C b (R) is uniformly continuous. Then for > 0 there is a > 0
such that for any x0 , x00 R
|x0 x00 | < |f (x0 ) f (x00 )| < .
Define An := {|Xn X| } , n N. Then

Z Z

f dPXn f dPX = |E (f Xn ) E (f X)|

E (|f Xn f X|)

= E (|f Xn f X| 1An ) + E |f Xn f X| 1Acn
2 kf k P (An ) + P (Acn )
2 kf k P (An ) + .
Here we used the notation

kf k := sup{|f (x)| : x R}
and that
|f Xn f X| |f Xn | + |f X| 2 kf k .
But since Xn X stochastically, P (An ) 0 as n , so for n large enough P (An )

2kf k
. Thus for such n
Z Z

f dPXn f dPX 2.

Now let f C b (R) be arbitrary and denote In := [n, n]. Since In R as n we

have PX (In ) 1 as n . Thus for all > 0 there is n0 () =: n0 such that
1 PX (In0 ) = PX (R \ In0 ) < .
We choose the continuous function un0 in the following way:

1 x I n0
0 |x| n0 + 1

un0 (x) =

x + n0 + 1 n0 < x < n0 + 1
x + n0 + 1 n0 1 < x < n0

Eventually put f := un0 f . Since f 0 outside the compact set [n0 1, n0 + 1], the
function f is uniformly continuous (and so is un0 ) and hence
Z Z
lim f dPXn = fdPX

n
as well as Z Z
lim un0 dPXn = un0 dPX ;
n
thus also Z Z
lim (1 un0 ) dPXn = (1 un0 ) dPX .
n
8
By the triangle inequality
Z Z

f dPXn f dPX (4.2)

Z Z Z Z
f f dPXn + fdPXn fdPX + f f dPX .

For large n n1 (), fdPXn fdPX . Furthermore from 0 1 un0 1R\In0 we

R R
obtain Z
(1 un0 ) dPX PX (R \ In0 ) < ,
so that for all n n2 () also Z

(1 un0 ) dPXn < .
This yields Z Z
f f dPX =

|f | (1 un0 ) dPX kf k
on the one hand and Z

f f dPXn kf k

all n n2 () on the other. Hence we obtain from (4.2) for large n:

Z Z

f dPXn f dPX 2 kf k + .

This proves weak convergence of the distributions.

For the converse let R and assume X (X is identically equal to R P-almost
surely). This means PX = where is the Dirac measure concentrated in . For the
open interval I = ( , + ) , > 0 we may find f C b (R) with f 1I and f () = 1.
Then Z
f dPXn PXn (I) = P (Xn I) 1.
Since we assumed weak convergence of PXn to PX we know that

Z Z
f dPXn f dPX = f () = 1
R
as n . Since f dPX P(Xn I) 1, this implies
P (Xn I) 1
as n . But
{Xn I} = {|Xn | }
and thus
P (|Xn X| ) = P (|Xn | )
= 1 P (|Xn | < ) 0
for all > 0. This means Xn converges stochastically to X.
9
Definition 4.5 Let Xn , X be random variables on a probability space (, F , P). If PXn
converges weakly to PX we also say that Xn converges to X in distribution.
Remark 4.6 If the random variables in Theorem 4.4 are Rd -valued and the f : Rd R
belong to C b (Rd ) the statement of the theorem stays valid.
Example 4.7 For each sequence n > 0 with lim n = 0 we have

lim N (0, n2 ) = 0 .
n
Here 0 denotes the Dirac measure concentrated in 0 R. Indeed, substituting x = y we

obtain Z Z
1 2
x2 1 y2
f (x)e 2 dx = e 2 f (y)dy
2 2 2
Now for all y R
1 y2 2
e 2 f (y) kf k e y2

2
which is integrable. Thus by dominated convergence
Z 2
Z
1 x2 1 y2
lim p e n f (x)dx = lim
2 e 2 f (n y)dy
n 2n2 n 2
Z
1 y2
Z
= lim e f (n y)dy = f (0) = f d0 .
2
n 2
Exercise 4.8 Show that the converse direction in Theorem 4.4 is not true, if we drop the
assumption that X P-a.s.
Exercise 4.9 Assume the following holds for sequence (Xn ) of random variables on a
probability space (, F , P):
P (|Xn | > ) <
For all n large enough (larger than n ()) for each given > 0. Is this equivalent with
stochastic convergence of Xn to 0 ?
Exercise 4.10 For a sequence of Poisson distributions (n ) with parameters n > 0 show
that
lim n = 0 ,
n
if limn n = 0. Is there a probability measure on B 1 , with

lim n = (weakly)
n
if lim n = ?
10
5 Independence
The concept of independence of events is one of the most essential in probability theory. It
is met already in the introductory courses. Its background is the following:
Assume we are given a probability space (, F , P). For two events A, B F with
P (B) > 0 one may ask, how the probability of the event A changes, if we know already
that B has happened, we only need to consider the probability of A B. To obtain a
probability again we normalize by P (B) and get the conditional probability of A given B:
P (A B)
P (A | B) = .
P (B)
now we would call A and B independent, if the knowledge that B has happened does not
change the probability that A will happen or not, i.e. if P (A | B) and P (A) are the same
P (A | B) = P (A) .
This in other words means A and B are independent, if
P (A B) = P (A) P (B) .
More generally we define
Definition 5.1 A family (Ai )iI of events on F is called independent if for each choice of
different indices i1 , . . . , in I
P (Ai1 .. Ain ) = P (Ai1 ) . . . P (Ain ) (5.1)
Exercise 5.2 Give an example of a sequence of events that are pairwise independent, i.e.
each pair of events from this sequence is independent, but not independent (i.e. all events
together not independent).
We generalize Definition 5.1 to set systems in the following way:
Definition 5.3 For each i I let Ei F be a collection of events. (Ei )iI are called
independent if (5.1) holds for each i1 , . . . , in I, each n N and each Ai Ei , =
1, . . . , n.
Exercise 5.4 Show the following
1. A family (Ei )iI is independent, if and only if every finite sub-family is independent.
2. Independence of (Ei )iI is maintained, if we reduce the families (Ei ). More precisely,
let (Ei ) be independent and Ei0 Ei then also the families (Ei0 ) are independent.
all n N, (Ein )iI are independent and for all n N and i I, Ein Ein+1 ,
3. If for S
then ( n Ein )iI are independent.
11
Exercise 5.5 If (Ei )iI are independent, then so are the Dynkin-systems (D (Ei ))iI . Here
D (A) is the Dynkin system generated by A, it coincides with the intersection of all Dynkin
systems containing A. (See the Appendix for definitions.)
Corollary 5.6 Let (Ei )iI be an independent family of -stable sets Ei F . Then also the
families ( (Ei ))iI of the -algebras generated by the Ei are independent.
Theorem 5.7 Let (Ei )iI be an independent family of -stable sets Ei F and
I = jJ Ij

with Ii Ij = , i 6= j. Let Aj := iIj Ei . Then also (Aj )jJ is independent.
Proof. For j J let Ej be the system of all sets of the form
E i1 . . . E in
where 6= {i1 , . . . , in } Ij and Ei Ei , = 1, . . . , n, are arbitrary. Then Ej is -

stable. As an immediate consequence of the independence of the (Ei )iI also the (Ej )jJ
are independent. Eventually Aj = (Ej ). Thus the assertion follows from Corollary 5.6.
In the next step we want to show that events that depend on all but finitely many
-algebras of a countable family of independent -algebras can only have probability zero
or one. To this end we need the following definition.
Definition 5.8 Let (An )n be a sequence of -algebras from F and

!
[
Tn := Am
m=n
the -algebra generated by An , An+1 , . . .. Then

\
T := Tn
n=1
is called the -algebra of the tail events.
Exercise 5.9 Why is T a -algebra?
This is now the result announced above:
Theorem 5.10 (Kolmogorovs Zero-One-Law) Let (An ) be an independent sequence

of -algebras An F . Then for every tail event A T it holds
P (A) = 0 or P(A) = 1.
12
Proof. Let A T and D the system of all sets D F that are independent of A. We
want to show that A D:
In the exercise below we show that D is a Dynkin system. By Theorem 5.7 the -algebra
Tn+1 is independent of the -algebra
An := (A1 . . . An ) .
Since A Tn+1 we know that An D for every n N. Thus

[
A := An D
n=1

Obviously An is increasing. For E, F A there thus exists n with E, F An and hence
E F An and thus E F A. This means that A is -stable. Since A D, i.e.
(A, {A}) is independent,
from Exercise 5.5 we conclude that (D(A), {A}) is independent,

so that A = D A D. Moreover An A, all n. Hence T1 = (An ) A .
Therefore
A T T1 A D.
Therefore A is independent of A, i.e. it holds
P(A) = P(A A) = P(A) P(A) = (P(A))2 .
Hence P(A) {0, 1} is asserted.
Exercise 5.11 Show that D from the proof of Theorem 5.10 is a Dynkin system.
As an immediate consequence of the Kolmogorov Zero-One-Law (Theorem 5.10) we

obtain
Theorem 5.12 (Borels Zero-One-Law) For each independent sequence (A n )n of events

An F we have
P(An for infinitely many n) = 0
or
P(An for infinitely many n) = 1,
i.e.
P (lim sup An ) {0, 1} .
Proof. LetSAn = (An ), i.e. An = {, , An , Acn }. It follows that (An )n is independent.

For Qn := m=n Am we have Qn Tn . Since (Tn )n is decreasing we even have Qm Tn for
all m n, n N. Since (Qn )n is decreasing we obtain

\
\
lim sup An = Qk = Q k Tj
n
k=1 k=j
for all j N. Hence lim sup An T . Hence the assertion follows from Kolmogorovs
zero-one law.
13
Exercise 5.13 In every F the pairs (A, B) (any A, B F with P(A) = 0 or P(A) = 1 or
P(B) = 0 or P(B) = 1) are pairs of independent sets. If these are the only pairs of indepen-
dent sets we call F independence-free. Show that the following space is independence-free:
= N, F = P () , and P ({k}) = 2k!
for each k 2, P ({1}) = 1

P
k=2 P (k). (Hint: Door zo nodig over te gaan op comple-
menten, mag je aannemen dat 1 / A en 1 / B.)
A special case of the above abstract setting is the concept of independent random
variables. This will be introduced next. Again we work over a probability space (, F , P).
Definition 5.14 A family of random variables (Xi )i is called independent, if the -algebras
( (Xi ))i generated by them are independent.
For finite families there is another criterion (which is important, since by definition of
independence we only need to check the independence of finite families).
Theorem 5.15 Let X1 , . . . , Xn be a sequence of n random variables with values in mea-

surable spaces (i , Ai ) with -stable generators Ei of Ai . X1 , . . . , Xn are independent if
and only if
n
Y
P (X1 E1 , . . . , Xn En ) = P (Xi Ei )
i=1
for all Ei Ei , i = 1, . . . , n.
Proof. Put
Gi := Xi1 (Ei ) , Ei Ei .

Then Gi generates (Xi ). Gi is -stable and Gi . According to Corollary 5.6 we need

to show the independence of (Gi )i=1...n , which is equivalent with
P (G1 Gn ) = P (G1 ) P (Gn )
for all choices of Gi Gi . Sufficiency is evident, since we may choose Gi = for appropriate
i.
Exercise 5.16 Random variables X1 , . . . , Xn+1 are independent with values in (i , Fi ) if

and only if X1 , . . . , Xn are independent and Xn+1 is independent of (X1 , . . . , Xn ).
The following theorem states that a measurable deformation of an independent family

of random variarables stays independent:
Theorem 5.17 Let (Xi )iI be a family of independent random variables Xi with values in
(i , Ai ) and let
fi : (i , Ai ) (0i , A0i )
be measurable. Then (fi (Xi ))iI is independent.
14
Proof. Let i1 , . . . , in I. Then
P fi1 (Xi1 ) A0i1 , . . . , fin (Xin ) A0in

= P Xi1 fi1 A0i1 , . . . , Xin fi1 A0in

1 n
Yn
P Xi fi1 A0i

=
=1
Yn
P fi (Xi ) A0i

=
=1
by the independence of (Xi )iI . Here the A0i A0i were arbitrary.
Already Theorem 5.15 gives rise to the idea that independence of random variables
may be somehow related to product measures. This is made more precise in the following
theorem. To this end let X1 , . . . , Xn be random variables such that
Xi : (, F ) (i , Ai ) .
Define
Y := X1 Xn : 1 n
Then the distribution of Y which we denote by PY can be computed as PY = PX1 Xn .
Note that PY is a probability measure on ni=1 Ai .
Theorem 5.18 The random variables X1 , . . . , Xn are independent if and only if their dis-
tribution is the product measure of the individual distributions, i.e. if
PX1 ...Xn = PX1 PXn
Proof. Let Ai Ai , i = 1, . . . , n. Then with
Y = X1 X n :
n
! n
!
Y Y
PY Ai = P Y Ai = P (X1 A1 , . . . , Xn An )
i=1 i=1
as well as
PXi (Ai ) = P (Xi Ai ) i = 1, . . . n.
Now PY is the product measure of the PXi if and only if
PY (A1 . . . An ) = PX1 (A1 ) . . . PXn (An ) .
But this is identical with

n
Y
P (X1 A1 , . . . , Xn An ) = P (Xi Ai ) .
i=1
But according to Theorem 5.15 this is equivalent with the independence of the Xi .
15
6 Products and Sums of Independent Random Vari-
ables
In this section we will study independent random variables in greater detail.
Theorem 6.1 Let X1 , . . . , Xn be independent, real-valued random variables. Then
n n
!
Y Y
E Xi = E (Xi ) (6.1)
i=1 i=1
Qn
if EXi is well defined (and finite) for all i. (6.1) shows that then also E ( i=1 Xi ) is well
defined.
Proof. We know that Q := ni=1 PXi is the joint distribution of the X1 , . . . , Xn . By
Proposition 3.2 and Fubinis theorem
!
Y n Z
E Xi = |x1 . . . xn | dQ(x1 , . . . , xn )

i=1
Z Z
= . . . |x1 | . . . |xn | dPX1 (x1 ) . . . dPXn (xn )
Z Z
|x1 | dPX1 (x1 ) . . . |xn | dPXn (xn )
This shows that integrability of the Xi implies integrability of ni=1 Xi . In this case the
Q
equalities are also true without absolute values. This proves the result.
Exercise 6.2 For any two random variables X, Y that are integrable Theorem 6.1 tells
that independence of X, Y implies that
E (X Y ) = E (X) E (Y )
Show that the converse is not true.
Definition 6.3 For any two random variables X, Y that are integrable and have an inte-
grable product we define the covariance of X and Y to be
cov (X, Y ) = E [(X EX) (Y EY )]
= E (XY ) EXEY.
X and Y are uncorrelated if cov (X, Y ) = 0.
Remark 6.4 If X, Y are independent cov(X, Y ) = 0.
Theorem 6.5 Let X1 , . . . , Xn be square integrable random variables. Then
n
! n
X X X
V Xi = V(Xi ) + cov (Xi , Xj ) (6.2)
i=1 i=1 i6=j
In particular, if X1 , . . . , Xn are uncorrelated

n
! n
X X
V Xi = V (Xi ) . (6.3)
i=1 i=1
16
Proof. We have
! !2
n
X n
X
V Xi = E (Xi EXi )
i=1 i=1
" n
#
X 2
X
= E (Xi EXi ) + (Xi EXi ) (Xj EXj )
i=1 i6=j
n
X X
= V (Xi ) + cov (Xi , Xj ) .
i=1 i=j
This proves (6.2). For (6.3 just note that for uncorrelated random variables X, Y one has
cov(X, Y ) = 0.
Eventually we turn to determining the distribution of the sum of independent random
variables.
Theorem 6.6 Let X1 , . . . , Xn be independent Rd valued random variables. Then the distri-
bution of the sum Sn := X1 +. . .+Xn is given by the convolution product of the distributions
of the Xi , i.e.
PSn := PX1 PX2 PXn
n
Proof. Again let Y := X1 . . . Xn : Rd , and vector addition
An : Rd . . . R d R d .
Then Sn = An Y , hence a random variable. Now PSn is the image measure of P under
An Y , which we denote by (An Y ) (P). Thus
PSn = (An Y ) (P) = An (PY ) .
Now PY = PXi . So by the definition of the convolution product
PX1 . . . PXn = An (PY ) = PSn
More explicitly, in the case d = 1, let g(x1 , . . . , xn ) = 1 for x1 + + xn s and
g(x1 , . . . , xn ) = 0, otherwise. Then application of Fubinis theorem yields
Z Z
P(Sn s) = E(g(X1 , . . . , Xn )) = g(x1 , x2 , . . . , xn )dPX1 (x1 )dP(X2 ,...,Xn ) (x2 , . . . , xn ) =
Z
= P(X1 s x2 xn )dP(X2 ,...,Xn ) (x2 , . . . , xn )
In the case that X1 has a density fX1 with respect to Lebesgue measure, and the order of
differentiation with respect to s and integration can be exchanged, it follows that Sn has a
density fSn and
Z
fSn (s) = fX1 (s x2 xn )dP(X2 ,...,Xn ) (x2 , . . . , xn )
The same formula holds in the case that X1 , . . . , Xn have a discrete distribution (that is,
almost surely assume values in a fixed countable subset of R) if densities are taken with
respect to the count measure.
17
Example 6.7 1. As we learned in Introduction to Statistics the convolution of a Bi-
nomial distribution with parameters n and p, B(n, p) and a Binomial distribution
B (m, p) is a B(n + m, p) distribution:
B(n.p) B(m, p) = B(n + m, p).
2. As we learned in Introduction to Statistics the convolution of a P ()-distribution

(a Poisson distribution with parameter ) with a P ()-distribution is a P ( + )-
distribution:
P () P () = P ( + )
3. As has been communicated in Introduction to Probability:
N , 2 N , 2 = N + , 2 + 2 .

7 Infinite product probability spaces

Many theorems in probability theory start with: Let X1 , X2 , . . . , Xn , . . . be a sequence
of i.i.d. random variables. But how do we know that such sequences really exist? This
will be shown in this section. In the last section we established the framework for some
of the most important theorems from probabilities, as the Weak Law of Large Numbers or
the Central Limit Theorem. Those are the theorems that assume: Let X1 , X2 , . . . Xn be
i.i.d. random variables. Others, as the Strong Law of Large Numbers ask for the behavior
of a sequence of independent and identically distributed (i.i.d.) random variables; they
usually start like Let X1 , X2 , . . . be a sequence of i.i.d. random variables. The natural
first question to ask is: Does such a sequence exist at all?
In the same way as the existence of a finite sequence of i.i.d. random variables is related
to finite product measures, the answer to the above question is related to infinite product
measures. So, we assume that we are given a sequence of measure spaces (n , An , n ) of
which we moreover assume that
n (n ) = 1 for all n.
We construct (, A) as follows: We want each to be a sequence (n )n where n n .

So we put
Y
:= n .
n=1
Moreover we have the idea that a probability measure on should be defined by what
happens on the first n coordinates, n N. So for A1 A1 , A2 A2 , . . . , An An , n N
we want
A := A1 . . . An n+1 n+2 . . . (7.1)
to be in A. By independence we want to define a measure on (, A) that assigns to A
defined in (7.1) the mass
(A) = 1 (A) . . . n (An ).
18
We will solve this problem in greater generality. Let I be an index set and (i , Ai , i )iI
be measure spaces with i (i ) = 1. For 6= K I define
Y
K := i , (7.2)
iK
in particular := I . Let pK
J for J K denote the canonical projection from K to J .
For J = {i} we will also write pK K I
i instead of p{i} and pi in place of pi . Obviously
pLJ = pK L
J pK (J K L) (7.3)
and
pJ := pIJ = pK
J pK (J K) . (7.4)
Moreover denote by
H (I) := {J I, J 6= , |J| is finite} .
For J H (I) the -algebras and measures
AJ := iJ Ai and J := iJ i
are defined by Fubinis theorem in measure theory.

In analogy to the finite dimensional case we define
Definition 7.1 The product -algebra iI Ai of the -algebras (Ai )iI is defined as the
smallest -algebra A on , such that all projections pi : i are (A, Ai )-measurable.
Hence
iI Ai := (pi , i I) . (7.5)
Exercise 7.2 Show that

iI Ai := (pJ , J H (I)) . (7.6)
According to the above we are now looking for a measure on (, A), that assigns mass
1 (A1 ) . . . n (An ) to each A as defined in (7.1). In other words
!! !
Y Y
p1J Ai = J Ai .
iJ iJ
The question, whether such a measure exists, is solved in
Theorem 7.3 On A := iI Ai there is a unique measure with
pJ () := p1
J = J (7.7)
for all J H (I). It holds () = 1.
19
Proof. We may assume |I| = , since otherwise the result is known from Fubinis theorem.
We start with some preparatory considerations:
In Exercise 7.4 below it will be shown that pK
J is (AK , AJ )-measurable for J K and
K
that pJ (K ) = J , (J K, J, K H (I)).
Hence, if we introduce the -algebra of the J-cylinder sets
ZJ := p1
J (AJ ) (J H (I)) (7.8)
1
the measurability of pK implies pK

J J (AJ ) AK and thus
ZJ Z K (J K, J, K H (I)) . (7.9)
Eventually we introduce the system of all cylinder sets
[
Z := ZJ .
JH(I)
Note that due to (7.9) for Z1 , Z2 Z we have Z1 , Z2 ZJ , for suitably chosen J H (I).
Hence Z is an algebra (but generally not a -algebra). From (7.5) and (7.6) it follows
A = (Z) .
Now we come to the main part of the proof. This will be divided into four parts.
1. Assume Z 3 Z = p1J (A), J H (I) , A AJ . According to (7.7) Z must get mass
(Z) = J (A). We have to show that this is well defined. So let
Z = p1 1
J (A) = pK (B)
for J, K H (I), A AJ , B AK . If J K we obtain:

1 1 K 1

pJ (A) = pK pJ (A)
and thus 1
p1 1 0
K (B) = pK (B ) with B 0 := pK
J (A) .
Since pK () = K we obtain
1
B = B 0 = pK
J (A) .
Thus by the introductory considerations
K (B) = J (A) .
For arbitrary J, K define L := J K. Since J, K L, (7.9) implies the existence of
C AL with p1 1 1
L (C) = pJ (A) = pK (B). Therefore from what we have just seen:
L (C) = J (A) and L (C) = K (B).

Hence
J (A) = K (B).
Thus the function
0 p1

J (A) = J (A) (J H(I) , A AJ ), (7.10)
is well-defined on Z.
20
2. Now we show that 0 as defined in (7.10) is a volume on Z. Trivially it holds, 0 0
and 0 () = 0. Moreover, as shown above for Y, Z Z, Y Z = , there is a
J H (I) , A, B AJ such that Y = p1 1
J (A), Z = pJ (B). Now Y Z = implies
A B = and due to
Y Z = p1 J (A B)
we obtain
0 (Y Z) = J (A B) = J (A) + J (B) = 0 (Y ) + 0 (Z)
hence the finite additivity of 0 .

It remains to show that 0 is also -additive. Then the general principles from
measure theory yield that 0 can be uniquely extended to a measure on (Z) = A.
also is a probability measure, because of = p1
J (J ) for all J H(I) and therefore
() = 0 () = J (J ) = 1.
To prove the -additivity of 0 we first show:
3. Let Z Z and J H(I). Then for all J J the set
Z J := { : (J , pI\J ()) Z}
is a cylinder set. This set consists of all with the following property: if we
replace the coordinates i with i J by the corresponding coordinates of J , we
obtain a point in Z. Moreover
Z
0 (Z) = 0 (Z J )dJ (J ). (7.11)
This is shown by the following consideration. For Z Z there are K H(I) and
A AK such that Z = p1 K (A), this means that 0 (Z) = K (A). Since I is infinite
we may assume J K and J 6= K. For the J -intersection of A in K , which we
call AJ , i.e. for the set of all 0 K\J with (J , 0 ) A, it holds
Z J = p1
K\J (AJ ).
By Fubinis theorem AJ AK\J and hence Z J = p1

K\J (AJ ) are (K\J)-cylindersets.
Since K = J K\J Fubinis theorem implies
Z
0 (Z) = K (A) = K\J (AJ )dJ (J ). (7.12)
But this is (7.11), since

0 (Z J ) = K\J (AJ )
(because of Z J = p1
K\J (AJ )).
21
4. Eventually we show that 0 is -continuous and thus -additive. To this end let (Zn )
be a decreasing family of cylinder sets in Z with := inf n 0 (Zn ) > 0. We will show
that
\
Zn 6= . (7.13)
n=1
Now each Zn is of the form Zn = p1Jn (An ), Jn H(I), An AJn . Due to (7.9) we may
assume J1 J2 J3 . . . . We apply the result proved in 3. to J = J1 and Z = Zn .
J
As J1 7 0 Zn 1 is AJ1 -measurable
n
J1 o
Qn := J1 J1 : 0 Zn A J1 .
2
Since all J s have mass one we obtain from (7.11):

0 (Zn ) J1 (Qn ) + ,
2
hence J1 (Qn ) 2 > 0, for all n N. Together with (ZT n ) also (Qn ) is decreasing.
a finite measure is - continuous, which implies
J1 as T n=1 Qn 6= . Hence there is

J1 n=1 Qn with

0 Zn J1 > 0 all n. (7.14)
2
Successive application
of 3. implies via induction that for each k N there is Jk
Jk J
2k > 0 and pJk+1

Jk with (7.14) 0 Zn k
Jk+1 = Jk .
Due to this second property there is 0 with pJk (0 ) = Jk . Because of (7.14)

we have Zn Jn 6= such that there is n with Jn , pI\Jn (n ) Zn . But then
also
Jn , pI\Jn (0 ) = 0 Zn .
Thus 0
T
n=1 Zn which proves (7.13).
Therefore 0 is -additive and hence has an extension to A by Caratheodorys theorem
(Theorem 14.7). It is clear that 0 has mass one (i.e. 0 () = 1), since for J H (I)
we have = p1J (J ) and hence
0 () = J (J ) = 1.
In particular 0 is finite, and the extension is unique, and it is a probability

measure, that is () = 0 () = 1.
This proves the theorem.

We conclude the chapter with an Exercise, that was left open during this proof.
Exercise 7.4 With the notations of this section, in particular of Theorem 7.3 show that
pK
J is (AK , AJ )-measurable (J K, J, K H (I)) and that
pK
J (K ) = J.
22
8 Zero-One Laws
Already in Section 5 we encountered the prototype of a zero-one law: For a sequence of
events (An )n that are independent we have Borels Zero-One-Law (Theorem 5.12):
P (lim sup An ) {0, 1} .
In a first step we will now ask, when the probability in question is zero and when it is one.
This leads to the following frequently used lemma:
Lemma 8.1 (Borel-Cantelli Lemma) Let (An ) be a sequence of events over a probabil-
ity space (, F , P). Then

X
P (An ) < P (lim sup An ) = 0 (8.1)
n=1
If the events (An ) are pairwise independent then also

X
P (An ) = P (lim sup An ) = 1. (8.2)
n=1
Remark 8.2 The Borel-Cantelli Lemma is most often used in the form of (8.1). Note that
this part does not require any knowledge about the dependence structure of the A n .
Proof of Lemma 8.1. (8.1) is easy. Define

[
\
A := lim sup An = Ai .
n=1 i=n
This implies

[
A Ai for all n N.
i=n
and thus !

[
X
P (A) P Ai P (Ai ) (8.3)
i=n i=n
P P
Since i=1 P (Ai ) converges, i=n P (Ai ) converges to zero as n . This implies P (A) =
0, hence (8.1).
For (8.2) again put A := lim sup An and furthermore
n
X
In := 1An , Sn := Ij
j=1
and eventually

X
S := Ij .
j=1
23
Since the An are assumed to be pairwise independent they are pairwise uncorrelated as
well. Hence
Xn n
X
E Ij2 E (Ij )2

V (Sn ) = V (Ij ) =
j=1 j=1
n
X
= E (Sn ) E (Ij )2 ESn ,
j=1
P
where the last equality follows since Ij2 = Ij . Now by assumption n=1 E (In ) = +.
Since Sn S this is equivalent with
lim E (Sn ) = E (S) = (8.4)
n
On the other hand A, if and only if An for infinitely many n which is the case, if
and only if S () = +. The assertion thus is
P (S = +) = 1.
This can be seen as follows. By Chebyshevs inequality
V (Sn )
P (|Sn E(Sn )| ) 1
2
for all > 0. Because of (8.4) we may assume that ESn > 0 and choose = 21 ESn . Hence

1 1 V (Sn )
P Sn E (Sn ) P |Sn ESn | ESn 1 4
2 2 E (Sn )2
But V (Sn ) E (Sn ) and E (Sn ) . Thus
V (Sn )
lim = 0.
E (Sn )2
Therefore for all > 0 and all n large enough

1
P Sn ESn 1 .
2
But now S Sn and hence also

1
P S ESn 1
2
for all > 0. But this implies P (S = +) = 1 which is what we wanted to show.
Example 8.3 Let (Xn ) be a sequence of real valued random variables which satisfies

X
P (|Xn | > ) < (8.5)
n=1
for all > 0. Then Xn 0 P-a.s. Indeed the Borel-Cantelli Lemma says that (8.5)
implies that
P (|Xn | > infinitely often in n) = 0.
But this is exactly the definition of almost sure convergence of Xn to 0.
24
Exercise 8.4 Is (8.5) equivalent with P-almost sure convergence of X n to 0?
Here is how Theorem 5.10 translates to random variables.

Theorem 8.5 (Kolmogorovs 0-1 Law) Let (Xn )n be a sequence of independent random
variables with values in arbitrary measurable spaces. Then for every tail event A, i.e. for
each A with
\
A (Xm , m n)
n=1
it holds that P (A) {0, 1}.
Exercise 8.6 Derive Theorem 8.5 from Theorem 5.10.
Corollary 8.7 Let (Xn )nN a sequence of independent, real-valued random variables. De-
fine

\
T := (Xi , i m)
m=1
to be the tail -algebra. If then T is a real-valued random variable, that is measurable with
respect to T , then T is P-almost surely constant. I.e. there is a R such that
P (T = ) = 1.
Such random variables T : R that are T -measurable are called tail functions.
Proof. For each R we have that
{T } T .
This implies P (T ) {0, 1}. On the other hand, being a distribution function we have
lim P(T ) = 0 and lim P(T ) = 1

+
Let C := { R : P (T ) = 1} and := inf(C) = inf { R : P (T ) = 1} Then for

an appropriately chosen decreasing sequence (n ) C we have n and since {T n }
{T } we have C. Hence = min { C}. This implies
P (T < ) = 0
which implies
P (T = ) = 1.
Exercise 8.8 A coin is tossed infinitely often. Show that every finite sequence
(1 , . . . , k ) , i {H, T } , k N
occurs infinitely often with probability one.
25
Exercise 8.9 Try to prove (8.2) in the Borel-Cantelli Lemma for independent events
(An ) as follows:
1. For each sequence (n ) of real numbers with 0 n 1 we have

n n
!
Y X
(1 i ) exp i (8.6)
i=1 i=1
This implies

X n
Y
n = lim (1 i ) = 0
n
n=1 i=1
2. For A := lim sup An we have

!
\
1 P (A) = lim P Acm
n
m=n
YN
= lim lim (1 P (Am )) .
n N
m=n
P
3. As P (An ) diverges we have because of 1.
N
Y
lim (1 P (Am )) = 0
N
m=n
and hence P (A) = 1 because of 2. Fill in the missing details!
9 Laws of Large Numbers

The central goal of probability theory is to describe the asymptotic behavior of a sequence
of random variables. In its easiest form this has already been done for i.i.d. sequences
in Introduction to Probability and Statistics. In the first theorem of this section this is
slightly generalized.
Theorem 9.1 (Khintchine) Let (Xn )nN be a sequence of square integrable, real valued
random variables, that are pairwise uncorrelated. Assume
n
1 X
lim V (Xi ) = 0.
n n2
i=1
Then the weak law of large numbers holds, i.e.

!
1 Xn n
1 X
lim P Xi E Xi > = 0 for all > 0.

n n n i=1
i=1
26
Proof. By Chebyshevs inequality for each > 0:
! !
1 Xn n
1 1X
P (Xi EXi ) > 2V (Xi EXi )

n n i=1
i=1
n
!
1 1 X
= 2 2V (Xi EXi )
n i=1
n
1 1 X
= V (Xi EXi )
2 n2 i=1
n
1 1 X
= 2 2 V (Xi ) .
n i=1
Here we used that the random variables are pairwise uncorrelated. By assumption the
latter expression converges to zero.
Remark 9.2 As we will learn in the next Theorem, for an independent sequence square
integrability is even not required.
Theorem 9.1 raises the question whether we can replace the stochastic convergence there
by almost sure convergence. This will be shown in the following theorem. Such a theorem
is called a strong law of large numbers. Its first form was proved by Kolmogorov. We
will present a proof due to Etemadi from 1981.
Theorem 9.3 (Strong Law of Large Numbers Etemadi 1981) For each sequence
(Xn )n of real-valued, pairwise independent, identically distributed (integrable) random vari-
ables the Strong Law of Large Numbers holds, i.e.
n !
1 X
P lim sup Xi EX1 > = 0 for each > 0.

n n i=1

Before we prove Theorem 9.3 let us make a couple of remarks. These should reveal the
structure of the proof a bit:
Pn 1
1. Denote Sn = i=1 Xi . Then Theorem 9.3 asserts that S
n n
:= EX1 , P-almost
surely.
2. Together with Xn also Xn+ and Xn (where Xn+ = max (Xn , 0) and Xn = (Xn )+ )
satisfy the assumptions of Theorem 9.3. Since Xn = Xn+ Xn it therefore suffices to
prove Theorem 9.3 for positive random variables. We therefore assume Xn 0 for
the rest of the proof.
3. All proofs of the Strong Law of Large Numbers use the following trick: We truncate
the random variables Xn by cutting off values that are too large. We therefore
introduce
Yn := Xn 1{|Xn |<n} = Xn 1{Xn <n}
27
Of course, if is the distribution of Xn and n is the distribution of Yn , then n 6= .
Indeed n = fn (), where

x if 0 x < n,
fn (x) :=
0 otherwise.
The idea behind truncation is that we gain square integrability of the sequence.
Indeed: Z Z n
2 2 2
x2 d(x) < .

E Yn = E fn Xn = fn (x)d(x) =
0
4. Of course, after having gained information about the Yn we need to translate these
results back to the Xn . To this end we will apply the Borel-Cantelli Lemma and show
that
X
P (Xn 6= Yn ) < .
n=1
This implies that Xn 6= P

Yn only for finitely many n with probability one. P
In particular,
if we can show that n i=1 Yi P-a.s., we also can show that n ni=1 Xi
1 n 1
P-a.s..
5. For the purposes of the proof we remark the following: Let > 1 and for n N let
kn := [n ]
denote the largest integer n . This means kn N and
kn n < kn + 1.
Since
n 1
lim =1
n n
there is a number c , 0 < c < 1, such that
kn > n 1 c n for all n N. (9.1)
We now turn to
Proof of Theorem 9.3.
Step 1: Without loss of generality Xn > 0. Define Yn = 1{Xn <n} Xn . Then Yn are
independent and square integrable. Define
n
X
Sn0 := (Yi EYi ) .
i=1
Let > 0 and > 1. Using Chebyshevs inequality and the independence of the random
variables (Yn ) we obtain
n
1 0 1 1 0 1 1 0 1 1 X
P Sn > 2 V
S = 2 2 V (Sn ) = 2 2 V (Yi ) .
n n n n n i=1
28
Observe that V (Yi ) = E (Yi2 ) (E (Yi ))2 E (Yi2 ). Thus
n
1 0 1 1 X
E Yi2 .

P Sn > 2 2

n n i=1
For kn = [n ] this gives

kn
1 0 1 X
E Yi2

P S kn > 2 2

kn kn i=1
for all n N. Thus

kn
X 1 0 1 XX 1
E Yi2 .

P Skn > 2 2
n=1
kn n=1 i=1 kn
By rearranging the order of summation we obtain

X 1 0 1 X
tj E Yj2

P S kn > 2

n=1
kn j=1
where
X 1
tj :=
n=n
kn2
j
and nj is the smallest n with kn j. From (9.1) we obtain

1 X 1 1 1
tj 2 = 2 2nj = d 2nj
c n=n 2n c 1 12
j
1
where d = c2 2
(1 ) > 0. This implies
tj d j 2 .
By using the above
j Z
d X 1 X k 2

X 1 0
P S kn > 2
x d(x).
n=1
kn j=1 j 2 k=1 k1
Again rearranging the order of summation yields:

j Z

!Z
X 1 X k 2 X X 1 k
2
x d(x) = 2
x2 d(x).
j=1
j k1 j k1
k=1 k=1 j=k
Since

X 1 1 1 1
2
< 2+ + +...
j=k
j k k (k + 1) (k + 1) (k + 2)

1 1 1 1 1 1 1 2
= 2+ + +... = 2 + ,
k k k+1 k+1 k+2 k k k
29
this yields
Z Z
2d X k x 2d X k

X 1 0 2d
P S kn > 2
xd(x) 2 xd(x) = 2 E (X1 ) < .
n=1
kn k=1 k1 k k=1 k1
Thus by the Borel-Cantelli Lemma

1 0
P Skn > infinitely often in n = 0.

kn
But this is equivalent with the almost sure convergence of
1 0
lim Skn = 0 P-a.s. (9.2)
n kn
Step 2: Next let us see that indeed k1n ki=1

Pn
Yi can only converge to E (X1 ). By
definition of Yn we have that
Z Z n
E (Yn ) = xdn (x) = xd (x) .
0
Thus by monotone convergence
E (X1 ) = lim E (Yn ) .

n
By Exercise 9.6 below this implies

1
E (X1 ) = lim (EY1 + . . . + EYn ) . (9.3)
n n
By definition of the sums Sn0 we have

kn kn
1 0 1 X 1 X
S = Yi E (Yi ) ,
k n kn kn i=1 kn i=1
(9.2) and (9.3) together imply

kn kn
1 X 1 0 1 X
lim Yi = lim Skn + lim EYi = EX1 P-a.s.,
n kn n kn n kn
i=1 i=1
which is what we wanted to show in this step.
Step 3: Now we are aiming at removing the truncation from the Xn . Consider the sum

X
X
P (Xn 6= Yn ) = P (Xn n)
n=1 n=1
According to Exercise 3.4 this is smaller than E(X1 ), so that it is bounded. Therefore
P (Xn 6= Yn infinitely often) = 0.
30
Hence there is a n0 (random) such that with probability one Xn = Yn for all n n0 . But
the finitely many differences drop out when averaging, hence
1
lim Skn = EX1 P-a.s.
n kn
Step 4: Eventually we show that the theorem holds not only for subsequences kn chosen
as above, but also for the whole sequence.
For fixed > 1, of course, the sequences (kn )n are fixed and diverge to +. Hence for
every m N there exists n N such that
kn < m kn+1 .
Since we assumed the Xi to be non-negative this implies
Skn Sm Skn+1 .
Hence
S kn k n Sm Sk kn+1
n+1 .
kn m m kn+1 m
The definition of kn yields
kn n < kn + 1 m kn+1 n+1 .
This gives
kn+1 n+1
< n =
m
as well as
kn n 1
> n+1 .
m
Now, given , for all n n1 = n1 () we have n 1 n1 . Hence, if m kn1 and thus
n n1 we obtain
kn n 1 n1 1
> n+1 > n+1 = 2
m
Now for each we have a set with P ( ) = 1 with
1
lim Sk () = EX1 for all .
kn n
Without loss of generality we may assume that Xi are not identically equal to zero P-a.s.,
otherwise the assertion of the Strong Law of Large Numbers is trivially true. Therefore we
may assume without loss of generality that EX1 > 0. Since > 1 we then have
1 1
EX1 < Skn () < EX1
kn
for all and all n large enough. For such m and this means
1
3 1 EX1 < Sm () EX1 < 2 1 EX1 .

m
31
Define
\
1 := 1+ 1 .
n
n=1
Then P (1 ) = 1 and
1
lim Sm () = EX1
m m
for all 1 . This proves the theorem.
Remark 9.4 Theorem 9.3 in particular implies that for i.i.d. sequences of random vari-
ables (Xn ) with a finite first moment the Strong Law of Large Numbers holds true. Since
stochastic convergence is implied by almost sure convergence also Theorem 9.1 the Weak
Law of Large Numbers holds true for such sequences as well. Therefore the finiteness of
the second moment is not necessary for Theorem 9.1 to hold true for i.i.d. sequences.
Remark 9.5 One might, of course, ask whether a finite first moment is necessary for
Theorem 9.3Pto hold. Indeed one can prove that, if a sequence of i.i.d. random variables is
such that n1 ni=1 Xi converges almost surely to some random variable Y (necessarily a tail
function as in Corollary 8.7!), then EX1 exists and Y = EX1 almost surely. This will not
be shown in the context of this course.
Exercise 9.6 Let (am )m be real numbers such that limm am = a. Show that this implies
that their Cesaro mean
1
lim (a1 + a2 + . . . + an ) = a.
n n
Exercise 9.7 Prove the Strong Law of Large Numbers for a sequence of i.i.d. random
variables (Xn )n with a finite fourth moment, i.e. for random variables with E (X14 ) < .
Do not use the statement of Theorem 9.3 explicitly.
Remark 9.8 A very natural question to ask in the context of Theorem 9.3 is: how fast
does n1 Sn converge to EX1 , i.e. given a sequence of i.i.d. random variables (Xn )n what is
X
1
P Xi EX1 ?

n
If X1 has a finite moment generating function; i.e. if
M (t) := log EetX1 < for all t,
the answer is: exponentially fast. Indeed, Cramers theorem (which cannot be proven in the
context of this course) asserts the following: let I : R R be given by
I(x) = sup [xt M (t)] .

t
Then for every closed set A R

n
!
1 1X
lim sup log P Xi A inf I(x)
n n n i=1 xA
32
and for every open set O R
n
!
1 1X
lim inf log P Xi O inf I(x).
n n n i=1 xO
Pn
This is called a principle of large deviations for the random variables n1 i=1 Xi . In
particular, one can show that the function I is convex and non-negative, with
I(x) = 0 x = EX1 .
We therefore obtain
X
1
Xi EX1 en min(I(EX1 +),I(EX1 ))+n

> 0N n > N : P

n
where I is the I-function introduced above evaluated for the random variables X i EX1 .
The speed of convergence is thus exponentially fast.
Example 9.9 For a Bernoulli B(1, p) random variable X

M (t) tX t
p 1p
e = E(e ) = pe + (1 p); I(x) = x log (1 x) log .
x 1x
Exercise 9.10 Determine the functions M and I for a normally distributed random vari-
able.
Exercise 9.11 Argue that if the moment generating function of a random variable X is
finite, all its moments are finite. In particular both Laws of Large Numbers are applicable
to a sequence X1 , X2 , . . . of iid random variables distributed like X.
At the end of this section we will turn to two applications of the Law of Large Numbers
which are interesting in their own right:
The first of these
two applications is in number theory. Let (, F , P) be given by =
1
1

[0, 1), F = B and P = (Lebesgue measure). For every number we may
consider its g-adic representation
X
= n g n (9.4)
n=1
Here g 2 is a natural number and n {0, . . . , g 1}. This representation is unique,

if we ask that not all but finitely many of the digits n are equal to g 1. For each
{0, . . . , g 1} let Sn,g () be the number of all i {1, . . . , n} with i () = in its
g-adic representation (9.4). We will call a number [0, 1) g-normal1 , if
1 ,g 1
lim Sn () =
n n g
1
The usual meaning of g-normality is that each string of digits 1 2 . . . k occurs with frequency g k .
33
for all = 0, . . . , g 1. Hence is g-normal, if in the long run all of its digits occur with the
same frequency. We will call absolutely normal, if is g-normal for all g N, g 2.
Now for a number [0, 1) randomly chosen according to Lebesgue measure the i () are
i.i.d. random variables; they have as their distribution the uniform distribution on the set
{0, . . . , g 1}. This has to be shown in Exercise 9.13 below and is a consequence of the
uniformity of Lebesgue measure. Hence the random variable

1 if n () =
Xn () =
0 otherwise
are i.i.d. random variables for each g and . Moreover Sn,g () = ni=1 Xi,g (). According
P
to the Strong Law of Large Numbers (Theorem 9.3)
1 ,g 1
Sn () E (X1,g ) = 1 -a.s.
n g
for all {0, . . . , g 1} and all g 2. This means 1 -almost every number is g-normal,
i.e. there is a set Ng with 1 (Ng ) = 0, such that is g-normal for all Ngc . Now

[
N := Ng
g=2
is a set of Lebesgue measure zero as well. This readily implies
Theorem 9.12 (E. Borel) 1 -almost every [0, 1) is absolutely normal.
It is rather surprising that hardly any normal numbers (in the usual meaning, see
Footnote page 33), are known. Champernowne (1933) showed that
= 0, 1234567891011121314 . . .

is 10-normal. Whether 2, log 2, e or are normal of any kind has not been shown yet.
There are no absolutely normal numbers known at all.
Exercise 9.13 Show that for every g 2, the random variables n () introduced above
are i.i.d. random variables that are uniformly distributed on {0, . . . , g 1}.
The second application is to derive a classical result from analysis which in principle
has nothing to do with probability theory is related to the Strong Law of Large Numbers.
As may be well known the approximation theorem by Stone and Weierstra asserts
that every continuous function on [a, b] (more generally on every compact set) can be
approximated uniformly by polynomials. Obviously it suffices to prove this for [a, b] = [0, 1].
So let f C ([0, 1]) be a continuous function on [0, 1]. Define the nth Bernstein polynomial
for f as
n
X n k
Bn f (x) = f xk (1 x)nk .
k n
k=0
34
Theorem 9.14 For each f C ([0, 1]) the polynomials Bn f converge to f uniformly in
[0, 1].
Proof. Since f is continuous and [0, 1] is compact f is uniformly continuous on [0, 1], i.e.
for each > 0 there exists > 0 such that
|x y| < |f (x) f (y)| < .
Now consider a sequence of i.i.d Bernoullis with parameter p, (Xn )n . Call

n
1 1X
Sn := Sn := Xi .
n n i=1
Then by Chebyshevs inequality

1 1 p (1 p) 1
P (|Sn p| ) V (S
n ) = V(S n ) = . (9.5)
2 n2 2 n 2 4n 2
This yields
Z
Sn )

|Bn f (p) f (p)| = |E (f f (p)| = f (x)dPSn (x) f (p)
Z
|f (Sn (x)) f (p)| dPSn (x) +
|Sn
p|<
Z
+ |f (Sn (x)) f (p)| dPSn (x)
|Sn
p|
2 kf k
+ 2 kf k P (|Sn p| ) +
4n 2 .
Here kf k is the sup-norm of f . Hence
2 kf k
sup |Bn f (p) f (p)| +
p[0,1] 4n 2
This can be made smaller than 2 by choosing n large enough.

Notice that Weak Law of Large Numbers by itself would yield, instead of inequality (9.5),
an inequality of the kind
> 0 N such that n N : P(|Sn p| ) .
In this approach it is not clear that N can be chosen independently of p, so that we only
would get pointwise convergence.
10 The Central Limit Theorem

In the previous section we met one of the central theorems of probability theory the
Law of Large
Pn Numbers: If EX1 exists, for sequence of i.i.d. random variables (Xn ), their
1
average n i=1 Xi converges to EX1 (a.s.). The following theorem, called the Central Limit
35
Theorem analyzes the fine structure in the Law of Large Numbers. Its name is due to Polya,
the proof of the following theorem goes back to Charles Stein.
First of all notice that in a certain sense, in order to analyze the fine structure of ni=1 Xi
P
the scaling of the Weak Law of Large Numbers n1 is already an overscaling. P On this scale
we just cannot see the shape of the distribution anymore, since by scaling ni=1 Xi by a
factor n1 we have reduced its variance to a scale n1 which converges to zero. What we see
in the Law of Large Numbers is a bell shaped curve with a tiny, tiny width. Here is what
we get, if we scale the variance to one:
Theorem 10.1 (Central Limit Theorem - CLT) Let X1 , . . . , Xn be a sequence of ran-

dom variables that are independent and have identical distribution (the same for all n) with
EX12 < . Then
Pn Z a

(X i EX 1 ) 1 2
lim P i=1
a = ex /2 dx. (10.1)
n nVX1 2
Before proving the Central Limit Theorem let us remark that it holds under weaker
assumptions as well
Remark 10.2 Indeed, the Central Limit Theorem also holds under the following weaker
assumption. Assume given for n = 1, 2, . . . an independent family of random variables X ni ,
i = 1, . . . , n. For j = 1, . . . n let
nj = EXnj
and v
u n
uX
sn := t VXni .
i=1
The sequence ((Xni )ni=1 )n is said to satisfy the Lindeberg condition, if
Ln () 0 as n
for all > 0. Here

n
1 X
E (Xnj nj )2 ; |Xnj nj | sn .

Ln () = 2
sn j=1
Intuitively speaking the Lindeberg condition asks that none of the variables dominates the
whole sum.
The generalized form of the CLT stated above now asserts that if the sequence (X n )
satisfies the Lindeberg condition, it also satisfies the CLT, i.e.
Pn ! Z a
i=1 (Xni ni ) 1 2
lim P pPn a = ex /2 dx.
n
i=1 VXni 2
The proof of this more general theorem basically mimicks the proof we will give below for
Theorem 10.1. We spare ourselves the additional technical work.
36
We will present a proof of the CLT that goes back to Charles Stein. It is based on a
couple of facts:
Fact 1: It suffices to prove the CLT for i.i.d. random variables with EX1 = 0. Otherwise
one just substracts EX1 from each of the Xi .
Fact 2: Define
n
X
Sn := Xi and 2 := V(X1 )
i=1
Theorem 10.1 asserts the convergence in distribution of the (normalized) Sn to a Gaussian

random variable. What we need to show is thus
Z
Sn 1 2
E f f (x)ex /2 dx = E [f (Y )] (10.2)
n 2 2
as n for all f : R R that are uniformly continuous and bounded. Here Y is a
standard normal random variable, i.e. it is N (0, 1) distributed.
We prepare the proof of the CLT in two lemmata.
Lemma 10.3 Let f : R R be bounded and uniformly continuous. Define

Z
1 2
N (f ) := f (y)ey /2 dy
2
and Z x
x2 /2 2 /2
g(x) := e (f (y) N (f )) ey dy.

Then g fulfills
g 0 (x) xg(x) = f (x) N (f ). (10.3)
Proof. Differentiating g gives
Z x
0 x2 /2 2 2 2
g (x) = xe (f (y) N (f )) ey /2 dy + ex /2 (f (x) N (f )) ex /2

= xg(x) + f (x) N (f ).
The importance of the above lemma becomes obvious, if we substitute a random variable
X into (10.3) and take expectations:
E [g 0 (X) Xg(X)] = E [f (X) N (f )] .
If X N (0, 1) is standard normal, the right hand side is zero and so is the left hand
side. The idea is thus that instead of showing that
E [f (Un ) N (f )]
converges to zero, we may show the same for
E [g 0 (Un ) Un g(Un )] .
The next step discusses the function g introduced above.
37
Lemma 10.4 Let f : R R be bounded and uniformly continuous and g be the solution
of
g 0 (x) xg(x) = f (x) N (f ). (10.4)
Then g(x), xg(x) and g 0 (x) are bounded and continuous
Proof. Obviously g is even differentiable, hence continuous. But then also xg(x) is con-
tinuous. Eventually
g 0 (x) = xg(x) + f (x) N (f )
is continuous as the sum of continuous functions. For the boundedness part first note that
any continuous function on a compact set is bounded, hence we only need to check that
the functions g, xg, and g 0 are bounded for x .
To this end first note that
Z x
x2 /2 2
g(x) = e (f (y) N (f )) ey /2 dy

Z
x2 /2 2
= e (f (y) N (f )) ey /2 dy
x
(this is true since the whole integral must equal zero).

For x 0 we have
Z x
x2 /2 2 /2
g(x) sup |f (y) N (f )| e ey dy
y0
while for x 0 we have

Z
x2 /2 2 /2
g(x) sup |f (y) N (f )| e ey dy.
y0 x
Now for x 0
x x
y y2 /2 1
Z Z
x2 /2 y 2 /2 x2 /2
e e dy e e dy =
|x| |x|
and similarly for x 0

y y2 /2 1
Z Z
x2 /2 y 2 /2 x2 /2
e e dy e e dy . (10.5)
x x x |x|
Thus we see that for x 1
|g(x)| |xg(x)| sup |f (y) N (f )|

y0
as well as for x 1
|g(x)| |xg(x)| sup |f (y) N (f )| .
y0
Hence g and xg are bounded. But then also g 0 is bounded, since
g 0 (x) = xg(x) + f (x) N (f ).
38
Now we turn to proving the CLT.
Proof of Theorem 10.1. Besides assuming that EXi= 0 for all i we may P also assume
that = VX1 = 1. Otherwise we just replace Xi by Xi / . We write Sn := ni=1 Xi and
2 2
recall that in order to prove the assertion it suffices that for all f bounded and continuous

0 Sn Sn Sn
E g g 0
n n n
as n . Here g is defined as above.
But using the identity
Z 1
Sn X j

Xj 0 Sn Xj 0
g (1 s) g ds
0 n n n n
Sn X j Xj 0 S n X j

Sn
=g g g
n n n n
which is to be proven in Exercise 10.5 below we arrive at

0 Sn Sn Sn
E g g
n n n
n
X 1 0 Sn Xj Sn
= E g g
j=1
n n n n
n
Sn X j Xj2 0 Sn Xj

X 1 0 Sn Xj
= E g g g
j=1
n n n n n n
Xj2 1 0 Sn Sn X j

Xj
Z
0
g (1 s) g ds
n 0 n n n
n
1 0 Sn X j

X 1 0 Sn
= E g g
j=1
n n n n
Xj2 1 0 Sn Sn X j

Xj
Z
0
g (1 s) g ds .
n 0 n n n
In the last step we used linearity of expectation together with EXi = 0 for all i as well as
independence of Xi and Sn Xi for all i together with EXi2 = 1. Let us define
1 0 Sn X j

1 0 Sn
j := E g g
n n n n
2 Z 1
Xj Sn X j

0 Sn Xj 0
g (1 s) g ds .
n 0 n n n
The idea will now be that the continuous function
g 0 is uniformly
continuous on every

Xj 0 S 1 0 Sn Xj 0 Sn X
compact set. So, if n is small, so are g n n g
n
n
and g
n
(1 s) j
n

S X Sn
g 0 nn j as long as n
is inside the chosen compact set. On the other hand the probabil-
Sn X
ities that is outside a chosen large compact set or that j is large are very small. This
n n
39
together with the boundedness of g and g 0 will basically yield the proof. For K > 0, > 0
we write
1j := j 1| Xj | 1| Sn |K
n n
2j := j 1| Xj | 1| Sn |>K
n n
3j := j 1| Xj |> .
n
Hence n n n n n
X X X X X
j = 1j + 2j + 3j = 1j + 2j + 3j .
j=1 j=1 j=1 j=1 j=1
We first consider the 2j -terms:

By Chebyshevs inequality

Sn Sn1
Sn V
n 1 Sn1 V
n n1
P(| | > K) 2
= 2 and P(| | > K ) 2
= .
n K K n (K ) n(K )2
Hence for given > 0 we can find K so large that

Sn Sn1
P(| | > K) and P(| | > K ) .
n n
Since g 0 is bounded by ||g 0 || := supxR |g 0 (x)| we obtain:

n n
1 0 Sn X j

X X 1 0 Sn
|2j |

= E n g n n g

j=1 j=1
n
Xj2 1 0 Sn

Sn X j

Xj
Z
0

g (1 s) g ds 1| Xj | 1| Sn |>K
n 0 n n n n n
n
2||g 0 ||

X 1
E 2||g 0||1| Xj | 1| Sn |>K + E Xj2 1| Xj | 1| Sn |>K
j=1
n n n n n n
n
2||g 0 ||

X 1 0 2
E 2||g ||1| Xj | 1| Sn |>K + E Xj 1| Xj | 1| SnXj |>K
j=1
n n n n n n
2||g 0|| X

0 2
2||g ||E 1 Sn >K +

E Xj 1| Xj | E 1| SnXj |>K
n n j
n n

Sn Sn1
= 2||g 0 ||P > K + 2||g 0 ||P > K
n n
0
4||g ||
For the 1j -terms observe that for every fixed K > 0 the continuous function g 0 is
uniformly continuous on [K, K]. This means that given > 0, there is > 0 such that
|x y| < |g 0 (x) g 0 (y)| < .
40
For given > 0 we choose such , and K as in the first step. Then
n n
1 0 Sn X j

X
1
X 1 0 Sn
|j | = E n g n n g

j=1 j=1
n
Xj2 1 0 Sn

Sn X j

Xj
Z
0

g (1 s) g ds 1| Xj | 1| Sn |K
n 0 n n n n n
n
1 0 Sn X j

E 1 g 0 Sn
X
n g 1 Xj 1 | Sn
|K

j=1
n n n | n |
n
2 Z 1
Xj

0 S n X j 0 S n X j
+ g (1 s) g ds 1| Xj | 1| Sn |K
n 0 n n n n n
n
EXj2
X 1
+
j=1
n n

2
= n = 2.
n
Eventually we turn to the 3j -terms:

Since EX12 < there exists an n0 such that for a given > 0 and all n n0 and as
above we have
2
E X1 1 X1 > < .

n
This implies
n n
1 0 Sn X j

X X 1 0 Sn
|3j |

= E n g n n g

j=1 j=1
n
Xj2 1 0 Sn

Sn X j

Xj
Z
0

g (1 s) g ds 1| Xj |>
n 0 n n n n
n
X 2
2
E1| Xj |> ||g 0 || + ||g 0 ||E Xj2 1| Xj |>
j=1
n n n n

0
Xj 0 2
2||g ||P > + 2||g ||E Xj 1| Xj |>

n n
0
4||g ||
Hence for a given > 0 with the choice of > 0 and K as above we obtain

0 Sn Sn Sn
E g g
n n n
Xn X n X n
= 1j + 2j + 3j
j=1 j=1 j=1
0
2 + 8||g ||.
This can be made arbitrarily small by letting 0. This proves the theorem.
41
Pn
Exercise 10.5 Let X1 , . . . , Xn be i.i.d. random variables and Sn = i=1 Xi . Let
g:RR
be a continuously differentiable function. Show that for all j
Z 1
[g 0 (Sn (1 s)Xj ) g 0 (Sn Xj )]Xj ds = g(Sn ) g(Sn Xj ) Xj g 0 (Sn Xj ).
0
We conclude the section with the informal discussion of two extensions on the Central
Limit Theorem. The first is of practical importance, the second is of more theoretical
interest.
When one tries to apply the Central Limit Theorem, e.g. for a sequence of i.i.d. random
variables, it is of course not only important to know that
Pn
(Xi EX1 )
Xn := i=1
nVX1
converges to a random variable Z N (0, 1). One also needs to know, how close the
distributions of Xn and Z are. This is stated in the following theorem due to Berry and
Esseen:
Theorem 10.6 (Berry-Esseen) Let (Xi )iN be a sequence of i.i.d. random variables with
E(|X1 |3 ) < .Then for a N (0, 1)-distributed random variable Z it holds:
Pn
C E(|X1 EX1 |3 )

i=1 X i EX 1
sup P
a P (Z a)
.
aR nVX1 n (VX1 )3/2
The numerical value of C is below 6 and larger than 0.4. This is rather easy to prove.
The second extension of the Central Limit Theorem starts with the following obser-
vation: Let X1 , X2 , . . . be a sequence of i.i.d. random variables
Pn with finite variance and
1
expectation zero. Then the law of large numbers says that n i=1 Xi converges to EX1 = 0
in probability and almost surely. But it tells nothing about the size of the fluctuations.
This is considered in greater detail by the Central Limit Theorem. The latter describes the
asymptotic probabilities that
Pn !
i=1 (X i EX 1 )
P p a .
nV (X1 )
Since these probabilities are positive for allPa R according to the Central
Limit Theorem,
n
it can be shown that the fluctuations of i=1 Xi are larger than Pn
n, more precisely, for
Xi
each positive a R it holds with probability one that lim sup n a.
i=1
The question for the precise size of the fluctuations, i.e. for the right scaling (an ) such
that Pn
Xi
lim sup i=1
an n
is almost surely finite, is answered by the law of the iterated logarithm:
42
Theorem 10.7 (Law of the Iterated Logarithm by Hartmann and Winter) Let
(Xi )iNPbe a sequence of i.i.d. random variables with 2 := VX1 < ( > 0). Then for
Sn := ni=1 Xi it holds
Sn
lim sup = + P-a.s.
2n log log n
and
Sn
lim inf = P-a.s.
2n log log n
Due to the restricted time we will not be able to prove the Law of the Iterated Logarithm
in the context of this course. Despite its theoretical interest its practical
relevance is rather
limited. To understand why, notice that the correction to the nVX 1 from the Central
Limit Theorem to the Law of the Iterated Logarithm are of order log log n. Even for a
100
fantastically large number
of observation, 10 (which is more than one observation per
atom in the universe) log log n is really small,e.g.
p p p
log log 10100 = log(100 log 10) = log 100 + log log 10 ' 6.13 ' 2.47.
11 Conditional Expectation
To understand the concept of conditional expectation, we will start with a little example.
Example 11.1 Let be a finite population and let the random variable X () denote the
income of person . So, if we are only interested in income, X contains the full information
of our experiment. Now assume we are a sociologist and want to measure the influence of
a persons religion on his income. So we are not interested in the full information given by
X, but only in how X behaves on each of the sets,
{catholic}, {protestant}, {islamic}, {jewish}, {atheist},
etc. This leads to the concept of conditional expectation.
The basic idea of conditional expectation will be that given a random variable
X : (, F ) R
and a sub--algebra A of F to introduce a new random variable called E [X | A] =: X0

such that X0 is A-measurable and
Z Z
X0 dP = XdP
C C
for all C A. So X0 contains all information necessary when we only consider events in
A. First we need to see that such a X0 can be found in a unique way.
43
Theorem 11.2 Let (, F , P) be a probability space and X an integrable random variable.
Let C F be a sub--algebra. Then (up to P-a.s. equality) there is a unique random
variable X0 , which is C-measurable and satisfies
Z Z
X0 dP = XdP for all C C. (11.1)
C C
If X 0, then X0 0 P-a.s.
Proof. First we treat the case X 0. Denote P0 := P |C and Q = XP |C. Both, P0 and
Q are measures on C, P0 even is a probability measure. By definition
Z
Q (C) = XdP.
C
Hence Q (C) = 0 for all C with P (C) = 0 = P0 (C). Hence Q P0 . By the theorem of
Radon-Nikodym there is a C-measurable function X0 0 on such that Q = X0 P0 . Thus
Z Z
X0 dP0 = XdP for all C C.
C C
Hence Z Z
X0 dP = XdP for all C C.
C C
Hence
X0 satisfies (11.1).
R For X0R that is C-measurable and satisfies (11.1) theset C =
X0 < X0 is in C and C X0 dP = C X0 dP, whence P(C) = 0. In the same way P( X0 > X0 =
0. Therefore X0 is P-a.s. equal to X0 .
The proof for arbitrary, integrable X is left to the reader.
Exercise 11.3 Prove Theorem 11.2 for arbitrary, integrable X : R.
Definition 11.4 Under the conditions of Theorem 11.2 the random variable X 0 (which is
P-a.s. unique) is called the conditional expectation of X given C. It is denoted by
X0 =: E [X | C] =: EC [X] .
If C is generated by a sequence of random variable (Yi )iI such that C = (Yi , i I) we

write
E X | (Yi )iI = E [X | C] .
If I = {1, . . . , n} we also write E [X | Y1 , . . . , Yn ].
Note that, in order to check whether Y (Y C-measurable) is a conditional expectation

of X given the sub--algebra C we need to check
Z Z
Y dP = XdP
C C
for all C C. This determines E [X | C] only P-a.s. on sets C C. We therefore also speak
about different versions of conditional expectation.
44
Example 11.5 1. If C = {, }, then the constant random variable EX is a version of
E [X | C]. Indeed if C = , then any variable does the job. If C =
Z Z
XdP = EX = EXdP.
C
2. If C is generated by the family (Bi )iI of mutually disjoint sets (i.e. Bi Bj =

if i 6= j), where I is countable and Bi A (the original space being (, A, P)) and
P(Bi ) > 0 then
X 1 Z
E [X | C] = 1B XdP P-a.s.
iI
P(Bi ) i Bi
be checked in the following exercise.
Exercise 11.6 Show that the assertion of Example 11.5.2. is true.
Exercise 11.7 Show that the following assertions for the conditional expectation E [X | C]
of random variables
X, Y : (, A) R, B 1

(C A) are true:
1. E [E [X | C]] = EX
2. If X is C-measurable then E [X | C] = X P-a.s.
3. If X = Y P-a.s., then E [X | C] = E [Y | C] P-a.s.
4. If X , then E [X | C] = P-a.s.
5. E [X + Y | C] = E [X | C] + E [Y | C] P-a.s. Here , R.
6. XY P-a.s. implies E [X | C] E [Y | C] P-a.s.
The following theorems have proofs that are almost identical with the proofs of the
theorems for expectations:
Theorem 11.8 (monotone convergence) Let (Xn ) be an increasing sequence of positive

random variables with X = sup Xn , X integrable , then
h i
sup E [Xn | C] = lim E [Xn | C] = E lim Xn | C = E [X | C] .
n n n
Theorem 11.9 (dominated convergence) Let (Xn ) be a sequence of random variables

converging pointwise to an (integrable) random variable X, such that there is an integrable
random variable Y with Y |Xn |, then
lim E [Xn | C] = E [X | C] .
n
Also Jensens inequality has a generalization to conditional expectations:
45
Theorem 11.10 (Jensens inequality) Let X be an integrable random variable taking
values in an open interval I R and let
q:I R
be a convex function. Then for each C A it holds
E [X | C] : I
and
q (E [X | C]) E [q X | C] .
An immediate consequence of Theorem 11.10 is the following (for p 1):
|E [X | C]|p E [|X|p | C]
which implies
E (|E [X | C]|p ) E (|X|p ) .
Denoting by
Z 1/p
p
Np (f ) = |f | dP
this means
Np (E [X | C]) Np (X) , X Lp (P) .
This holds for 1 p < . Np (f ) is called the Lp -norm of f . The case p = , which means
that if X is bounded P-a.s. by some M 0, then so is E [X | C], follows from Exercise
11.7.
We slightly reformulate the definition of conditional expectation to discuss its further
properties.
Lemma 11.11 Let X be a positive integrable function. Let X0 : (, A) (R, B 1 ) a posi-

tive C-measurable integrable random variable that is a version of E [X | C] (X integrable),
then Z Z
ZX0 dP = ZXdP (11.2)
for all C-measurable, positive random variables Z.
Proof. From (11.1) we obtain (11.2) for step functions. The general result follows from
monotone convergence.
We are now prepared to show a number of properties of conditional expectations which
we will call smoothing properties
Theorem 11.12 (Smoothing properties of conditional expectations) Let (, F , P)

be probability space and X Lp (P) and Y Lq (P), 1 p , 1p + 1q = 1.
1. If C F and X is C-measurable then
E [XY | C] = XE [Y | C] (11.3)
46
2. If C1 , C2 F with C1 C2 then
E [E [X | C2 ] | C1 ] = E [E [X | C1 ] | C2 ] = E [X | C1 ] .
Proof.
1. First assume that X, Y 0. Let X be C-measurable and C C. Then
Z Z Z Z
XY dP = 1C XY dP = 1C XE [Y | C] dP = XE [Y | C] dP.
C C
Indeed, this follows immediately from lemma 11.11 since 1C X is C-measurable. On

the other hand, we also have XY L1 (P) and
Z Z
XY dP = E [XY | C] dP.
C C
Since XE [Y | C] is C-measurable we obtain
E [XY | C] = XE [Y | C] P-a.s.
In the case X Lp (P) , Y Lq (P) we observe that then XY L1 (P) and conclude
as above.
2. Observe that, of course, E [X | C1 ] is C1 -measurable and, since C1 C2 , also C2 -

measurable. Property 2 in Exercise 11.7 than implies
E [E [X | C1 ] | C2 ] = E [X | C1 ] , P-a.s.
Moreover for all C C1 Z Z

E [X | C1 ] dP = XdP.
C C
Hence for all C C1
Z Z
E [X | C1 ] dP = E [X | C2 ] dP.
C C
But this means

E [E [X | C2 ] | C1 ] = E [X | C1 ] P-a.s.
The previous theorem leads to yet another characterization of the conditional expecta-
tion. To this end take X L2 (P) and denote X0 := E [X | C] for a C F . Let Z L2 (P)
be C-measurable. Then X0 L2 (P) and by (11.3)
E [Z (X X0 ) | C] = ZE [X X0 | C] = Z (E [X | C] X0 ) = Z (X0 X0 ) = 0.
Theorem 11.13 For all X L2 (P) and each C F the conditional expectation E [X | C]
is (up to a.s. equality) the unique C-measurable random variable X 0 L2 (P) with
E (X X0 )2 = min {E (X Y )2 ; Y L2 (P) , Y C-measurable}

47
Proof. Let Y L2 (P) be C-measurable. Put X0 := E [X | C]. Then
E((X Y )2 ) = E((X X0 +X0 Y )2 ) = E((X X0 )2 )+E((X0 Y )2 )+2E((X X0 )(X0 Y ))
But E((X X0 )(X0 Y )) = 0, since X0 Y is C-measurable.

This gives
E (X Y )2 E (X X0 )2 = E (X0 Y )2 .

(11.4)
Due to positivity of squares we hence obtain
E (X X0 )2 E (X Y )2 .

If, on the other hand

E (X X0 )2 = E (X Y )2

then
E (X0 Y )2 = 0

which implies Y = X0 = E [X | C] P-a.s.

The last theorem states that E [X | C] for X L2 (P) is the best approximation of X
in the C-measurable function space in the sense of a least squares approximation. It is
the projection of X onto the space of square integrable, C-measurable functions.
Exercise 11.14 Prove that for X L2 (P), = E(X) is the number that minimizes
E((X )2 ).
With the help of conditional expectation we can also give a new definition of conditional
probability
Definition 11.15 Let (, F , P) be a probability space and C F be a sub--algebra. For

AF
P [A | C] := E [1A | C]
is called the conditional probability of A given C.
Example 11.16 In the situation of Example 11.5.2 the conditional expectation of A F

is given by
X X P (A Bi ) 1B
P (A | C) = P (A | Bi ) 1Bi := i
.
iI iI
P (B i )
In a last step we will only introduce (but not prove) conditional expectations on events
with zero probability. Of course, in general this will just give nonsense but in the case of
a conditional expectation E [X | Y = y] where X, Y are random variables such that (X, Y )
has a Lebesgue density we can give this expression a meaning.
48
Theorem 11.17 Let X, Y be real valued random variables such that (X, Y ) has a density
f : R2 R+ 2
{0} with respect to two dimensional Lebesgue measure . Assume that X is
integrable and that Z
f0 (y) := f (x, y) dx > 0 for all y R.
Then the function E(X|Y ) will be denoted by
y 7 E (X | Y = y)
and one has

1
Z
E (X | Y = y) = xf (x, y) dx for PY -a.e. y R.
f0 (y)
In particular
1
Z
E (X | Y ) = xf (x, Y ) dx P-a.s.
f0 (Y )
We will also need the following relationship between conditional expectation and indepen-
dence, which is a generalization of Example 11.5, case 1.
Lemma 11.18 Let X be an integrable real valued random variable and C F a sub--
algebra such that X is independent of C, that is (X) and C are independent, then
E(X | C) = E(X), P-a.s.
Proof. Suppose X 0. Then an increasing sequence of step functions Xn can be con-

structed by Xn = [2n X]/2n . Then Xn converges monotonically to X.
R Notice that Xn is a
linear combination
R of indicator functions 1A with A (X). And C 1A dP = P(C A) =
P(C)P(A) = C E(1A )dP. Thus E(1A | C) = E(1A ), and by linearity E(Xn | C) = E(Xn )
and by the monotone convergence theorem E(X | C) = E(X). The general case follows by
linearity, X = X + X .
Exercise 11.19 Let X and Y be as in Theorem 11.17, such that X and Y are independent.
Then X is independent of (Y ), and by Lemma 11.18 we have E(X | Y ) = E(X | (Y )) =
E(X). Apply Theorem 11.17 to give an alternative derivation of this fact.
49
12 Martingales
In this section we are going to define a notion, that will turn out to be of central interest
in all of so called stochastic analysis and mathematical finance. A key role in its definition
will be taken by conditional expectation. In this section we will just give the definition and
a couple of examples. There is a rich theory of martingales. Parts of this theory we will
meet in a class on Stochastic Calculus.
Definition 12.1 Let (, F , P) be a probability space and I be an ordered set (linearly or-
dered), i.e. for s, t I either s t or t s, with s t and t s implies s = t and s t,
t u implies s u. For t I let Ft F be a -algebra. (Ft )tI is called a filtration, if
s t implies Fs Ft . A sequence of random variables (Xt )tI is called (Ft )tI - adapted
if Xt is Ft -measurable for all t I.
Exercise 12.2 Construct a filtration on a probability space with |I| 3.
Example 12.3 Let (Xt )tI be a family of random variables, and I a linearly ordered set,
then
Ft = {Xs , s t}
is a filtration and (Xt ) is adapted with respect to (Ft ). (Ft )tI is called the canonical
filtration with respect to (Xt )tI .
Definition 12.4 Let (, F , P) be a probability space and I a linearly ordered set. let (F t )tI
be a filtration and (Xt )tI be an (Ft )-adapted sequence of random variables. (Xt ) is called
an (Ft )-supermartingale, if
E [Xt | Fs ] Xs P-a.s. (12.1)
for all s t. (12.1) is equivalent with

Z Z
Xt dP Xs dP, for all C Fs . (12.2)
C C
(Xt ) is called a (Ft )-submartingale, if (Xt ) is a (Ft )-supermartingale. Eventually (Xt )

is called a martingale, if it is both a submartingale and a supermartingale. This means that
E [Xt | Fs ] = Xs P-a.s.
for s t or, equivalently,

Z Z
Xt dP = Xs dP, C Fs .
C C
Exercise 12.5 Show that the conditions (12.1) and (12.2) are equivalent.
Remark 12.6 1. If (Ft ) is the canonical filtration with respect to (Xt )tI , then often
(Xt ) simply called a supermartingale, submartingale, or a martingale.
50
2. (12.1) and (12.2) are evidently correct for s = t (with equality). Hence these proper-
ties only need to be checked for s < t.
3. Putting C = in (12.2) we obtain for a supermartingale (Xt )t
s t E (Xs ) E (Xt ) .
Hence for supermartingales (E (Xs ))s is a decreasing sequence, while for a submartin-
gale (E (Xs )) is an increasing sequence.
4. In particular, if each of the random variables Xs is almost surely constant, e.g. if
is a singleton (a set with just one element) then (Xs ) is a decreasing sequence, if (Xs )
is a supermartingale. And it is an increasing sequence, if (Xs ) is a submartingale.
Hence martingales are (in a certain sense) the stochastic generalization of constant
sequences.
Exercise 12.7 Let (Xt ), (Yt ) be adapted to the same filtration and , R. Show the
following
1. If (Xt ) and (Yt ) are martingales, then (Xt + Yt ) is a martingale.
2. If (Xt ) and (Yt ) are supermartingales, then so is (Xt Yt ) = (min(Xt , Yt ))
3. If (Xt )) is a submartingale, so is Xt+ , Ft .

4. If (Xt ) is a martingale taking values in an open set J R and

q:J R
is convex then (q Xt , Ft ) is a submartingale, if q (Xt ) is integrable for all t.
Of course, at first glance the definition of a martingale may look a bit weird. We will
therefore give a couple of examples to show that it is not as strange as expected.
Example 12.8 Let (Xn ) be an i.i.d. sequence of R-valued random variables. Put Sn =
X1 + . . . + Xn and consider the canonical filtration Fn = (Sm , m n). By Lemma 11.18
we have
E [Xn+1 | S1 , . . . , Sn ] = E [Xn+1 ] P-a.s.
and by part 2. of Exercise 11.7
E [Xi | S1 , . . . , Sn ] = Xi P-a.s.
for all i = 1, . . . , n. Adding these n + 1 equations gives
E [Sn+1 | Fn ] = Sn + E [Xn+1 ] P-a.s.
If EXi = 0 for all i, then
E [Sn+1 | Fn ] = Sn
i.e. (Sn ) is a martingale. If E [Xi ] 0 then
E [Sn+1 | Fn ] Sn ,
i.e. (Sn ) is a supermartingale. In the same way (Sn ) is a submartingale if EXi 0.
51
Example 12.9 Consider the following game. For each n N a coin with probability p
for heads is tossed. If it shows heads (Xn = +1) our player receives money otherwise he
(Xn = 1) looses money. The way he wins or looses is determined in the following way.
Before the game starts he determines a sequence (%n )n of functions
%n : {H, T }n R+ .
In round number n + 1 he plays for %n (X1 , . . . , Xn ) Euros depending on how the first n
games ended. If we denote by Sn his capital at time n, then
S1 = X 1 and Sn+1 = Sn + %n (X1 , . . . , Xn ) Xn+1 .
Hence
E [Sn+1 | X1 , . . . , Xn ] = Sn + %n (X1 , . . . , Xn ) E [Xn+1 | X1 , . . . , Xn ]

= Sn + %n (X1 , . . . , Xn ) E (Xn+1 )
= Sn + (2p 1) %n (X1 , . . . , Xn ) ,
1
since Xn+1 is independent of X1 , . . . , Xn and E (Xn+1 ) = 2p 1. Hence for p = 2
E [Sn+1 | X1 , . . . , Xn ] = Sn
1
so (Sn ) is a martingale while for p > 2
E [Sn+1 | X1 , . . . , Xn ] Sn ,
1
hence (Sn ) is a submartingale and for p < 2
E [Sn+1 | X1 , . . . Xn ] Sn ,
so (Sn ) is a supermartingale. This explains the idea that martingales are generalizations
of fair games.
Exercise 12.10 Let X1 , X2 , . P . . be a sequence of independent random variables with finite

variance V(Xi ) = i2 . Then { ni=1 (Xi E(Xi ))}2 ni=1 i2 is a martingale with respect
P
to the filtration Fn = (X1 , . . . , Xn ).
Exercise 12.11 Consider the gamblers martingale. Consider an i.i.d. sequence (X n ) n=1
of Bernoulli variables with values 1 and 1, each with probability 1/2. Consider the sequence
n1
(Yn ) such that Yn =
P2n if X1 = = Xn1 = 1, and Yn = 0 if Xi = 1 for some i n1.
Show that Sn = i=1 Xi Yi is a martingale. Show that Sn almost surely converges and
determine its limit S . Observe that Sn 6= E(S | Fn ).
Example 12.12 In a sense Example 12.8 is both, a special case and a generalization of the
following example. To this end let X1 , . . . , Xn , . . . denote an i.i.d. sequence of Rd -valued
random variables. Assume
1
P (Xi = +%k ) = P (Xi = %k ) =
2d
52
for all i = 1, 2, . . . and all k = 1, . . . , d. Here %k denotes the k-th unit vector. Define the
stochastic process Sn by
S0 = 0,
and n
X
Sn = Xi .
i=1
This process is called a random walk in d directions. Some of its properties will be discussed
below. First we will see that indeed (Sn ) is a martingale. Indeed,
E [Sn+1 | X1 , . . . , Xn ] = E [Xn+1 ] + Sn = Sn .
As a matter of fact, not only is (Sn ) a martingale, but, in a certain sense it is the discrete
time martingale.
Since the random walk in d dimensions is the model for a discrete time martingale
(the standard model of a continuous time martingale will be introduced in the following
section) it is worth while studying some of its properties. This has been done in thousands
of research papers in the past 50 years. We will just mention one interesting property here,
that reveals a dichotomy in the random walks behavior for dimensions d = 1, 2 or d 3.
Definition 12.13 Let (Sn ) be a stochastic process in Zd , i.e. for each n N, Sn is a

random variable with values in Zd . (Sn ) is called recurrent in a state x Zd , if
P (Sn = x infinitely often in n) = 1.
It is called transient in x, if
P (Sn = x infinitely often in n) < 1.
(Sn ) is called recurrent (transient), if each x Zd is recurrent (transient).
Proposition 12.14 In the situation of Example 12.12, if x Zd is recurrent, then all

y Zd are recurrent.
Exercise 12.15 Show proposition 12.14.
We will show a variant of the following

Theorem 12.16 The random walk (Sn ) introduced in Example 12.12 is recurrent in di-
mensions d = 1, 2 and transient for d 3.
To prove a version of Theorem 12.16 we will first discuss the property of recurrence:
Lemma 12.17 Let fk denote probability that the random walk return to the origin after
k steps for the first time, and let pk denote probability that the random walk
Preturn to the
origin after k steps. Then a P random walk (Sn ) is recurrent if and only if k fk = 1 and
this is the case if and only if k pk = .
53
Proof. The first equivalence is easy. Denote by k the set of all realizations of the random
walk returning to the origin for the first time after k steps. Then if the random walk (Sn )
is recurrent, with P
probability one there exists a k >P 0 such that Sk = 0Sand Sl 6= 0 for all
0 < l < k. Hence k fk = 1. On the other hand, if k fk = 1, then P( k k ) = 1. Hence
with probability one there exists a k > 0 such that Sk = 0 and Sl 6= 0 for all 0 < l < k.
But then the situation at times 0 and k is completely the same and hence there exists
k 0 > k such that Sk0 = 0 and Sl 6= 0 for all k < l < k 0 . Iterating this gives that Sk = 0 for
infinitely many ks with probability one.
In order to relate fk and pk we derive the following recursion
pk = fk + fk1 p1 + + f0 pk (12.3)
(the last summand ist just added for completeness, we have f0 = 0). Indeed this is again
easy to see. The left hand side is just the probability to be at the origin at time k. This
event is the disjoint uinion of the events to be at 0 for the first time after 1 l k steps
and to walk from zero to zero in the remaining steps. Hence we obtain.
k
X
pk = fi pki and p0 = 1. (12.4)
i=1
Define the generating functions

X X
F (z) = fk z k and P (z) = pk z k .
k0 k0
Multiplying the left and right sides in (12.4) with z k and summing them from k = 0 to
infinity gives
P (z) = 1 + P (z)F (z)
i.e.
F (z) = 1 1/P (z).
By Abels theorem

X 1
fk = F (1) = lim F (z) = 1 lim .
k=1
z1 z1 P (z)
P
First assume that k pk < . Then
X
lim P (z) = P (1) = pk <
z1
k
and thus
1 X
lim = 1/ pk > 0.
z1 P (z) k
Hence
P
k=1 fk < 1 and
Pthe random walk (Sn ) is transient.
Next assume that k pk = and fix > 0. Then we find N such that
N
X 2
pk .
k=0

54
PN 1
Then for z sufficiently close to one we have k=0 pk z k
and consequently for such z
1 1
PN .
P (z) k=0 pk z
k
But this implies that

1 X
lim = 1/ pk = 0
z1 P (z)
k
P
and therefore k=1 fk = 1 and the random walk (Sn ) is transient.
Exercise 12.18 What has the Borel-Cantelli Lemma 8.1 to say in the above situation?
Dont overlook that the events {Sn = x} (n N) may be dependent.
We will now apply this criterion to analyze recurrence and transience for a random walk
similar to the one defined in Example 12.12.
To this end define the following random walk (Rn ) in d dimensions. For k N let
Y1k , . . . , Ydk be i.i.d. random variables, taking values in {1, +1} with P(Y1k = 1) = P(Y1k =
1) = 1/2. Let Xk be the random vector Xk = (Y1k , . . . , Ydk ). Define R0 0 and for k 1
n
X
Rn = Xk .
k=1
Theorem 12.19 The random walk (Rn ) defined above is recurrent in dimensions d = 1, 2
and transient for d 3.
P2k (Zk ) taking values in {1, +1} with

Proof. Consider a sequence of i.i.d. random variables
P(Zk = 1) = P(Zk = 1) = 1/2. Write qk = P( i=1 Zi = 0). Then we apply Stirlings
formula
lim n!/( 2nn+1/2 en ) = 1.
n
to obtain

2k 2k 2k! 2k
qk = 2 = 2
k k!k!
2k
4k 2k e 2k
2
k 2k
2k e
r
1
= .
k
Hence the probability of a single q
coordinate of R2n to be zero (Rn cannot be zero if n is
1
odd) asymptotically behaves like n . Hence
d2
1
P(Rn = 0) .
n
55
But d
X 1 2
=
n
n
for d = 1 and d = 2, while
d
X 1 2
<
n
n
for d 3. This proves the theorem.
We will give two results about martingales. The first one is inspired by the fact that
given a random variable M and a filtration Ft of F , the family of conditional expectations,
E(M | Ft ), yields a martingale. Can a martingale always be described in this way? That
is the content of the Martingale Limit Theorem.
Theorem 12.20 Suppose Mt is a martingale with respect to the filtration Ft and that
the martingale is (uniformly) square integrable, that is lim sup t E(Mt2 ) < . Then there
is a square integrable random variable M such that Mt = E(M | Ft ) a.s.. Moreover
lim Mt = M in L2 sense.
Proof. The basic property that we will use, is that the space of square integrable random
variables L2 (, F , P) with the L2 inner product is a complete vectorspace. In other words,
a Cauchy sequence converges. Recall that for any t < s it holds that Mt = E(Ms | Ft ), and
we have seen in Theorem 11.13 that then Ms Mt is perpendicular to Mt . In particular
we have the Pythagoras formula
E(Ms2 ) = E(Mt2 ) + E((Ms Mt )2 ).
This implies that E(Ms2 ) is increasing in s, and therefore its limit exists and equals
lim sup E(Mt2 ) which is finite. Therefore given > 0 there is a u such that for t > s > u we
have E((Ms Mt )2 ) < . That means that {Ms }s is a Cauchy sequence. Let M be its limit.
In particular M is a random variable. Since orthogonal projection onto the suspace of Ft
measurable functions is a continuous map, it holds that E(M | Ft ) = lim E(Ms | Ft ) = Mt .
With some extra effort one may show that M is also the limit in the sense of almost
sure convergence. The Martingale Limit Theorem is valid under more general circum-
stances, for example it is sufficient that only lim supt E(|Mt |) < , in which case M is
the limit in L1 sense (as well as almost surely).
An important concept for random processes is the concept of a stopping time.
Definition 12.21 A stopping time is a random variable : I {}, such that for
all t I, { ; () t} is Ft measurable. Here I {} is ordered such that t < for all
t I. Given a process Mt , the stopped process M t is given by M t () = Ms () where
s = t = min(, t).
Example 12.22 Given A Fu a stopping time is constructed by = 1Ac + u 1A , that

is, A () = if
/ A, and () = u if A.
56
Exercise 12.23 If T : R is constant, then T is a stopping time. If S and T are
stopping times then max(S, T ) and min(S, T ) are stopping times.
A very important property of martingales is the following Martingale Stopping Theorem.

Theorem 12.24 Let {Mt }t be a martingale, and a stopping time. Then the stopped
process {M t }t is a martingale.
Proof. It is easy to see that an adapted process stopped at a stopping time is again an
adapted process. We will give a proof of the martingale property for the simple stopping
time A given above. If s > t u, let B Ft , then
Z Z Z Z
E(M s | Ft )dP = M s dP = M s + M s
B B BA BA c
Z Z Z Z
= Mu + Ms = Mu + E(Ms | Ft )
BA BA c BA BA c
Z Z Z
= Mu + Mt = M t .
BA BAc B
If u t and s > t, then one can apply Theorem 11.12
E(M s | Ft ) = E(E(M s | Fu ) | Ft ) = E(Mus | Ft ) = Mt = M t .
Exercise 12.25 Modify the proof of Theorem 12.24 to show that a stopped supermartingale
is a supermartingale.
Exercise 12.26 Consider the roulette game. There are several possibilities for a bet, given
by a number p (0, 1) such that with probability 36/37 p the return is p 1 times the stake
and the return is zero with probability 136/37p. The probabilities p are such that p 1 N.
Suppose you start with an initial fortune X0 N, and perform a sequence of bets until
this fortune is reduced to zero. We are interested in the expected value of the total sum of
stakes. To determine this consider the sequence of subsequent fortunes X i , and consider the
sequence of stakes Yi , meaning that the stake in bet i is Yi = Yi (X1 , . . . , Xi1 ) (Yi Xi1 ).
In particular, if for this stake the probability p is chosen, either X i = Xi1 Yi + p1 Yi
P 36/37 p) or Xi = Xi1 Yi (with probability 1 36/37 p). Show that
(with probability
(Xi + 1/37 ij=1 Yj )i is a martingale with respect to the filtration Fi = (X1 , . . . , Xi ). The
stopping time N is the first time i such that Xi = 0. Show that E( N
P
j=1 Yj ) = 37 X0 .
13 Brownian motion
In this section we will construct the continuous time martingale, Brownian motion. Besides
this, Brownian motion is also a building block of stochastic calculus and stochastic analysis.
In stochastic analysis one studies random functions of one variable and various kinds
of integrals and derivatives thereof. The argument of these functions is usually interpreted
as time, so the functions themselves can be thought of as the path of a random process.
57
Here, like in other areas of mathematics, going from the discrete to the continuous
yields a pay-off in simplicity and smoothness, at the price
R n 3 of a formally morePcomplicated
analysis. Compare, to make an analogy, the integral 0 x dx with the sum nk=1 k 3 . The
integral requires a more refined analysis for its definition and its properties, but once this
has been done the integral is easier to calculate. Similarly, in stochastic analysis you will
become acquainted with a convenient differential calculus as a reward for some hard work
in analysis.
Stochastic analysis can be applied in a wide variety of situations. We sketch a few
examples below.
1. Some differential equations become more realistic when we allow some randomness
in their coefficients. Consider for example the following growth equation, used among
other places in population biology:
d
St = (r + Nt )St . (13.1)
dt
Here, St is the size of the population at time t, r is the average growth rate of the
population, and the noise Nt models random fluctuations in the growth rate.
2. At time t = 0 an investor buys stocks and bonds on the financial market, i.e., he
divides his initial capital C0 into A0 shares of stock and B0 shares of bonds. The
bonds will yield a guaranteed interest rate r 0 . If we assume that the stock price St
satisfies the growth equation (13.1), then his capital Ct at time t is
0
C t = A t St + B t e r t , (13.2)
where At and Bt are the amounts of stocks and bonds held at time t. With a keen eye
on the market the investor sells stocks to buy bonds and vice versa. If his tradings
0
are self-financing, then dCt = At dSt + Bt d(er t ). An interesting question is:
- What would he be prepared to pay for a so-called European call option, i.e.,
the right (bought at time 0) to purchase at time T > 0 a share of stock at a
predetermined price K?
The rational answer, q say, was found by Black and Scholes (1973) through an analysis
of the possible strategies leading from an initial investment q to a payoff CT . Their
formula is being used on the stock markets all over the world.
3. The Langevin equation describes the behaviour of a dust particle suspended in a fluid:
d
m Vt = Vt + Nt . (13.3)
dt
Here, Vt is the velocity at time t of the dust particle, the friction exerted on the
particle due to the viscosity of the fluid is Vt , and the noise Nt stands for the
disturbance due to the thermal motion of the surrounding fluid molecules colliding
with the particle.
58
4. The path of the dust particle in example 3 is observed with some inaccuracy. One
measures the perturbed signal Z(t) given by
Zt = Vt + Nt . (13.4)
Here Nt is again a noise. One is interested in the best guess for the actual value of
Vt , given the observation Zs for 0 s t. This is called a filtering problem: how to
filter away the noise Nt . Kalman and Bucy (1961) found a linear algorithm, which
was almost immediately applied in aerospace engineering. Filtering theory is now a
flourishing and extremely useful discipline.
5. Stochastic analysis can help solve boundary value problems such as the Dirichlet
problem. If the value of a harmonic function f on the boundary of some bounded
regular region D Rn is known, then one can express the value of f in the interior
of D as follows:
E (f (Bx )) = f (x), (13.5)
t
where Btx := x + 0 Nt dt is an integrated noise or Brownian motion, starting at x,
R
and denotes the time when this Brownian motion first reaches the boundary. (A
harmonic function f is a function satisfying f = 0 with the Laplacian.)
The goal of the course Stochastic Analysis is to make sense of the above equations, and to
work with them.
In all the above examples the unexplained symbol Nt occurs, which is to be thought of
as a completely random function of t, in other words, the continuous time analogue of
a sequence of independent identically distributed random variables. In a first attempt to
catch this concept, let us formulate the following requirements:
1. Nt is independent of Ns for t 6= s;
2. The random variables Nt (t 0) all have the same probability distribution ;
3. E (Nt ) = 0.
However, when taken literally these requirements do not produce what we want. This is
seen by the following argument. By requirement 1 we have for every point in time an
independent value of Nt . We shall show that such a continuous i.i.d. sequence Nt is not
measurable in t, unless it is identically 0.
Let denote the probability distribution of Nt , which by requirement 2 does not depend
on t, i.e., ([a, b]) := P[a Nt b]. Divide R into two half lines, one extending from a to
and the other extending from a to . If Nt is not a constant function of t, then there
must be a value of a such that each of the half lines has positive measure. So
p := P(Nt a) = ((, a]) (0, 1). (13.6)
Now consider the set of time points where the noise Nt is low: E := { t 0 : Nt a }.
It can be shown that with probability 1 the set E is not Lebesgue measurable. Without
59
giving a full proof we can understand this as follows. Let denote the Lebesgue measure
on R. If E were measurable, then by requirement 1 and Eq. (13.6) it would be reasonable
to expect its relative share in any interval (c, d) to be p, i.e.,
(E (c, d)) = p (d c) . (13.7)
On the other hand, it is known from measure theory that every measurable set E is ar-
bitrarily thick somewhere with respect to the Lebesgue measure , i.e., for all < 1 an
interval (c, d) can be found such that
(E (c, d)) > (d c)
(cf. Halmos (1974) Th. III.16.A). This clearly contradicts Eq. (13.7), so E is not measurable.
This is a bad property of Nt : for, in view of (13.1), (13.3), (13.4) and (13.5), we would like
to integrate Nt .
For this reason, let us approach the problem from another angle. Instead of Nt , let us
consider the integral of Nt , and give it a name:
Z t
Bt := Ns ds.
0
The three requirements on the evasive object Nt then translate into three quite sensible
requirements for Bt .
BM1. For 0 = t0 t1 tn the random variables Btj+1 Btj (j = 0, . . . , n 1) are

independent;
BM2. Bt has stationary increments, i.e., the joint probability distribution of
(Bt1 +s Bu1 +s , Bt2 +s Bu2 +s , . . . , Btn +s Bun +s )
does not depend on s 0, where ti > ui for i = 1, 2, , n are arbitrary.
BM3. E (Bt B0 ) = 0 for all t.
We add a normalisation:
BM4. B0 = 0 and E(B12 ) = 1.
Still, these four requirements do not determine Bt . For example, the compensated Poisson
jump process also satisfies them. Our fifth requirement fixes the process Bt uniquely:
BM5. t 7 Bt continuous a.s.
The object Bt so defined is called the Wiener process, or (by a slight abuse of physical
terminology) Brownian motion. In the next section we shall give a rigorous and explicit
construction of this process.
Before we go into details we remark the following
60
Exercise 13.1 Show that BM5, together with BM1 and BM2, implies the following:
For any > 0
nP (|Bt+ 1 Bt | > ) 0 (13.8)
n
as n . Hint: compare with inequality (8.6).
Exercise 13.1 helps us to specify the increments of Brownian motion in the following way2 .
Exercise 13.2 Suppose BM1, BM2, BM4 and (13.8) hold. Apply the Central Limit
Theorem (Lindebergs condition, page 36) to
Xn,k := B kt B (k1)t
n n
and conclude that Bs+t Bs , t > 0 has a normal distribution with variance t, i.e.
1
Z
x2
P (Bs+t Bs A) = e 2t dx.
2t A
As a matter of fact, BM1 and BM5 already imply that the increments Bs+t Bs are
normally distributed3 .
BM 2. If s 0 and t > 0, then

1
Z
x2
P (Bs+t Bs A) = e 2t dx.
2t A
we can now define Brownian motion as follows
Definition 13.3 A one-dimensional Brownian motion is a real-valued process B t , t 0

with the properties BM1, BM2, and BM5.
13.1 Construction of Brownian Motion

Whenever a stochastic process with certain porperties is defined, the most natural question
to ask is, does such a process exist? Of course, the answer is yes, otherwise these lecture
notes would not have been written.
In this section we shall construct Brownian motion on [0, T ]. For the sake of simplicity
we will take T = 1, the construction for general T can be carried out along the same lines,
or, by just concatenating independent Brownian motions.
The construction we shall use was given by P. Levy in 1948. Since we saw that the
increments of Brownian motion are independent Gaussian random variables, the idea is to
construct Brownian motion from these Gaussian increments.
2
See R. Durrett (1991), Probability: Theory and Examples, Section 7.1, Exercise 1.1, p. 334. Unfortu-
nately, there is something wrong with this exercise. See the 3rd edition (2005) for a correct treatment.
3
See e.g. I. Gihman, A. Skorohod, The Theory of Stochastic Processes I, Ch. III, 5, Theorem 5, p. 189.
For a high-tech approach, see N. Ikeda, S. Watanabe, Stochastic Differential Equations and Diffusion
Processes, Ch. II, Theorem 6.1, p. 74.
61
More precisely, we start with the following observation. Suppose we already had con-
structed Brownian motion, say (Bt )0tT . Take two times 0 s < t T , put := s+t 2
,
and let
1 2
p(, x, y) := e(yx) /2 , > 0, x, y, R
2
be the Gaussian kernel centered in x with variance . Then, conditioned on Bs = x and
Bt = z, the random variable B is normal with mean := x+z 2
and variance 2 := ts
4
.
Indeed, since Bs ,B Bs , and Bt B are independent we obtain
ts ts
P [Bs dx, B dy, Bt dz] = p(s, 0, x)p( , x, y)p( , y, z)dx dy dz
2 2
1 (y)2
= p(s, 0, x)p(t s, x, z) e 22 dx dy dz
2
(which is just a bit of algebra). Dividing by
P [Bs dx, Bt dz] = p(s, 0, x)p(t s, x, z)dx dz
we obtain
1 (y)2
P [B dy|Bs dx, Bt dz] = e 22 dy,
2
which is our claim.
This suggests that we might be able to construct Brownian motion on [0, 1] by interpo-
lation.
(n)
To carry out this program, we begin with a sequence {k , k I(n), n N0 } of inde-
pendent, standard normal random variables on some probability space (, F , P ). Here
I(n) := {k N, k 2n , k = 2l + 1 for some l N}
denotes the set of odd, positive integers less than 2n . For each n N0 we define a process
(n)
B (n) := {Bt : 0 t 1} by recursion and linear interpolation of the preceeding process,
(n) (n1)
as follows. For n N, Bk/2n1 will agree with Bk/2n1 , for all k = 0, 1, . . . , 2n1 . Thus for
(n)
each n we only need to specify the values of Bk/2n for k I(n). We start with
(0) (1) (0)

B0 = 0 and B 1 = 1 .
(n1) (n1)
If the values of Bk/2n1 , k = 0, 1 . . . 2n1 have been defined (an thus Bt , k/2n1 t
(n1) (n1)
(k + 1)/2n1 is the linear interpolation between Bk/2n1 and B(k+1)/2n1 ) and k I(n), we
(n1) (n1)
denote s = (k 1)/2n , t = (k + 1)/2n , = 21 (Bs + Bt ) and 2 = ts
4
= 2n+1 and
set in accordance with the above observations
(n) (n) (n)
Bk/2n := B(t+s)/2 := + k .
(n)
We shall show that, almost surely, Bt converges uniformly in t to a continuous function
Bt (as n ) and that Bt is a Brownian motion.
62
We start with giving a more convenient representation of the processes B (n) , n = 0, 1, . . ..
We define the following Haar functions by H10 (t) 1, and for n N, k I(n)
k1
t < 2kn
(n1)/2
2 , 2n
(n) k
Hk (t) := 2(n1)/2 , 2n
t < k+1
2n
0 otherwise.

The Schauder functions are defined by

Z t
(n) (n)
Sk (t) := Hk (u)du, 0 t 1, n N0 , k I(n).
0
(0) (n)
Note that S1 (t) = t, and that for n 1 the graphs of Sk are little tents of height
2(n+1)/2 centered at k/2n and non overlapping for different values of k I(n). Clearly,
(0) (0) (0)
Bt = 1 S1 (t), and by induction on n, it is readily verified that
n
(n) (m) (m)
X X
Bt () = k ()Sk (t), 0 t 1, n N. (13.9)
m=0 kI(m)
(n)
Lemma 13.4 As n , the sequence of functions {Bt (), 0 t 1}, n N0 , given
by (13.9) converges uniformly in t to a continuous function {Bt (), 0 t 1} for almost
every .
(n)
Proof. Let bn := maxkI(n) |k |. Oberserve that for x > 0 and each n, k
r Z
(n) 2 2
P (|k | > x) = eu /2 du

r Zx r
2 u u2 /2 2 1 x2 /2
e du = e ,
x x x
which gives
r
[ (n) n (n) 2 2n n2 /2
P (bn > n) = P( {|k | > n}) 2 P (|1 | > n) e ,
n
kI(n)
for all n N. Since r

X 2 2n n2 /2
e < ,
n
n
the Borel-Cantelli Lemma implies that there is a set with P () = 1 such that for
there is an n0 () such that for all n n0 () it holds true that bn () n. But then
X X (n) (n)
X
|k ()Sk (t)| n2(n+1)/2 < ;
nn0 () kI(n) nn0 ()
(n)
so for , Bt () converges uniformly in t to a limit Bt . The uniformity of the
convergence implies the continuity of the limit Bt .
The following exercise facilitates the construction of Brownian motion substantially:
63
Exercise 13.5 Check the following in a textbook of functional analysis:
The inner product Z 1
hf, gi := f (t)g(t)dt
0
(n)
turns L2 [0, 1] into a Hilbert space, and the Haar functions {Hk ; k I(n), n N0 } form a
complete, orthonormal system.
Thus the Parseval equality
X
(n) (n)
X
hf, gi = hf, Hk ihg, Hk i (13.10)
n=0 kI(n)
holds true.
Applying (13.10) to f = 1[0,t] and g = 1[0,s] yields

X
(n) (n)
X
Sk (t)Sk (s) = s t. (13.11)
n=0 kI(n)
Now we are able to prove
Theorem 13.6 With the above notations

(n)
Bt := lim Bt
n
is a Brownian motion in [0, 1].
Proof. In view of our definition of Brownian motion it suffices to prove that for 0 = t0 <
t1 . . . < tn 1, the increments (Btj Btj1 )j=1,...,n are independent, normally distributed
with mean zero and variance (tj tj1 ). For this we will show that the Fourier transforms
satisfy the appropriate condition, namely that for j R (and as usual i := 1)
n n
X Y 1
exp 2j (tj tj1 ) .

E exp i j (Btj Btj1 ) = (13.12)
j=1 j=1
2
To derive (13.12) it is most natural to exploit the construction of Bt form Gaussian random
(n)
variables. Set n+1 = 0 and use the independence and normality of the k to compute for
64
M N
n
(M )
X
E exp i (j+1 j )Btj
j=1
M n
(m) (m)
X X X
= E exp i k (j+1 j )Sk (tj )
m=0 kI(m) j=1
M n
(m) (m)
Y Y X
= E exp ik (j+1 j )Sk (tj )
m=0 kI(m) j=1
M n
Y Y 1 X (m) 2
= exp (j+1 j )Sk (tj )
m=0 kI(m)
2 j=1
n n M
1 XX X X (m) (m)
= exp (j+1 j )(l+1 l ) Sk (tj )Sk (tl )
2 j=1 l=1 m=0 kI(m)
Now we send M and apply (13.11) to obtain

n
X n
X

E exp i j (Btj Btj1 ) = E exp i (j+1 j )Btj
j=1 j=1
Xn1 X n n
1X 2
= exp (j+1 j )(l+1 l )tj (j+1 j ) tj
j=1 l=j+1
2 j=1
Xn1 n
1X 2
= exp (j+1 j )(j+1 )tj (j+1 j ) tj
j=1
2 j=1
Xn1
1 2 2 1 2
= exp ( j )tj n tn
2 j=1 j+1 2
n
Y 1
exp 2j (tj tj1 ) .

=
j=1
2
65
14 Appendix
Let be a set and P() the collection of subsets of .
Definition 14.1 A system of sets R P() is called a ring if it satisfies
A, B R A \ B R
A, B R A B R
If additionally
R
then R is called an algebra.
Note that for A, B their intersection A B = A \ (B \ A).
Definition 14.2 A system D P() is called a Dynkin system if it satisfies
D D Dc D
For every sequence (Dn )nN of pairwise disjoint sets Dn D, their union n Dn is also in
D.
The following theorem holds:
Theorem 14.3 A Dynkin system is a -algebra if and only of for any two A, B D we
have
AB D
Similar to the case of -algebras for every system of sets E P() there is a smallest
Dynkin system D(E) generated by (and containing) E. The importance of Dynkin systems
mainly is due to the following
Theorem 14.4 For every E P() with
A, B E A B E
we have
D(E) = (E).
Definition 14.5 Let R be a ring. A function
: R [0, ]
is called a volume, if it satisfies

() = 0
66
and n
X
(ni=1 Ai ) = (Ai ) (14.1)
i=1
for all pairwise disjoint sets A1 , . . . , An R and all n N. A volume is called a

pre-measure if
X

(i=1 Ai ) = (Ai ) (14.2)
i=1
for all pairwise disjoint sequence of sets (Ai )iN R. We will call (14.1) finite additivity
and (14.2) -additivity.
A pre-measure on a -algebra A is called a measure.
Theorem 14.6 Let R be a ring and be a volume on R. If is a pre-measure, then

it is -continuous, i.e. for all (An )n , An R, with (An ) < and An it holds
limn (An ) = 0. If R is an algebra and () < , then the reverse also holds: an
-continuous volume is a pre-measure.
Theorem 14.7 (Caratheodory) For every pre-measure on a ring R over there is

at least one way to extend to a measure on (R).
In the case that R is an algebra and is -finite (i.e. is the countable union of subsets
of finite measure), this extension is unique.
67

ProbabilityC PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ProbabilityC PDF

Загружено:

Авторское право:

Доступные форматы

Probability Theory

Academic Year 2001/2002

2 Basics, Random Variables 1

3 Expectation, Moments, and Jensens inequality 3

4 Convergence of random variables 7

6 Products and Sums of Independent Random Variables 16

7 Infinite product probability spaces 18

9 Laws of Large Numbers 26

10 The Central Limit Theorem 35

However, a mathematically clean description of these results could not be given by

2 Basics, Random Variables

Definition 2.1 A probability space is a triple (, F , P), where

is a set, is called a state of the world.

P is a measure on F with P() = 1, P(A) is called the probability of event A.

A probability space can be considered as an experiment we perform. Of course, in

Definition 2.2 A random variable X is a mapping

that is measurable. Here we endow Rd with its Borel -algebra B d .

So the distribution of a random variable is its image measure in Rd .

The Normal distribution with parameters and 2 , i.e. a random variable X is

The Dirac distribution with atom in a R, i.e. a random variable X is Dirac

The Poisson distribution with parameter R, i.e. a random variable X is Poisson

The Multivariate Normal distribution with parameters Rd and , i.e. a random

3 Expectation, Moments, and Jensens inequality

Definition 3.1 The expectation of a random variable is defined as

if this integral is well defined.

Proposition 3.2 Let X : Rd be a random variable and f : Rd R be a measurable

Exercise 3.3 If X : N0 , then

Exercise 3.4 If X : R, then

Further characteristics of random variables are the p-th moments.

Definition 3.5 1. For p 1, the p-th moment of a random variable is defined as

2. The centered p-th moment of a random variable is defined as E [(X EX) p ].

3. The variance of a random variable X is its centered second moment, hence

Its standard deviation is defined as

EX L2 (P) X = (X EX) + EX L2 (P) .

On the other hand if X L2 (P) X L1 (P). Hence EX exists and is a constant.

V(X) = E(X EX)2 = EX 2 2(EX)2 + (EX)2 = EX 2 (EX)2 .

This immediately implies (3.3) and (3.4).

Exercise 3.7 Show that X and X EX have the same variance.

Applying Exercise 3.8 yields the generalization of (3.4) mentioned above.

for all x, y I with equality for x = y. Hence

for all y I. Putting y = X() in (3.5) yields

This is the assertion of Jensens inequality.

Corollary 3.10 Let X Lp (P) for some p 1. Then

Definition 4.1 Let (Xn ) be a sequence of random variables.

1. Xn is stochastically convergent (or convergent in probability) to a random variable

2. Xn converges almost surely to a random variable X, if for each > 0

3. Xn converges to a random variable X in Lp or in p-norm, if

lim E (|Xn X|p ) = 0.

Already in measure theory we proved:

Theorem 4.2 1. If Xn converges to X almost surely or in Lp , then it also converges

for all f C b (R).

|x0 x00 | < |f (x0 ) f (x00 )| < .

Define An := {|Xn X| } , n N. Then

Here we used the notation

Now let f C b (R) be arbitrary and denote In := [n, n]. Since In R as n we

1 PX (In0 ) = PX (R \ In0 ) < .

We choose the continuous function un0 in the following way:

For large n n1 (), fdPXn fdPX . Furthermore from 0 1 un0 1R\In0 we

so that for all n n2 () also Z

on the one hand and Z

all n n2 () on the other. Hence we obtain from (4.2) for large n:

This proves weak convergence of the distributions.

Since we assumed weak convergence of PXn to PX we know that