Probability Theory: 1 Heuristic Introduction

Probability Theory
Muhammad Waliji
August 11, 2006
Abstract
This paper introduces some elementary notions in Measure-Theoretic
Probability Theory. Several probabalistic notions of the convergence of
a sequence of random variables are discussed. The theory is then used
to prove the Law of Large Numbers. Finally, the notions of conditional
expectation and conditional probability are introduced.
1 Heuristic Introduction
Probability theory is concerned with the outcome of experiments that are ran-
dom in nature, that is, experiments whose outcomes cannot be predicted in
advance. The set of possible outcomes, , of an experiment is called the sample
space, denoted by . For instance, if our experiment consists of rolling a dice,
we will have = 1, 2, 3, 4, 5, 6. A subset, A, of is called an event. For
instance A = 1, 3, 5 corresponds to the event an odd number is rolled.
In elementary probability theory, one is normally concerned with sample
spaces that are either nite or countable. In this case, one often assigns a
probability to every single outcome. That is, we have probability function P :
[0, 1], where P() is the probability that occurs. Here, we inssist that
P() = 1.
However, if the sample space is uncountable, then this condition becomes non-
sensible. Two elementary types of problems come into this category and hence
cannot be dealt with by elementary probability theory: an innite number of
repeated coin tosses (or dice rolls), and a number drawn at random from [0, 1].
This illustrates the importance of uncountable sample spaces.
The solution to this problem is to use the theory of measures. Instead of
assigning probabilities to outcomes in the sample space, one can restrict himself
to a certain class of events that form a structure known as a -eld, and assign
probabilities to these special kinds of events.
1
2 -Fields, Probability Measures, and Distribu-
tion Functions
Denition 2.1. A class of subsets of , T, is a -eld if the following hold:
(i) T and T
(ii) A T = A
c
T
(iii) A
1
, A
2
, . . . T =
n
A
n
T
Note that this implies that -elds are also closed under countable intere-
sections also.
Denition 2.2. The -eld generated by a class of sets, /, is the smallest
-eld containing /. It is denoted (/).
Denition 2.3. Let T be a -eld. A function P : T [0, 1] is a probability
measure if P() = 0, P() = 1, and whenever (A
n
)
nN
is a disjoint collection
of sets in T, we have
P
_

_
n=1
A
n
_
=
n=1
P(A
n
).
Throughout this paper, unless otherwise noted, the words increasing, de-
creasing, and monotone are always meant in their weak sense. Suppose A
n
is
a sequence of sets. We say that A
n
is an increasing sequence if A
1
A
2
.
We say that A
n
is a decreasing sequence if A
1
A
2
. In both of these
cases, the sequence A
n
is said to be monotone. If A
n
is increasing, then
set lim
n
A
n
:=

n
A
n
. If A
n
is decreasing, then set lim
n
A
n
:=

n
A
n
. The
following properties follow immediately from the denitions.
Lemma 2.4. Let T be a -eld, and let P be a probability measure on it.
(i) P(A
c
) = 1 P(A)
(ii) If A B, then P(A) P(B).
(iii) P(
i=1
A
i
)
i=1
P(A
i
).
(iv) If A
n
is a monotone sequence in T, then lim
n
P(A
n
) = P(lim
n
A
n
).
Denition 2.5. Suppose is a set, T is a eld on , and P is a probability
measure on T. Then, the ordered pair (, T) is called a measurable space. The
triple (, T, P) is called a probability space. The a probability space is nitely
additive or countably additive depending on whether P is nitely or countably
additive.
Denition 2.6. Let (X, ) be a topological space. The -eld, B(X, ) gener-
ated by is called the Borel -eld. In particular, B(X, ) is the smallest -eld
containing all open and closed sets of X. The sets of B(X, ) are called Borel
sets.
2
When the topology, , or even the space X are obvious from the context,
B(X, ) will often be abbreviated B(X) or even just B.
A particularly important situation in probability theory is when = R and
T are the Borel sets in R.
Denition 2.7. A distribution function is an increasing right-continuous func-
tion F : R[0, 1] such that
lim
x
F(x) = 0 and lim
x
F(x) = 1.
We can associate probability functions on (R, B) with distribution functions.
Namely, the distribution function associated with P is F(x) := P((, x]).
Conversely, each distribution function denes a probability function on the reals.
3 Random Variables, Transformations, and Ex-
pectation
We now have stated the basic objects that we will be studying and discussed
their elementary properties. We now introduce the concept of a Random Vari-
able. Let be the set of all possible drawings of lottery numbers. The function
X: R which indicates the payo X() to a player associated with a drawing
is an example of a random variable. The expectation of a random variable is
the average or expected value of X.
Denition 3.1. Let (
1
, T
1
) and (
2
, T
2
) be measurable spaces. A function
T :
1
2
is a measurable transformation if the preimage of any measurable
set is a measurable set. That is, T is a measurable transformation if (A
T
2
)(T
1
(A) T
1
).
Lemma 3.2. It is sucient to check the condition in Denition 3.1 for those
A in a class that generates T
2
. More precisely, suppose that / generates T
2
.
Then, if (A /)(T
1
(A) T
1
), then T is a measurable transformation.
Proof. Let ( := A 2
2
: T
1
(A) T
1
. Then, ( is a -eld, and / (. But
then, (/) = T
2
(, which is exactly what we wanted.
Denition 3.3. Let (, T) be a measurable space. A measurable function or
a random variable is a measurable transformation from (, T) into (R, B).
Lemma 3.4. If f : R R is a continuous function, then f is a measurable
transformation from (R, B) to (R, B).
Denition 3.5. Given a set A, the indicator function for A is the function
I
A
() :=
_
1 if A
0 if / A
3
If A T, then I
A
is a measurable function.
Note that many elementary operations, including composition, arithmetic,
max, min, and others, when performed upon measurable functions, again yield
measurable functions.
Let (
1
, T
1
, P) be a probability space and (
2
, T
2
) a measurable space. A
measurable transformation T :
1
2
naturally induces a probability measure
PT
1
on (
2
, T
2
). In the case of a random variable, X, the induced measure
on R will generally be denoted . The distribution function associated with
will be denoted F
X
. will sometimes be called a probability distribution.
Now that we have a notion of measure and of measurable functions, we
can develop a notion of the integral of a function. The integral will have
the probabalistic interpretation of being an expected (or average) value. For
the precise denition of the Lebesgue integral, see any textbook on Measure
Theory.
Denition 3.6. Suppose X is a random variable. Then the expectation of X
is EX :=
_
X()dP.
We conclude this section with a useful change of variables formula for inte-
grals.
Proposition 3.7. Let (
1
, T
1
, P) be a probability space and let (
2
, T
2
) be a
measurable space. Suppose T :
1
2
is a measurable transformation. Suppose
f :
2
R is a measurable function. Then, PT
1
is a probability measure on
(
2
, T
2
) and fT :
1
R is a measurable function. Furthermore, f is integrable
i fT is integrable, and
_
1
fT(
1
)dP =
_
2
f(
2
)dPT
1
.
4 Notions of Convergence
We will now introduce some notions of the convergence of random variables.
Note that we will often not explicitly state the dependence of a function X()
on . Hence, sets of the form : X() > 0 will often be abbreviated X > 0.
For the remainder of this section, let X
n
be a sequence of random variables.
Denition 4.1. The sequence X
n
converges almost surely (almost everywhere)
to a random variable X if X
n
() X() for all outside of a set of probability
0.
Denition 4.2. The sequence X
n
converges in probability (in measure) to a
function X if,
for every > 0, lim
n
P : [X
n
() X()[ = 0.
This is denoted X
n
P
X.
4
Proposition 4.3. If X
n
converges almost surely to X, then X
n
converges in
probability to X.
Proof. We have : X
n
() X() N, P(N) = 0. That is,
> 0
n=1
_
m=n
[X
m
X[ N.
Therefore, given > 0, we have
lim
n
P[X
n
X[ lim
n
P
_
m=n
[X
m
X[
= P
n=1
_
m=n
[X
m
X[ P(N) = 0
thereby completing the proof.
Note, however, that the converse is not true. Let = [0, 1] with Lebesgue
measure. Consider the sequence of sets A
1
= [0,
1
2
], A
2
= [
1
2
, 1], A
3
= [0,
1
3
],
A
4
= [
1
3
,
2
3
], and so on. Then, the indicator functions, I
An
, converge in prob-
ability to 0. However, I
An
() does not converge for any , and in particular
the sequence does not converge almost surely. However, the following holds as
a sort of converse:
Proposition 4.4. Suppose f
n
converges in probability to f. Then, there is a
subsequence f
n
k
of f
n
such that f
n
k
converges almost surely to f.
Proof. Let B
n
:= : [f
n
() f()[ . Then,
f
ni
f almost surely i P(
i
_
j>i
B
nj
) = 0.
We know that for any ,
lim
n
P(B
n
) = 0.
Now, notice that
P(
n
_
mn
B
m
) inf
n
P(
_
mn
B
m
) inf
n
m=n
P(B
m
) = lim
n
m=n
P(B
m
).
Furthermore,
1
<
2
B
1
n
B
2
n
P(B
1
n
) P(B
2
n
).
Let
i
:= 1/2
i
. Now, note that (i)(n
i
)(n n
i
)(P(B
n
) <
i
). Let
n
i
:= n
i
i
. Choose 0. Note, (m)(
m
< ). Hence,
P(
i
_
ji
B
nj
) lim
i
j=i
P(B
nj
) lim
i
j=i
P(B
j
nj
) = lim
i
j=i
j
= 0
which is what we wanted.
5
Denition 4.5. A sequence of probability measures
n
on R converges weakly
to if whenever (a) = (b) = 0, for a < b R, we have
lim
n
n
[a, b] = [a, b].
A sequence of random variable X
n
converges weakly to X if the induced
probability measures
n
converge weakly to . This is denoted
n
or
X
n
X.
Lemma 4.6. Suppose
n
and are probability measures on R with associated
distribution functions F
n
and F. Then,
n
i F
n
(x) F(x) for each
continuity point x of F.
Proof. First, note that x is a continuity point of F i (x) = 0. Let a < b be
continuity points of F. Suppose F
n
(x) F(x) for each continuity point x of
F. Then,
lim
n
n
[a, b] = lim
n
F
n
(b) F
n
(a) = F(b) F(a) = [a, b].
For the converse, suppose
n
. Then,
lim
n
F
n
(b) F
n
(a) = lim
n
n
[a, b] = [a, b].
Now, we can let a in such a way that a is always a continuity point of
F. Then, we get, lim
n
F
n
(b) = (, b].
The next result shows that weak convergence is actually weak:
Proposition 4.7. Suppose X
n
converges in probability to X. Then, X
n
con-
verges weakly to X.
Proof. Let F
n
, F be the distribution functions of X
n
, X respectively. suppose
x is a continuity point of F. Note that
X x
_
[X
n
X[ X
n
x
and
X
n
x = X
n
x and X x + X
n
x and X > x +
X x + [X
n
X[
Therefore,
PX x P[X
n
X[ PX
n
x
PX x + +P[X
n
X[
Since for each > 0, lim
n
P[X
n
X[ = 0, when we let n , we have
F(x ) liminf
n
F
n
(x) limsup
n
F
n
(x) F(x +).
6
Finally, since F is continuous at x, letting 0, we have
lim
n
F
n
(x) = F(x)
so that X
n
X.
The converse is not true in general. However, if X is a degenerate distribu-
tion (takes a single value with probability one), then the converse is true.
Proposition 4.8. Suppose X
n
X, and X is a degenerate distribution such
that PX = a = 1. Then, X
n
P
X.
Proof. Let
n
and be the distributions on R induced by X
n
and X respectively.
Given > 0, we have
lim
n
n
[a , a +] = [a , a +] = 1.
Hence,
lim
n
P[X
n
X[ = 1,
and so
lim
n
P[X
n
X[ > = 0
5 Product Measures and Independence
Suppose (
1
, T
1
) and (
2
, T
2
) are two measurable spaces. We want to construct
a product measurable space with sample space
1

2
.
Denition 5.1. Let / = AB : A T
1
, B T
2
. Let T
1
T
2
be the -eld
generated by /. T
1
T
2
is called the product -eld of T
1
and T
2
.
If P
1
and P
2
are probability measures on the measurable spaces above, then
P
1
P
2
(AB) := P
1
(A)P
2
(B) gives a probability measure on /. This can be
extended in a canonical way to the -eld T
1
T
2
.
Denition 5.2. P
1
P
2
is called the product probability measure of P
1
and P
2
.
Let :=
1

2
, T := T
1
T
2
, and P := P
1
P
2
.
Note that when calculating integrals with respect to a product probability
measure, we can normally perform an iterated integral in any order with re-
spect to the component probability measures. This result is know as Fubinis
Theorem.
Before we dene a notion of independence, we will give some heuristic con-
siderations. Two events A and B should be independent if A occurring has
nothing to do with B occurring. If we denote by P
A
(X), the probability that
X occurs given that A has occurred, then we see that P
A
(X) =
P(AX)
P(A)
. Now,
suppose that A and B are indeed independent. This means that P
A
(B) = P(B).
But then, P(B) =
P(AB)
P(A)
, so that P(A B) = P(A)P(B). This leads us to
dene,
7
Denition 5.3. Let (, T, P) be a probability space. Let A
i
T for every i.
Let X
i
be a random variable for every i.
(i) A
1
, . . . , A
n
are independent if P(A
1
A
n
) = P(A
1
) P(A
n
).
(ii) A collection of events A
i
iI
is independent if every nite subcollection
is independent.
(iii) X
1
, . . . , X
n
are independent if for any n sets A
1
, . . . , A
n
B(R), the events
X
i
A
i
n
i=1
are independent.
(iv) A collection of random variables X
i
iI
is independent if every nite
subcollection is independent.
Lemma 5.4. Suppose X, Y are random variables on (, T, P), with induced
distributions , on R respectively. Then, X and Y are independent if and
only if the distribution induced on R
2
by (X, Y ) is .
Lemma 5.5. Suppose X, Y are independent random variables, and suppose
that f, g are measurable functions. Then, f(X) and g(Y ) are also independent
random variables.
Proposition 5.6. Let X, Y be independent random variables, and let f, g be
measurable functions. Suppose that E[f(X)[ and E[g(Y )[ are both nite. Then,
E[f(X)g(Y )] = E[f(X)]E[g(Y )].
Proof. Let be the distribution on R induced by f(X), and let be the distri-
bution induced by g(Y ). Then, the distribution on R
2
induced by (f(X), g(Y ))
is . So,
E[f(X)g(Y )] =
_
f(X())g(Y ()) dP =
_
R
_
R
uv dd
=
_
R
ud
_
R
v d = E[f(X)]E[g(Y )]
6 Characteristic Functions
The inverse Fourier transform of a probability distribution plays a central role
in probability theory.
Denition 6.1. Let be a probability measure on R. Then, the characteristic
function of is
(t) =
_
R
e
tx
d
If X is a random variable, the characteristic function of the distribution on
R induced by X will sometimes be denoted
X
. These results demonstrate the
importance of the characteristic function in probability.
8
Proposition 6.2. Suppose and are probability measures on R with char-
acteristic functions and respectively. Suppose further that for each t R,
(t) = (t). Then, = .
Theorem 6.3. Let
n
, be probability measures on R with distribution func-
tions F
n
and F and characteristic functions
n
and . Then, the following are
equivalent
(i)
n
.
(ii) for any bounded continuous function f : RR,
lim
n
_
R
f(x)d
n
=
_
R
f(x)d.
(iii) for every t R,
lim
n
n
(t) = (t).
Theorem 6.4. Suppose
n
is a sequence of probability measures on R, with
characteristic functions
n
. Suppose that for each t R, lim
n
n
(t) =: (t)
exists and is continuous at 0. Then, there is a probability distriubtion such
that is the characteristic function of . Furthermore,
n
.
Next, we show how to recover the moments of a random variable from its
characteristic function.
Denition 6.5. Suppose X is a random variable. Then, the kth moment of X
is EX
k
. The kth absolute moment of X is E[X[
k
.
Proposition 6.6. Let X be a random variable. Suppose that the kth moment
of X exists. Then, the characteristic function of X is k times continuously
dierentiable, and
(k)
(0) =
k
EX
k
.
Now, a result on ane transforms of a random variable:
Proposition 6.7. Suppose X is a random variable, and Y = aX + b. Let
X
and
Y
be the characteristic functions of X and Y . Then,
Y
(t) = e
tb
X
(at).
We will often be interested in the sums of independent random variables.
Suppose that X and Y are independent random variables with induced distri-
butions and on R respectively. Then, the induced distribution of (X, Y ) on
R
2
is . Consider the map f : R
2
R given by f(x, y) = x + y. Then, the
distribution on R induced by is denoted , and is called the convolution
of and . is the distribution of the sum of X and Y .
Proposition 6.8. Suppose X and Y are independent random variables with
distributions and respectively. Then,
X+Y
(t) =
X
(t)
Y
(t).
9
Proof.
(t) =
_
R
e
tz
d =
_
R
_
R
e
t(x+y)
dd
=
_
R
_
R
e
tx
e
ty
dd =
_
R
e
tx
d
_
R
e
ty
d
=
(t)
(t)
7 Useful Bounds and Inequalities
Here, we will prove some useful bounds regarding random variables and their
moments.
Denition 7.1. Let X be a random variable. Then, the variance of X is
Var(X) := E[(X EX)
2
] = EX
2
(EX)
2
.
The variance is a measure of how far spread X is on average from its mean.
It exists if X has a nite second moment. It is often denoted
2
.
Lemma 7.2. Suppose X, Y are independent random variables. Then, Var(X +
Y ) = Var(X) + Var(Y )
Proposition 7.3 (Markovs Inequality). Let > 0. Suppose X is a random
variable with nite kth absolute moment. Then, P[X[
1
k
E[X[
k
.
Proof.
P[X[ =
_
{|X|}
dP
1
k
_
{|X|}
[X[
k
dP
1
k
_
[X[
k
dP =
1
k
E[X[
k
Corollary 7.4 (Chebyshevs Inequality). Suppose X is a random variable with
nite 2nd moment. Then,
P[X EX[
1
2
Var(X).
The following is also a useful fact:
Lemma 7.5. Suppose X is a nonnegative random variable. Then,
m=1
PX m EX
Proof.
EX| =
n=1
nPn X < n + 1 =
m=1
n=m
Pn X < n + 1
=
m=1
PX m EX
10
8 The Borel-Cantelli Lemma
First, let us introduce some terminology. Let A
1
, A
2
, . . . be sets. Then,
limsup
n
A
n
:=
n=1
_
m=n
A
m
.
limsup
n
A
n
consists of those that appear in A
n
innitely often (i.o.). Also,
liminf
n
A
n
:=
_
n=1
m=n
A
n
.
liminf
n
A
n
consists of those that appear in all but nitely many A
n
.
Theorem 8.1 (Borel-Cantelli Lemma). Let A
1
, A
2
, . . . T. If

n=1
P(A
n
) <
, then P(limsup
n
A
n
) = 0. Furthermore, suppose that the A
i
are independent.
Then, if

n=1
P(A
n
) = , then P(limsup
n
A
n
) = 1.
Proof. Suppose

n=1
P(A
n
) < . Then,
P
_

n=1
_
m = n
A
m
_
= lim
n
P
_

_
m=n
A
m
_
lim
n
m=n
P(A
m
) = 0.
For the converse, it is enough to show that
P
_

_
n=1
m=n
A
c
m
_
= 0,
and so it is also enough to show that
P
_

m=n
A
c
m
_
= 0
for all n. By independence, and since 1 x e
x
, we have
P
_

m=n
A
c
m
_
P
_
n+k
m=n
A
c
m
_
=
n+k
m=n
(1 P(A
m
)) exp
_
n+k
m=n
P(A
m
)
_
Since the last sum diverges, taking the limit as k , we get
P
_

m=n
A
c
m
_
= 0
11
9 The Law of Large Numbers
Let X
1
, X
2
, . . . be random variables that are independent and indentically dis-
tributed (iid). Let S
n
:= X
1
+ +X
n
. We will be interested in the asymptotic
behavior of the average
Sn
n
. If X
i
has a nite expectation, then we would think
that
Sn
n
would settle down to EX
i
. This is known as the Law of Large Numbers.
There are two varieties of this law: the Weak Law of Large Numbers and the
Strong Law of Large Numbers. The weak law states that the average converges
in probability to EX
i
. The strong law, however states that the average con-
verges almost surely to EX
i
. However, the strong law is signicantly harder to
prove, and requires a bit of additional machinery. For the rest of this section,
x a probability space (, T, P).
Theorem 9.1 (The Weak Law of Large Numbers). Suppose X
1
, X
2
, . . . are iid
random variables with mean EX
i
= m < . Then,
Sn
n

P
m.
Proof. Let be the characteristic function of X
i
. Then, the characteristic
function of S
n
is [(t)]
n
. Then, by 6.7, the characteristic function of
Sn
n
is
n
(t) = [(
t
n
)]
n
. Furthermore, by 6.6, is dierentiable, and
(0) = im.
Therefore, we can form the Taylor expansion,
_
t
n
_
= 1 +
mt
n
+o
_
1
n
_
,
and so
n
(t) =
_
1 +
mt
n
+o
_
1
n
__
n
.
Taking the limit as n , we get
lim
n
n
(t) = e
mt
which is the characteristic function for the distribution degenerate at m. There-
fore, by Proposition 4.8,
Sn
n
converges in probability to m.
Theorem 9.2 (The Strong Law of Large Numbers). Suppose X
1
, X
2
, . . . are
iid random variables with EX
i
= m < . Let S
n
= X
1
+ + X
n
. Then,
Sn
n
converges to m almost surely.
Proof. We can decompose an arbitrary random variable X
i
into its positive
and negative parts: X
+
i
:= X
i
I
{Xi0}
and X
i
:= X
i
I
{Xi<0}
, so that X
i
=
X
+
i
X
i
. Then, we have S
n
= X
+
1
+ +X
+
n
(X
1
+ +X
n
) =: S
+
n
S
n
.
Hence, it is enough to prove the Theorem for nonnegative X
i
.
Now, Let Y
i
:= X
i
I
{Xii}
. Let S
n
:= Y
1
+ . . . Y
n
. Furthermore, let > 1,
and set u
n
:=
n
|. We shall rst establish the inequality
n=1
P
_
un
ES
un
u
n

_
< > 0 (9.1)
12
Since the X
i
are independent, we have
Var(S
n
) =
n
k=1
Var(Y
k
)
n
k=1
EY
2
k

n
k=1
E[X
2
i
I
{Xik}
] nE[X
2
i
I
{Xin}
]
By Chebyshevs inequality, we have
n=1
P
_
un
ES
un
u
n
n=1
Var(S
un
)
2
u
2
n
n=1
E[X
2
i
I
{Xiun}
]
u
n
=
1
2
E
_
X
2
i
n=1
1
u
n
I
{Xiun}
_
(9.2)
Now, let K :=
2
1
. Let x > 0, and let N := infn : u
n
x. Then,
N
x.
Also, note that
n
2u
n
, and so u
n
2
n
. Then,
unx
1
u
n
2
nN
1
n
= 2
N
n=0
_
1
_
n
= K
N
Kx
1
,
and hence,
n=1
1
u
n
I
{Xiun}
KX
1
1
if X
1
> 0
and so, putting this into (9.2), we get
1
2
E
_
X
2
i
n=1
1
u
n
I
{Xiun}
_
2
E
_
X
2
i
KX
1
i
=
K
2
EX
i
<
thereby establishing inequality (9.1).
Therefore, by the Borel-Cantelli Lemma, we have
P
_
limsup
n
_
un
ES
un
u
n

__
= 0 > 0.
Taking an intersection over all rational , we get that
S
un
ES
un
u
n
0 almost surely.
However,
1
n
ES
n
=
1
n
n
k=1
EY
k
, and since EY
k
EX
i
, taking the limit as
n , we have that
1
n
ES
n
EX
i
. Therefore, we have that
S
un
u
n
EX
i
almost surely. (9.3)
13
Now, notice that by Lemma 7.5,
n=1
PX
n
,= Y
n
=
n=1
PX
i
> n EX
i
<
Again, by the Borel-Cantelli Lemma, we have
P
_
limsup
n
X
n
,= Y
n
_
= 0.
Therefore,
S
n
Sn
n
0 almost surely, and so by (9.3),
S
un
u
n
EX
i
almost surely. (9.4)
Now, to get that the entire sequence
Sn
n
EX
i
almost surely, note that S
m
is
an increasing sequence. Suppose u
n
k u
n+1
. Then,
u
n
u
n+1
S
un
u
n
S
k
k

u
n+1
u
n
S
un+1
u
n+1
and so,
1
EX
i
liminf
k
S
k
k
limsup
k
S
k
k
EX
i
almost surely.
Taking 1, we get by (9.4)
lim
k
S
k
k
= EX
i
almost surely
10 Conditional Expectation and Probability
Before dening condtional expectation and probability, we will make a few ob-
servations about the probabalistic interpretation of -elds.
Consider a process where a random number between zero and one is chosen.
More precisely, an outcome is chosen according to some probability law from
the set of all possible outcomes, = [0, 1). We may be able to observe this
number to some amount of precision, say up to one digit. The -eld that
represents this amount of precision is T
1
:= [0, .1), [.1, .2), . . . , [.9, 1). The
-eld T
1
represents all the information that we can know about by observing
it to one digit of precision. That is, an observer who can observe the number
to one digit will be able to determine exactly which sets A T
1
that
belongs to, but he will not be able to give any information more precise than
that. Similarly, if we can observe up to n digits of precision, the -eld which
corresponds to this is: T
n
:=
__
i
10
n
,
i+1
10
n
_
: 0 i < 10
n
_
.
This example illustrates a general concept: The -eld that is used represents
the amount of information that an observer has about the random process.
14
Denition 10.1. If T is a -eld, a T-observer is an observer who can deter-
mine precisely which sets A T that a random outcome belongs to but has
no more information about .
Therefore, a 2
-observer has complete information about the outcome ,

whereas a T-observer has less information. Similarly, if T, then a T-
observer has more information than a -observer.
Suppose that a random variable X is T-measurable. This means that the
preimage of any Borel set under X in T. Therefore, a T-observer will have
complete information about X, or any other T-measurable random variable.
Note that if T, a -measurable function is also T-measurable.
Suppose that X is a T-measurable random variable, and that you are a -
observer. You do not have complete information about X. However, given your
information , you would like to make a buest guess about the value of X.
That is, you want to create another random variable, Y , that is -measurable,
but which approximates X. Y is called the conditional expectation of X wrt ,
and is denoted E[X[].
We will require that
_
A
X()dP =
_
A
E[X[]()dP for all A (10.1)
Lemma 10.2. Let (, T, P) be a probability space, and let be a sub--eld
of T. Let P
|
denote the restriction of P to . Suppose f is a -measurable
function and A . Then,
_
A
f()dP
|
=
_
A
f()dP.
Justied by the previous lemma, we will often be sloppy and not explicitly
say which -eld a particular integral is taken over. In order to prove that a
function satisfying (10.1) exists, we will have to discuss the Radon-Nikodym
Theorem. First, a denition.
Denition 10.3. A signed measure on a measurable space (, T) is a function
: T R such that whenever A
1
, A
2
, . . . is a nite or countable sequence of
disjoint sets in T, we have
_
_
i
A
i
_
=
i
(A
i
)
In particular, we have for a signed measure, () = 0. All probability
measures are also signed measures. Note that is permitted to take on negative
values. However, it is not permitted to take on the values + or .
Denition 10.4. a signed measure on (, T) is absolutely continuous with
respect to a probability measure P if, whenever P(A) = 0, we have also (A) =
0. This is denoted P.
15
For example, if f is an integrable function wrt P, then (A) =
_
A
f()dP
is a signed measure that is absolutely continuous with respect to P. In fact, all
absolutely continuous signed measures arise in this way.
Theorem 10.5 (Radon-Nikodym). Suppose P. Then, there is an inte-
grable function f such that
(A) =
_
A
f()dP. (10.2)
Furthermore, if f
is another function satisfying (10.2), then f = f
P-almost-
everywhere.
Denition 10.6. The function f in Theorem 10.5 is called the Radon-Nikodym
derivative of with respect to P. It is denoted
d
dP
.
Note that the Radom-Nikodym derivative is only dened up to equality
almost everywhere. We can use the Radon-Nikodym derivative to dene the
conditional expectation satisfying (10.1).
Denition 10.7. Let (, T, P) be a probability space. Let be a sub--eld
of T. Let X be a T-integrable random variable. Let be the signed measure
dened by (A) =
_
A
X()dP. The conditional expectation of X wrt is
E[X[] :=
d
|
dP
|
.
We now state some of the elementary properties of conditional expectation.
Lemma 10.8. Let X and X
i
be random variables on (, T, P). Let be a
sub--eld of T.
(i) E[E[X[]] = E[X]
(ii) If X is nonnegative, then E[X[] is nonnegative almost surely.
(iii) Suppose a
1
, a
2
R. Then
E[a
1
X
1
+a
2
X
2
[] = a
1
E[X
1
[] +a
2
E[X
2
[] almost surely.
(iv)
_
[E[X[][dP
_
[X[dP.
(v) If Y is bounded and -measurable, then E[XY [] = Y E[X[] almost
surely.
(vi) If
2

1
T are sub--elds, then E[X[
2
] = E[E[X[
1
][
2
] almost
surely.
As a special case of conditional expectation, we have conditional probability.
16
Denition 10.9. Let (, T, P) be a probability space, and let be a sub-
-eld of T. Then, the conditional probability of an event A T given is
P[A[] := E[I
A
[].
P[A[]() is also sometimes written P(, A). We now state some of the
elementary properties of conditional probability.
Lemma 10.10. The following hold almost surely:
(i) P(, ) = 1 and P(, ).
(ii) For any A T, 0 P(, A) 1.
(iii) If A
1
, A
2
. . . is a nite or countable sequence of disjoint sets in T, then
P
_
,
_
i
A
i
_
=
i
P(, A
i
).
(iv) If A , then P(, A) = I
A
().
Lemma 10.10 in particular implies that given , P(, ) is a probability
measure on (, T).
References
[1] P. Billingsley, Probability and measure, John Wiley & Sons, Inc., 1995.
[2] S.R.S. Varadhan, Probability theory, Courant lecture notes, 7. American
Mathematical Society, 2001.
17

Probability Theory: 1 Heuristic Introduction

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Probability Theory: 1 Heuristic Introduction

Загружено:

Авторское право:

Доступные форматы

Probability Theory

-observer has complete information about the outcome ,

is another function satisfying (10.2), then f = f

Вам также может понравиться