Conditional Expectation

Lecture 7: Conditional Expectation 1 of 17
Course: Theory of Probability I

Term: Fall 2013
Instructor: Gordan Zitkovic
Lecture 7
Conditional Expectation
The denition and existence of conditional expectation
For events A, B with P[B] > 0, we recall the familiar object
P[A[B] =
P[AB]
P[B]
.
We say that P[A[B] the conditional probability of A, given B. It is
important to note that the condition P[B] > 0 is crucial. When X and Y
are random variables dened on the same probability space, we often
want to give a meaning to the expression P[X A[Y = y], even though
it is usually the case that P[Y = y] = 0. When the random vector
(X, Y) admits a joint density f
X,Y
(x, y), and f
Y
(y) > 0, the concept
of conditional density f
X[Y=y
(x) = f
X,Y
(x, y)/f
Y
(y) is introduced and
the quantity P[X A[Y = y] is given meaning via
_
A
f
X[Y=y
(x, y) dx.
While this procedure works well in the restrictive case of absolutely
continuous random vectors, we will see how it is encompassed by
a general concept of a conditional expectation. Since probability is
simply an expectation of an indicator, and expectations are linear, it
will be easier to work with expectations and no generality will be lost.
Two main conceptual leaps here are: 1) we condition with respect
to a -algebra, and 2) we view the conditional expectation itself as a
random variable. Before we illustrate the concept in discrete time, here
is the denition.
Denition 7.1. Let ( be a sub--algebra of T, and let X /
1
be a
random variable. We say that the random variable is (a version of)
the conditional expectation of X with respect to ( - and denote it by
E[X[(] - if
1. /
1
.
2. is (-measurable,
3. E[1
A
] = E[X1
A
], for all A (.
Last Updated: November 28, 2013
Example 7.2. Suppose that (, T, P) is a probability space where =
a, b, c, d, e, f , T = 2
and P is uniform. Let X, Y and Z be random

variables given by (in the obvious notation)
E[X (Y )]
X
E[X ( )]
X
X
_
a b c d e f
1 3 3 5 5 7
_
,
Y
_
a b c d e f
2 2 1 1 7 7
_
and Z
_
a b c d e f
3 3 3 3 2 2
_
We would like to think about E[X[(] as the average of X() over all
which are consistent with the our current information (which is (). For
example, if ( = (Y), then the information contained in ( is exactly
the information about the exact value of Y. Knowledge of the fact that
Y = y does not necessarily reveal the true , but certainly rules out
all those for which Y() ,= y.
In our specic case, if we know that Y = 2, then = a or = b,
and the expected value of X, given that Y = 2, is
1
2
X(a) +
1
2
X(b) = 2.
Similarly, this average equals 4 for Y = 1, and 6 for Y = 7. Let us show
that the random variable dened by this average, i.e.,

_
a b c d e f
2 2 4 4 6 6
_
,
satises the denition of E[X[(Y)], as given above. The integrability
is not an issue (we are on a nite probability space), and it is clear
that is measurable with respect to (Y). Indeed, the atoms of (Y)
are a, b, c, d and e, f , and is constant over each one of those.
Finally, we need to check that
E[1
A
] = E[X1
A
], for all A (Y),
which for an atom A translates into
() =
1
P[A]
E[X1
A
] =

/
A
X(
/
)P[
/
[A], for all A.
The moral of the story is that when A is an atom, part 3. of Deni-
tion 7.1 translates into a requirement that be constant on A with
value equal to the expectation of X over A with respect to the condi-
tional probability P[[A]. In the general case, when there are no atoms,
3. still makes sense and conveys the same message.
Btw, since the atoms of (Z) are a, b, c, d and e, f , it is clear
that
E[X[(Z)]() =
_
_
_
3, a, b, c, d,
6, e, f .
Look at the illustrations above and convince yourself that
E[E[X[(Y)][(Z)] = E[X[(Z)].
A general result along the same lines - called the tower property of con-
ditional expectation - will be stated and proved below.
Our rst task is to prove that conditional expectations always exist.
When is nite (as explained above) or countable, we can always
construct them by averaging over atoms. In the general case, a different
argument is needed. In fact, here are two:
Proposition 7.3. Let ( be a sub--algebra ( of T. Then
1. there exists a conditional expectation E[X[(] for any X /
1
, and
2. any two conditional expectations of X /
1
are equal P-a.s.
Proof. (Uniqueness): Suppose that and
/
both satisfy 1., 2. and 3. of
Denition 7.1. Then
E[1
A
] = E[
/
1
A
], for all A (.
For A
n
=
/

1
n
, we have A
n
( and so
E[1
A
n
] = E[
/
1
A
n
] E[( +
1
n
)1
A
n
] = E[1
A
n
] +
1
n
P[A
n
].
Consequently, P[A
n
] = 0, for all n N, so that P[
/
> ] = 0. By a
symmetric argument, we also have P[
/
< ] = 0.
(Existence): By linearity, it will be enough to prove that the condi-
tional expectation exists for X /
1
+
.
1. A Radon-Nikodym argument. Suppose, rst, that X 0 and
E[X] = 1, as the general case follows by additivity and scaling. Then
the prescription
Q[A] = E[X1
A
],
denes a probability measure on (, T), which is absolutely continu-
ous with respect to P. Let Q
(
be the restriction of Q to (; it is trivially
absolutely continuous with respect to the restriction P
(
of P to (. The
Radon-Nikodym theorem - applied to the measure space (, (, P
(
)
and the measure Q
(
P
(
- guarantees the existence of the Radon-
Nikodym derivative
=
dQ
(
dP
(
L
1
+
(, (, P
(
).
For A (, we thus have
E[X1
A
] = Q[A] = Q
(
[A] = E
P
(
[1
A
] = E[1
A
].
where the last equality follows from the fact that 1
A
is (-measurable.
Therefore, is (a version of) the conditional expectation E[X[(].
1. An /
2
-argument. Suppose, rst, that X /
2
. Let H be the
family of all (-measurable elements in /
2
. Let

H denote the closure
of H in the topology induced by /
2
-convergence. Being a closed and
convex (why?) subset of /
2
,

H satises all the conditions of Problem
4.9 so that there exists

H at the minimal /
2
-distance from X (when
X

H, we take = X). The same problem states that has the
following property:
E[( )(X )] 0 for all

H,
and, since

H is a linear space, we have
E[( )(X )] = 0, for all

H.
It remains to pick of the form = + 1
A

H, A (, to conclude
that
E[X1
A
] = E[1
A
], for all A (.
Our next step is to show that is (-measurable (after a modi-
cation on a null set, perhaps). Since

H, there exists a sequence
nN
such that
n
in /
2
. By Corollary 4.14,
n
k
a.s.
, for
some subsequence
n
k
kN
of
n
nN
. Set
/
= liminf
kN

n
k

/
0
([, ], () and

=
/
1
[
/
[<
, so that

= , a.s., and

is (-
measurable.
We still need to remove the restriction X /
2
+
. We start with
a general X /
1
+
and dene X
n
= min(X, n) /
+
/
2
+
. Let
n
= E[X
n
[(], and note that E[
n+1
1
A
] = E[X
n+1
1
A
] E[X
n
1
A
] =
E[
n
1
A
]. It follows (just like in the proof of uniqueness above) that
n

n+1
, a.s. We dene = sup
n

n
, so that
n
, a.s. Then, for
A (, the monotone-convergence theorem implies that
E[X1
A
] = lim
n
E[X
n
1
A
] = lim
n
E[
n
1
A
] = E[1
A
],
and it is easy to check that 1
<
/
1
(() is a version of E[X[(].
Remark 7.4. There is no canonical way to choose the version of the
conditional expectation. We follow the convention started with Radon-
Nikodym derivatives, and interpret a statement such at E[X[(],
a.s., to mean that
/
, a.s., for any version
/
of the conditional
expectation of X with respect to (.
If we use the symbol L
1
to denote the set of all a.s.-equivalence
classes of random variables in /
1
, we can write:
E[[(] : /
1
(T) L
1
((),
but L
1
(() cannot be replaced by /
1
(() in a natural way. Since X = X
/
,
a.s., implies that E[X[(] = E[X
/
[(], a.s. (why?), we consider condi-
tional expectation as a map from L
1
(T) to L
1
(()
E[[(] : L
1
(T) L
1
(().
Properties
Conditional expectation inherits many of the properties from the or-
dinary expectation. Here are some familiar and some new ones:
Proposition 7.5. Let X, Y, X
n
nN
be random variables in /
1
, and let (
and 1 be sub--algebras of T. Then
1. (linearity) E[X + Y[(] = E[X[(] + E[Y[(], a.s.
2. (monotonicity) X Y, a.s., implies E[X[(] E[Y[(], a.s.
3. (identity on L
1
(()) If X is (-measurable, then X = E[X[(], a.s. In
particular, c = E[c[(], for any constant c R.
4. (conditional Jensens inequality) If : R R is convex and E[[(X)[] <
then
E[(X)[(] (E[X[(]), a.s.
5. (/
p
-nonexpansivity) If X /
p
, for p [1, ], then E[X[(] L
p
and
[[E[X[(][[
/
p [[X[[
/
p .
In particular,
E[[X[ [(] [E[X[(][ a.s.
6. (pulling out whats known) If Y is (-measurable and XY /
1
, then
E[XY[(] = YE[X[(], a.s.
7. (L
2
-projection) If X /
2
, then
= E[X[(] minimizes E[(X )

2
]
over all (-measurable random variables /
2
.
8. (tower property) If 1 (, then
E[E[X[(][1] = E[X[1], a.s..
9. (irrelevance of independent information) If 1 is independent of ((, (X))
then
E[X[((, 1)] = E[X[(], a.s.
In particular, if X is independent of 1, then E[X[1] = E[X], a.s.
10. (conditional monotone-convergence theorem) If 0 X
n
X
n+1
, a.s., for
all n N and X
n
X /
1
, a.s., then
E[X
n
[(] E[X[(], a.s.
11. (conditional Fatous lemma) If X
n
0, a.s., for all n N, and liminf
n
X
n

/
1
, then
E[liminf
n
X[(] liminf
n
E[X
n
[(], a.s.
12. (conditional dominated-convergence theorem) If [X
n
[ Z, for all n N
and some Z /
1
, and if X
n
X, a.s., then
E[X
n
[(] E[X[(], a.s. and in /
1
.
Proof. Note: Some of the properties are proved
in detail. The others are only com-
mented upon, since they are either sim-
ilar to the other ones or otherwise not
hard.
1. (linearity) E[(X + Y)1
A
] = E[(E[X[(] + E[Y[(])1
A
], for A (.
2. (monotonicity) Use A = E[X[(] > E[Y[(] ( to obtain a contra-
diction if P[A] > 0.
3. (identity on L
1
(()) Check the denition.
4. (conditional Jensens inequality) Use the result of Lemma 4.16 which
states that (x) = sup
nN
(a
n
+ b
n
x), where a
n
nN
and b
n
nN
are sequences of real numbers.
5. (L
p
-nonexpansivity) For p [1, ), apply conditional Jensens in-
equality with (x) = [x[
p
. The case p = follows directly.
6. (pulling out whats known) For Y (-measurable and XY /
1
, we need
to show that
E[XY1
A
] = E[YE[X[(]1
A
], for all A (. (7.1)
Let us prove a seemingly less general statement:
E[ZX] = E[ZE[X[(]], for all (-measurable Z with ZX /
1
.
(7.2)
The statement (7.1) will follow from it by taking Z = Y1
A
. For
Z =
n
k=1

k
1
A
k
, (7.2) is a consequence of the denition of condi-
tional expectation and linearity. Let us assume that both Z and X
are nonnegative and ZX /
1
. In that case we can nd a non-
decreasing sequence Z
n
nN
of non-negative simple random vari-
ables with Z
n
Z. Then Z
n
X /
1
for all n Nand the monotone
convergence theorem implies that
E[ZX] = lim
n
E[Z
n
X] = lim
n
E[Z
n
E[X[(]] = E[ZE[X[(]].
Our next task is to relax the assumption X /
1
+
to the original one
X /
1
. In that case, the /
p
-nonexpansivity for p = 1 implies that
[E[X[(][ E[[X[ [(] a.s.,
and so
[Z
n
E[X[(][ Z
n
E[[X[ [(] ZE[[X[ [(].
We know from the previous case that
E[ZE[[X[ [(]] = E[Z [X[], so that ZE[[X[ [(] /
1
.
We can, therefore, use the dominated convergence theorem to con-
clude that
E[ZE[X[(]] = lim
n
E[Z
n
E[X[(]] = lim
n
E[Z
n
X] = E[ZX].
Finally, the case of a general Z follows by linearity.
7. (L
2
-projection) It is enough to show that X E[X[(] is orthogonal to
all (-measurable /
2
. for that we simply note that for /
2
,
x (, we have
E[(X E[X[(])] = E[X] E[E[X[(]]
= E[X] E[E[X[(]] = 0.
8. (tower property) Use the denition.
9. (irrelevance of independent information) We assume X 0 and show
that
E[X1
A
] = E[E[X[(]1
A
], a.s. for all A ((, 1). (7.3)
Let / be the collection of all A ((, 1) such that (7.3) holds. It is
straightforward that / is a -system, so it will be enough to estab-
lish (7.3) for some -system that generates ((, 1). One possibility
is T = G H : G (, H 1, and for G H T we use inde-
pendence of 1
H
and E[X[(]1
G
, as well as the independence of 1
H
and X1
G
to get
E[E[X[(]1
GH
] = E[E[X[(]1
G
1
H
] = E[E[X[(]1
G
]E[1
H
]
= E[X1
G
]E[1
H
] = E[X1
GH
]
(7.4)
10. (conditional monotone-convergence theorem) By monotonicity, we have
E[X
n
[(] /
0
+
((), a.s. The monotone convergence theorem
implies that, for each A (,
E[1
A
] = lim
n
E[1
A
E[X
n
[(]] = lim
n
E[1
A
X
n
] = E[1
A
X].
11. (conditional Fatous lemma) Set Y
n
= inf
kn
X
k
, so that Y
n
Y =
liminf
k
X
k
. By monotonicity,
E[Y
n
[(] inf
kn
E[X
k
[(], a.s.,
and the conditional monotone-convergence theorem implies that
E[Y[(] = lim
nN
E[Y
n
[(] liminf
n
E[X
n
[(], a.s.
12. (conditional dominated-convergence theorem) By the conditional Fatous
lemma, we have
E[Z + X[(] liminf
n
E[Z + X
n
[(],
as well as
E[Z X[(] liminf
n
E[Z X
n
[(], a.s.,
and the a.s.-statement follows.
Problem 7.1.
1. Show that the condition 1 ( is necessary for the tower property Hint: Take = a, b, c.
to hold in general.
2. For X, Y /
2
and a sub--algebra ( of T, show that the following
self-adjointness property holds
E[XE[Y[(]] = E[E[X[(] Y] = E[E[X[(] E[Y[(]].
3. Let 1 and ( be two sub--algebras of T. Is it true that
1 = ( if and only if E[X[(] = E[X[1], a.s., for all X /
1
?
4. Construct two random variables X and Y in /
1
such that E[X[(Y)] =
E[X], a.s., but X and Y are not independent.
Regular conditional distributions
Once we have a the notion of conditional expectation dened and an-
alyzed, we can use it to dene other, related, conditional quantities.
The most important of those is the conditional probability:
Denition 7.6. Let ( be a sub--algebra of T. The conditional prob-
ability of A T, given ( - denoted by P[A[(] - is dened by
P[A[(] = E[1
A
[(].
It is clear (from the conditional version of the monotone-convergence
theorem) that
P[
nN
A
n
[(] =

nN
P[A
n
[(], a.s.
(7.5)
We can, therefore, think of the conditional probability as a countably-
additive map from events to (equivalence classes of) random variables
A P[A[(]. In fact, this map has the structure of a vector measure:
Denition 7.7. Let (B, [[ [[) be a Banach space, and let (S, o) be a
measurable space. A map : o B is called a vector measure if
1. () = 0, and
2. for each pairwise-disjoint sequence A
n
nN
in o, we have
(
n
A
n
) =

nN
(A
n
)
(where the series in B converges absolutely).
Proposition 7.8. The conditional probability A P[A[(] L
1
is a vector
measure with values in B = L
1
.
Proof. Clearly P[0[(] = 0, a.s. Let A
n
nN
be a pairwise-disjoint
sequence in T. Then
P[A
n
[(]
L
1
= E[[E[1
A
n
[(][] = E[1
A
n
] = P[A
n
],
and so
nN
P[A
n
[(]
L
1
=

nN
P[A
n
] = P[
n
A
n
] 1 < ,
which implies that
nN
P[A
n
[(] converges absolutely in L
1
. Finally,
for A =
nN
A
n
, we have
P[A[(]
N
n=1
P[A
n
[(]
L
1
=
E[
n=N+1
1
A
n
[(]
L
1
= P[
n=N+1
A
n
] 0 as N .
It is tempting to try to interpret the map A P[A[(]() as a
probability measure for a xed . It will not work in general; the
reason is that P[A[(] is dened only a.s., and the exceptional sets pile
up when uncountable families of events A are considered. Even if we
xed versions P[A[(] /
0
+
, for each A T, the countable additivity
relation (7.5) holds only almost surely so there is no guarantee that, for
a xed , P[
nN
A
n
[(]() =
nN
P[A
n
[(](), for all pairwise
disjoint sequences A
n
nN
in T.
There is a way out of this predicament in certain situations, and
we start with a description of an abstract object that corresponds to a
well-behaved conditional probability:
Denition 7.9. Let (R, !) and (S, o) be measurable spaces. A map
: R o R is called a (measurable) kernel if
1. x (x, B) is !-measurable for each B o, and
2. B (x, B) is a measure on o for each x R.
Denition 7.10. Let ( be a sub--algebra of T, let (S, o) be a mea-
surable space, and let e : S be a random element in S. A kernel
e[(
: o [0, 1] is called the regular conditional distribution of
e, given (, if
e[(
(, B) = P[e B[(](), a.s., for all B o.
Remark 7.11.
1. When (S, o) = (, T), and e() = , the regular conditional distri-
bution of e (if it exists) is called the regular conditional probability.
Indeed, in this case,
e[(
(, B) = P[e B[(] = P[B[(], a.s.
2. It can be shown that regular conditional distributions not need to
exist in general if S is too large.
When (S, o) is small enough, however, regular conditional distri-
butions can be constructed. Here is what we mean by small enough:
Denition 7.12. A measurable space (S, o) is said to be a Borel space
(or a nice space) if it is isomorphic to a Borel subset of R, i.e., if there
one-to-one map : S R such that both and
1
are measurable.
Problem 7.2. Show that R
n
, n N (together with their Borel - Hint: Show, rst, that there is a measur-
able bijection : [0, 1] [0, 1] [0, 1]
such that
1
is also measurable. Use
binary (or decimal, or . . . ) expansions.
algebras) are Borel spaces.
Remark 7.13. It can be show that any Borel subset of any complete and
separable metric space is a Borel space. In particular, the coin-toss
space is a Borel space.
Proposition 7.14. Let ( be a sub--algebra of T, and let (S, o) be a Borel
space. Any random element e : S admits a regular conditional distri-
bution.
Proof. Let us, rst, deal with the case S = R, so that e = X is a random
variable. Let Q be a countable dense set in R. For q Q, consider the
random variable P
q
, dened as an arbitrary version of
P
q
= P[X q[(].
By redening each P
q
on a null set (and aggregating the countably
many null sets - one for each q Q), we may suppose that P
q
()
P
r
(), for q r, q, r Q, for all and that lim
q
P
q
() = 1
and lim
q
P
q
() = 0, for all . For x R, we set
F(, x) = inf
qQ,q>x
P
q
(),
so that, for each , F(, ) is a right-continuous non-decreasing
function from R to [0, 1], which satises lim
x
F(, x) = 1 and
lim
x
F(, x) = 0, for all . Moreover, as an inmum of
countably many random variables, the map F(, x) is a random
variable for each x R.
By (the proof of) Proposition 6.24, for each , there exists a
unique probability measure
e[(
(, ) on R such that
e[(
(, (, x]) =
F(, x), for all x R. Let / denote the set of all B B such that
1.
e[(
(, B) is a random variable, and
2.
e[(
(, B) is a version of P[X B[(].
It is not hard to check that / is a -system, so we need to prove that
1. and 2. hold for all B in some -system which generates B(R). A
convenient -system to use is T = (, x] : x R. For B =
(, x] T, we have
e[(
(, B) = F(, x), so that 1. holds. To check
2., we need to show that F(x, ) = P[X x[(], a.s. This follows from
the fact that
F(, x) = inf
q>x
P
q
= lim
qx
P
q
= lim
qx
P[X q[(] = P[X x[(], a.s.,
by the conditional dominated convergence theorem.
Turning to the case of a general random element e which takes
values in a Borel space (S, o), we pick a one-to-one measurable map
f : S R whose inverse
1
is also measurable. Then X = (e) is
a random variable, and so, by the above, there exists a kernel
X[(
:
B(R) [0, 1] such that
X[(
(, A) = P[(e) A[(], a.s.
We dene the kernel
e[(
: o [0, 1] by
e[(
(, B) =
X[(
(, (B)).
Then,
e[(
(, B) is a random variable for each B o and for a pairwise
disjoint sequence B
n
nN
in o, we have
e[(
(,
n
B
n
) =
X[(
(, (
n
B
n
)) =
X[(
(,
n
(B
n
))
=

nN
X[(
(, (B
n
)) =

nN
e[(
(, B
n
),
which shows that
e[(
is a kernel; we used the measurability of
1
to conclude that (B
n
) B(R) and the injectivity of to ensure
that (B
n
)
nN
is pairwise disjoint. Finally, we need to show that
e[(
(, B) is a version of the conditional probability P[e B[(]. By
injectivity of , we have
P[e B[(] = P[(e) (B)[(] =
X[(
(, (B)) =
e[(
(, B), a.s.
Remark 7.15. Note that the conditional distribution, even in its regular
version, is not unique in general. Indeed, we can redene it arbitrarily
(as long as it remains a kernel) on a set of the form N o o,
where P[N] = 0, without changing any of its dening properties. This
will, in these notes, never be an issue.
One of the many reasons why regular conditional distributions are
useful is that they sometimes allow non-conditional thinking to be
transferred to the conditional case:
Proposition 7.16. Let X be an R
n
-valued random vector, let ( be a sub-
-algebra of T, and let g : R
n
R be a Borel function with the property
g(X) L
1
. Then
_
R
n
g(x)
X[(
(, dx) is a (-measurable random variable
and
E[g(X)[(] =
_
R
n
g(x)
X[(
(, dx), a.s.
Proof. When g = 1
B
, for B R
n
, the statement follows by the very
denition of the regular condition distribution. For the general case,
we simply use the standard machine.
Just like we sometimes express the distribution of a random vari-
able or a vector in terms of its density, cdf or characteristic function,
we can talk about the conditional density, conditional cdf or the condi-
tional characteristic function. All of those will correspond to the case
covered in Proposition 7.14 and all conditional distributions will be as-
sumed to be regular. For x = (x
1
, . . . , x
n
) and y = (y
1
, . . . , y
n
), y
n
x
means y
1
x
1
, . . . , y
n
x
n
.
Denition 7.17. Let X : R
n
be a random vector, let ( be a
sub--algebra of T, and let
X[(
: B(R
n
) [0, 1] be the regular
conditional distribution of X given (.
1. The (regular) conditional cdf of X, given ( is the map F :
R
n
[0, 1], given by
F(, x) =
X[(
(, y R
n
: y
n
x), for x R
n
,
2. A map f
X[(
: R
n
[0, ) is called the conditional density of
X with respect to ( if
(a) f
X[(
(, ) is Borel measurable for all ,
(b) f
X[(
(, x) is (-measurable for each x R
n
, and
(c)
_
B
f
X[(
(, x) dx =
X[(
(, B), for all and all B B(R
n
),
3. The conditional characteristic function of X, given ( is the map
X[(
: R
n
C, given by
X[(
(, t) =
_
R
n
e
itx
X[(
(, dx), for t R
n
and .
To illustrate the utility of the above concepts, here is a versatile
result (see Example 7.20 below):
Proposition 7.18. Let X be a random vector in R
n
, and let ( be a sub--
algebra of T. The following two statements are equivalent:
1. There exists a (deterministic) function : R
n
C such that for P-almost
all ,
X[(
(, t) = (t), for all t R
n
.
2. (X) is independent of (.
Moreover, whenever the two equivalent statements hold, is the characteristic
function of X.
Proof. 1. 2.. By Proposition 7.16, we have
X[(
(, t) = E[e
itX
[(],
a.s. If we replace
X[(
by , multiplying both sides by a bounded
(-measurable random variable Y and take expectations, we get
(t)E[Y] = E[Ye
itX
].
In particular, for Y = 1 we get (t) = E[e
itX
], so that
E[Ye
itX
] = E[Y]E[e
itX
], (7.5)
for all (-measurable and bounded Y, and all t R
n
. For Y of the
form Y = e
isZ
, where Z is a (-measurable random variable, relation
(7.5) and (a minimal extension of) part 1. of Problem 8.7, we conclude
that X and Z are independent. Since Z is arbitrary and (-measurable,
X and ( are independent.
2. 1.. If (X) is independent of (, so is e
itX
, and so, the irrel-
evance of independent information property of conditional expecta-
tion implies that
(t) = E[e
itX
] = E[e
itX
[(] =
X[(
(, t), a.s.
One of the most important cases used in practice is when a random
vector (X
1
, . . . , X
n
) admits a density and we condition on the -algebra
generated by several of its components. To make the notation more
intuitive, we denote the rst d components (X
1
, . . . , X
d
) by X
o
(for
observed) and the remaining n d components (X
d+1
, . . . , X
n
) by X
u
(for unobserved).
Proposition 7.19. Suppose that the random vector
X = (X
o
, X
u
) = (X
1
, . . . , X
d
. .
X
o
, X
d+1
, . . . , X
n
. .
X
u
)
admits a density f
X
: R
n
[0, ) and that the -algebra ( = (X
o
) is gen-
erated by the random vector X
o
= (X
1
, . . . , X
d
), for some d 1, . . . , n
1. Then, for X
u
= (X
d+1
, . . . , X
n
), there exists a conditional density
f
X
u
[(
: R
nd
[0, ), of X
u
given (, and (a version of it) is given by
f
X
u
[(
(, x
u
) =
_
_
_
f
X
(X
o
(),x
u
)
_
R
nd
f
X
(X
o
(),y) dy
,
_
R
nd
f (X
o
, y) dy > 0,
f
0
(x
u
), otherwise,
for x R
nd
and , where f
0
: R
nd
R is an arbitrary density
function.
Proof. First, we note that f
X
u
[(
is constructed from the jointly Borel-
measurable function f
X
and the random vector X
o
in an elementary
way, and is, thus, jointly measurable in ( B(R
nd
). It remains to
show that
_
A
f
X
u
[(
(, x
u
) dx
u
is a version of P[X
u
A[(], for all A B(R
nd
).
Equivalently, we need to show that
E[1
X
o
A
o
_
A
u
f
X
u
[(
(, x
u
) dx
u
] = E[1
X
o
A
o
1
X
u
A
u
],
for all A
o
B(R
d
) and A
u
B(R
nd
).
Fubinis theorem, and the fact that f
X
o (x
o
) =
_
R
nd
f (x
o
, y) dy is
the density of X
o
yield
E[1
X
o
A
o
_
A
u
f
X
u
[(
(, x
u
) dx
u
] =
_
A
u
E[1
X
o
A
o
f
X
u
[(
(, x
u
)] dx
o
=
_
A
u
_
A
o
f
X
u
[(
(x
o
, x
u
) f
X
o (x
o
) dx
o
dx
u
=
_
A
u
_
A
o
f
X
(x
o
, x
u
) dx
o
dx
u
= P[X
o
A
o
, X
u
A
u
].
The above result expresses a conditional density, given ( = (X
o
),
as a (deterministic) function of X
o
. Such a representation is possible
even when there is no joint density. The core of the argument is con-
tained in the following problem:
Problem 7.3. Let X be a random vector in R
d
, and let ( = (X)
be the -algebra generated by X. Then, a random variable Z is (-
measurable if and only if there exists a Borel function f : R
d
R
with the property that Z = f (X).
Let X
o
be a random vector in R
d
. For X /
1
the conditional
expectation E[X[(X
o
)] is (X
o
)-measurable, so there exists a Borel
function f : R
d
R such that E[X[(X
o
)] = f (X
o
), a.s. Note that f is
uniquely dened only up to
X
o -null sets. The value f (x
o
) at x
o
R
d
is usually denoted by E[X[X
o
= x
o
].
Example 7.20 (Conditioning normals on their components). Let X =
(X
o
, X
u
) R
d
R
nd
be a multivariate normal random vector with
mean = (
o
,
u
) and the variance-covariance matrix = E[

X

X
T
],
where

X = X . A block form of the matrix is given by
=
_
oo

ou
uo

uu
_
,
Where
oo
= E[

X
o
(

X
o
)
T
] R
dd
ou
= E[

X
o
(

X
u
)
T
] R
d(nd)
uo
= E[

X
u
(

X
o
)
T
] R
(nd)d
uu
= E[

X
u
(

X
u
)
T
] R
(nd)(nd)
.
We assume that
oo
is invertible. Otherwise, we can nd a subset of
components of X
o
whose variance-covariance matrix in invertible and
which generate the same -algebra (why?). The matrix A =
uo
1
oo
has the property that E[(

X
u
A

X
o
)(

X
o
)
T
] = 0, i.e., that the random
vectors

X
o
A

X
o
and

X
o
are uncorrelated. We know, however, that
X = (

X
o
,

X
u
) is a Gaussian random vector, so, by Problem 8.7, part 3.,
X
o
A

X
o
is independent of

X
o
. It follows from Proposition 7.18 that
the conditional characteristic function of

X
o
A

X
o
, given ( = (

X
o
)
is deterministic and given by
E[e
it(

X
u
A

X
o
)
[(] =
X
u
A

X
o (t), for t R
nd
.
Since A

X
o
is (-measurable, we have
E[e
itX
u
[(] = e
it
u
e
itA

X
o
e
1
2
t
T

t
, for t R
nd
.
where

= E[(

X
u
A

X
o
)(

X
u
A

X
o
)
T
]. A simple calculation yields
that, conditionally on (, X
u
is multivariate normal with mean
X
u
[(
and variance-covariance matrix
X
u
[(
given by
X
u
[(
=
o
+ A(X
o
o
),
X
u
[(
=
uu
uo
1
oo

ou
.
Note how the mean gets corrected by a multiple of the difference be-
tween the observed value X
o
and its (unconditional) expected value.
Similarly, the variance-covariance matrix gets corrected by
uo
1
oo

ou
,
but this quantity does not depend on the observation X
o
.
Problem 7.4. Let (X
1
, X
2
) be a bivariate normal vector with Var[X
1
] >
0. Work out the exact form of the conditional distribution of X
2
, given
X
1
in terms of
i
= E[X
i
],
2
i
= Var[X
i
], i = 1, 2 and the correlation
coefcient = corr(X
1
, X
2
).
Additional Problems
Problem 7.5 (Conditional expectation for non-negative random vari-
ables). A parallel denition of conditional expectation can be given
for random variables in /
0
+
. For X /
0
+
, we say that the random
variable Y is a conditional expectation of X with respect to ( - and
denote it by E[X[(] - if
(a) Y is (-measurable and [0, ]-valued, and
(b) E[Y1
A
] = E[X1
A
] [0, ], for A (.
Show that
1. E[X[(] exists for each X /
0
+
. Hint: The argument in the proof of
Proposition 7.3 needs to be modied be-
fore it can be used.
2. E[X[(] is unique a.s.
3. E[X[(] no longer necessarily exists for all X /
0
+
if we insist that
E[X[(] < , a.s., instead of E[X[(] [0, ], a.s.
Problem 7.6 (How to deal with the independent component). Let
f : R
2
R be a bounded Borel-measurable function, and let X and Y
be independent random variables. Dene the function g : R R by
g(y) = E[ f (X, y)].
Show that the function g is Borel-measurable, and that
E[ f (X, Y)[Y = y] = g(y),
Y
a.s.
Problem 7.7 (Some exercises in conditional probability).
1. Let X, Y
1
, Y
2
be random variables. Show that the random vectors
(X, Y
1
) and (X, Y
2
) have the same distribution if and only if P[Y
1

B[(X)] = P[Y
2
B[(X)], for all B B(R).
2. Let X
n
nN
be a sequence of non-negative integrable random Hint: Prove that for X
n
/
0
+
, we have
X
n
P
0 if and only if E[min(X
n
, 1)] 0. variables, and let T
n
nN
be sub--algebras of T. Show that
X
n
P
0 if E[X
n
[T
n
]
P
0. Does the converse hold?
3. Let ( be a complete sub--algebra of T. Suppose that for X Hint: Use the conditional Jensens in-
equality.
/
1
, E[X[(] and X have the same distribution. Show that X is (-
measurable.
Problem 7.8 (A characterization of (-measurability). Let (, T, P) be
a complete probability space and let ( be a sub--algebra of T. Show
that for a random variable X /
1
the following two statements are
equivalent:
1. X is (-measurable.
2. For all /
, E[X] = E[XE[[(]].
Problem7.9 (Conditioning a part with respect to the sum). Let X
1
, X
2
, . . .
be a sequence of iid r.v.s with nite rst moment, and let S
n
= X
1
+
X
2
+ + X
n
. Dene ( = (S
n
).
1. Compute E[X
1
[(].
2. Supposing, additionally, that X
1
is normally distributed, compute
E[ f (X
1
)[(], where f : R R is a Borel function with f (X
1
) L
1
.

Conditional Expectation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Conditional Expectation

Загружено:

Авторское право:

Доступные форматы

Lecture 7: Conditional Expectation 1 of 17

Course: Theory of Probability I

and P is uniform. Let X, Y and Z be random

= E[X[(] minimizes E[(X )

Вам также может понравиться