0 оценок0% нашли этот документ полезным (0 голосов)

23 просмотров240 страницMIT lectures of probability

Jan 06, 2018

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

MIT lectures of probability

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

23 просмотров240 страницMIT lectures of probability

© All Rights Reserved

Вы находитесь на странице: 1из 240

Lecture 1 9/3/2008

Contents

1. Probabilistic experiments

2. Sample space

3. Discrete probability spaces

4. σ-ﬁelds

5. Probability measures

6. Continuity of probabilities

1 PROBABILISTIC EXPERIMENTS

phenomena or experiments whose outcome is uncertain. A probabilistic model

is a mathematical model of a probabilistic experiment that satisﬁes certain math

ematical properties (the axioms of probability theory), and which allows us to

calculate probabilities and to reason about the likely outcomes of the experi

ment.

A probabilistic model is deﬁned formally by a triple (Ω, F, P), called a

probability space, comprised of the following three elements:

(a) Ω is the sample space, the set of possible outcomes of the experiment.

(b) F is a σ-ﬁeld, a collection of subsets of Ω.

(c) P is a probability measure, a function that assigns a nonnegative probabil

ity to every set in the σ-ﬁeld F.

explore some of their properties.

1

2 SAMPLE SPACE

The sample space is a set Ω comprised of all the possible outcomes of the ex

periment. Typical elements of Ω are often denoted by ω, and are called ele

mentary outcomes, or simply outcomes. The sample space can be ﬁnite, e.g.,

Ω = {ω1 , . . . , ωn }, countable, e.g., Ω = N, or uncountable, e.g., Ω = R or

Ω = {0, 1}∞ .

As a practical matter, the elements of Ω must be mutually exclusive and

collectively exhaustive, in the sense that once the experiment is carried out, there

is exactly one element of Ω that occurs.

Examples

(a) If the experiment consists of a single roll of an ordinary die, the natural sample

space is the set Ω = {1, 2, . . . , 6}, consisting of 6 elements. The outcome ω = 2

indicates that the result of the roll was 2.

(b) If the experiment consists of ﬁve consecutive rolls of an ordinary die, the natural

sample space is the set Ω = {1, 2, . . . , 6}5 . The element ω = (3, 1, 1, 2, 5) is an

example of a possible outcome.

(c) If the experiment consists of an inﬁnite number of consecutive rolls of an ordinary

die, the natural sample space is the set Ω = {1, 2, . . . , 6}∞ . In this case, an elemen

tary outcome is an inﬁnite sequence, e.g., ω = (3, 1, 1, 5, . . .). Such a sample space

would be appropriate if we intend to roll a die indeﬁnitely and we are interested in

studying, say, the number of rolls until a 4 is obtained for the ﬁrst time.

(d) If the experiment consists of measuring the velocity of a vehicle with inﬁnite preci

sion, a natural sample space is the set R of real numbers.

speciﬁes the possible outcomes.

their full generality, it is helpful to consider the simpler case where the sample

space Ω is ﬁnite or countable.

(b) The σ-ﬁeld F is the set of all subsets of Ω.

(c) The probability measure assigns a number in the set [0, 1] to every subset

of Ω. It is deﬁned in terms of the probabilities P({ω}) of the elementary

outcomes, and satisﬁes

�

P(A) = P({ω}),

ω∈A

P({ω}) = 1.

ω∈Ω

For simplicity, we will usually employ the notation P(ω) instead of P({ω}),

and we will often denote P(ωi ) by pi .

The following are some examples of discrete probability spaces. Note that

typically we do not provide an explicit expression for P(A) for every A ⊂ Ω. It

sufﬁces to specify the probability of elementary outcomes, from which P(A) is

readily obtained for any A.

Examples.

(a) Consider a single toss of a coin. If we believe that heads (H) and tails (T) are equally

likely, the following is an appropriate model. We set Ω = {ω1 , ω2 ), where ω1 = H

and ω2 = T , and let p1 = p2 = 1/2. Here, F = {Ø, {H}, {T }, {H, T }}, and

P(Ø) = 0, P(H) = P(T ) = 1/2, P({H, T }) = 1.

(b) Consider a single roll of a die. if we believe that all six outcomes are equally likely,

the following is an appropriate model. We set Ω = {1, 2, . . . , 6} and p1 = · · · =

p6 = 1/6.

(c) This example is not necessarily motivated by a meaningful experiment, yet it is a

legitimate discrete probability space. Let Ω = {1, 2, 5, a, v, aaa, ∗}, and P(1) = .1,

P(2) = .1, P(5) = .3, P(a) = .15, P(v) = .15, P(aaa) = .2, P(∗) = 0.

(d) Let Ω = N, and pk = (1/2)k , for k = 1, 2, . . . . More generally, given a parameter

p ∈ [0, 1), we can deﬁne pk = (1 −� p)pk−1 , for k = 1, 2, . . . . This results in a

∞

legitimate probability space because k=1 (1 − p)pk−1 = 1.

(e) Let Ω = N. We ﬁx a parameter λ > 0, and let pk =� e−λ λk /k!, for every k ∈ N.

∞

This results in a legitimate probability space because k=1 e−λ λk /k! = 1.

(f) We toss an unbiased coin n times. We let Ω = {0, 1}n , and if we believe that all se

quences of heads and tails are equally likely, we let P(ω) = 1/2n for every ω ∈ Ω.

3

(g) We roll a die n times. We let Ω = {1, 2, . . . , 6}n , and if we believe that all elemen

tary outcomes (6-long sequences) are equally likely, we let P(ω) = 1/6n for every

ω ∈ Ω.

Given the probabilities pi , the problem of determining P(A) for some sub

set of Ω is conceptually straightforward.

� However, the calculations involved in

determining the value of the sum ω∈A P(ω) can range from straightforward to

daunting. Various methods that can simplify such calculations will be explored

in future lectures.

4 σ-FIELDS

When the sample space Ω is uncountable, the idea of deﬁning the probability of

a general subset of Ω in terms of the probabilities of elementary outcomes runs

into difﬁculties. Suppose, for example, that the experiment consists of drawing

a number from the interval [0, 1], and that we wish to model a situation where all

elementary outcomes are “equally likely.” If we were to assign a probability of

zero to every ω, this alone would not be of much help in determining the proba

bility of a subset such as [1/2, 3/4]. If we were to assign the same positive value

to every ω, we would obtain P({1, 1/2, 1/3, . . .}) = ∞, which is undesirable.

A way out of this difﬁculty is to work directly with the probabilities of more

general subsets of Ω (not just subsets consisting of a single element).

Ideally, we would like to specify the probability P(A) of every subset of

Ω. However, if we wish our probabilities to have certain intuitive mathematical

properties, we run into some insurmountable mathematical difﬁculties. A so

lution is provided by the following compromise: assign probabilities to only a

partial collection of subsets of Ω. The sets in this collection are to be thought

of as the “nice” subsets of Ω, or, alternatively, as the subsets of Ω of interest.

Mathematically, we will require this collection to be a σ-ﬁeld, a term that we

deﬁne next.

4

Deﬁnition 2. Given a sample space Ω, a σ-ﬁeld is a collection F of subsets

of Ω, with the following properties:

(a) Ø ∈ F.

(b) If A ∈ F, then Ac ∈ F.

(c) If Ai ∈ F for every i ∈ N, then ∪∞

i=1 Ai ∈ F.

a measurable set. The pair (Ω, F) is called a measurable space.

concluded, the realized outcome ω either belongs to A, in which case we say

that the event A has occurred, or it doesn’t, in which case we say that the event

did not occur.

Exercise 1.

(a) Let F be a σ-ﬁeld. Prove that if A, B ∈ F, then A ∩ B ∈ F. More generally, given

a countably inﬁnite sequence of events Ai ∈ F, prove that ∩∞i=1 Ai ∈ F.

(b) Prove that property (a) of σ-ﬁelds (that is, Ø ∈ F) can be derived from properties

(b) and (c), assuming that the σ-ﬁeld F is non-empty.

The following are some examples of σ-ﬁelds. (Check that this is indeed the

case.)

Examples.

(a) The trivial σ-ﬁeld, F = {Ø, Ω}.

(b) The collection F = {Ø, A, Ac , Ω}, where A is a ﬁxed subset of Ω.

(c) The set of all subsets of Ω: F = 2Ω = {A | A ⊂ Ω}.

(d) Let Ω = {1, 2, . . . , 6}n , the sample space associated with n rolls of a die. Let

A = {ω = (ω1 , . . . ωn ) | ω1 ≤ 2}, B = {ω = (ω1 , . . . , ωn ) | 3 ≤ ω1 ≤ 4}, and

C = {ω = (ω1 , . . . , ωn ) | ω1 ≥ 5}, and F = {Ø, A, B, C, A∪B, A∪C, B∪C, Ω}.

Example (d) above can be thought of as follows. We start with a number

of subsets of Ω that we wish to have included in a σ-ﬁeld. We then include

more subsets, as needed, until a σ-ﬁeld is constructed. More generally, given a

collection of subsets of Ω, we can contemplate forming complements, countable

unions, and countable intersections of these subsets, to form a new collection.

We continue this process until no more sets are included in the collection, at

which point we obtain a σ-ﬁeld. This process is hard to formalize in a rigorous

manner. An alternative way of deﬁning this σ-ﬁeld is provided below. We will

need the following fact.

5

Proposition 1. Let S be an index set (possibly inﬁnite, or even uncountable),

and suppose that for every s we have a σ-ﬁeld Fs of subsets of the same

sample space. Let F = ∩s∈S Fs , i.e., a set A belongs to F if and only if

A ∈ Fs for every s ∈ S. Then F is a σ-ﬁeld.

Proof. We need to verify that F has the three required properties. Since each

Fs is a σ-ﬁeld, we have Ø ∈ Fs , for every s, which implies that Ø ∈ F. To

establish the second property, suppose that A ∈ F. Then, A ∈ Fs , for every s.

Since each s is a σ-ﬁeld, we have Ac ∈ Fs , for every s. Therefore, Ac ∈ F,

as desired. Finally, to establish the third property, consider a sequence {Ai } of

elements of F. In particular, for a given s ∈ S, every set Ai belongs to Fs . Since

Fs is a σ-ﬁeld, it follows that ∪∞i=1 Ai ∈ Fs . Since this is true for every s ∈ S,

∞

it follows that ∪i=1 Ai ∈ Fs . This veriﬁes the third property and establishes that

F is indeed a σ-ﬁeld.

Suppose now that we start with a collection C of subsets of Ω, which is not

necessarily a σ-ﬁeld. We wish to form a σ-ﬁeld that contains C. This is always

possible, a simple choice being to just let F = 2Ω . However, for technical

reasons, we may wish the σ-ﬁeld to contain no more sets than necessary. This

leads us to deﬁne F as the intersection of all σ-ﬁelds that contain C. Note that

if H is any other σ-ﬁeld that contains C, then F ⊂ H. (This is because F was

deﬁned as the intersection of various σ-ﬁelds, one of which is H.) In this sense,

F is the smallest σ-ﬁeld containing C. The σ-ﬁeld F constructed in this manner

is called the σ-ﬁeld generated by C, and is often denoted by σ(C).

Example. Let Ω = [0, 1]. The smallest σ-ﬁeld that includes every interval [a, b] ⊂ [0, 1]

is hard to describe explicitly (it includes fairly complicated sets), but is still well-deﬁned,

by the above discussion. It is called the Borel σ-ﬁeld, and is denoted by B. A set

A ⊂ [0, 1] that belongs to this σ-ﬁeld is called a Borel set.

5 PROBABILITY MEASURES

already seen that when the sample space Ω is countable, this can be accom

plished by assigning probabilities to individual elements ω ∈ Ω. However, as

discussed before, this does not work when Ω is uncountable. We are then led

to assign probabilities to certain subsets of Ω, speciﬁcally to the elements of a

σ-ﬁeld F, and require that these probabilities have certain “natural” properties.

Besides probability measures, it is convenient to deﬁne the notion of a mea

sure more generally.

6

Deﬁnition 3. Let (Ω, F) be a measurable space. A measure is a function

µ : F → [0, ∞], which assigns a nonnegative extended real number µ(A) to

every set A in F, and which satisﬁes the following two conditions:

(a) µ(Ø) = 0;

(b) (Countable additivity)

F, then µ(∪i Ai ) = i=1 µ(Ai ).

1. In that case, the triple (Ω, F, P) is called a probability space.

For any A ∈ F, P(A) is called the probability of the event A. The assign

ment of unit probability to the event Ω expresses our certainty that the outcome

of the experiment, no matter what it is, will be an element of Ω. Similarly, the

outcome cannot be an element of the empty set; thus, the empty set cannot occur

and is assigned zero probability. If an event A ∈ F satisﬁes P(A) = 1, we say

that A occurs almost surely. Note, however, that A happening almost surely is

not the same as the condition A = Ω. For a trivial example, let Ω = {1, 2, 3},

p1 = .5, p2 = .5, p3 = 0. Then the event A = {1, 2} occurs almost surely, since

P(A) = .5 + .5 = 1, but A �= Ω. The outcome 3 has zero probability, but is still

possible.

The countable additivity property is very important. Its intuitive meaning

is the following. If we have several events A1 , A2 , . . ., out of which at most

one can occur, then the probability that “one of them will occur” is equal to

the sum of their individual probabilities. In this sense, probabilities (and more

generally, measures) behave like the familiar notions of area or volume: the area

or volume of a countable union of disjoint sets is the sum of their individual areas

or volumes. Indeed, a measure is to be understood as some generalized notion

of a volume. In this light, allowing the measure µ(A) of a set to be inﬁnite is

natural, since one can easily think of sets with inﬁnite volume.

The properties of probability measures that are required by Deﬁnition 3 are

often called the axioms of probability theory. Starting from these axioms, many

other properties can be derived, as in the next proposition.

7

Proposition 2. Probability measures have the following properties.

(a) (Finite

�n additivity) If the events A1 , . . . , An are disjoint, then P(∪ni=1 Ai ) =

i=1 P(Ai ).

(b) For any event A, we have P(Ac ) = 1 − P(A).

(c) If the events A and B satisfy A ⊂ B, then P(A) ≤ P(B).

(d) (Union bound) For any sequence {Ai } of events, we have

∞

�

� ∞

�

P Ai ≤ P(Ai ).

i=1 i=1

n

�

� n

� �

P Ai = P(Ai ) − P(Ai ∩ Aj )

�

+ P(Ai ∩ Aj ∩ Ak ) + · · · + (−1)n P(A1 ∩ · · · ∩ An ).

(i,j,k): i<j<k

Proof.

(a) This property is almost identical to condition (b) in the deﬁnition of a mea

sure, except that it deals with a ﬁnite instead of a countably inﬁnite collec

tion of events. Given a ﬁnite collection of disjoint events A1 , . . . , An , let us

deﬁne Ak = Ø for k > n, to obtain an inﬁnite sequence of disjoint events.

Then,

n

�

� ∞

�

� �∞ n

�

P Ai =P Ai = P(Ai ) = P(Ai ).

i=1 i=1 i=1 i=1

Countable additivity was used in the second equality, and the fact P(Ø) = 0

was used in the last equality.

(b) The events A and Ac are disjoint. Using part (a), we have P(A ∪ Ac ) =

P(A) + P(Ac ). But A ∪ Ac = Ω, whose measure is equal to one, and the

result follows.

(c) The events A and B \ A are disjoint. Also, A ∪ (B \ A) = B. Therefore,

using also part (a), we obtain P(A) ≤ P(A) + P(B \ A) = P(B).

(d) Left as an exercise.

8

(e) Left as an exercise; a simple proof will be provided later, using random

variables.

Part (e) of Proposition 2 admits a rather simple proof that relies on random

variables and their expectations, a topic to be visited later on. For the special

case where n = 2, the formula simpliﬁes to

Let us note that all properties (a), (c), and (d) in Proposition 2 are also valid

for general measures (the proof is the same). Let us also note that for a proba

bility measure, the property P(Ø) = 0 need not be assumed, but can be derived

from the other properties. Indeed, consider a sequence of sets Ai , each of which

�∞ since Ø ∩ Ø = Ø. Applying

is equal to the empty set. These sets are disjoint,

the countable additivity property, we obtain i=1 P(Ø) = P(Ø) ≤ P(Ω) = 1,

which can only hold if P(Ø) = 0.

Finite Additivity

Our deﬁnitions of σ-ﬁelds and of probability measures involve countable unions

and a countable additivity property. A different mathematical structure is ob

tained if we replace countable unions and sums by ﬁnite ones. This leads us to

the following deﬁnitions.

(i) Ø ∈ F.

(ii) If A ∈ F, then Ac ∈ F.

(iii) If A ∈ F and B ∈ F, then A ∪ B ∈ F.

ﬁnitely additive if

We note that ﬁnite additivity, for the two case of two events, easily implies

ﬁnite additivity for a general ﬁnite number n of events, namely, the property in

part (a) of Proposition 2. To see this, note that ﬁnite additivity for n = 2 allows

9

us to write, for the case of three disjoint events,

Finite additivity is strictly weaker than the countable additivity property of

probability measures. In particular, ﬁnite additivity on a ﬁeld, or even for the

special case of a σ-ﬁeld, does not, in general, imply countable additivity. The

reason for introducing the stronger countable additivity property is that with

out it, we are severely limited in the types of probability calculations that are

possible. On the other hand, ﬁnite additivity is often easier to verify.

6 CONTINUITY OF PROBABILITIES

[1, n] converges to the event A = [1, ∞), and it is reasonable to expect that the

probability of [1, n] converges to the probability of [1, ∞). Such a property is

established in greater generality in the result that follows. This result provides

us with a few alternative versions of such a continuity property, together with

a converse which states that ﬁnite additivity together with continuity implies

countable additivity. This last result is a useful tool that often simpliﬁes the

veriﬁcation of the countable additivity property.

subsets of Ω, and suppose that P : F → [0, 1] satisﬁes P(Ω) = 1 as well as

the ﬁnite additivity property. Then, the following are equivalent:

(b) If {Ai } is an increasing sequence of sets in F (i.e., Ai ⊂ Ai+1 , for all i),

and A = ∪∞ i=1 Ai , then limi→∞ P(Ai ) = P(A).

(c) If {Ai } is a decreasing sequence of sets in F (i.e., Ai ⊃ Ai+1 , for all i),

and A = ∩∞ i=1 Ai , then limi→∞ P(Ai ) = P(A).

(d) If {Ai } is a decreasing sequence of sets (i.e., Ai ⊃ Ai+1 , for all i) and

∩∞i=1 Ai is empty, then limi→∞ P(Ai ) = 0.

Proof. We ﬁrst assume that (a) holds and establish (b). Observe that A = A1 ∪

(A2 \ A1 ) ∪ (A3 \ A2 ) ∪ . . ., and that the events A1 , (A2 \ A1 ), (A3 \ A2 ), . . .

10

are disjoint (check this). Therefore, using countable additivity,

∞

�

P(A) = P(A1 ) + P(Ai \ Ai−1 )

i=2

n

�

= P(A1 ) + lim P(Ai \ Ai−1 )

n→∞

i=2

�n

� �

= P(A1 ) + lim P(Ai ) − P(Ai−1 )

n→∞

i=2

= P(A1 ) + lim (P(An ) − P(A1 ))

n→∞

= lim P(An ).

n→∞

Suppose now that property (b) holds, let Ai be a decreasing sequence of sets,

and let A = ∩∞ c

i=1 Ai . Then, the sequence Ai is increasing, and De Morgan’s law,

together with property (b) imply that

�� �c �

P(Ac ) = P ∩∞ = P ∪∞ c c

� �

A

i=1 i i=1 Ai = lim P(Ai ), n→∞

and

� �

n→∞ n→∞ n→∞

Property (d) follows from property (c), because (d) is just the special case of

(c) in which the set A is empty.

To complete the proof, we now assume that property (d) holds and establish

that property (a) holds as well. Let Bi ∈ F be disjoint events. Let An =

∪∞i=n Bi . Note that {An } is a decreasing sequence of events. We claim that

∩∞n=1 An = Ø. Indeed, if ω ∈ A1 , then ω ∈ Bn for some n, which implies that

ω∈ / ∪∞ i=n+1 Bi = An+1 . Therefore, no element of A1 can belong to all of the

sets An , which means that that the intersection of the sets An is empty. Property

(d) then implies that limn→∞ P(An ) = 0.

Applying ﬁnite additivity to the n disjoint sets B1 , B2 , . . . , Bn−1 , ∪∞

i=n Bi ,

we have � ∞ � n−1 �∞ �

�

P Bi = P(Bi ) + P Bi .

i=1 i=1 i=n

�can take the limit as n → ∞. The ﬁrst term

∞

on the right-hand side converges to i=1 P(Bi ). The second term is P(An ), and

11

�∞ � n

�

P Bi = P(Bi ),

i=1 i=1

References

(b) Williams, Chapter 1, Sections 1.0-1.5, 1.9-1.10.

12

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 2 9/8/2008

Contents

2. Lebesgue measure on [0, 1] and R

3. Coin tosses: a “uniform” measure on {0, 1}∞

4. Completion of probability space

5. Further remarks

6. Appendix: strange sets

The following are two fundamental probabilistic models that can serve as

building blocks for more complex models:

(a) The uniform distribution on [0, 1], which assigns probability b − a to every

interval [a, b] ⊂ [0, 1].

(b) A model of an inﬁnite sequence of fair coin tosses that assigns equal proba

bility, 1/2n , to every possible sequence of length n.

These two models are often encountered in elementary probability and used

without further discussion. Strictly speaking, however, we need to make sure

that these two models are well-posed, that is, consistent with the axioms of

probability. To this effect, we need to deﬁne appropriate σ-ﬁelds and probability

measures on the corresponding sample spaces. In what follows, we describe the

required construction, while omitting the proofs of the more technical steps.

1 ´

CARATHEODORY’S EXTENSION THEOREM

The general outline of the construction we will use is as follows. We are in

terested in deﬁning a probability measure with certain properties on a given

1

measurable space (Ω, F). We consider a smaller collection, F0 ⊂ F, of subsets

of Ω, which is a ﬁeld, and on which the desired probabilities are easy to deﬁne.

(Recall that a ﬁeld is a collection of subsets of the sample space that includes

the empty set, and which is closed under taking complements and under ﬁnite

unions.) Furthermore, we make sure that F0 is rich enough, so that the σ-ﬁeld

it generates is the same as the desired σ-ﬁeld F. We then extend the deﬁnition

of the probability measure from F0 to the entire σ-ﬁeld F. This is possible, un

der few conditions, by virtue of the following fundamental result from measure

theory.

sets of a sample space Ω, and let F = σ(F0 ) be the σ-ﬁeld that it generates.

Suppose that P0 is a mapping from F0 to [0, 1] that satisﬁes P0 (Ω) = 1, as

well as countable additivity on F0 .

Then, P0 can be extended uniquely to a probability measure on (Ω, F).

That is, there exists a unique probability measure P on (Ω, F) such that

P(A) = P0 (A) for all A ∈ F0 .

Remarks:

(a) The proof of the extension theorem is fairly long and technical; see, e.g.,

Appendix A of [Williams].

(b) The main hurdle in applying the extension theorem is the veriﬁcation of

the countable additivity property of P0 on F0 ; that is, one needs to show

that if {Ai } is �a sequence disjoint sets in F0 , and if ∪∞ i=1 Ai ∈ F0 , then

P0 (∪i=1 Ai ) = ∞

∞ P

i=1 0 (Ai ). Alternatively, in the spirit of Theorem 1 from

Lecture 1, it sufﬁces to verify that if {Ai } is a decreasing sequence of sets

in F0 and if ∩∞ i=1 Ai is empty, then limn→∞ P0 (Ai ) = 0. Indeed, while

Theorem 1 of Lecture 1 was stated for the case where F is a σ-ﬁeld, an

inspection of its proof indicates that it remains valid even if F is replaced

by a ﬁeld F0 .

In the next two sections, we consider the two models of interest. We deﬁne

appropriate ﬁelds, deﬁne probabilities for the events in those ﬁelds, and then use

the extension theorem to obtain a probability measure.

In this section, we construct the uniform probability measure on [0, 1], also

known as Lebesgue measure. Under the Lebesgue measure, the measure as

2

signed to any subset of [0, 1] is meant to be equal to its length. While the deﬁni

tion of length is immediate for simple sets (e.g., the set [a, b] has length b − a),

more general sets present more of a challenge.

We start by considering the sample space Ω = (0, 1], which is slightly more

convenient than the sample space [0, 1].

Consider the collection C of all intervals [a, b] contained in (0, 1], and let F be

the σ-ﬁeld generated by C. As mentioned in the Lecture 1 notes, this is called

the Borel σ-ﬁeld, and is denoted by B. Sets in this σ-ﬁeld are called Borel sets

or Borel measurable sets.

Any set that can be formed by starting with intervals [a, b] and using a count

able number of set-theoretic operations (taking complements, or forming count

able unions and intersections of previously formed sets) is a Borel set. For exam

ple, it can be veriﬁed that single-element sets, {a}, are Borel sets. Furthermore,

intervals (a, b] are also Borel sets since they are of the form [a, b] \ {a}. Every

countable set is also a Borel set, since it is the union of countably many single

element sets. In particular, the set of rational numbers in (0, 1], as well as its

complement, the set of irrational numbers in (0, 1], is a Borel set. While Borel

sets can be fairly complicated, not every set is a Borel set; see Sections 5-6.

Directly deﬁning a probability measure for all Borel sets directly is difﬁcult,

so we start by considering a smaller collection, F0 , of subsets of (0, 1]. We let

F0 consist of the empty set and all sets that are ﬁnite unions of intervals of the

form (a, b]. In more detail, if a set A ∈ F0 is nonempty, it is of the form

A = (a1 , b1 ] ∪ · · · ∪ (an , bn ],

where 0 ≤ a1 < b1 ≤ a2 < b2 ≤ · · · ≤ an < bn ≤ 1, and n ∈ N.

Proof. We have already argued that every interval of the form (a, b] is a Borel

set. Hence, a typical element of F0 (a ﬁnite union of such intervals) is also a

Borel set. Therefore, F0 ⊂ B, which implies that σ(F0 ) ⊂ σ(B) = B. (The

last equality holds because B is already a σ-ﬁeld and is therefore equal to the

smallest σ-ﬁeld that contains B.)

Note that for a > 0, we have [a, b] = ∩∞

n=1 (a − 1/n, b]. Since (a − 1/n, b] ∈

F0 ⊂ σ(F0 ), it follows that [a, b] ∈ σ(F0 ). Thus, C ⊂ σ(F0 ), which implies

that � �

B = σ(C) ⊂ σ σ(F0 ) = σ(F0 ) ⊂ B.

3

(The second equality holds because the smallest σ-ﬁeld containing σ(F0 ) is

σ(F0 ) itself.) The ﬁrst equality in the statement of the proposition follows.

Finally, the equality σ(C) = B is just the deﬁnition of B.

Lemma 2.

(b) The collection F0 is not a σ-ﬁeld.

Proof.

(a) By deﬁnition, Ø ∈ F0 . Note that Øc = (0, 1] ∈ F0 . More generally, if A is

of the form A = (a1 , b1 ]∪· · ·∪(an , bn ], its complement is (0, a1 ]∪(b1 , a2 ]∪

· · · ∪ (bn , 1], which is also in F0 . Furthermore, the union of two sets that

are unions of ﬁnitely many intervals of the form (a, b] is also a union of

ﬁnitely many such intervals. For example, if A = (1/8, 2/8] ∪ (4/8, 7/8]

and B = (3/8, 5/8], then A ∪ B = (1/8, 2/8] ∪ (3/8, 7/8].

(b) To see that F0 is not a σ-ﬁeld, note that (0, n/(n + 1)] ∈ F0 , for every

n ∈ N, but the union of these sets, which is (0, 1), does not belong to

F0 .

For every A ∈ F0 of the form

A = (a1 , b1 ] ∪ · · · ∪ (an , bn ],

� �

which is its total length. Note that P0 (Ω) = P (0, 1] = 1. Also P0 is ﬁnitely

additive. Indeed if A1 , . . . , An are disjoint ﬁnite unions of intervals of the form

(a, b], then A = ∪1≤i≤n Ai is also a ﬁnite union of such intervals and its to

tal length is the sum of the lengths of the sets Ai . It turns out that P0 is also

countably additive on F0 . This essentially boils down to checking the fol

lowing. � If (a, b] = ∪∞ i=1 (ai , bi ], where the intervals (ai , bi ] are disjoint, then

b−a = ∞ (b

i=1 i − a ).

i This may appear intuitively obvious, but a formal proof

is nontrivial; see, e.g., [Williams, Chapter A1].

We can now apply the Extension Theorem and conclude that there exists a

probability measure P, called the Lebesgue or uniform measure, deﬁned on the

4

� �

entire Borel σ-ﬁeld B, that agrees with P0 on F0 . In particular, P (a, b] = b−a,

for every interval (a, b] ⊂ (0, 1].

By augmenting the sample space Ω to include 0, and assigning zero prob

ability to it, we obtain a new probability model with sample space Ω = [0, 1].

(Exercise: deﬁne formally the sigma-ﬁeld on [0, 1], starting from the σ-ﬁeld on

(0, 1].)

Exercise 1. Let A be the set of irrational numbers in [0, 1]. Show that P(A) = 1.

Example. Let A be the set of points in [0, 1] whose decimal representation contains

only odd digits. (We disallow decimal representations that end with an inﬁnite string of

nines. Under this condition, every number has a unique decimal representation.) What

is the Lebesgue measure of this set?

Observe that A = ∩∞ n=1 An , where An is the set of points whose ﬁrst n digits are

odd. It can be checked that An is a union of 5n intervals, each with length 1/10n . Thus,

P(An ) = 5n /10n = 1/2n . Since A ⊂ An , we obtain P(A) ≤ P(An ) = 1/2n . Since

this is true for every n, we conclude that P(A) = 0.

Exercise 2. Let A be the set of points in [0, 1] whose decimal representation contains

at least one digit equal to 9. Find the Lebesgue measure of that set.

Note that there is nothing special about the interval (0, 1]. For example,

� � if

we let Ω = (c, d], where c < d, and if (a, b] ⊂ (c, d], we can deﬁne P0 (a, b] =

(b − a)/(d − c) and proceed as above to obtain a uniform probability measure

on the set (c, d], as well as on the set [c, d].

On the other hand, a “uniform” probability measure on the entire real line,

R, that assigns equal probability to intervals of equal length, is incompatible

with the requirement P(Ω) = 1. What we obtain instead, in the next section, is

a notion of length which becomes inﬁnite for certain sets.

Let Ω = R. We ﬁrst deﬁne a σ-ﬁeld of subsets of R. This can be done in several

ways. It can be veriﬁed that the three alternatives below are equivalent.

(a) Let C be the collection of all intervals of the form [a, b], and let B = σ(C)

be the σ-ﬁeld that it generates.

(b) Let D be the collection of all intervals of the form (−∞, b], and let B =

σ(D) be the σ-ﬁeld that it generates.

(c) For any n, we deﬁne the Borel σ-ﬁeld of (n, n + 1] as the σ-ﬁeld generated

by sets of the form [a, b] ⊂ (n, n + 1]. We then say that A is a Borel subset

of R if A ∩ (n, n + 1] is a Borel subset of (n, n + 1], for every n.

Let Pn be the uniform measure on (n, n + 1] (deﬁned on the Borel sets in

(n, n + 1]). Given a set A ⊂ R, we decompose it into countably many pieces,

each piece contained in some interval (n, n + 1], and deﬁne its “length” µ(A)

using countable additivity:

∞

� � �

µ(A) = Pn A ∩ (n, n + 1] .

n=−∞

It turns out that µ is a measure on (R, B), called again Lebesgue measure.

However, it is not a probability measure because µ(R) = ∞.

Exercise 4. Show that µ is a measure on (R, B). Hint: Use the countable additivity of

the measures Pn to establish the countable

�∞ additivity

�∞ of µ.�You can also the fact that if

∞ �∞

the numbers aij are nonnegative, then i=1 j=1 aij = j=1 i=1 aij .

bilistic model of this experiment under which every possible sequence of results

of the ﬁrst n tosses has the same probability, 1/2n .

The sample space for this experiment is the set {0, 1}∞ of all inﬁnite se

quences ω = (ω1 , ω2 , . . .) of zeroes and ones (we use zeroes and ones instead

of heads and tails).

For technical reasons, it is not possible to assign a probability to every subset

of the sample space. Instead, we proceed as in Section 2, i.e., ﬁrst deﬁne a ﬁeld

of subsets, assign probabilities to sets that belong to this ﬁeld, and then extend

them to a probability measure on the σ-ﬁeld generated by that ﬁeld.

Let Fn be the collection of events whose occurrence can be decided by looking

at the results of the ﬁrst n tosses. For example, the event {ω | ω1 = 1 and ω2 =

�

ω4 } belongs to F4 (as well as to Fk for every k ≥ 4).

Let B be an arbitrary subset of {0, 1}n . Consider the set

We can express A ⊂ {0, 1}∞ in the form A = B × {0, 1}∞ . (This is simply

saying that any sequence in A can be viewed as a pair consisting of a n-long

sequence that belongs to B, followed by an arbitrary inﬁnite sequence. The

6

event A belongs to Fn , and all elements of Fn are of this form, for some A. It

is easily veriﬁed that Fn is a σ-ﬁeld.

Exercise 5. Provide a formal proof that Fn is a σ-ﬁeld.

The σ-ﬁeld Fn , for any ﬁxed n, is too small; it can only serve to model

the ﬁrst n coin tosses. We are interested instead in sets that belong to Fn , for

arbitrary n, and this leads us to deﬁne F0 = ∪∞n=1 Fn , the collection of sets that

belong to Fn for some n. Intuitively, A ∈ F0 if the occurrence or nonoccurrence

of A can be decided after a ﬁxed number of coin tosses.1

Example. Let An = {ω | ωn = 1}, the event that the nth toss results in a “1”. Note

that An ∈ Fn . Let A = ∪∞ i=1 An , which is the event that there is at least one “1” in

the inﬁnite toss sequence. The event A does not belong to Fn , for any n. (Intuitively,

having observed a sequence of n zeroes does not allow us to decide whether there will

be a subsequent “1” or not.) Consider also the complement of A, which is the event that

the outcome of the experiment is an inﬁnite string of zeroes. Once more, we see that Ac

does not belong to F0 .

The preceding example shows that F0 is not a σ-ﬁeld. On the other hand, it

can be veriﬁed that F0 is a ﬁeld.

Exercise 6. Prove that F0 is a ﬁeld.

the events in Fn , for every n. This means that we need a σ-ﬁeld that includes

F0 . On the other hand, we would like our σ-ﬁeld to be as small as possible, i.e.,

contain as few subsets of {0, 1}n as possible, to minimize the possibility that it

includes pathological sets to which probabilities cannot be assigned. This leads

us to deﬁne F as the sigma-ﬁeld σ(F0 ) generated by F0 .

�

satisﬁes P0 ({0, 1}∞ ) = 1. This is accomplished as follows. Every set A in F0

is of the form B × {0, 1}∞ , for some n and some B ⊂ {0, 1}n . We then let

P0 (A) = |B|/2n .2 Note that the event {ω1 , ω2 , . . . , ωn } × {0, 1}n , which is the

event that the ﬁrst n tosses resulted in a particular sequence {ω1 , ω2 , . . . , ωn },

1

The union ∪∞ ∞

i=1 Fi = F0 is not the same as the collection of sets of the form ∪i=1 Ai ,

for Ai ∈ Fi . For an illustration, if F1 = {{a}, {b, c}} and F2 = {{d}}, then F1 ∪ F2 =

{{a}, {b, c}, {d}}. Note that {b, c} ∪ {d} = {b, c, d} is not in F1 ∪ F2 .

2

For any set A, |A| denotes its cardinality, the number of elements it contains.

assigned equal probability, as desired.

Before proceeding further, we need to verify that the above deﬁnition is

consistent. Note that same set A can belong to Fn for several values of n. We

therefore need to check that when we apply the deﬁnition of P0 (A) for different

choices of n, we obtain the same value. Indeed, suppose that A ∈ Fm , which

implies that A ∈ Fn , for n > m. In this case, A = B ×{0, 1}∞ = C ×{0, 1}∞ ,

where B ⊂ {0, 1}n and C ⊂ {0, 1}m . Thus, B = C × {0, 1}n−m , and |B| =

|C| · 2n−m . One application of the deﬁnition yields P0 (A) = |B|/2n , and

another yields P0 (A) = |C|/2m . Since |B| = |C| · 2n−m , they both yield the

same value.

It is easily veriﬁed that P0 (Ω) = 1, and that P0 is ﬁnitely additive: if A, B ⊂

Fn are disjoint, then P(A ∪ B) = P(A) + P(B). It also turns out that P0

is also countably additive on F0 . (The proof of this fact is more elaborate,

and is omitted.) We can now invoke the Extension Theorem and conclude that

there exists a unique probability measure on F, the σ-ﬁeld generated by F0 ,

that agrees with P0 on F0 . This probability measure assigns equal probability,

1/2n , to every possible sequence of length n, as desired. This conﬁrms that the

intuitive process of an inﬁnite sequence of coin ﬂips can be captured rigorously

within the framework of probability theory.

Exercise 7. Consider the probability space ({0, 1}∞ , F, P). Let A be the set of all

inﬁnite sequences ω for which ωn = 0 for every odd n.

(a) Establish that A ∈

/ F0 , but A ∈ F .

(b) Compute P(A).

Similar to the case of Borel sets in [0, 1], there exist subsets of {0, 1}∞ that

do not belong to F. In fact the similarities between the models of Sections 2 and

3 are much deeper; the two models are essentially equivalent, although we will

not elaborate on the meaning of this. Let us only say that the equivalence relies

on the one-to-one correspondence of the sets [0, 1] and {0, 1}∞ obtained through

the binary representation of real numbers. Intuitively, generating a real number

at random, according to the uniform distribution (Lebesgue measure) on [0, 1],

is probabilistically equivalent to generating each bit in its binary expansion at

random.

Starting with a ﬁeld F0 and a countably additive function P0 on that ﬁeld, the

Extension Theorem leads to a measure on the smallest σ-ﬁeld containing F0 .

Can we extend the measure further, to a larger σ-ﬁeld? If so, is the extension

unique, or will there have to be some arbitrary choices? We describe here a

generic extension that assigns probabilities to certain additional sets A for which

there is little choice.

Consider a probability space (Ω, F, P). Suppose that B ∈ F, and P(B) =

0. Any set B with this property is called a null set. (Note that in this context,

“null” is not the same as “empty.”) Suppose now that A ⊂ B. If the set A is not

in F, it is not assigned a probability; were it to be assigned one, the only choice

that would not lead to a contradiction is a value of zero.

The ﬁrst step is to augment the σ-ﬁeld F so that it includes all subsets of

null sets. This is accomplished as follows:

(a) Let N be the collection of all subsets of null sets;

(b) Deﬁne F ∗ = σ(F ∪ N ), the smallest σ-ﬁeld that contains F as well as all

subsets of null sets.

(c) Extend P in a natural manner to obtain a new probability measure P∗ on

(Ω, F ∗ ). In particular, we let P∗ (A) = 0 for every subset A ⊂ B of every

null set B ∈ F. It turns out that such an extension is always possible and

unique.

The resulting probability space is said to be complete. It has the property that

all subsets of null sets are included in the σ-ﬁeld and are also null sets.

When Ω = [0, 1] (or Ω = R), F is the Borel σ-ﬁeld, and P is Lebesgue

measure, we obtain an augmented σ-ﬁeld F ∗ and a measure P∗ . The sets in F ∗

are called Lebesgue measurable sets. The new measure P∗ is referred to by the

same name as the measure P (“Lebesgue measure”).

5 FURTHER REMARKS

We record here a few interesting facts related to Borel σ-ﬁelds and the Lebesgue

measure. Their proofs tend to be fairly involved.

(a) There exist sets that are Lebesgue measurable but not Borel measurable,

i.e., F is a proper subset of F ∗ .

(b) There are as many Borel measurable sets as there are points on the real

line (this is the “cardinality of the continuum”), but there are as many

Lebesgue measurable sets as there are subsets of the real line (which is a

higher cardinality) [Billingsley]

(c) There exist subsets of [0, 1] that are not Lebesgue measurable; see Section

6 and [Williams, p. 192].

9

(d) It is not possible to construct a probability space in which the σ-ﬁeld in

cludes all subsets of [0, 1], with the property that P({x}) = 0 for every

x ∈ (0, 1] [Billingsley, pp. 45-46].

In this appendix, we provide some evidence that not every subset of (0, 1] is

Lebesgue measurable, and, furthermore, that Lebesgue measure cannot be ex

tended to a measure deﬁned for all subsets of (0, 1].

Let “+” stand for addition modulo 1 in (0, 1]. For example, 0.5 + 0.7 = 0.2,

instead of 1.2. You may want to visualize (0, 1] as a circle that wraps around so

that after 1, one starts again at 0. If A ⊂ (0, 1], and x is a number, then A + x

stands for the set of all numbers of the form y + x where y ∈ A.

Deﬁne x and y to be equivalent if x + r = y for some rational number r.

Then, (0, 1] can be partitioned into equivalence classes. (That is, all elements

in the same equivalence class are equivalent, elements belonging to different

equivalent classes are not equivalent, and every x ∈ (0, 1] belongs to exactly

one equivalence class.) Let us pick exactly one element from each equivalence

class, and let H be the set of the elements picked this way. (This fact that a set H

can be legitimately formed this way involves the Axiom of Choice, a generally

accepted axiom of set theory.) We will now consider the sets of the form H + r,

where r ranges over the rational numbers in (0, 1]. Note that there are countably

many such sets.

� r2 , and if the two sets H + r1 ,

The sets H + r are disjoint. (Indeed, if r1 =

H + r2 share the point h1 + r = h2 + r2 , with h1 , h2 ∈ H, then h1 and h2

differ by a rational number and are equivalent. If h1 = � h2 , this contradicts the

construction of H, which contains exactly one element from each equivalence

class. If h1 = h2 , then r1 = r2 , which is again a contradiction.) Therefore,

(0, 1] is the union of the countably many disjoint sets H + r.

The sets H + r, for different r, are “translations” of each other (they are

all formed by starting from the set H and adding a number, modulo 1). Let us

say that a measure is translation-invariant if it has the following property: if A

and A + x are measurable sets, then P(A) = P(A + x). Suppose that P is a

translation invariant probability measure, deﬁned on all subsets of (0, 1]. Then,

� � � �

1 = P (0, 1] = P(H + r) = P(H),

r r

where the sum is taken over all rational numbers in (0, 1]. But this impossible.

We conclude that a translation-invariant measure, deﬁned on all subsets of (0, 1]

does not exist.

10

On the other hand, it can be veriﬁed that the Lebesgue measure is translation-

invariant on the Borel σ-ﬁeld, as well as its extension, the Lebesgue σ-ﬁeld. This

implies that the Lebesgue σ-ﬁeld does not include all subsets of (0, 1].

An even stronger, and more counterintuitive example is the following. It in

dicates, that the ordinary notion of area or volume cannot be applied to arbitrary

sets.

The Banach-Tarski Paradox. Let S be the two-dimensional surface of the unit

sphere in three dimensions. There exists a subset F of S such that for any k ≥ 3,

S = (τ1 F ) ∪ · · · ∪ (τk F ),

where each τi is a rigid rotation and the sets τi F are disjoint. For example, S

can be made up by three rotated copies of F (suggesting probability equal to

1/3, but also by four rotated copies of F , suggesting probability equal to 1/4).

Ordinary geometric intuition clearly fails when dealing with arbitrary sets.

11

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 3 9/10/2008

Most of the material in this lecture is covered in [Bertsekas & Tsitsiklis] Sections

1.3-1.5 and Problem 48 (or problem 43, in the 1st edition), available at http:

//athenasc.com/Prob-2nd-Ch1.pdf. Solutions to the end of chapter

problems are available at http://athenasc.com/prob-solved 2ndedition.

pdf. These lecture notes provide some additional details and twists.

Contents

1. Conditional probability

2. Independence

3. The Borel-Cantelli lemma

1 CONDITIONAL PROBABILITY

with P(B) > 0. For every event A ∈ F, the conditional probability that A

occurs given that B occurs is denoted by P(A | B) and is deﬁned by

P(A ∩ B)

P(A | B) = .

P(B)

(a) If B is an event with P(B) > 0, then P(Ω | B) = 1, and for any sequence

{Ai } of disjoint events, we have

∞

�

∪∞

� �

P i=1 Ai | B = P(Ai | B).

i=1

(b) Suppose that B is an event with P(B) > 0. For every A ∈ F, deﬁne

PB (A) = P(A | B). Then, PB is a probability measure on (Ω, F).

(c) Let A be an event. If the events Bi , i ∈ N, form a partition of Ω, and

P(Bi ) > 0 for every i, then

n

�

P(A) = P(A | Bi )P(Bi ).

i=1

(d) (Bayes’ rule) Let A be an event with P(A) > 0. If the events Bi , i ∈ N,

form a partition of Ω, and P(Bi ) > 0 for every i, then

P(Bi | A) = = �∞ .

P(A) j =1 P(Bj )P(A | Bj )

∞

�

P(∩∞

i=1 Ai ) = P(A1 ) P(Ai |A1 ∩ · · · ∩ Ai−1 ),

i=2

Proof.

� P B ∩ (∪∞

� �

� ∞ i=1 Ai ) P(∪∞

i=1 (B ∩ Ai ))

P ∪i=1 Ai | B = = .

P(B) P(B)

2

Since the sets B ∩ Ai , i ∈ N are disjoint, countable additivity, applied to the

right-hand side, yields

�∞ ∞

i=1 P(B ∩ Ai ) �

∪∞

� �

P i=1 Ai | B = = P(Ai | B),

P(B)

i=1

as claimed.

(b) This is immediate from part (a).

(c) We have

� �

P(A) = P(A ∩ Ω) = P A ∩ (∪∞ = P ∪∞

� �

B

i=1 i ) i=1 (A ∩ Bi )

∞

� ∞

�

= P(A ∩ Bi ) = P(A | Bi )P(Bi ).

i=1 i=1

In the second equality, we used the fact that the sets Bi form a partition of

Ω. In the next to last equality, we used the fact that the sets Bi are disjoint

and countable additivity.

(d) This follows from the fact

(e) Note that the sequence of events ∩ni=1 Ai is decreasing and converges � ∞to

∩∞� i

i=1 A . By the continuity

� n property

� of probability measures, we have P ∩i=1

Ai = limn→∞ P ∩i=1 Ai . Note that

P ∩ni=1 Ai = P(A1 ) ·

� �

· ···

P(A1 ) P(A1 ∩ A2 ) P(A1 ∩ · · · ∩ An−1 )

�n

= P(A1 ) P(Ai | A1 ∩ · · · ∩ Ai−1 ).

i=2

2 INDEPENDENCE

rence of one does not affect the probability assigned to the other. The following

deﬁnition formalizes and generalizes the notion of independence.

3

Deﬁnition 2. Let (Ω, F.P) be a probability space.

If P(B) > 0, an equivalent condition is P(A) = P(A | B).

(b) Let S be an index set (possibly inﬁnite, or even uncountable), and let {As |

s ∈ S} be a family (set) of events. The events in this family are said to be

independent if for every ﬁnite subset S0 of S, we have

� � �

P ∩s∈S0 As = P(As ).

s∈S0

independent if any two events A1 ∈ F1 and A2 ∈ F2 are independent.

(d) More generally, let S be an index set, and for every s ∈ S, let Fs be a σ

ﬁeld contained in F. We say that the σ-ﬁelds Fs are independent if the

following holds. If we pick one event As from each Fs , the events in the

resulting family {As | s ∈ S} are independent.

Example. Consider an inﬁnite sequence of fair coin tosses, under the model constructed

in the Lecture 2 notes. The following statements are intuitively obvious (although a

formal proof would require a few steps).

� j, the events Ai and Aj

(a) Let Ai be the event that the ith toss resulted in a “1”. If i =

are independent.

(b) The events in the (inﬁnite) family {Ai | i ∈ N} are independent. This statement

captures the intuitive idea of “independent” coin tosses.

(c) Let F1 (respectively, F2 ) be the collection of all events whose occurrence can be

decided by looking at the results of the coin tosses at odd (respectively, even) times

n. More formally, Let Hi be the event that the ith toss resulted in a 1. Let C be

the collection of events C = {Hi | i is odd}, and ﬁnally let F1 = σ(C), so that

F1 is the smallest σ-ﬁeld that contains all the events Hi , for even i. We deﬁne F2

similarly, using even times instead of odd times. Then, the two σ-ﬁelds F1 and F2

turn out to be independent. This statement captures the intuitive idea that knowing

the results of the tosses at odd times provides no information on the results of the

tosses at even times.

(d) Let Fn be the collection of all events whose occurrence can be decided by looking

at the results of tosses 2n and 2n + 1. (Note that each Fn is a σ-ﬁeld comprised of

ﬁnitely many events.) Then, the families Fn , n ∈ N, are independent.

Remark: How can one establish that two σ-ﬁelds (e.g., as in the above coin

4

tossing example) are independent? It turns out that one only needs to check

independence for smaller collections of sets; see the theorem below (the proof

is omitted and can be found in p. 39 of [W]).

Theorem: If G1 and G2 are two collections of measurable sets, that are closed

under intersection (that is, if A, B ∈ Gi , then A ∩ B ∈ Gi ), if Fi = σ(Gi ),

i = 1, 2, and if P(A ∩ B) = P(A)P(B) for every A ∈ G1 , B ∈ G2 , then F1 and

F2 are independent.

The Borel-Cantelli lemma is a tool that is often used to establish that a certain

event has probability zero or one. Given a sequence of events An , n ∈ N, recall

that {An i.o.} (read as “An occurs inﬁnitely often”) is the event consisting of all

ω ∈ Ω that belong to inﬁnitely many An , and that

∞

� ∞

{An i.o.} = lim sup An = Ai .

n→∞

n=1 i=n

let A = {An i.o.}.

(a) If ∞

�

n=1 P(An ) < ∞, then P(A) = 0.

�∞

(b) If n=1 P(An ) = ∞ and the events An , n ∈ N, are independent, then

P(A) = 1.

Remark: The result in part (b) is not true without the independence assumption.

Indeed, consider an arbitrary event C such that 0 < P(C) � < 1 and let An = C

for all n. Then P({An i.o.}) = P(C) < 1, even though n P(An ) = ∞.

The following lemma is useful here and in may other contexts.

�∞

Lemma�1. Suppose that 0 ≤ pi ≤ 1 for every i ∈ N, and that i=1 pi = ∞.

Then, ∞

i=1 (1 − pi ) = 0.

Proof. Note that log(1 − x) is a concave function of its argument, and its deriva

tive at x = 0 is −1. It follows that log(1 − x) ≤ −x, for x ∈ [0, 1]. We then

5

have

∞

� � n

� �

log (1 − pi ) = log lim (1 − pi )

n→∞

i=1 i=1

k

�

≤ log (1 − pi )

i=1

k

�

= log(1 − pi )

i=1

�k

≤ (−pi ).

i=1

�∞

� k. By taking the limit as k → ∞, we obtain log

This is true for every i=1 (1 −

pi ) = −∞, and ∞ i=1 (1 − pi ) = 0.

Proof of Theorem 2.

(a) The assumption ∞

� �∞

n=1 P(An ) < ∞ implies that limn→∞ i=n P(Ai ) = 0.

Note that for every n, we have A ⊂ ∪∞ A

i=n i . Then, the union bound implies

that

∞

� �

P(A) ≤ P ∪∞

�

A

i=n i ≤ P(An ).

i=n

We take the limit of both sides as n → ∞. Since the right-hand side con

verges to zero, P(A) must be equal to zero.

(b) Let Bn = ∪∞ ∞ c

i=n Ai , and note that A = ∩n=1 Bn . We claim that P(Bn ) = 0.

This will imply the desired result because

∞

� �

P(Ac ) = P ∪∞ c

P(Bnc ) = 0.

�

B

n=1 n ≤

n=1

m

� m

�

P(∩m c

P(Aic ) =

� �

i=n Ai ) = 1 − P(Ai ) .

i=n i=n

�∞ �

i=n P(Ai ) = ∞. Using

Lemma 1, with the sequence {P(Ai ) | i ≥ n} replacing the sequence {pi },

we obtain

�m

P(Bnc ) = P(∩∞ c m c

� �

A

i=n i ) = lim P(∩ A

i=n i ) = lim 1 − P(Ai ) = 0,

m→∞ m→∞

i=n

6

where the second equality made use of the continuity property of probability

measures.

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 4 9/15/2008

COUNTING

Readings: [Bertsekas & Tsitsiklis], Section 1.6, and solved problems 57-58 (in

1st edition) or problems 61-62 (in 2nd edition). These notes only cover the part

of the lecture that is not covered in [BT].

in each pocket. Each time a match is needed, the mathematician reaches into

a “random” pocket and takes a match out of the corresponding box. We are

interested in the probability that the ﬁrst time that the mathematician reaches

into a pocket and ﬁnds an empty box, the other box contains exactly k matches.

Solution: The event of interest can happen in two ways:

(a) In the ﬁrst 2n − k times, the mathematician reached n times into the right

pocket, n − k times into the left pocket, and then, at time 2n − k + 1, into

the right pocket.

(b) In the ﬁrst 2n − k times, the mathematician reached n times into the left

pocket, n − k times into the right pocket, and then, at time 2n − k + 1, into

the left pocket.

� �

2n − k 1 1

· 2n−k · .

n 2 2

Scenario (b) has the same probability. Thus, the overall probability is

� �

2n − k 1

· 2n−k .

n 2

2 MULTINOMIAL PROBABILITIES

results, a1 , a2 , . . . , ar , and the ith result is obtained ith probability pi . What is

1

the probability that in n trials there were exactly n1 results equal to a1 , n2 results

equal to r2 , etc., where the ni are given nonnegative integers that add to n?

Solution: Note that every possible outcome (n-long sequence of results) that in

volves ni results equal to ai , for all i, has the same probability, pn1 1 · · · pnrr . How

many such sequences are there? Any such sequence corresponds to a partition

of the set {1, . . . n} of trials into subsets of sizes n1 , . . . , nr : the ith subset, of

size ni , indicates the trials at which the result was equal to ai . Thus, using the

formula for the number of partitions, the desired probability is equal to

n!

· pn1 · · · pnr r .

n1 ! · · · nr ! 1

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 5 9/17/2008

RANDOM VARIABLES

Contents

2. Cumulative distribution functions

3. Discrete random variables

4. Continuous random variables

A. Continuity

B. The Cantor set, an unusual random variable, and a singular measure

ing on the outcome of an experiment. More precisely, a random variable can be

viewed as a function from the sample space to the real numbers, and we will use

the notation X(ω) to denote the numerical value of a random variable X, when

the outcome of the experiment is some particular ω. We may be interested in the

probability that the outcome of the experiment is such that X is no larger than

some c, i.e., that the outcome belongs to the set {ω | X(ω) ≤ c}. Of course, in

order to have a probability assigned to that set, we need to make sure that it is

F-measurable. This motivates Deﬁnition 1 below.

Example 1. Consider a sequence of ﬁve consecutive coin tosses. An appropriate sample

space is Ω = {0, 1}n , where “1” stands for heads and “0” for tails. Let F be the col

lection of all subsets of Ω, and suppose that a probability measure P has been assigned

to (Ω, F). We are interested in the number of heads obtained in this experiment. This

quantity can be described by the function X : Ω → R, deﬁned by

X(ω1 , . . . , ωn ) = ω1 + · · · + ωn .

1

With this deﬁnition, the set {ω | X(ω) < 4} is just the event that there were fewer than

4 heads overall, belongs to the σ-ﬁeld F, and therefore has a well-deﬁned probability.

Consider the real line, and let B be the associated Borel σ-ﬁeld. Sometimes,

we will also allow random variables that take values in the extended real line,

R = R ∪ {−∞, ∞}. We deﬁne the Borel σ-ﬁeld on R, also denoted by B, as

the smallest σ-ﬁeld that contains all Borel subsets of R and the sets {−∞} and

{∞}.

F-measurable for every c ∈ R.

(b) A function X : Ω → R is an extended-valued random variable if the set

{ω | X(ω) ≤ c} is F-measurable for every c ∈ R.

use upper case letters to denote random variables and lower case letters to denote

numerical values (elements of R). Thus, a statement such as “X(ω) = x = 5”

means that when the outcome happens to be ω, then the realized value of the

random variable is a particular number x, equal to 5.

Example 2. (Indicator functions) Suppose that A ⊂ Ω, and let IA : Ω → {0, 1} be

the indicator function of that set; i.e., IA (ω) = 1, if ω ∈ A, and IA (ω) = 0, otherwise.

If A ∈ F, then IA is a random variable. But if A ∈ / F, then IA is not a random variable.

and let us deﬁne a function Y : Ω → R by letting Y (ω) = X 3 (ω), for every ω ∈ Ω, or

Y = X 3 for short. We claim that Y is also a random variable. Indeed, for any c ∈ R,

the set {ω | Y (ω) ≤ c} is the same as the set {ω | X(ω) ≤ c1/3 }, which is in F, since

X is a random variable.

For a random variable X, the event {ω | X(ω) ≤ c} is often written as {X ≤ c},

and is sometimes just called “the event that X ≤ c.” The probability of this event

is well deﬁned, since this event belongs to F. Let now B be a more general

subset of the real line. We use the notation X −1 (B) or {X ∈ B} to denote the

set {ω | X(ω) ∈ B}.

2

Because the collection of intervals of the form (−∞, c] generates the Borel

σ-ﬁeld in R, it can be shown that if X is a random variable, then for any

−1

� −1set B,� the set X (B) is F-measurable. It follows that the probability

Borel

P X (B) = P({ω | X(ω) ∈ B}) is well-deﬁned. It is often denoted by

P(X ∈ B).

a probability space, and let X : Ω → R be a random variable.

(a) For every Borel subset B of the real line (i.e., B ∈ B), we deﬁne PX (B) =

P(X ∈ B).

(b) The resulting function PX : B → [0, 1] is called the probability law of X.

the cumulative distribution function deﬁned in the next section.

According to the next result, the law PX of X is also a probability measure.

Notice here that PX is a measure on (R, B), as opposed to the original measure

P, which is a measure on (Ω, F). In many instances, the original probability

space (Ω, F, P) remains in the background, hidden or unused, and one works

directly with the much more tangible probability space (R, B, PX ). Indeed, if

we are only interested in the statistical properties of the random variable X, the

latter space will do.

variable. Then, the law PX of X is a measure on (R, B).

Proof: Clearly, PX (B) ≥ 0, for every Borel set B. Also, PX (R) = P(X ∈

R) = P(Ω) = 1. We now verify countable additivity. Let {Bi } be a countable

sequence of disjoint Borel subsets of R. Note that the sets X −1 (Bi ) are also

disjoint, and that

X −1 ∪∞ ∞ −1

� �

i=1 Bi = ∪i=1 X (Bi ),

or, in different notation,

{X ∈ ∪∞ ∞

i=1 Bi } = ∪i=1 {X ∈ Bi }.

∞ ∞

� � �

PX ∪∞ ∞

� � �

B

i=1 i = P X ∈ ∪ B

i=1 i = P(X ∈ Bi ) = PX (Bi ).

i=1 i=1

3

1.3 Technical digression: measurable functions

The following generalizes the deﬁnition of a random variable.

tion f : Ω1 → Ω2 is called (F1 , F2 )-measurable (or just measurable, if

the relevant σ-ﬁelds are clear from the context) if f −1 (B) ∈ F1 for every

B ∈ F2 .

According to the above deﬁnition, given a probability space (Ω, F, P), and

taking into account the discussion in Section 1.1, a random variable on a proba

bility space is a function X : Ω → R that is (F, B)-measurable.

As a general rule, functions constructed from other measurable functions

using certain simple operations are measurable. We collect, without proof, a

number of relevant facts below.

(a) (Simple random variables) If A ∈ F, the corresponding indicator function

IA is measurable (more, precisely, it is (F, B)-measurable).

are F-measurable sets, and x1 , . . . , xn are real numbers, the

(b) If A1 , . . . , An �

function X = ni=1 xi IAi , or in more detail,

n

�

X(ω) = xi IAi (ω), ∀ ω ∈ Ω,

i=1

(c) Suppose that (Ω, F) = (R, B), and that X : R → R is continuous. Then,

X is a random variable.

(d) (Functions of a random variable) Let X be a random variable, and sup

pose that f : R → R is continuous (or more generally, (B, B)-measurable).

Then, f (X) is a random variable.

(e) (Functions of multiple random variables) Let X1 , . . . , Xn be random

variables, and suppose that f : Rn → R is continuous. Then, f (X1 , . . . , Xn )

is a random variable. In particular, X1 +X2 and X1 X2 are random variables.

Another way that we can form a random variable is by taking the limit of

a sequence of random variables. Let us ﬁrst introduce some terminology. Let

each fn be a function from some set Ω into R. Consider a new function f =

inf n fn deﬁned by f (ω) = inf n fn (ω), for every ω ∈ Ω. The functions supn fn ,

lim inf n→∞ fn , and lim supn→∞ fn are deﬁned similarly. (Note that even if

the fn are everywhere ﬁnite, the above deﬁned functions may turn out to be

extended-valued. ) If the limit limn→∞ fn (ω) exists for every ω, we say that the

sequence of functions {fn } converges pointwise, and deﬁne its pointwise limit

to be the function f deﬁned by f (ω) = limn→∞ fn (ω). For example, suppose

that Ω = [0, 1] and that fn (ω) = ω n . Then, the pointwise limit f = limn→∞ fn

exists, and satisﬁes f (1) = 1, and f (ω) = 0 for ω ∈ [0, 1).

for every n, then inf n Xn , supn Xn , lim inf n→∞ Xn , and lim supn→∞ Xn

are random variables. Furthermore, if the sequence {Xn } converges point-

wise, and X = limn→∞ Xn , then X is also a random variable.

random variables is measurable. On the other hand, we note that measurable

functions can be highly discontinuous. For example the function which equals

1 at every rational number, and equals zero otherwise, is measurable, because it

is the indicator function of a measurable set.

in terms of the cumulative distribution function, which we now deﬁne.

able. The function FX : R → [0, 1], deﬁned by

Example 4. Let X be the number of heads in two independent tosses of a fair coin. In

5

particular, P(X = 0) = P(X = 2) = 1/4, and P(X = 1) = 1/2. Then,

⎧

⎪

⎪ 0, if x < 0,

1/4, if 0 ≤ x < 1,

⎨

FX (x) =

⎪

⎪ 3/4, if 1 ≤ x < 2.

1, if x ≥ 2.

⎩

Example 5. (A uniform random variable and its square) Consider a probability space

(Ω, B, P), where Ω = [0, 1], B is the Borel σ-ﬁeld B, and P is the Lebesgue measure.

The random variable U deﬁned by U (ω) = ω is said to be uniformly distributed. Its

CDF is given by

⎧

⎨ 0, if x < 0,

FU (x) = x, if 0 < x < 1,

1, if x ≥ 1.

⎩

x < 0, and P(U 2 ≤ x) = 1, when x ≥ 1. For x ∈ [0, 1), we have

√ � √

P(U 2 ≤ x) = P {ω ∈ [0, 1] | ω 2 ≤ x} = P {ω ∈ [0, 1] : ω ≤ x} = x.

� � �

Thus,

⎧

⎨ 0,

√ if x < 0,

FX (x) = x, if 0 ≤ x < 1,

1, if x ≥ 1.

⎩

The cumulative distribution function of a random variable always possesses cer

tain properties.

(a) (Monotonicity) If x ≤ y, then FX (x) ≤ FX (y).

(b) (Limiting values) We have limx→−∞ FX (x) = 0, and limx→∞ F (x) = 1.

(c) (Right-continuity) For every x, we have limy↓x FX (y) = FX (x).

Proof:

(a) Suppose that x ≤ y. Then, {X ≤ x} ⊂ {X ≤ y}, which implies that

6

(b) Since FX (x) is monotonic in x and bounded below by zero, it converges as

x → −∞, and the limit is the same for every sequence {xn } converging to

−∞. So, let xn = −n, and note that the sequence of events ∩∞ n=1 {X ≤

−n} converges to the empty set. Using the continuity of probabilities, we

obtain

x→−∞ n→∞ n→∞

(c) Consider a decreasing sequence {xn } that converges to x. The sequence of

events {X ≤ xn } is decreasing and ∩∞ n=1 {X ≤ xn } = {X ≤ x}. Using

the continuity of probabilities, we obtain

n→∞ n→∞

Since this is true for every such sequence {xn }, we conclude that

limy↓x FX (y) = FX (x).

we have limx↑0 FX (x) = 0, but FX (0) = 1/4.

Consider a function F : R → [0, 1] that satisﬁes the three properties in Theorem

3; we call such a function a distribution function. Given a distribution function

F , does there exist a random variable X, deﬁned on some probability space,

whose CDF, FX , is equal to the given distribution F ? This is certainly the

case for the distribution function F function that satisﬁes F (x) = x, for x ∈

(0, 1): the uniform random variable U in Example 5 will do. More generally,

the objective FX = F can be accomplished by letting X = g(U ), for a suitable

function g : (0, 1) → R.

ity space ([0, 1], B, P), where B is the Borel σ-ﬁeld, and P is the Lebesgue

measure. There exists a measurable function X : Ω → R whose CDF FX

satisﬁes FX = F .

particular, we assume that F is continuous and strictly increasing. Then, the

7

range of F is the entire interval (0, 1). Furthermore, F is invertible: for ev

ery y ∈ (0, 1), there exists a unique x, denoted F −1 (y), such that F (x) = y.

We deﬁne U (ω) = ω and X(ω) = F −1 (ω), for every ω ∈ (0, 1), so that

X = F −1 (U ). Note that F (F −1 (ω)) = ω for every ω ∈ (0, 1), so that

F (X) = U . Since F is strictly increasing, we have X ≤ x if and only

F (X) ≤ F (x), or U ≤ F (x). (Note that this also establishes that the event

{X ≤ x} is measurable, so that X is indeed a random variable.) Thus, for every

x ∈ R, we have

� �

FX (x) = P(X ≤ x) = P F (X) ≤ F (x) = P(U ≤ F (x)) = F (x),

as desired.

Note that the probability law of X assigns probabilities to all Borel sets,

whereas the CDF only speciﬁes the probabilities of certain intervals. Neverthe

less, the CDF contains enough information to recover the law of X.

determined by the CDF FX .

Proof: (Outline) Let F0 be the collection of all subsets of the real line that are

unions of ﬁnitely many intervals of the form (a, b]. Then, F0 is a ﬁeld. Note

that, the CDF FX can be used to completely determine the probability PX (A)

of a set A ∈ F0 . Indeed, this is done using relations such as

Extension Theorem, it follows that there exists a unique measure on (R, B) (the

probability law of X) that agrees with the values determined by FX on F0 .

Discrete random variables take values in a countable set. We need some nota

tion. Given a function f : Ω → R, its range is the set

8

Deﬁnition 5. Discrete random variables and PMFs)

(a) A random variable X, deﬁned on a probability space (Ω, F, P), is said to be

discrete if its range X(Ω) is countable.

(b) If X is a discrete random variable, the function pX : R → [0, 1] deﬁned by

pX (x) = P(X = x), for every x, is called the (probability) mass function

of X, or PMF for short.

case, for any Borel set A, countable additivity yields

�

P(X ∈ A) = P(X ∈ A ∩ C) = P(X = x).

x∈C

�

FX (x) = P(X = y).

{y∈C | y≤x}

A random variable that takes only integer values is discrete. For instance, the

random variable in Example 4 (number of heads in two coin tosses) is discrete.

Also, every simple random variable is discrete, since it takes a ﬁnite number of

values. However, more complicated discrete random variables are also possible.

Example 6. Let the sample space be the set N of natural numbers, and consider a

measure that satisﬁes P(n) = 1/2n , for every n ∈ N. The random variable X deﬁned

by X(n) = n is discrete.

Suppose now that the rational numbers have been arranged in a sequence, and that

xn is the nth rational number, according to this sequence. Consider the random variable

Y deﬁned by Y (n) = xn . The range of this random variable is countable, so Y is a

discrete random variable. Its range is the set of rational numbers, every rational number

has positive probability, and the set of irrational numbers has zero probability.

We close by noting that discrete random variables can be represented in

terms of indicator functions. Indeed, given a discrete random variable X, with

range {x1 , x2 , . . .}, we deﬁne An = {X = xn }, for every n ∈ N. Observe that

each set An is measurable (why?). Furthermore, the sets An , n ∈ N, form a

partition of the sample space. Using indicator functions, we can write

∞

�

X(ω) = xn IAn (ω).

n=1

9

Conversely, suppose we are given a sequence {An } of disjoint events, and a

real sequence {xn }. Deﬁne X : Ω → R by letting X(ω) = xn if and only if

ω ∈ An . Then X is a discrete random variable, and P(X = xn ) = P(An ), for

every n.

for a random variable to have a “continuous range.”

is said to be continuous if there exists a nonnegative measurable function

f : R → [0, ∞) such that

� x

FX (x) = f (t) dt, ∀ x ∈ R. (1)

−∞

The function f is called a (probability) density function (or PDF, for short)

for X,

the integral of a measurable function may be unclear. We will see later in this

course how such an integral is deﬁned. For now, we just note that the integral

is well-deﬁned, and agrees with the Riemann integral encountered in calculus,

if the function is continuous, or more generally, if it has a ﬁnite number of

discontinuities. �x

Since limx→∞ FX (x) = 1, we must have limx→∞ −∞ f (t)dt = 1, or

� ∞

fX (t) dt = 1. (2)

−∞

Any nonnegative measurable function that satisﬁes Eq. (2) is called a�density

x

function. Conversely, given a density function f , we can deﬁne F (x) = −∞ f (t) dt,

and verify that F is a distribution function. It follows that given a density func

tion, there always exists a random variable whose PDF is the given density.

If a CDF FX is differentiable at some x, the corresponding value fX (x) can

be found by taking the derivative of FX at that point. However, CDFs need not

be differentiable, so this will not always work. Let us also note that a PDF of

a continuous random variable is not uniquely deﬁned. We can always change

the PDF at a ﬁnite set of points, without affecting its integral, hence multiple

10

PDFs can be associated to the same CDF. However, this nonuniqueness rarely

becomes an issue. In the sequel, we will often refer to “the PDF” of X, ignoring

the fact that it is nonunique.

Example 7. For a uniform random variable, we have FX (x) = P(X ≤ x) = x, for

every x ∈ (0, 1). By differentiating, we ﬁnd fX (x) = 1, for x ∈ (0, 1). For x < 0

we have FX (x) = 0, and for x > 1 we have FX (x) = 1; in both cases, we obtain

fX (x) = 0. At x = 0, the CDF is not differentiable. We are free to deﬁne fX (0) to be

0, or 1, or in fact any real number; the value of the integral of fX will remain unaffected.

Using the PDF of a continuous random variable, we can calculate the prob

ability of various subsets of the real line. For example, we have P(X = x) = 0,

for all x, and if a < b,

� b

P(a < X < b) = P(a ≤ X ≤ b) = fX (t) dt.

a

� �

P(X ∈ B) = f (t) dt = IB (t)fX (t) dt,

B

more detail on this subject must wait until we develop the theory of integration

of measurable functions.

We close by pointing out that not all random variables are continuous or

discrete. For example, suppose that X is a discrete random variable, and that Y

is a continuous random variable. Fix some λ ∈ (0, 1), and deﬁne

as the CDF of some new random variable Z. However, the random variable Z is

neither discrete, nor continuous. For an interpretation, we can visualize Z being

generated as follows: we ﬁrst generate the random variables X and Y ; then,

with probability λ, we set Z = X, and with probability 1 − λ, we set Z = Y .

Even more pathological random variables are possible. Appendix B dis

cusses a particularly interesting one.

gence of function values, and continuity.

11

(a) Consider a function f : Rm → R, and ﬁx some x ∈ Rm . We say that f (y)

converges to a value c, as y tends to x, if we have limn→∞ f (xn ) = c, for

every sequence {xn } of elements of Rm such that xn = � x for all n, and

limn→∞ xn = x. In this case, we write limy→x f (y) = c.

this holds for every x, we say that f is continuous.

f (y) converges to a value c, as y decreases to x, if we have limn→∞ f (xn ) =

c, for every decreasing sequence {xn } of elements of Rm such that xn > x

for all n, and limn→∞ xn = x. In this case, we write limy↓x f (y) = c.

right-continuous at x. If this holds for every x ∈ R, we say that f is

right-continuous.

f (y) converges to a value c, as y increases to x, if we have limn→∞ f (xn ) =

c, for every increasing sequence {xn } of elements of R such that xn < x

for all n, and limn→∞ xn = x. In this case, we write limy↑x f (y) = c.

left-continuous at x. If this holds for every x ∈ R, we say that f is left-

continuous.

from the above deﬁnitions, that f is continuous at x if and only if it is both

left-continuous and right-continuous.

ABLE, AND A SINGULAR MEASURE

∞

� xi

x= , with xi ∈ {0, 1, 2}. (4)

3i

i=1

This expansion is not unique. For example, 1/3 admits two expansions, namely

.10000 · · · and .022222 · · · . Nonuniqueness occurs only for those x that admit

an expansion ending with an inﬁnite sequence of 2s. The set of such unusual x

is countable, and therefore has Lebesgue measure zero.

12

The Cantor set C is deﬁned as the set of all x ∈ [0, 1] that have a ternary ex

pansion that uses only 0s and 2s (no 1s allowed). The set C can be constructed as

follows. Start with the interval [0, 1] and remove the “middle third” (1/3, 2/3).

Then, from each of the remaining closed intervals, [0, 1/3] and [2/3, 1], remove

their middle thirds, (1/9, 2/9) and (7/9, 8/9), resulting in four closed intervals,

and continue this process indeﬁnitely. Note that C is measurable, since it is

constructed by removing a countable sequence of intervals. Also, the length

(Lebesgue measure) of C is 0, since at each stage its length is mutliplied by a

factor of 2/3. On the other hand, the set C has the same cardinality as the set

{0, 2}∞ , and is uncountable.

Consider now an inﬁnite sequence of independent rolls of a 3-sided die,

whose faces are labeled 0, 1, and 2. Assume that at each roll, each of the three

possible results has the same probability, 1/3. If we use the sequence of these

rolls to form a number x, then the probability law of the resulting random vari

able is the Lebesgue measure (i.e., picking a ternary expansion “at random”

leads to a uniform random variable).

The Cantor set can be identiﬁed with the event consisting of all roll se

quences in which a 1 never occurs. (This event has zero probability, which is

consistent with the fact that C has zero Lebesgue measure.)

Consider now an inﬁnite sequence of independent tosses of a fair coin. If

the ith toss results in tails, record xi = 0; if it results in heads, record xi = 2.

Use the xi s to form a number x, using Eq. (4). This deﬁnes a random variable

X on ([0, 1], B), whose range is the set C. The probability law of this random

variable is therefore concentrated on the “zero-length” set C. At the same time,

P(X = x) = 0 for every x, because any particular sequence of heads and tails

has zero probability. A measure with this property is called singular.

The random variable X that we have constructed here is neither discrete nor

continuous. Moreover, the CDF of X cannot be written as a mixture of the kind

considered in Eq. (3).

13

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 6 9/24/2008

Contents

2. Joint, marginal, and conditional PMFs

3. Independence of random variables

4. Expected values

Recall that a random variable X : Ω → R is called discrete if its range (i.e., the

set of values that it can take) is a countable set. The PMF of X is a function

pX : R → [0, 1], deﬁned by pX (x) = P(X = x), and completely determines

the probability law of X.

The following are some important PMFs.

(a) Discrete uniform with parameters a and b, where a and b are integers with

a < b. Here,

(b) Bernoulli with parameter p, where 0 ≤ p ≤ 1. Here, pX (0) = p, pX (1) =

1 − p.

(c) Binomial with parameters n and p, where n ∈ N and p ∈ [0, 1]. Here,

� �

n k

pX (k) = p (1 − p)n−k , k = 0, 1, . . . , n.

k

1

In the remaining examples, the qualiﬁcation “pX (k) = 0, otherwise,” will be omitted for

brevity.

1

A binomial random variable with parameters n and p represents the number

of heads observed in n independent tosses of a coin if the probability of

heads at each toss is p.

(d) Geometric with parameter p, where 0 < p ≤ 1. Here,

pX (k) = (1 − p)k−1 p, k = 1, 2, . . . , .

independent tosses of a coin until heads are observed for the ﬁrst time, if the

probability of heads at each toss is p.

(e) Poisson with parameter λ, where λ > 0. Here,

λk

pX (k) = e−λ , k = 0, 1, . . . .

k!

As will be seen shortly, a Poisson random variable can be thought of as a

limiting case of a binomial random variable. Note that this is a legitimate

PMF (i.e., it sums

� to one), because of the series expansion of the exponential

function, eλ = ∞ k=0 λ k /k!.

1 1

pX (k) = α

− , k = 1, 2, . . . .

k (k + 1)α

the formula

1

P(X ≥ k) = α , k = 1, 2, . . . .

k

Note that when α is small, the “tail” P(X ≥ k) of the distribution decays

slowly (slower than an exponential) as k increases, and in some sense such

a distribution has “heavy” tails.

Notation: Let us use the abbreviations dU(a, b), Ber(p), Bin(n, p), Geo(p),

Pois(λ), and Pow(α) to refer the above deﬁned PMFs. We will use notation

d

such as X = dU(a, b) or X ∼ dU(a, b) as a shorthand for the statement that X

is a discrete random variable whose PMF is uniform on (a, b), and similarly for

d

the other PMFs we deﬁned. We will also use the notation X = Y to indicate

that two random variables have the same PMFs.

2

1.1 Poisson distribution as a limit of the binomial

To get a feel for the Poisson random variable, think of a binomial random vari

able with very small p and very large n. For example, consider the number of

typos in a book with a total of n words, when the probability p that any one

word is misspelled is very small (associate a word with a coin toss that results in

a head when the word is misspelled), or the number of cars involved in accidents

in a city on a given day (associate a car with a coin toss that results in a head

when the car has an accident). Such random variables can be well modeled with

a Poisson PMF.

More precisely, the Poisson PMF with parameter λ is a good approximation

for a binomial PMF with parameters n and p, i.e.,

λk n!

e−λ ≈ pk (1 − p)n−k , k = 0, 1, . . . , n,

k! k! (n − k)!

provided λ = np, n is large, and p is small. In this case, using the Poisson

PMF may result in simpler models and calculations. For example, let n = 100

and p = 0.01. Then the probability of k = 5 successes in n = 100 trials is

calculated using the binomial PMF as

100!

· 0.015 (1 − 0.01)95 = 0.00290.

95! 5!

Using the Poisson PMF with λ = np = 100 · 0.01 = 1, this probability is

approximated by

1

e−1 = 0.00306.

5!

d d

and suppose that Xn = Bin(n, λ/n), for every n. Let X = Pois(λ). Then,

as n → ∞, the PMF of Xn converges to the PMF of X, in the sense that

limn→∞ P(Xn = k) = P(X = k), for any k ≥ 0.

Proof: We have

n(n − 1) · · · (n − k + 1) λk � λ �n−k

P(Xn = k) = · · 1 − .

nk k! n

Fix k and let n → ∞. We have, for j = 1, . . . , k,

n−k+j � λ �−k � λ �n

→ 1, 1− → 1, 1− → e−λ .

n n n

3

Thus, for any ﬁxed k, we obtain

λk

lim P(Xn = k) = e−λ = P(X = k),

n→∞ k!

as claimed.

In most applications, one typically deals with several random variables at once.

In this section, we introduce a few concepts that are useful in such a context.

Consider two discrete random variables X and Y associated with the same ex

periment. The probability law of each one of them is described by the corre

sponding PMF, pX or pY , called a marginal PMF. However, the marginal PMFs

do not provide any information on possible relations between these two random

variables. For example, suppose that the PMF of X is symmetric around the

origin. If we have either Y = X or Y = −X, the PMF of Y remains the same,

and fails to capture the speciﬁcs of the dependence between X and Y .

The statistical properties of two random variables X and Y are captured by a

function pX,Y : R2 → [0, 1], deﬁned by

called the joint PMF of X and Y . Here and in the sequel, we will use the

abbreviated notation P(X = x, Y = y) instead of the more precise notations

P({X = x} ∩ {Y = y}) or P(X = x and Y = x). More generally, the PMF

of ﬁnitely many discrete random variables, X1 , . . . , Xn on the same probability

space is deﬁned by

in which case the joint PMF will be denoted simply as pX (x), where now the

argument x is an n-dimensional vector.

4

The joint PMF of X and Y determines the probability of any event that can

be speciﬁed in terms of the random variables X and Y . For example if A is the

set of all pairs (x, y) that have a certain property, then

� � �

P (X, Y ) ∈ A = pX,Y (x, y).

(x,y)∈A

In fact, we can calculate the marginal PMFs of X and Y by using the formulas

� �

pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y).

y x

� �

pX (x) = P(X = x) = P(X = x, Y = y) = pX,Y (x, y),

y y

where the second equality follows by noting that the event {X = x} is the union

of the countably many disjoint events {X = x, Y = y}, as y ranges over all the

different values of Y . The formula for pY (y) is veriﬁed similarly.

Let X and Y be two discrete random variables, deﬁned on the same probability

space, with joint PMF pX,Y . The conditional PMF of X given Y is a function

pX|Y , deﬁned by

of conditional probabilities, we obtain

pX,Y (x, y)

pX|Y (x | y) = ,

pY (y)

whenever pY (y) > 0.

More generally, if we have random variables X1 , . . . , Xn ad Y1 , . . . , Ym ,

deﬁned on the same probability space, we deﬁne a conditional PMF by letting

= P(X1 = x1 , . . . , Xn = xn | Y1 = y1 , . . . , Ym = ym )

pX1 ,...,Xn ,Y1 ,...,Ym (x1 , . . . , xn , y1 , . . . , ym )

= ,

pY1 ,...,Ym (y1 , . . . , ym )

and Y = (Y1 , . . . , Ym ), the shorthand� notation pX|Y (x | y) can be used.

Note that if pY (y) > 0, then x pX|Y (x | y) = 1, where the sum is over

all x in the range of the random variable X. Thus, the conditional PMF is es

sentially the same as an ordinary PMF, but with redeﬁned probabilities that take

into account the conditioning event Y = y. Visually, if we ﬁx y, the conditional

PMF pX|Y (x | y), viewed as a function of x is a “slice” of the joint PMF pX,Y ,

renormalized so that its entries sum to one.

start with a general deﬁnition that applies to all types of random variables, in

cluding discrete and continuous ones. We then specialize to the case of discrete

random variables.

Intuitively, two random variables are independent if any partial information on

the realized value of one random variable does not change the distribution of the

other. This notion is formalized in the following deﬁnition.

(a) Let X1 , . . . , Xn be random variables deﬁned on the same probability space.

We say that these random variables are independent if

ments of a (possibly inﬁnite) index set S. We say that these random variables

are independent if for every ﬁnite subset {s1 , . . . , sn } of S, the random vari

ables Xs1 , . . . , Xsn are independent.

the random variables Xs , s ∈ S, is the same as requiring independence of the

events {Xs ∈ Bs }, s ∈ S, for every choice of Borel subsets Bs , s ∈ S, of the

real line.

6

Verifying the independence of random variables using the above deﬁnition

(which involves arbitrary Borel sets) is rather difﬁcult. It turns out that one only

needs to examine Borel sets of the form (−∞, x].

subset {s1 , . . . , sn } of S, the events {Xsi ≤ xi }, i = 1, . . . , n, are indepen

dent. Then, the random variables Xs , s ∈ S, are independent.

We omit the proof of the above proposition. However, we remark that ul

timately, the proof relies on the fact that the collection of events of the form

(−∞, x] generate the Borel σ-ﬁeld.

Let us deﬁne the joint CDF of of the random variables X1 , . . . , Xn by

dition

nite) index set. Prove that the events As are independent if and only if the corresponding

indicator functions IAs , s ∈ S, are independent random variables.

For a ﬁnite number of discrete random variables, independence is equivalent to

having a joint PMF which factors into a product of marginal PMFs.

probability space. The following are equivalent.

(a) The random variables X and Y are independent.

(b) For any x, y ∈ R, the events {X = x} and {Y = y} are independent.

(c) For any x, y ∈ R, we have pX,Y (x, y) = pX (x)pY (y).

(d) For any x, y ∈ R such that pY (y) > 0, we have pX|Y (x | y) = pX (x).

Proof: The fact that (a) implies (b) is immediate from the deﬁnition of indepen

dence. The equivalence of (b), (c), and (d) is also an immediate consequence of

our deﬁnitions. We complete the proof by verifying that (c) implies (a).

7

Suppose that X and Y are independent, and let A, B, be two Borel subsets

of the real line. We then have

�

P(X ∈ A, Y ∈ B) = P(X = x, Y = y)

x∈A, y∈B

�

= pX,Y (x, y)

x∈A, y∈B

�

= pX (x)pY (x)

x∈A, y∈B

�� �� �

= pX (x) pY (y))

x∈A y∈B

Since this is true for any Borel sets A and B, we conclude that X and Y are

independent.

We note that Theorem 1 generalizes to the case of multiple, but ﬁnitely many,

random variables. The generalization of conditions (a)-(c) should be obvious.

As for condition (d), it can be generalized to a few different forms, one of which

is the following: given any subset S0 of the random variables under considera

tion, the conditional joint PMF of the random variables Xs , s ∈ S0 , given the

values of the remaining random variables, is the same as the unconditional joint

PMF of the random variables Xs , s ∈ S0 , as long as we are conditioning on an

event with positive probability.

We ﬁnally note that functions g(X) and h(Y ) of two independent random

variables X and Y must themselves be independent. This should be expected

on intuitive grounds: If X is independent from Y , then the information provided

by the value of g(X) should not affect the distribution of Y , and consequently

should not affect the distribution of h(Y ). Observe that when X and Y are

discrete, then g(X) and h(Y ) are random variables (the required measurability

conditions are satisﬁed) even if the functions g and h are not measurable (why?).

and h be some functions from R into itself. Then, the random variables g(X)

and h(Y ) are independent.

8

3.3 Examples

parameter p. Then, the random variable X deﬁned by X = X1 + · · · + Xn is binomial

with parameters n and p. To see this, consider n independent tosses of a coin in which

every toss has probability p of resulting in a one, and let Xi be the result of the ith coin

toss. Then, X is the number of ones observed in n independent tosses, and is therefore

a binomial random variable.

Example. Let X and Y be independent binomial random variables with parameters

(n, p) and (m, p), respectively. Then, the random variable Z, deﬁned by Z = X + Y is

binomial with parameters (n + m, p). To see this, consider n + m independent tosses of

a coin in which every toss has probability p of resulting in a one. Let X be the number

of ones in the ﬁrst n tosses, and let Y be the number of ones in the last m tosses. Then,

Z is the number of ones in n+m independent tosses, which is binomial with parameters

(n + m, p).

Example. Consider n independent tosses of a coin in which every toss has probability

p of resulting in a one. Let X be the number of ones obtained, and let Y = n − X,

which is the number of zeros. The random variables X and Y are not independent. For

example, P(X = 0) = (1 − p)n and P(Y = 0) = pn , but P(X = 0, Y = 0) = 0 �=

P(X = 0)P(Y = 0). Intuitively, knowing that there was a small number of heads gives

us information that the number of tails must be large.

However, in sharp contrast to the intuition from the preceding example, we

obtain independence when the number of coin tosses is itself random, with a

Poisson distribution. More precisely, let N be a Poisson random variable with

parameter λ. We assume that the conditional PMF pX|N (· | n) is binomial with

parameters n and p (representing the number of ones observed in n coin tosses),

and deﬁne Y = N − X, which represents the number of zeros obtained. We

have the following surprising result. An intuitive justiﬁcation will have to wait

until we consider the Poisson process, later in this course. The proof is left as

an exercise.

d d

� X and

variables � Y are independent. Moreover, X = Pois(λp) and Y =

Pois λ(1 − p) .

4 EXPECTED VALUES

Consider

�∞ a sequence {an } of nonnegative

�n real numbers and the inﬁnite sum

a

i=1 i , deﬁned as the limit, limn→∞ i=1 ai , of the partial sums. The inﬁnite

sum can be ﬁnite or inﬁnite; in either case, it is well deﬁned, as long as we

allow the limit to be an extended real number. Furthermore, it can be veriﬁed

that the value of the inﬁnite sum is the same even if we reorder the elements

of the sequence {an } and carry out the summation according to this different

order. Because

� the order of the summation does not matter, we can use the

notation n∈N an for the inﬁnite sum. More generally, if C is a countable

set and g : C → [0, ∞) is a nonnegative function, we can use the notation

�

x∈C g(x), which is unambiguous even without specifying a particular order

in which the values g(x) are to be summed.

When we consider a sequence of nonpositive real numbers, the discussion

remains the same, and inﬁnite sums can be unambiguously deﬁned. However,

when we consider sequences that involve both positive and negative numbers,

the situation is more complicated. In particular, the order at which the elements

of the sequence are added can make a difference.

�n

Example. Let an = (−1)n /n. It can be veriﬁed that the limit limn→∞ i=1 an exists,

and is ﬁnite, but that the elements of the sequence {an } can be reordered to form a new

�n

sequence {bn } for which the limit of i=1 bn does not exist.

In order to deal with the general case, we proceed as follows. Let S be

a countable set, and consider a collection of real numbers as , s ∈ S. Let S+

� s for which as ≥ 0 (respectively, as < 0).

� S− ) be the set of indices

(respectively,

Let S+ = s∈S+ as and S− = s∈S− |as |. We distinguish four cases.

�

(a) If both S+� and S − are ﬁnite (or equivalently, if s∈S |as | < ∞, we say that

the sum s∈S as is absolutely convergent, and is equal to S+ − S− . In

this case, for every possible arrangement of the elements of S in a sequence

{sn }, we have

� n

lim asi = S + − S− .

n→∞

i=1

�

(b) If S+ = ∞ and S− < ∞, the sum s∈S as is not absolutely convergent;

we deﬁne it to be equal to ∞. In this case, for every possible arrangement

of the elements of S in a sequence {sn }, we have

n

�

lim asi = ∞.

n→∞

i=1

10

�

(c) If S+ < ∞ and S− = ∞, the sum s∈S as is not absolutely convergent;

we deﬁne it to be equal to −∞. In this case, for every possible arrangement

of the elements of S in a sequence {sn }, we have

n

�

lim asi = −∞.

n→∞

i=1

�

(d) If S+ = ∞ and S− = ∞, the sum s∈S an is left undeﬁned. In fact, in this

case, different arrangements of the elements of S in a sequence {sn } will

result into different or even nonexistent values of the limit

n

�

lim asi .

n→∞

i=1

(c), and call it absolutely convergent only in case (a).

We close by recording a related useful fact. If we have a doubly indexed

family of nonnegative numbers � aij , i, j ∈ N, and if either (i) the numbers are

nonnegative, or (ii) the sum (i,j) aij is absolutely convergent, then

∞ �

� ∞ ∞ �

� ∞ �

aij = aij = aij . (1)

i=1 j=1 j=1 i=1 (i,j)∈N2

More important, we stress that the ﬁrst equality need not hold in the absence of

conditions (i) or (ii) above.

The PMF of a random variable X provides us with several numbers, the prob

abilities of all the possible values of X. It is often desirable to summarize this

information in a single representative number. This is accomplished by the ex

pectation of X, which is a weighted (in proportion to probabilities) average of

the possible values of X.

As motivation, suppose you spin a wheel of fortune many times. At each

spin, one of the numbers m1 , m2 , . . . , mn comes up with corresponding proba

bility p1 , p2 , . . . , pn , and this is your monetary reward from that spin. What is

the amount of money that you “expect” to get “per spin”? The terms “expect”

and “per spin” are a little ambiguous, but here is a reasonable interpretation.

11

Suppose that you spin the wheel k times, and that ki is the number of times

that the outcome is mi . Then, the total amount received is m1 k1 + m2 k2 + · · · +

mn kn . The amount received per spin is

m1 k1 + m2 k2 + · · · + mn kn

M= .

k

If the number of spins k is very large, and if we are willing to interpret proba

bilities as relative frequencies, it is reasonable to anticipate that mi comes up a

fraction of times that is roughly equal to pi :

ki

≈ pi , i = 1, . . . , n.

k

Thus, the amount of money per spin that you “expect” to receive is

m1 k1 + m2 k2 + · · · + mn kn

M= ≈ m1 p1 + m2 p2 + · · · + mn pn .

k

Motivated by this example, we introduce the following deﬁnition.

expectation or the mean) of a discrete random variable X, with PMF pX , as

�

E[X] = xpX (x),

x

whenever the sum is well deﬁned, and where the sum is taken over the count

able set of values in the range of X.

We start by pointing out an alternative formula for the expectation, and leave

its proof as an exercise. In particular, if X can only take nonnegative integer

values, then

�

E[X] = P(X > n). (2)

n≥0

Example. Using this formula, it is easy to give an example of a random variable for

d

which the expected value is inﬁnite. Consider X = Pow(α), where α ≤ 1. Then, it can

�∞

be veriﬁed, using the fact n=1 1/n = ∞, that E[X] = n≥0 n1α = ∞. On the other

�

12

Here is another useful fact, whose proof is again left as an exercise.

R, we have

�

E[g(X)] = g(x)pX (x). (3)

{x | pX (x)>0}

of random variables with joint PMF pX = pX1 ,...,Xn , and a function g :

Rn → R n .

the function g : R → R deﬁned by g(x) = x2 . Let Y = g(X). In order

to calculate the expectation E[Y ] according to Deﬁnition� 1, we need to ﬁrst

ﬁnd the PMF of Y , and then use the formula E[Y ] = y ypY (y). However,

� 2 3, we can work directly with the PMF of X, and write

according to Proposition

2

E[Y ] = E[X ] = x x pX (x).

The quantity E[X 2 ] is called the second moment of X. More generally,

r ∈ N, the �quantity

if �� E[X r ] is called the rth moment of X. Furthermore,

r�

E X − E[X] is called the rth central moment of X. The second central

moment, E[(X −E[X])2 ] is called the variance of X, and is denoted by var(X).

The square root of the variance is called the standard deviation of X, and is

often denoted by σX , or just σ. Note, that for every even r, the rth moment

and the rth central moment are always nonnegative; in particular, the standard

deviation is always well deﬁned.

We continue with a few more important properties of expectations. In the

sequel, notations such as X ≥ 0 or X = c mean that X(ω) ≥ 0 or X(ω) = c,

respectively, for all ω ∈ Ω. Similarly, a statement such as “X ≥ 0, almost

surely” or “X ≥ 0, a.s.,” means that P(X ≥ 0) = 1.

13

Proposition 4. Let X and Y be discrete random variables deﬁned on the

same probability space.

(b) If X = c, a.s., for some constant c ∈ R, then E[X] = c.

(c) For any a, b ∈ R, we have E[aX + bY ] = aE[X] + bE[Y ] (as long as the

sum aE[X] + bE[Y ] is well-deﬁned).

(d) If E[X] is ﬁnite, then var(X) = E[X 2 ] − (E[X])2 .

(e) For every a ∈ R, we have var(aX) = a2 var(X).

(f) If X and Y are independent, then E[XY ] = E[X]E[Y ], and var(X + Y ) =

var(X)+var(Y ), provided all of the expectations involved are well deﬁned.

(g) More generally, if X1 , . . . , Xn are independent, then

n

�� � �n

E Xi = E[Xi ],

i=1 i=1

and

n

�� � �n

var Xi = var(Xi ),

i=1 i=1

Proof: We only give the proof for the case where all expectations involved are

well deﬁned and ﬁnite, and leave it to the reader to verify that the results extend

to the case where all expectations involved are well deﬁned but possibly inﬁnite.

Parts (a) and (b) are immediate consequences of the deﬁnitions. For part (c),

we use the second part of Proposition 3, and then Eq. (1), we obtain

�

E[aX + bY ] = (ax + by)pX,Y (x, y)

x,y

�� � � �� � �

= ax pX,Y (x, y) + by pX,Y (x, y)

x y y x

� �

= a xpX (x) + b ypY (y)

x y

= aE[X] + bE[Y ].

14

For part (d), we have

� �

� �

= E[X 2 ] − (E[X])2 .

Part (e) follows easily from (d) and (c). For part (f), we apply Proposition 3

and then use independence to obtain

�

E[XY ] = xypX,Y (x, y)

x,y

�

= xypX (x)pY (y)

x,y

�� �� � �

= xpX (x) pY (y)

x y

= E[X] E[Y ].

= E[X 2 ] + E[Y 2 ] + 2E[XY ] − (E[X])2 − (E[Y ])2 − 2E[X]E[Y ].

Using the equality E[XY ] = E[X]E[Y ], the above expression becomes var(X)+

var(Y ). The proof of part (g) is similar and is omitted.

Remark: The equalities in part (f) need not hold in the absence of independence.

For example, consider a random variable X that takes either value 1 or −1, with

probability 1/2. Then, E[X] =� 0, but� E[X 2 ] = 1. If we let Y = X, we see that

E[XY ] = E[X 2 ] = 1 �= 0 = E[X] . Furthermore, var(X + Y ) = var(2X) =

4var(X), while var(X) + var(Y ) = 2.

Exercise 2. Show that var(X) = 0 if and only if there exists a constant c such that

P(X = c) = 1.

15

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 7 9/29/2008

EXPECTATIONS

Contents

2. Expected values of some common random variables

3. Covariance and correlation

4. Indicator variables and the inclusion-exclusion formula

5. Conditional expectations

�

(a) Recall

� that E[X] is well deﬁned unless both sums x:x<0 xpX (x) and

x:x>0 xpX (x) are inﬁnite. Furthermore, E[X] is well-deﬁned and ﬁnite if

and only if both sums are ﬁnite. This is the same as requiring that

�

E[|X|] = |x|pX (x) < ∞.

x

(b) Note that for any random variable X, E[X 2 ] is always

� well-deﬁned (whether

2

ﬁnite or inﬁnite), because all the terms in the sum x x pX (x) are nonneg

ative. If we have E[X 2 ] < ∞, we say that X is square integrable.

(c) Using the inequality |x| ≤ 1 + x2 , we have E[|X|] ≤ 1 + E[X 2 ], which

shows that a square integrable random variable is always integrable.

(d) Because of the formula var(X) = E[X 2 ] − (E[X])2 , we see that: (i) if X is

square integrable, the variance is ﬁnite; (ii) if X is integrable, but not square

integrable, the variance is inﬁnite; (iii) if X is not integrable, the variance is

undeﬁned.

1

2 EXPECTED VALUES OF SOME COMMON RANDOM VARIABLES

calculate the mean and variance of a few common discrete random variables.

Then,

E[X] = 1 · p + 0 · (1 − p) = p,

var(X) = E[X 2 ] − (E[X])2 = 12 · p + 02 · (1 − p) − p2 = p(1 − p).

and p. We note that X can be expressed in the form X = ni=1 Xi , where

X1 , . . . , Xn are independent Bernoulli random variables with a common

parameter p. It follows that

n

�

E[X] = E[Xi ] = np.

i=1

n

�

var(X) = var(Xi ) = np(1 − p).

i=1

�∞ random variable with parameter p.

We will use the formula E[X] = n=0 P(X > n). We observe that

∞

�

P(X > n) = (1 − p)j−1 p = (1 − p)n ,

j=n+1

∞

� 1

E[X] = (1 − p)n = .

p

n=0

1−p

var(X) = ,

p2

but we defer the derivation to a later section.

2

(d) Poisson(λ). Let X be a Poisson random variable with parameter λ. A direct

calculation yields

∞

� λn

E[X] = e−λ n

n!

n=0

∞

� λn

= e−λ n

n!

n=1

∞

� λn

= e−λ

(n − 1)!

n=1

∞

−λ

� λn

= λe

n!

n=0

= λ.

The variance of X turns out to satisfy var(X) = λ, but we defer the deriva

tion to a later section. We note, however, that the mean and the variance of a

Poisson random variable are exactly what one would expect, on the basis of

the formulae for the mean and variance of a binomial random variable, and

taking the limit as n → ∞, p → 0, while keeping np ﬁxed at λ.

(e) Power(α). Let X be a random variable with a power law distribution with

parameter α. We have

∞ ∞

� � 1

E[X] = P(X > k) = .

(k + 1)α

k=0 k=0

is ﬁnite, but a closed form expression is not available; it is known as the

Riemann zeta function, and is denoted by ζ(α).

3.1 Covariance

The covariance of two square integrable random variables, X and Y , is denoted

by cov(X, Y ), and is deﬁned by

�� �� ��

cov(X, Y ) = E X − E[X] Y − E[Y ] .

3

Note that, under the square integrability assumption, the covariance is al

ways well-deﬁned and ﬁnite. This is a consequence of the fact that |XY | ≤

(X 2 + Y 2 )/2, which implies that XY , as well as (X − E[X])(Y − E[Y ]), are

integrable.

Roughly speaking, a positive or negative covariance indicates that the values

of X − E[X] and Y − E[Y ] obtained in a single experiment “tend” to have the

same or the opposite sign, respectively. Thus, the sign of the covariance provides

an important qualitative indicator of the relation between X and Y .

We record a few properties of the covariance, which are immediate conse

quences of its deﬁnition:

(b) cov(X, Y + a) = cov(X, Y );

(c) cov(X, Y ) = cov(Y, X);

(d) cov(X, aY + bZ) = a · cov(X, Y ) + b · cov(X, Z).

we have E[XY ] = E[X] E[Y ], which implies that cov(X, Y ) = 0. Thus, if X

and Y are independent, they are also uncorrelated. However, the reverse is not

true, as illustrated by the following example.

Example. Suppose that the pair of random variables (X, Y ) takes the values (1, 0),

(0, 1), (−1, 0), and (0, −1), each with probability 1/4. Thus, the marginal PMFs of X

and Y are symmetric around 0, and E[X] = E[Y ] = 0. Furthermore, for all possible

value pairs (x, y), either x or y is equal to 0, which implies that XY = 0 and E[XY ] =

0. Therefore,

cov(X, Y ) = E[XY ] − E[X] E[Y ] = 0,

and X and Y are uncorrelated. However, X and Y are not independent since, for

example, a nonzero value of X ﬁxes the value of Y to zero.

The covariance can be used to obtain a formula for the variance of the sum of

several (not necessarily independent) random variables. In particular, if X1 , X2 ,

. . . , Xn are random variables with ﬁnite variance, we have

4

and, more generally,

� n � n n−1 n

� � � �

var Xi = var(Xi ) + 2 cov(Xi , Xj ).

i=1 i=1 i=1 j=i+1

This can be seen from the following calculation, where for brevity, we denote

X̃i = Xi − E[Xi ]:

� n ⎡� �2 ⎤

n

�

� �

var Xi = E ⎣ X̃i ⎦

i=1 i=1

⎡ ⎤

�n �

n

= E⎣ X̃i X̃j ⎦

i=1 j=1

n �

� n

= E[X̃i X̃j ]

i=1 j=1

n

� � � n−1

� � n

= E X̃i2 + 2 E[X̃i X̃j ]

i=1 i=1 j=i+1

n

� n−1

� n

�

= var(Xi ) + 2 cov(Xi , Xj ).

i=1 i=1 j=i+1

The correlation coefﬁcient ρ(X, Y ) of two random variables X and Y that

have nonzero and ﬁnite variances is deﬁned as

cov(X, Y )

ρ(X, Y ) = � .

var(X)var(Y )

(The simpler notation ρ will also be used when X and Y are clear from the

context.) It may be viewed as a normalized version of the covariance cov(X, Y ).

ance, and correlation coefﬁcient equal to ρ.

(a) We have −1 ≤ ρ ≤ 1.

(b) We have ρ = 1 (respectively, ρ = −1) if and only if there exists a positive

(respectively, negative) constant c such that Y − E[Y ] = a(X − E[X]), with

probability 1.

ity, given below.

ables, X and Y , with ﬁnite variance, we have

�2

E[XY ] ≤ E[X 2 ] E[Y 2 ].

�

� 0; otherwise, we have Y = 0 with probabil

ity 1, and hence E[XY ] = 0, so the inequality holds. We have

�� �2 �

E[XY ]

0≤E X− Y

E[Y 2 ]

� � �2 �

2 E[XY ] E[XY ] 2

=E X −2 XY + � �2 Y

E[Y 2 ] 2

E[Y ]

� �2

2 E[XY ] E[XY ] 2

= E[X ] − 2 E[XY ] + � �2 E[Y ]

E[Y 2 ] E[Y 2 ]

� �2

2 E[XY ]

= E[X ] − ,

E[Y 2 ]

�2

i.e., E[XY ] ≤ E[X 2 ] E[Y 2 ].

�

Proof of Theorem 1:

get

� �2

� �2 E[X̃Ỹ ]

ρ(X, Y ) = ≤ 1,

E[X̃ 2 ] E[Ỹ 2 ]

and hence |ρ(X, Y )| ≤ 1.

6

(b) One direction is straightforward. If Ỹ = aX̃, then

˜ X̃]

E[Xa a

ρ(X, Y ) = � = ,

|a|

E[X̃ 2 ] E[(aX̃)2 ]

� �2

To establish the reverse direction, let us assume that ρ(X, Y ) = 1, which

�2

implies that E[X̃ 2 ]E[Ỹ 2 ] = E[X̃Ỹ ] . Using the inequality established in

�

E[X̃Ỹ ]

X̃ − Ỹ

E[Ỹ 2 ]

�

E[X̃Ỹ ] E[X̃ 2 ]

X̃ = Ỹ = ρ(X, Y )Y˜ .

E[Ỹ 2 ] E[Ỹ 2 ]

Note that the sign of the constant ratio of X̃ and Ỹ is determined by the sign

of ρ(X, Y ), as claimed.

p. Let X and Y be the numbers of heads and of tails, respectively, and let us look at the

correlation coefﬁcient of X and Y . Here, we have X + Y = n, and also E[X] + E[Y ] =

n. Thus, � �

X − E[X] = − Y − E[Y ] .

We will calculate the correlation coefﬁcient of X and Y , and verify that it is indeed

equal to −1.

We have

�� �� ��

cov(X, Y ) = E X − E[X] Y − E[Y ]

�� �2 �

= −E X − E[X]

= −var(X).

cov(X, Y ) −var(X)

ρ(X, Y ) = � =� = −1.

var(X)var(Y ) var(X)var(X)

7

4 INDICATOR VARIABLES AND THE INCLUSION-EXCLUSION FOR

MULA

Indicator functions are special discrete random variables that can be useful in

simplifying certain derivations or proofs. In this section, we develop the inclusion-

exclusion formula and apply it to a matching problem.

Recall that with every event A, we can associate its indicator function,

which is a discrete random variable IA : Ω → {0, 1}, deﬁned by IA (ω) = 1 if

ω ∈ A, and IA (ω) = 0 otherwise. Note that IAc = 1 − IA and that E[IA ] =

P(A). These simple observations, together with the linearity of expectations

turn out to be quite useful.

Note that IA∩B = IA IB , for every A, B ∈ F. Therefore,

= 1 − (1 − IA )(1 − IB ) = IA + IB − IA IB .

We now derive a generalization, known as the inclusion-exclusion formula.

Suppose we have a collection of events Aj , j = 1, . . . , n, and that we are inter

ested in the probability of the event B = ∪nj=1 Aj . Note that

n

�

IB = 1 − (1 − IAj ).

j=1

We begin with the easily veriﬁable fact that for any real numbers a1 , . . . , an , we

have

n

� � � �

(1 − aj ) =1 − aj + ai aj − ai aj ak

j=1 1≤j≤n 1≤i<j≤n 1≤i<j<k≤n

+ · · · + (−1)n a1 · · · an .

8

� � �

P(B) = P(Aj ) − P(Ai ∩ Aj ) − P(Ai ∩ Aj ∩ Ak )

1≤j≤n 1≤i<j≤n 1≤i<j<k≤n

+ · · · + (−1)n P(A1 ∩ · · · ∩ An ).

Suppose that n people throw their hats in a box, where n ≥ 2, and then each

person picks one hat at random. (Each hat will be picked by exactly one person.)

We interpret “at random” to mean that every permutation of the n hats is equally

likely, and therefore has probability 1/n!.

In an alternative model, we can visualize the experiment sequentially: the

ﬁrst person picks one of the n hats, with all hats being equally likely; then,

the second person picks one of the remaining n − 1 remaining hats, with every

remaining hat being equally likely, etc. It can be veriﬁed that the second model

is equivalent to the ﬁrst, in the sense that all permutations are again equally

likely.

We are interested in the mean, variance, and PMF of a random variable X,

deﬁned as the number of people that get back their own hat.1 This problem is

best approached using indicator variables.

For the ith person, we introduce a random variable Xi that takes the value 1

if the person selects his/her own hat, and takes the value 0 otherwise. Note that

X = X1 + X2 + · · · + Xn .

Since P(Xi = 1) = 1/n and P(Xi = 0) = 1 − 1/n, the mean of Xi is

� �

1 1 1

E[Xi ] = 1 · + 0 · 1 − = ,

n n n

which implies that

1

E[X] = E[X1 ] + E[X2 ] + · · · + E[Xn ] = n · = 1.

n

In order to ﬁnd the variance of X, we ﬁrst ﬁnd the variance and covariances

of the random variables Xi . We have

� �

1 1

var(Xi ) = 1− .

n n

1

For more results on various extensions of the matching problem, see L.A. Zager and G.C.

Verghese, “Caps and robbers: what can you expect?,” College Mathematics Journal, v. 38, n. 3,

2007, pp. 185-191.

9

� j, we have

For i =

�� �� ��

cov(Xi , Xj ) = E Xi − E[Xi ] Xj − E[Xj ]

= E[Xi Xj ] − E[Xi ] E[Xj ]

= P(Xi = 1 and Xj = 1) − P(Xi = 1)P(Xj = 1)

= P(Xi = 1)P(Xj = 1 | Xi = 1) − P(Xi = 1)P(Xj = 1)

1 1 1

= · − 2

n n − 1 n

= .

n2 (n − 1)

Therefore,

� n �

�

var(X) = var Xi

i=1

n

� n−1

� n

�

= var(Xi ) + 2 cov(Xi , Xj )

i=1 i=1 j=i+1

� �

1 1 n(n − 1) 1

= n· 1− +2· · 2

n n 2 n (n − 1)

= 1.

Finding the PMF of X is a little harder. Let us ﬁrst dispense with some

easy cases. We have P(X = n) = 1/n!, because there is only one (out of

the n! possible) permutations under which every person receives their own hat.

Furthermore, the event X = n − 1 is impossible: if n − 1 persons have received

their own hat, the remaining person must also have received their own hat.

Let us continue by ﬁnding the probability that X = 0. Let Ai be the event

that the ith person gets their own hat, i.e., Xi = 1. Note that the event X = 0

is the same as the event ∩i Aci . Thus, P(X = 0) = 1 − P(∪ni=1 Ai ). Using the

inclusion-exclusion formula, we have

� � �

P(∪ni=1 Ai ) = P(Ai ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak ) + · · · .

i i<j i<j<k

1 1 1 (n − k)!

P(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = · ··· = . (1)

n n−1 n−k+1 n!

10

Thus,

� � � � � �

1 n (n − 2)! n (n − 3)! n n (n − n)!

P(∪ni=1 Ai ) =n· − + + · · · + (−1)

n 2 n! 3 n! n n!

1 1 1

= 1 − + − · · · + (−1)n+1 .

2! 3! n!

We conclude that

1 1 1

P(X = 0) = − + · · · + (−1)n . (2)

2! 3! n!

Note that P(X = 0) → e−1 , as n → ∞.

To conclude, let us now ﬁx some integer r, with 0 < r ≤ n−2, and calculate

P(X = r). The event {X = r} can only occur as follows: for some subset S of

{1, . . . , n}, of cardinality r, the following two events, BS and CS , occur:

CS : for every i ∈

/ S, person i does not receive their own hat.

We then have

{X = r} = BS ∩ CS .

S: |S|=r

metry, P(BS ∩ CS ) is the same for every S of cardinality r. Thus,

�

P(X = r) = P(BS ∩ CS )

S: |S|=r

� �

n

= P(BS ) P(CS | BS ).

r

Note that

(n − r)!

P(BS ) = ,

n!

by the same argument as in Eq. (1). Conditioned on the event that the r persons

in the set S have received their own hats, the event CS will materialize if and

only if none of the remaining n − r persons receive their own hat. But this is

the same situation as the one analyzed when we calculated the probability that

X = 0, except that n needs to be replaced by n − r. We conclude that

1 1 1

P (CS | BS ) = − + · · · + (−1)n−r .

2! 3! (n − r)!

11

Putting everything together, we conclude that

� �

n (n − r)! � 1 1 1 �

P(X = r) = − + · · · + (−1)n−r

r n! 2! 3! (n − r)!

1�1 1 1 �

= − + · · · + (−1)n−r .

r! 2! 3! (n − r)!

Note that for each ﬁxed r, the probability P(X = r) converges to e−1 /r!,

as n → ∞, which corresponds to a Poisson distribution with parameter 1. An

intuitive justiﬁcation is as follows. The random variables Xi are not independent

(in particular, their covariance is nonzero). On the other hand, as n → ∞, they

are “approximately independent”. Furthermore, the success probability for each

person is 1/n, and the situation is similar to the one in our earlier proof that the

binomial PMF approaches the Poisson PMF.

5 CONDITIONAL EXPECTATIONS

the value of a random variable Y . Similarly, given an event A, we can deﬁne a

conditional PMF pX|A , by letting pX|A (x) = P(X = x | A). In either case, the

conditional PMF, as a function of x, is a bona ﬁde PMF (a nonnegative function

that sums to one). As such, it is natural to associate a (conditional) expectation

to the (conditional) PMF.

Deﬁnition 1. Given an event A, such that P(A) > 0, and a discrete random

variable X, the conditional expectation of X given A is deﬁned as

�

E[X | A] = xpX | A (x),

x

Note that the preceding also provides a deﬁnition for a conditional expecta

tion of the form E[X | Y = y], for any y such that pY (y) > 0: just let A be the

event {Y = y}, which yields

�

E[X | Y = y] = xpX | Y (x | y).

x

We note that the conditional expectation is always well deﬁned when either

the random variable X is nonnegative, or when the random variable X is inte

grable. In particular, whenever E[|X|] < ∞, we also have E[|X| | Y = y] < ∞,

12

for every y such that pY (y) > 0. To verify the latter assertion, note that for

every y such that pY (y) > 0, we have

� � pX,Y (x, y) 1 � E[|X |]

|x|pX|Y (x | y) = |x| ≤ |x|pX (x) = .

x x

pY (y) pY (y) x pY (y )

The converse, however, is not true: it is possible that E[|X| | Y = y] is ﬁnite for

every y that has positive probability, while E[|X|] = ∞.

The conditional expectation is essentially the same as an ordinary expecta

tion, except that the original PMF is replaced by the conditional PMF. As such,

the conditional expectation inherits all the properties of ordinary expectations

(cf. Proposition 4 in the notes for Lecture 6).

A simple calculation yields

� ��

E[X | Y = y]pY (y) = xpX|Y (x | y)pY (y)

y y x

��

= xpX,Y (x, y)

y x

= E[X].

Suppose now that {Ai } is a countable family of disjoint events that forms a

partition of the probability space Ω. Deﬁne a random variable Y by letting Y = i

if and only if Ai occurs. Then, pY (i) = P(Ai ), and E[X | Y = i] = E[X | Ai ],

which yields �

E[X] = E[X | Ai ]P(Ai ).

i

Example. (The mean of the geometric.) Let X be a random variable with parameter p,

so that pX (k) = (1−p)k−1 p, for p ∈ N. We ﬁrst observe that the geometric distribution

is memoryless: for k ∈ N, we have

P(X = k + 1, X > 1)

P(X − 1 = k | X > 1) =

P(X > 1)

P(X = k + 1)

=

P(X > 1)

(1 − p)k p

= = (1 − p)k−1 p

1−p

= P(X = k).

13

In words, in a sequence of repeated i.i.d., trials, given that the ﬁrst trial was a failure,

the distribution of the remaining trials, X − 1, until the ﬁrst success is the same as the

unconditional distribution of the number of trials, X, until the ﬁrst success. In particular,

E[X − 1 | X > 1] = E[X].

Using the total expectation theorem, we can write

E[X] = E[X | X > 1]P(X > 1)+E[X | X = 1]P(X = 1) = (1+E[X])(1−p)+1·p.

We solve for E[X], and ﬁnd that E[X] = 1/p.

Similarly,

E[X 2 ] = E[X 2 | X > 1]P(X > 1) + E[X 2 | X = 1]P(X = 1).

Note that

E[X 2 | X > 1] = E[(X −1)2 | X > 1]+E[2(X −1)+1 | X > 1] = E[X 2 ]+(2/p)+1.

Thus,

E[X 2 ] = (1 − p)(E[X 2 ] + (2/p) + 1) + p,

which yields

2 1

E[X 2 ] = − .

p2 p

We conclude that

�2 2 1 1 1−p

var(X) = E[X 2 ] − E[X] = 2 − − 2 =

�

.

p p p p2

random variable with parameter λ. The probability of heads at each ﬂip is p. Let X be

the number of heads, and let Y be the number of tails. Then,

∞ n � �

� � n m

E[X | N = n] = mP(X = m | N = n) = m p (1 − p)n−m .

m=0 m=0

m

But X is just the expected number of heads in n trials, so that E[X | N = n] = np.

Let us now calculate E[N | X = m]. We have

∞

�

E[N | X = m] = nP(N = n | X = m)

n=0

∞

� P(N = n, X = m)

= n

n=m

P(X = m)

∞

� P(X = m | N = n)P(N = n)

= n

n=m

P(X = m)

∞ �n� m

� p (1 − p)n−m (λn /n!)e−λ

= n m .

n=m

P(X = m)

14

d

Recall that X = Pois(λp), so that P(X = m) = e−λp (λp)m /m!. Thus, after some

cancellations, we obtain

∞

� (1 − p)n−m λn−m e−λ(1−p)

E[N | X = m] = n

n=m

(n − m)!

∞

� (1 − p)n−m λn−m e−λ(1−p)

= (n − m)

n=m

(n − m)!

∞

� (1 − p)n−m λn−m e−λ(1−p)

+m

n=m

(n − m)!

= λ(1 − p) + m.

A faster way of obtaining this result is as follows. From Theorem 3 in the notes for

Lecture 6, we have that X and Y are independent, and that Y is Poisson with parameter

λ(1 − p). Therefore,

Let X and Y be two discrete random variables. For any ﬁxed value of y, the

expression E[X | Y = y] is a real number, which however depends on y, and

can be used to deﬁne a function φ : R → R, by letting φ(y) = E[X | Y = y].

Consider now the random variable φ(Y ); this random variable takes the value

E[X | Y = y] whenever Y takes the value y, which happens with probability

P(Y = y). This random variable will be denoted as E[X | Y ]. (Strictly speak

ing, one needs to verify that this is a measurable function, which is left as an

exercise.)

Example. Let us return to the last example and ﬁnd E[X | N ] and E[N | X]. We found

that E[X | N = n] = np. Thus E[X | N ] = N p, i.e., it is a random variable that takes

the value np with probability P(N = n) = (λn /n!)e−λ . We found that E[N | X =

m] = λ(1 − p) + m. Thus E[N | X] = λ(1 − p) + X.

Note further that

and

E[E[N | X]] = λ(1 − p) + E[X] = λ(1 − p) + λp = λ = E[N ].

This is not a coincidence; the equality E[E[X | Y ]] = E[X] is always true, as we shall

now see. In fact, this is just the total expectation theorem, written in more abstract

notation.

15

Theorem 2. Let g : R → R be a measurable function such that Xg(Y ) is

either nonnegative or integrable. Then,

� �

E E[X | Y ]g(Y ) = E[Xg(Y )].

Proof: We have

� � �

E E[X|Y ]g(Y ) = E[X | Y = y]g(y)pY (y)

y

��

= xpX|Y (x | y)g(y)pY (y)

y x

�

= xg(y)pX,Y (x, y) = E[Xg(Y )].

x,y

� �

E (E[X | Y ] − X)g(Y ) = 0. (3)

basis of Y , and E[X | Y ] − X as an estimation error. The above formula says

that the estimation error is uncorrelated with every function of the original data.

Equation (3) can be used as the basis for an abstract deﬁnition of conditional

expectations. Namely, we deﬁne the conditional expectation as a random vari

able of the form φ(Y ), where φ is a measurable function, that has the property

� �

E (φ(Y ) − X)g(Y ) = 0,

for every measurable function g. The merits of this deﬁnition is that it can

be used for all kinds of random variables (discrete, continuous, mixed, etc.).

However, for this deﬁnition to be sound, there are two facts that need to be

veriﬁed:

(a) Existence: It turns out that as long as X is integrable, a function φ with the

above properties is guaranteed to exist. We already know that this is the

case for discrete random variables: the conditional expectation as deﬁned in

the beginning of this section does have the desired properties. For general

random variables, this is a nontrivial and deep result. It will be revisited

later in this course.

16

(b) Uniqueness: It turns out that there is essentially only one function φ with

the above properties. More precisely, any two functions with the above

properties are equal with probability 1.

17

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 8 10/1/2008

Contents

2. Examples

3. Expected values

4. Joint distributions

5. Independence

Readings: For a less technical version of this material, but with more discussion

and examples, see Sections 3.1-3.5 of [BT] and Sections 4.1-4.5 of [GS].

can be written in the form

� x

P(X ≤ x) = FX (x) = fX (t)dt,

−∞

for some nonnegative measurable function f : R → [0, ∞), which is called the

PDF of X. We then have, for any Borel set B,

�

P(X ∈ B) = fX (x) dx.

B

Technical remark: Since we have not yet deﬁned the notion of an integral of

a measurable function, the discussion in these notes will be rigorous only when

we deal with integrals that can be interpreted in the usual sense of calculus,

1

The reader should revisit Section 4 of the notes for Lecture 5.

1

namely, Riemann integrals. For now, let us just concentrate on functions that are

piecewise continuous, with a ﬁnite number of discontinuities.

We note that fX should be more appropriately called “a” (as opposed to

“the”) PDF of X, because it is not unique. For example, if we modify fX at a

ﬁnite number of points, its integral is unaffected, so multiple densities can corre

spond to the same CDF. It turns out, however, that any two densities associated

with the same CDF are equal except on a set of Lebesgue measure zero.

A PDF is in some ways similar to a PMF, except that the value fX (x) cannot

be interpreted as a probability. In particular, the value of fX (x) is allowed to

be greater than one for some x. Instead, the proper intuitive interpretation is the

fact that if fX is continuous over a small interval [x, x + δ], then

P(x ≤ X ≤ x + δ) ≈ fX (x)δ.

Remark: The fact that a random variable X is continuous has no bearing on the

continuity of X as a function from Ω into R. In fact, we have not even deﬁned

what it means for a function on Ω to be continuous. But even in the special case

where Ω = R, we can have a discontinuous function X : R → R which is a

continuous random variable. Here is an example. Let the underlying probability

measure on Ω be the Lebesgue measure on the unit interval. Let

�

ω, 0 ≤ ω ≤ 1/2,

X(ω) =

1 + ω, 1/2 < ω ≤ 1.

The function X is discontinuous. The random variable X takes values in the set

[0, 1/2] ∪ (3/2, 2]. Furthermore, it is not hard to check that X is a continuous

random variable with PDF given by

�

1, x ∈ [0, 1/2] ∪ (3/2, 2]

fX (x) =

0 otherwise.

2 EXAMPLES

ables.

2.1 Uniform

This is perhaps the simplest continuous random variable. Consider an interval

[a, b], and let

⎧

⎨ 0, x ≤ a,

FX (x) = (x − a)/(b − a), a < x ≤ b,

1, x > b.

⎩

note this distribution by U (a, b). We ﬁnd that a corresponding PDF is given

1

by fX (x) = (dFX /dx)(x) = b−a for x ∈ [a, b], and fX (x) = 0, otherwise.

When [a, b] = [0, 1], the probability law of a uniform random variable is just the

Lebesgue measure on [0, 1].

2.2 Exponential

Fix some λ > 0. Let FX (x) = 1 − e−λx , for x ≥ 0, and FX (x) = 0, for

x < 0. It is easy to check that FX satisﬁes the required properties of CDFs.

A corresponding PDF is fX (x) = λe−λx , for x ≥ 0, and fX (x) = 0, for

x < 0. We denote this distribution by Exp(λ). The exponential distribution

can be viewed as a “limit” of a geometric distribution. Indeed, if we ﬁx some δ

and consider the values of FX (kδ), for k = 1, 2, . . ., these values agree with the

values of a geometric CDF. Intuitively, the exponential distribution corresponds

to a limit of a situation where every δ time units, we toss a coin whose success

probability is λδ, and let X be the time elapsed until the ﬁrst success.

The distribution Exp(λ) has the following very important memorylessness

property.

for every x, t ≥ 0, we have P(X > x + t | X > x) = P(X > t).

P(X > x + t, X > x) P(X > x + t)

P(X > x + t | X > x) = =

P(X > x) P(X > x)

e−λ(x+t)

= = e−λt = P(X > t).

e−λx

processes, in which the elapsed waiting time does not affect our probabilistic

3

model of the remaining time until an arrival. For example, suppose that the

time until the next bus arrival is an exponential random variable with parameter

λ = 1/5 (in minutes). Thus, there is probability e−1 that you will have to wait

for at least 5 minutes. Suppose that you have already waited for 10 minutes. The

probability that you will have to wait for at least another ﬁve minutes is still the

same, e−1 .

Perhaps the most widely used distribution is the normal (or Gaussian) distribu

tion. It involves parameters µ ∈ R and σ > 0, and the density

1 (x−µ)2

fX (x) = √ e− 2σ2 .

σ 2π

It can be checked that this is a legitimate PDF, i.e., that it integrates to one. Note

also that this PDF is symmetric around x = µ. We use the notation N (µ, σ 2 ) to

denote the normal distribution with parameters µ, σ. The distribution N (0, 1) is

referred to as the standard normal distribution; a corresponding random vari

able is also said to be standard normal.

There is no closed form formula for the corresponding CDF, but numerical

tables are available. These tables can also be used to ﬁnd probabilities associated

with general normal variables. This is because of the fact (to be veriﬁed later)

that if X ∼ N (µ, σ 2 ), then (X − µ)/σ ∼ N (0, 1). Thus,

�X − µ c − µ�

P(X ≤ c) = P ≤ = Φ((c − µ)/σ),

σ σ

where Φ is the CDF of the standard normal, available from the normal tables.

2

� ∞ fX (x) = 1/(π(1 + x )), x ∈ R. It is an exercise in calculus to show that

Here,

−∞ f (t)dt = 1, so that fX is indeed a PDF. The corresponding distribution is

called a Cauchy distribution.

We have already deﬁned discrete power law distributions. We present here a

continuous analog. Our starting point is to introduce tail probabilities that decay

according to power law: P(X > x) = β/xα , for x ≥ c > 0, for some parame

ters α, c > 0. In this case, the CDF is given by FX (x) = 1 − β/xα , x ≥ c, and

cannot have a jump at x = c, and we therefore need β = cα . The corresponding

density is of the form

dFX αcα

fX (t) = (t) = α+1 .

dx t

3 EXPECTED VALUES

Similar to the discrete case, given a continuous random variable X with PDF

fX , we deﬁne

� ∞

E[X] = xfX (x) dx.

−∞

�∞

This integral is well deﬁned and ﬁnite if −∞ |x|fX (x) dx < ∞, in which case

we say that the random variable X is integrable. The integral� 0 is also well de

ﬁned, but inﬁnite, if one, but not both, of the integrals −∞ xfX (x) dx and

�∞

0 xfX (x) dx is inﬁnite. If both of these integrals are inﬁnite, the expected

value is not deﬁned.

Practically all of the results developed for discrete random variables carry

over to the continuous case. Many of them, e.g., E[X + Y ] = E[X] + E[Y ],

have the exact same form. We list below two results in which sums need to be

replaced by integrals.

Then

� ∞

E[X] = (1 − FX (t)) dt.

0

Proof: We have

� ∞ � ∞ � ∞� ∞

(1 − FX (t)) dt = P(X > t) dt = fX (x) dx dt

0 0 0 t

� ∞ � x � ∞

= fX (x) dt dx = xfX (x) dx = E[X].

0 0 0

(The interchange of the order of integration turns out to be justiﬁed becuase the

integrand is nonnegative.)

5

Proposition 2. Let X be a continuous random variable with density fX , and

suppose that g : R → R is a (Borel) measurable function such that g(X) is

integrable. Then,

� ∞

E[g(X)] = g(t)fX (t) dt.

−∞

Proof: Let us express the function g as the difference of two nonnegative func

tions,

g(x) = g + (x) − g − (x),

where g + (x) = max{g(x), 0}, and g − (x) = max{−g(x), 0}. In particular, for

any t ≥ 0, we have g(x) > t if and only if g + (x) > t.

We will use the result

� ∞ � ∞

� � � � � �

E g(X) = P g(X) > t dt − P g(X) < −t dt

0 0

from Proposition 1. The ﬁrst term in the right-hand side is equal to

� ∞� � ∞� � ∞

fX (x) dx dt = fX (x) dt dx = g + (x)fX (x) dx.

0 {x|g(x)>t} −∞ {t|0≤t<g(x)} −∞

� ∞ � ∞

g − (x)fX (x) dx.

� �

P g(X) < −t dt =

0 −∞

� ∞ � ∞ � ∞

+ −

� �

E g(X) = g (x)fX (x) dx − g (x)fX (x) dx = g(x)fX (x) dx.

−∞ −∞ −∞

Note that for this result to hold, the random variable g(X) need not be con

tinuous. The proof is similar to the one for Proposition 1, and involves an in

terchange of the order of integration; see [GS] for a proof for the special case

where g ≥ 0.

4 JOINT DISTRIBUTIONS

space, we say that they are jointly continuous if there exists a measurable2

2

Measurability here means that for every Borel subset of R, the set f −1 (B) is a Borel subset

of R2 . The Borel σ-ﬁeld in R2 is the one generated by sets of the form [a, b] × [c, d].

6

fX,Y : R2 → [0, ∞) such that their joint CDF satisﬁes

� x � y

FX,Y (x, y) = P(X ≤ x, Y ≤ y) = fX,Y (u, v) du dv.

−∞ −∞

At those points where the joint PDF is continuous, we have

∂2F

(x, y) = fX,Y (x, y).

∂x∂y

Similar to what was mentioned for the case of a single random variable, for

any Borel subset B of R2 , we have

� �

� �

P (X, Y ) ∈ B = fX,Y (x, y) dx dy = IB (x, y)fX,Y (x, y) dx dy.

B R2

We observe that

� x � ∞

P(X ≤ x) = fX,Y (u, v) du dv.

−∞ −∞

� ∞

fX (x) = fX,Y (x, y) dy.

−∞

We have just argued that if X and Y are jointly continuous, then X (and,

similarly, Y ) is a continuous random variables. The converse is not true. For a

trivial counterexample, let X be a continuous random variable, and let and Y =

X. Then the set {(x, y) ∈ R2 | x = y} has zero area (zero Lebesgue measure),

but unit probability, which is impossible for jointly continuous random variables.

In particular, the corresponding probability law on R2 is neither discrete nor

continuous, hence qualiﬁes as “singular.”

Proposition 2 has a natural extension to the case of multiple random vari

ables.

3

The Lebesgue measure on R2 is the unique measure µ deﬁned on the Borel subsets of R2

that satisﬁes µ([a, b] × [c, d]) = (b − a)(d − c), i.e., agrees with the elementary notion of “area”

on rectangular sets. Existence and uniqueness of such a measure is obtained from the Extension

Theorem, in a manner similar to the one used in our construction of the Lebesgue measure on R.

Proposition 3. Let X and Y be jointly continuous with PDF fX,Y , and sup

pose that g : R2 → R is a (Borel) measurable function such that g(X) is

integrable. Then,

� ∞� ∞

E[g(X, Y )] = g(u, v)fX,Y (u, v) du dv.

−∞ −∞

5 INDEPENDENCE

Recall that two random variables, X and Y , are said to be independent if for any

two Borel subsets, B1 and B2 , of the real line, we have P(X ∈ B1 , Y ∈ B2 ) =

P(X ∈ B1 )P(Y ∈ B2 ).

Similar to the discrete case (cf. Proposition 1 and Theorem 1 in Section 3 of

Lecture 6), simpler criteria for independence are available.

the same probability space. The following are equivalent.

(a) The random variables X and Y are independent.

(b) For any x, y ∈ R, the events {X ≤ x} and {Y ≤ y} are independent.

(c) For any x, y ∈ R, we have FX,Y (x, y) = FX (x)FY (y).

(d) If fX , fY , and fX,Y are corresponding marginal and joint densities, then

fX,Y (x, y) = fX (x)fY (y), for all (x, y) except possibly on a set that has

Lebesgue measure zero.

The proof parallels the proofs in Lecture 6, except for the last condition,

for which the argument is simple when the densities are continuous functions

(simply differentiate the CDF), but requires more care otherwise.

8

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 9 10/6/2008

Contents

2. Conditional PDFs

3. The bivariate normal distribution

4. Conditional expectation as a random variable

5. Mixed versions of Bayes’ rule

Recall that two random variables X and Y are said to be jointly continuous if

there exists a nonnegative measurable function fX,Y such that

� x � y

P(X ≤ x, Y ≤ y) = fX,Y (u, v) dv du.

−∞ −∞

Once we have in our hands a general deﬁnition of integrals, this can be used to

establish that for every Borel subset of R2 , we have

�

P((X, Y ) ∈ B) = fX,Y (u, v) du dv.

B

by

� ∞

fX (x) = fX,Y (x, y) dy.

−∞

�

Finally, recall that E[g(X)] = g(x)fX (x) dx. Similar to the discrete

case, the expectation of g(X) = X m and g(X) = (X − E[X])m is called

1

the mth moment and the mth central moment, respectively, of X. In particular,

var(X) � E[(X − E[X])2 ] is the variance of X.

The properties of expectations developed for discrete random variables in

Lecture 6 (such as linearity) apply to the continuous case as well. The sub

sequent development, e.g., for the covariance and correlation, also applies to

the continuous case, practically without any changes. The same is true for the

Cauchy-Schwarz inequality.

Finally, we note that all of the deﬁnitions and formulas have obvious exten

sions to the case of more than two random variables.

2 CONDITIONAL PDFS

For the case of discrete random variables, the conditional CDF is deﬁned by

FX|Y (x | y) = P(X ≤ x | Y = y), for any y such that P(Y = y) > 0. However,

this deﬁnition cannot be extended to the continuous case because P(Y = y) = 0,

for every y. Instead, we should think of FX|Y (x | y) as a limit of P(X ≤ x | y ≤

Y ≤ y + δ), as δ decreases to zero. Note that

FX|Y (x | y) ≈ P(X ≤ x | y ≤ Y ≤ y + δ)

P(X ≤ x, y ≤ Y ≤ y + δ)

=

P(y ≤ Y ≤ y + δ)

� x � y+δ

−∞ y fX,Y (u, v) dv du

≈

δfY (y)

�x

δ −∞ fX,Y (u, y) du

≈

δfY (y)

�x

−∞ fX,Y (u, y) du

= .

fY (y)

� x

fX,Y (u, y)

FX|Y (x | y) = du,

−∞ fY (y)

for every y such that fY (y) > 0, where fY is the marginal PDF of Y .

(b) The conditional PDF of X given Y is deﬁned by

fX,Y (x, y)

fX|Y (x | y) = ,

fY (y)

(c) The conditional expectation of X given Y = y is deﬁned by

�

E[X | Y = y] = xfX|Y (x | y)dx,

(d) The conditional probability of the event {X ∈ A}, given Y = y, is deﬁned

by

�

P(X ∈ A | Y = y) = fX|Y (x | y)dx,

A

It can be checked that FX | Y is indeed a CDF (it satisﬁes the required prop

erties such as monotonicity, right-continuity, etc.) For example, observe that

� ∞

fX,Y (u, y)

lim FX|Y (x | y) = du = 1,

x→∞ −∞ fY (y)

Finally, we note that

�

E[X] = E[X | Y = y]fY (y) dy,

and �

P(X ∈ A) = P(X ∈ A | Y = y)fY (y) dy.

3

These two relations are established as in the discrete case, by just replacing sum

mations with integrals. They can be rigorously justiﬁed if the random variable

X is nonnegative or integrable.

Let us ﬁx some ρ ∈ (−1, 1) and consider the function, called the standard

bivariate normal PDF,

1 � x2 − 2ρxy + y 2 �

f (x, y) = � exp − .

2π 1 − ρ2 2(1 − ρ2 )

Let X and Y be two jointly continuous random variables, deﬁned on the same

probability space, whose joint PDF is f .

(b) The marginal density of X and Y is N (0, 1), the standard normal PDF.

(c) We have ρ(X, Y ) = ρ. Also, X and Y are independent iff ρ = 0.

(d) The conditional density of X, given Y = y, is N (ρy, 1 − ρ2 ).

√ 2

Proof: We will use repeatedly the fact that 1/( 2πσ) exp(− (x−µ) 2σ 2

) is a PDF

2

(namely, the PDF of the N (µ, σ ) distribution), and thus integrates to one.

obtain

� 2 )y 2

�

� ∞ exp − (1−ρ

2(1−ρ2 )

� ∞ � (x − ρy)2 �

fY (y) = f (x, y)dx = exp − dx

2(1 − ρ2 )

�

−∞ 2π 1 − ρ2 −∞

exp(−y 2 /2) ∞ � (x − ρy)2 �

�

= exp − dx

2(1 − ρ2 )

�

2π 1 − ρ2 −∞

But we recognize

� ∞� (x − ρy)2 �

1

exp − dx

2(1 − ρ2 )

�

2π(1 − ρ2 ) −∞

as the PDF of the N (ρy, 1 − ρ2 ) distribution. Thus, the integral of this

density equals one, and we obtain

exp(−y 2 /2)

fY (y) = √ ,

2π

4

�∞

which is the standard normal PDF. Since −∞ fY (y) dy = 1, we conclude

that f (x, y) integrates to one, and is a legitimate joint PDF. Furthermore,

we have veriﬁed that the marginal PDF of Y (and by symmetry, also the

marginal PDF of X) is the standard normal PDF, N (0, 1).

are standard normal, and therefore have zero mean. We now have

� �

E[XY ] = xyf (x, y) dy dx.

� �

xf (x, y)dx = x exp − dx.

2(1 − ρ2 )

�

2π 1 − ρ2 −∞

But

� ∞ � (x − ρy)2 �

1

x exp − dx = ρy,

2(1 − ρ2 )

�

2π(1 − ρ2 ) −∞

since this is the expected value for the N (ρy, 1 − ρ2 ) distribution. Thus,

� � � �

E[XY ] = xyf (x, y) dx dy = yρyfY (y) dy = ρ y 2 fY (y) dy = ρ,

since the integral is the second moment of the standard normal, which

is equal to one. We have established that Cov(X, Y ) = ρ. Since the

variances of X and Y are equal to unity, we obtain ρ(X, Y ) = ρ. If X and

Y are independent, then ρ(X, Y ) = 0, implying that ρ = 0. Conversely,

if ρ = 0, then

1 � x2 + y 2 �

f (x, y) = exp − = fX (x)fY (y),

2π 2

and therefore X and Y are independent. Note that the condition ρ(X, Y ) =

0 implies independence, for the special case of the bivariate normal, whereas

this implication is not always true, for general random variables.

(d) Let us now compute the conditional PDF. Using the expression for fY (y),

we have

f (x, y)

fX|Y (x | y) =

fY (y)

1 � x2 − 2ρxy + y 2 �√

= � exp − 2π exp(y 2 /2)

2π 1 − ρ2 2(1 − ρ2 )

1 � x2 − 2ρxy + ρ2 y 2 �

=� exp −

2π(1 − ρ2 ) 2(1 − ρ2 )

1 � (x − ρy)2 �

= exp − ,

2(1 − ρ2 )

�

2π(1 − ρ2 )

which we recognize as the N (ρy, 1 − ρ2 ) PDF.

which the means are zero and the variances are equal to one. More generally,

the bivariate normal PDF is speciﬁed by ﬁve parameters, µ1 , µ2 , σ1 , σ2 , ρ, and

is given by

1 � 1 �

f (x, y) = � exp − Q(x, y) ,

2πσ1 σ2 1 − ρ2 2

where

1 � (x − µ1 )2 (x − µ1 ) (y − µ2 ) (y − µ2 )2 �

Q(x, y) = − 2ρ + .

1 − ρ2 σ12 σ1 σ2 σ22

For this case, it can be veriﬁed that

E[X] = µ1 , var(X) = σ12 , E[Y ] = µ2 , var(Y ) = σ22 , ρ(X, Y ) = ρ.

These properties can be derived by extending the tedious calculations in the

preceding proof.

There is a further generalization to more than two random variables, re

sulting in the multivariate normal distribution. It will be carried out in a more

elegant manner in a later lecture.

Similar to the discrete case, we deﬁne E[X | Y ] as a random variable that takes

the value E[X | Y = y], whenever Y = y, where fY (y) > 0. Formally, E[X | Y ]

is a function ψ : Ω → R that satisﬁes

�

ψ(ω) = xfX|Y (x | y) dx,

6

for every ω such that Y (ω) = y, where fY (y) > 0. Note that nothing is said

about the value of ψ(ω) for those ω that result in a y at which fY (y) = 0.

However, the set of such ω has zero probability measure. Because, the value

of ψ(ω) is completely determined by the value of Y (ω), we also have ψ(ω) =

φ(Y (ω)), for some function φ : R → R. It turns out that both functions ψ and

φ can be taken to be measurable.

One might expect that when X and Y are jointly continuous, then E[X | Y ]

is a continuous random variable, but this is not the case. To see this, suppose

that X and Y are independent, in which case E[X | Y = y] = E[X], which also

implies that E[X | Y ] = E[X]. Thus, E[X | Y ] takes a constant value, and is

therefore a trivial case of a discrete random variable.

Similar to the discrete case, for every measurable function g, we have

� �

E E[X | Y ]g(Y ) = E[Xg(Y )] (1)

the same, with integrals replacing summations. In fact, this property can be

taken as a more abstract deﬁnition of the conditional expectation. By letting g

be identically equal to 1, we obtain

� �

E E[X | Y ] = E[X].

Example. We have a stick of unit length [0, 1], and break it at X, where X is uniformly

distributed on [0, 1]. Given the value x of X, we let Y be uniformly distributed on [0, x],

and let Z be uniformly distributed on [0, 1 − x]. We assume that conditioned on X = x,

the random variables Y and Z are independent. We are interested in the distribution of

Y and Z, their expected values, and the expected value of their product.

It is clear from symmetry that Y and Z have the same marginal distribution, so we

focus on Y . Let us ﬁrst ﬁnd the joint distribution of Y and X. We have fX (x) = 1, for

x ∈ [0, 1], and fY |X (y | x) = 1/x, for y ∈ [0, x]. Thus, the joint PDF is

1 1

fX,Y (x, y) = fY |X (y | x)fX (x) = ·1= , 0 ≤ y ≤ x ≤ 1.

x x

We can now ﬁnd the PDF of Y :

� 1 � 1

1 �1

�

fY (y) = fX,Y (x, y) dx = dx = log x� = log(1/y).

y y x y

(check that this indeed integrates to unity). Integrating by parts, we then obtain

� 1 � 1

1

E[Y ] = yfY (y)dy = y log(1/y) dy = .

0 0 4

The above calculation is more involved than necessary. For a simpler argument,

simply observe that E[Y | X = x] = x/2, since Y conditioned on X = x is uniform on

[0, x]. In particular, E[Y | X] = X/2. It follows that E[Y ] = E[E[Y | X]] = E[X/2] =

1/4.

For an alternative version of this argument, consider the random variable Y /X.

Conditioned on the event X = x, this random variable takes values in the range [0, 1],

is uniformly distributed on that range, and has mean 1/2. Thus, the conditional PDF of

Y /X is not affected by the value x of X. This implies that Y /X is independent of X,

and we have

1 1 1

E[Y ] = E[(Y /X)X] = E[Y /X] · E[X] = · = .

2 2 4

To ﬁnd E[Y Z] we use the fact that, conditional on X = x, Y and Z are independent,

and obtain

� � � �

E[Y Z] = E E[Y Z | X] = E E[Y | X] · E[Z | X]

� X 1 − X � � 1 x(1 − x) 1

=E · = dx = .

2 2 0 4 24

Exercise 1. Find the joint PDF of Y and Z. Find the probability P(Y + Z ≤ 1/3).

Find E[X|Y ], E[X|Z], and ρ(Y, Z).

The conditional expectation E[X | Y ] can be viewed as an estimate of X, based

on the value of Y . In fact, it is an optimal estimate, in the sense that the mean

square of the resulting estimation error, X − E[X | Y ], is as small as possible.

Theorem 1. Suppose that E[X 2 ] < ∞. Then, for any measurable function

g : R → R, we have

� � � �

Proof: We have

� � � � �

�

+E[(X − E[X | Y ])(E[X | Y ] − g(Y ))

≥ E[(X − E[X | Y ])2 .

�

The inequality above is obtained by noticing that the term E (X − g(Y ))2 is�

� �

always nonnegative, and that the term E[(X − E[X | Y ])(E[X | Y ] − g(Y ))

8

�

is of the form E[(X − E[X | Y ])ψ(Y ) for ψ(Y ) = E[X | Y ] − g(Y ), and is

therefore equal to zero, by Eq. (1).

Notice that the preceding proof only relies on the property (1). As we have

discussed, we can view this as the deﬁning property of conditional expectations,

for general random variables. It follows that the preceding theorem is true for

all kinds of random variables.

the value of a related random variable, Y , whose distribution depends on the

value of X. This dependence can be captured by a conditional CDF, FY |X .

On the basis of the observed value y of Y , would like to make an inference on

the unknown value of X. While sometimes, this inference aims at a numerical

estimate for X, the most complete answer, which includes everything that can

be said about X, is the conditional distribution of X, given Y . This conditional

distribution can be obtained by using an appropriate form of Bayes’ rule.

When X and Y are both discrete, Bayes’ rule takes the simple form

pX (x)pY |X (y | x) pX (x)pY |X (y | x)

pX|Y (x | y) = =� � �

.

pY (y ) x� pX (x )pY |X (y | x )

When X and Y are both continuous, Bayes’ rule takes a similar form,

fX (x)fY |X (y | x) fX (x)fY |X (y | x)

fX|Y (x | y) = =� ,

fY (y ) fX (x� )fY |X (y | x� ) dx

It remains to consider the case where one random variable is discrete and

the other continuous. Suppose that K is a discrete random variable and Z is a

continuous random variable. We describe their joint distribution in terms of a

function fK,Z (k, z) that satisﬁes

� z

P(K = k, Z ≤ z) = fK,Z (k, t) dt.

−∞

We then have

� ∞

pK (k) = P(K = k) = fK,Z (k, t) dt,

−∞

and1

�� z � z �

FZ (z) = P(Z ≤ z) = fK,Z (k, t) dz = fK,Z (k, t) dz,

k −∞ −∞ k

fZ (z) = fK,Z (k, z),

k

is the PDF of Z.

Note that if P(K = k) > 0, then

� z

fK,Z (k, t)

P(Z ≤ z | K = k) = dt,

−∞ pK (k)

Finally, for z such that fZ (z) > 0, we deﬁne pK|Z (k | z) = fK,Z (k, z)/fZ (z),

and interpret it as the conditional probability of the event K = k, given that

Z = z. (Note that we are conditioning on a zero probability event; a more accu

rate interpretation is obtained by conditioning on the event z ≤ Z ≤ z + δ, and

let δ → 0.) With these deﬁnitions, we have

for every (k, z) for which fK,Z (k, z) > 0. By rearranging, we obtain two more

versions of the Bayes’ rule:

fZ (z)pK |Z (k | z) fZ (z )pK |Z (k | z )

fZ|K (z | k) = =� ,

pK (k ) fZ (z � )pK |Z (k | z � ) dz �

and

pK (k)fZ|K (z | k) pK (k)fZ|k (z | k)

pK|Z (k | z) = =� � �

.

fZ (z) k� pK (k )fZ|K (z | k )

Note that all four versions of Bayes’ rule take the exact same form; the only

difference is that we use PMFs and summations for discrete random variables,

as opposed to PDFs and integrals for continuous random variables.

1

The interchange of the summation and the integration can be rigorously justiﬁed, because the

terms inside are nonnegative.

10

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 10 10/8/2008

DERIVED DISTRIBUTIONS

Contents

2. Multivariate transformations

3. A single function of multiple random variables

4. Maximum and minimum of random variables

5. Sum of independent random variables – Convolution

g, we are often interested in the distribution (CDF, PDF, or PMF) of the ran

dom variable Y = g(X). For the case of a discrete random variable X, this is

straightforward: �

pY (y) = pX (x).

{x | g(x)=y}

that even if X is continuous, g(X) is not necessarily a continuous random vari

able, e.g., if the range of the function g is discrete. However, in many cases, Y

is continuous and its PDF can be found by following a systematic procedure.

The principal method for deriving the PDF of g(X) is the following two-step

approach.

1

Calculation of the PDF of a Function Y = g(X) of a Continuous Ran

dom Variable X

�

� �

FY (y) = P g(X) ≤ y = fX (x) dx.

{x | g(x)≤y}

dFY

fY (y) = (y).

dy

PDF. For any y > 0, we have

FY (y) = P(Y ≤ y)

= P(X 2 ≤ y)

√ √

= P(− y ≤ X ≤ y)

√ √

= FX ( y) − FX (− y),

1 √ 1 √

fY (y) = √ fX ( y) + √ fX (− y), y > 0.

2 y 2 y

Note that FY (y) = 0 for y < 1. For y ≥ 1, we have

� 2 �

FY (y) = P eX ≤ y = P(X 2 ≤ log y) = P(X ≤ log y).

�

� 1

fY (y) = fX ( log y) √ , y > 1.

2y log y

The calculation in the last example can be generalized as follows. Assume that

the range of the random variable X contains an open interval A. Suppose that

g is strictly monotone (say increasing), and also differentiable on the interval

2

A. Let B the set of values of g(x), as x ranges over A. Let g −1 be the inverse

function of g, so that g(g −1 (y)) = y, for y ∈ B. Then, for y ∈ B, and using the

chain rule in the last step, we have

d d dg −1 (y)

fY (y) = P(g(X) ≤ y) = P(X ≤ g −1 (y)) = fX (g −1 (y)) .

dy dy dy

Recall from calculus that the derivative of an inverse function satisﬁes

dg −1 1

(y) = � −1 ,

dy g (g (y))

where g � is the derivative of g. Therefore,

1

fY (y) = fX (g −1 (y)) .

g � (g −1 (y))

When g is strictly monotone decreasing, the only change is a minus sign in front

of g � (g −1 (y)). Thus, the two cases can be summarized in the single formula:

1

fY (y) = fX (g −1 (y)) , y ∈ B. (1)

|g � (g −1 (y))|

is

fY (y) |dy| = fX (x) |dx|,

where x and y are related by y = g(x), and since dy = |g � (x)| · |dx|,

fY (y) |g � (x)| = fX (x).

Consider now the special case where g(x) = ax + b, i.e., Y = aX + b. We

� 0. Then, g � (x) = a and g −1 (y) = (y − b)/a. We obtain

assume that a =

1 �y − b�

fY (y) = fX .

|a| a

d

X = N (0, 1) and Y = aX + b. Then,

1 (y−b)2

fY (y) = √ e− 2a2 ,

2π|a|

3

d

so that Y is N (b, a2 ). More generally, if X = N (µ, σ), then the same argument shows

d

that Y = ax + b = N (aµ + b, a2 σ 2 ). We conclude that a linear (more precisely, afﬁne)

function of a normal random variable is normal.

2 MULTIVARIATE TRANSFORMATIONS

jointly continuous, with joint PDF fX (x) = fX (x1 , . . . , xn ). Consider a func

tion g : Rn → Rn , and the random vector Y = (Y1 , . . . , Yn ) = g(X). Let gi

be the components of g, so that Yi = gi (X) = gi (X1 , . . . , Xn ). Suppose that

the function g is continuously differentiable on some open set A ⊂ Rn . Let

B = g(A) be the image of A under g. Assume that the inverse function g −1 is

well-deﬁned on B; that is, for every y ∈ B, there is a unique x ∈ Rn such that

g(x) = y.

The formula that we develop here is an extension of the formula fY (y) |g � (x)|

= fX (x) that we derived for the one-dimensional case, with the derivative g � (x)

being replaced by a matrix of partial derivatives, and with the absolute value

being replaced by the absolute value of the determinant. It can be justiﬁed by

appealing to the change of variables theorem from multivariate calculus, but we

provide here a more transparent argument.

Let us ﬁrst assume that g is a linear function, of the form g(x) = M x, for

some n × n matrix M . Fix some x ∈ A and some δ > 0. Consider the cube

C = [x, x + δ]n , and assume that δ is small enough so that C ⊂ A. The image

D = {M x | x ∈ C} of the cube C under the mapping g is a parallelepiped.

Furthermore, the volume of D is known to be equal to |M | · δ n , where we use

| · | to denote the absolute value of the determinant of a matrix.

Having ﬁxed x, let us also ﬁx y = M x. Assuming that fX (x) is continuous

at x, we have

�

P(X ∈ C) = fX (t) dt = fX (x)δ n + o(δ n ) ≈ fX (x)δ n ,

C

where o(δ n ) stands for a function such that limδ↓0 o(δ n )/δ n = 0, and where the

symbol ≈ indicates that the difference between the two sides is o(δ n ). Thus,

fX (x) · δ n ≈ P(X ∈ C)

= P(g(X) ∈ g(C))

= P(Y ∈ D)

≈ fY (y) · vol(D)

= fY (y) · |M | · δ n .

Let us now assume that the matrix M is invertible, so that its determinant is

nonzero. Using the relation y = M x and the fact det(M −1 ) = 1/ det(M ),

fX (M −1 y)

fY (y) = = fX (M −1 y) · |M −1 |.

|M |

proper subspace S of Rn . Then, Y is not jointly continuous (cannot be described

by a joint PDF). On the other hand, if we restrict our attention to S, and since S

is isomorphic to Rm for some m < n, we could describe the distribution of Y

in terms of a joint PDF on Rm .

Let us now generalize to the case where g is continuously differentiable at x.

We deﬁne M (x) as the Jacobian matrix (∂g/∂x)(x), with entries (∂gi /∂xj )(x).

The image D = g(C) of the cube C is not a parallelepiped. However, from a

ﬁrst order Taylor series expansion, g is approximately linear in the vicinity of

x. It can then be shown that the set D has volume |M (x)| · δ n + o(δ n ). It then

follows, as in the linear case, that

fX (g −1 (y))

fY (y) = = fX (g −1 (y)) · |M −1 (g −1 (y))|.

|M (g −1 (y))|

We note a useful fact from calculus that sometimes simpliﬁes the application

of the above formula. If we deﬁne J(y) as the Jacobian (the matrix of partial

derivatives) of the mapping g −1 (y), and if some particular x and y are related

by y = g(x), then J(y) = M −1 (x). Therefore,

5

fY (y) = fX (g −1 (y)) · |J(y)|. (2)

Let X and Y be independent standard normal random variables. Let g be

the mapping that transforms Cartesian coordinates to polar coordinates, and let

(R, Θ) = g(X, Y ). The mapping g is either undeﬁned or discontinuous at

(x, y) = (0, 0). So, strictly speaking, in order to apply the multivariate trans

formation formula (2), we should work with A = R2 \ {(0, 0)}. The inverse

mapping g −1 is given by (x, y) = g −1 (r, θ) = (r cos θ, r sin θ). Its Jacobian

matrix is of the form � �

cos θ sin θ

,

−r sin θ r cos θ

and therefore, |J(r, θ)| = r cos2 θ + r sin2 θ = r. From the multivariate trans

formation formula, we obtain

r −(r2 cos2 θ+r2 sin2 θ)/2 1 −r2 /2

fR,Θ (r, θ) = e = re , r > 0.

2π 2π

We observe that the joint PDF of R and Θ is of the form fR,Θ = fR fΘ , where

2 /2

fR (r) = re−r ,

and

1

fΘ (θ) = , θ = [0, 2π].

2π

In particular, R and Θ are independent. The random variable R is said to have a

Rayleigh distribution.

We can also ﬁnd the density of Z = R2 . For the mapping g deﬁned by

√

g(r) = r2 , we have g −1 (z) = z, and g � (r) = 2r, which leads to 1/g � (g −1 (z)) =

√

1/2 z. We conclude that

√ 1 1

fZ (z) = ze−z/2 √ = e−z/2 , z>0

2 z 2

which we recognize as an exponential PDF with parameter 1/2.

An interesting consequence of the above results is that in order to simulate

normal random variables, it sufﬁces to generate two independent random vari

ables, one uniform and one exponential. Furthermore, an exponential random

variable is easy to generate using a nonlinear transformation of another indepen

dent uniform random variable.

6

3 A SINGLE FUNCTION OF MULTIPLE RANDOM VARIABLES

ables with known PDF fX . Consider a function g1 : Rn → R and the random

variable Y1 = g1 (X). Note that the formulas from Section 2 cannot be used

directly. In order to ﬁnd the PDF of Y , one possibility is to calculate the multi

dimensional integral

�

{x | g(x)≤y}

Another possibility is to introduce additional functions g2 , . . . , gn : Rn →

R, and deﬁne Yi = gi (X), for i ≥ 2. As long as the resulting function g =

(g1 , . . . , gn ) is invertible, we can appeal to our earlier formula to ﬁnd the joint

PDF of Y , and then integrate to ﬁnd the marginal PDF of Y1 .

The simplest choice in the above described method is to let Yi = Xi , i = �

n

1, so that g(x) = (g1 (x), x2 , . . . , xn ). Let h : R → R be a function that

corresponds to the ﬁrst component of g −1 . That is, if y = g(x), then x1 = h(y).

Then, the inverse mapping g −1 is of the form

g −1 (y) = (h(y), y2 , . . . , yn ),

⎡

∂h ∂h ∂h ⎤

⎢ 0 1 ··· 0 ⎥

J(y) =

⎢ ⎥ .

⎢ ⎥

⎣

.. . .

.. .. ..

. . ⎦

0 0 ··· 1

∂h

It follows that |J(y)| = | ∂y1

(y)|, and

� ∂h

�

fY (y) = fX (h(y), y2 , . . . , yn )�

(y)�.

� �

∂y1

Integrating, we obtain

�

� ∂h

�

fY1 (y1 ) = fX (h(y), y2 , . . . , yn )�

(y)

� dy2 · · · dyn .

� �

∂y1

Example. Let X1 and X2 be positive, jointly continuous, random variables, and sup

pose that we wish to derive the PDF of Y1 = g(X1 , X2 ) = X1 X2 . We deﬁne Y2 = X2 .

7

From the relation x1 = y1 /x2 we see that h(y1 , y2 ) = y1 /y2 . The partial derivative

∂h/∂y1 is 1/y2 . We obtain

� �

1 1

fY1 (y1 ) = fX (y1 /y2 , y2 ) dy2 = fX (y1 /x2 , x2 ) dx2 .

y2 x2

d

For a special case, suppose that X1 , X2 = U (0, 1) are independent. Their common

PDF is fXi (xi ) = 1, for xi ∈ [0, 1]. Note that fY1 (y1 ) = 0 for y ∈

/ (0, 1). Furthermore,

fX1 (y1 /x2 ) is positive (and equal to 1) only in the range x2 ≥ y1 . Also fX2 (x2 ) is

positive, and equal to 1, iff x2 ∈ (0, 1). In particular,

fX (y1 /x2 , x2 ) = fX1 (y1 /x2 )fX2 (x2 ) = 1, for x2 ≥ y1 .

We then obtain

� 1

1

fY1 (y1 ) = dx2 = − log y, y1 ∈ (0, 1).

y1 x2

The direct approach to this problem would ﬁrst involve the calculation of FY1 (y1 ) =

P(X1 X2 ≤ y1 ). It is actually easier to calculate

� 1� 1

1 − FY1 (y1 ) = P(X1 X2 ≥ y1 ) = dx2 dx1

y1 y1 /x1

� �

y1 �

= 1− dx1

y1 x1

�1

= (x1 − y1 log x1 )� = (1 − y1 ) + y1 log y1 .

�

y1

Thus, FY1 (y1 ) = y1 − y1 log y1 . Differentiating, we ﬁnd that fY1 (y1 ) = − log y1 .

An even easier solution for this particular problem (along the lines of the stick

example in Lecture 9) is to realize that conditioned on X1 = x1 , the random variable

Y1 = X1 X2 is uniform on [0, x1 ], and using the total probability theorem,

� 1 � 1

1

fY1 (y1 ) = fX1 (x1 )fY1 |X1 (y1 | x1 ) dx1 = dx1 = − log y1 .

y1 y1 x1

X (1) denote the corresponding order statistics. Namely, X (1) = minj Xj , X (2)

is the second smallest of the values X1 , . . . , Xn , and X (n) = maxj Xj . We

would like to ﬁnd the joint distribution of the order statistics and speciﬁcally the

distribution of minj Xj and maxj Xj . Note that

P(max Xj ≤ x) = P(X1 , . . . , Xn ≤ x) = P(X1 ≤ x) · · · P(Xn ≤ x)

j

8

For the minimum, we have

j j

= 1 − P(X1 , . . . , Xn > x)

= 1 − (1 − FX1 (x)) · · · (1 − FXn (x)).

Let us consider the special case where the X1 , . . . , Xn are i.i.d., with common

CDF F and PDF f . For simplicity, assume that F is differentiable everywhere.

Then,

j j

implying that

fmaxj Xj (x) = nF n−1 (x)f (x), fminj Xj (x) = n(1 − F (x))n−1 f (x).

f , establish that the joint distribution of X (1) , . . . , X (n) is given by

fX (1) ,...,X (n) (x1 , . . . , xn ) = n!f (x1 ) · · · f (xn ), x1 < x2 < · · · < xn ,

and fX (1) ,...,X (n) (x1 , . . . , xn ) = 0, otherwise. Use this to derive the densities for

maxj Xj and minj Xj .

easy to ﬁnd:

�

= P(X = x, Y = y)

{(x,y)|x+y=z}

�

= P(X = x, Y = z − x)

x

�

= pX (x)pY (z − x).

x

mula can be expected to hold. We derive it in two different ways.

9

A ﬁrst derivation involves plain calculus. Let fX,Y be the joint PDF of X

and Y . Then.

� � ∞ � z−x

P(X + Y ≤ z) = fX,Y (x, y) dx dy = fX,Y (x, y) dy dx.

{x,y | x+y≤z} −∞ −∞

� ∞� z

P(X + Y ≤ z) = fX,Y (x, t − x) dt dx,

−∞ −∞

which gives

� ∞

fX+Y (z) = fX,Y (x, z − x) dx.

−∞

In the special case where X and Y are independent, we have fX,Y (x, y) =

fX (x)fY (y), resulting in

� ∞

fX+Y (z) = fX (x)fY (z − x) dx.

−∞

Consider the linear function g that maps (X, Y ) to (X, X + Y ). It is easily seen

that the associated determinant is equal to 1. Thus, with Z = X + Y , we have

d d

Exercise 2. Suppose that X = N (µ1 , σ12 ), X2 = N (µ2 , σ22 ) and independent. Estab

d

lish that X1 + X2 = N (µ1 + µ2 , σ12 + σ22 ).

10

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 11 10/15/2008

ABSTRACT INTEGRATION — I

Contents

1. Preliminaries

2. The main result

3. The Riemann integral

4. The integral of a nonnegative simple function

5. The integral of a nonnegative function

6. The general case

The material in these notes can be found in practically every textbook that

includes basic measure theory, although the order with which various properties

are proved can be somewhat different in different sources.

1 PRELIMINARIES

�

The objective

� of these notes is to deﬁne the integral g dµ [sometimes also

denoted g(ω) dµ(ω)] of a measurable function g : Ω → R, deﬁned on a

measure space (Ω, F, µ).

Special cases:

an extended-valued random variable), the integral X dP is also denoted

E[X], and is called the expectation of X.

(b) If we are dealing with the measure space (R, B, λ),

� where B is the Borel

σ-ﬁeld

� and λ is the Borel measure, the integral g dλ is often denoted

as g(x) dx, and is meant to be a generalization of the usual integral

encountered in calculus.

1

�

The program: We will deﬁne the integral g dµ for progressively general

classes of measurable functions g:

(a) Finite nonnegative functions g that take ﬁnitely many values (“simple

functions”). In this case, the integral is just a suitably weighted sum of

the values of g.

(b) Nonnegative functions g. Here, the integral will be deﬁned by approxi

mating g from below by a sequence of simple functions.

(c) General functions g. This is done by decomposing g in the form

� g =

g�+ −g− , where

� g+ and g− are nonnegative functions, and letting g dµ =

g+ dµ − g− dµ.

We will be focusing on the integral over the entire set Ω. The integral over a

(measurable) subset B of Ω will be deﬁned by letting

� �

g dµ = (1B g) dµ.

B

�

g(ω), if ω ∈ B,

(1B g)(ω) =

0, if ω ∈/ B.

Throughout, we will use the term “almost everywhere” to mean “for all ω

outside a zero-measure subset of Ω.” For the special case of probability mea

sures, we will often use the alternative terminology “almost surely,” or “a.s.” for

short. Thus, if X and

� Y are random variables,

� we have X = Y , a.s., if and only

if P(X = Y ) = P {ω : X(ω) �= Y (ω)} = 0.

In the sequel, an inequality g ≤ h between two functions will be interpreted

as “g(ω) ≤ h(ω), for all ω.” Similarly, “g ≤ h, a.e.,” means that “g(ω) ≤ h(ω),

for all ω outside a zero-measure set.” The notation “gn ↑ g” will mean that for

every ω, the sequence gn (ω) is monotone nondecreasing and converges to g(ω).

Finally, “gn ↑ g, a.e.,” will mean that the monotonic convergence of gn (ω) to

g(ω) holds for all ω outside a zero-measure set.

Once the construction is carried out, integrals of nonnegative functions will al

ways be well-deﬁned. For general functions, integrals will be left undeﬁned

only when an expression of the form ∞ − ∞ is encountered.

The following properties will turn out to be true, whenever the integrals or

expectations involved are well-deﬁned. On the left, we show the general version;

on the right, we show the same property, specialized to the case of probability

measures. In property 8, the convention 0 · ∞ will be in effect when needed.

�

1. 1B dµ = µ(B) E[1B ] = P(B)

�

2. g ≥ 0 ⇒ g dµ ≥ 0 X ≥ 0 ⇒ E[X] ≥ 0

�

3. g = 0, a.e. ⇒ g dµ = 0 X = 0, a.s. ⇒ E[X] = 0

� �

4. g ≤ h ⇒ g dµ ≤ h dµ X ≤ Y ⇒ E[X] ≤ E[Y ]

4�

� �

g ≤ h, a.e. ⇒ g dµ ≤ h dµ X ≤ Y, a.s. ⇒ E[X] ≤ E[Y ]

� �

5. g = h, a.e. ⇒ g dµ = h dµ X = Y, a.s. ⇒ E[X] = E[Y ]

�

6. [g ≥ 0, a.e., and g dµ = 0] ⇒ g = 0, a.e. [X ≥ 0, a.s., and E[X] ≥ 0] ⇒ X = 0, a.s.

� � �

7. (g + h) dµ = g dµ + h dµ E[X + Y ] = E[X] + E[Y ]

� �

8. (ag) dµ = a g dµ E[aX] = aE[X]

� �

9. 0 ≤ gn ↑ g ⇒ gn dµ ↑ g dµ 0 ≤ Xn ↑ X, ⇒ E[Xn ] ↑ E[X]

9� . 0 ≤ gn ↑ g, a.e. ⇒ gn dµ ↑ g dµ

� �

0 ≤ Xn ↑ X, a.s. ⇒ E[Xn ] ↑ E[X]

� �

10. g ≥ 0 ⇒ ν(B) = B g dµ is a measure [f ≥ 0 and � f dP = 1]

⇒ ν(B) = B f dP is a probability measure

�b for our purposes. Let us recall the deﬁni

tion of the (Riemann) integral a g(x) dx. We subdivide the interval [a, b] using

n−1

�� �

U (σ) = max g(x) · (xi+1 − xi ),

xi ≤x<xi+1

i=1

n

�−1 � �

L(σ) = min g(x) · (xi+1 − xi ).

xi ≤x<xi+1

i=1

3

Thus, U (σ) and L(σ) are approximations of the “area under the curve g,” from

�b

above and from below, respectively. We say that the integral a g(x) dx is well-

deﬁned, and equal to a constant c, if

σ σ

In this case, we also say that g is Riemann-integrable over [a, b]. Intuitively, we

want the upper and lower approximants U (σ) and L(σ) to agree, in the limit of

very ﬁne subdivisions of the interval [a, b].

It is known that if g : R → R is Riemann-integrable over every inter

val [a, b], then g is continuous almost everywhere (i.e., there exists a set S of

Lebesgue measure zero, such that g is continuous at every x ∈ / S). This is a

severe limitation on the class of Riemann-integrable functions.

Example. Let Q be the set of rational numbers in [0, 1]. Let g = 1Q . For any σ =

(x1 , x2 , . . . , xn ), and every i, the interval [xi , xi+1 ) contains a rational number, and

also an irrational number. Thus, maxxi ≤x<xi+1 g(x) = 1 and minxi ≤x<xi+1 g(x) = 0.

It follows that U (σ) = 1 and L(σ) = 0, for all σ, and supσ L(σ) �= inf σ U (σ).

Therefore 1Q is not Riemann integrable. On the other hand if we consider a uniform

distribution over [0, 1], and the binary random variable 1Q , we have P(1Q = 1) = 0,

�

and we would like to be able to say that E[X] = [0,1] 1Q (x) dx = 0. This indicates

that a different deﬁnition is in order.

ﬁnitely many different values. In particular, a simple function can be written as

k

�

g(ω) = ai 1Ai (ω), ∀ ω, (1)

i=1

ﬁnite, and the Ai are measurable sets.

Note that a simple function can have several representations of the form

(1). For example, 1[0,2] and 1[0,1] + 1(1,2] are two representations of the same

function. For another example, note that 1[0,2] + 1[1,2] = 1[0,1) + 2 · 1[1,2] .

On the other hand, if we require the ai to be distinct and the sets Ai to be

disjoint, it is not hard to see that there is only one possible representation, which

we will call the canonical representation. More concretely, in the canonical

representation, we let {a1 , . . . , ak } be the range of g, where the ai are distinct,

and Ai = {ω | g(ω) = ai }.

by

� �k

g dµ = ai µ(Ai ).

i=1

following sense. If we consider two alternative representations

� of the same sim

ple function, we need to ensure that the resulting value of g dµ is the same.

Technically, we need to show the following:

k

� m

� k

� m

�

if ai 1Ai = bi 1Bi , then ai µ(Ai ) = bi µ(Bi ).

i=1 i=1 i=1 i=1

This is left as an exercise for the reader.

Example. We have 1[0,2] + 1[1,2] = 1[0,1) + 2 · 1[1,2] . The ﬁrst representation leads

to µ([0, 2]) + µ([1, 2], the second to µ([0, 1)) + 2µ([1, 2]). Using the fact µ([0, 2]) =

µ([0, 1)) + µ([1, 2]) (ﬁnite additivity), we see that the two values are indeed equal.

simple

� function X : Ω → R is called a simple random variable, and its integral

X dP is also denoted as E[X]. We then have

k

�

E[X] = ai P(Ai ).

i=1

If the coefﬁcients ai are distinct, equal to the possible values of X, and by taking

Ai = {ω | X(ω) = ai }, we obtain

k

� k

�

E[X] = ai P({ω | X(ω) = ai }) = ai P(X = ai ),

i=1 i=1

which agrees with the elementary deﬁnition of E[X] for discrete random vari

ables.

Note, for future reference, that the sum or difference of two simple functions

is also simple.

For the various properties listed in Section 2, we will use the shorthand “property

S-A” and “property N-A”, to refer to “property A for the special case of simple

functions” and “property A for nonnegative measurable functions,” respectively.

5

We note a few immediate consequences

� of the deﬁnition. For any B ∈ F,

The function 1B is simple and 1B dµ = µ(B), which veriﬁes property 1. In

particular,� when Q is the set of rational numbers and µ is Lebesgue measure,

we have 1Q dµ = µ(Q) = 0, as desired. Note that a nonnegative simple

function

� has a representation of the form (1) with all ai positive. It follows that

g dµ ≥ 0, which veriﬁes property S-2.

Suppose now that a simple function satisﬁes

�k g = 0, a.e. Then, it has a

canonical representation of the form

� g = i=1 i 1Ai , where µ(Ai ) = 0, for

a

every i. Deﬁnition 1 implies that g dµ = 0, which veriﬁes property S-3.

Let us now verify the linearity property S-7. Let g and h be nonnegative

simple functions. Using canonical representations, we can write

k

� m

�

g= ai 1Ai , h= b j 1B j ,

i=1 j=1

where the sets Ai are disjoint, and the sets Bj are also disjoint. Then, the sets

Ai ∩ Bj are disjoint, and

k �

� m

g+h= (ai + bj )1Ai ∩Bj .

i=1 j=1

Therefore,

� k �

� m

(g + h) dµ = (ai + bj )µ(Ai ∩ Bj )

i=1 j=1

k

� m

� m

� k

�

= ai µ(Ai ∩ Bj ) + bj µ(Ai ∩ Bj )

i=1 j=1 j=1 i=1

k

� m

�

= ai µ(Ai ) + bj µ(Bj )

i=1 j=1

� �

= g dµ + h dµ.

(The ﬁrst and fourth equalities follow from Deﬁnition 1. The third equality made

use of ﬁnite additivity for µ.)

Property S-8 is an immediate � consequence of Deﬁnition 1. We only need

to be careful for the case where g�dµ = ∞ and a = 0. Using �the convention

0 · ∞, we see that ag = 0, so that (ag) dµ = 0 = 0 · ∞ = a g dµ, and the

see that, for simple functions g and h we have (g − h) dµ = g dµ − h dµ.

We now verify property S-4� , which also implies property S-4 as a special

case. Suppose that g ≤ h, a.e. We then have h = g + q, for a simple function q

such that q ≥ 0, a.e. In particular, q = q− q− , where q+ ≥ 0, q− ≥ 0, and q− =

0, a.e. Thus, h = g + q− q− . Note that q, q+ , and q− are all simple functions.

Using the linearity property S-7and then properties S-3, S-2, we obtain

� � � � � � �

h dµ = g dµ + q+ dµ − q− dµ = g dµ + q+ dµ ≥ g dµ.

� � then we �have both g ≤ h, a.e.,

S-5. �If g = h, a.e.,

and �h ≤ g, a.e. � Thus, g dµ ≤ h dµ, and g dµ ≥ h dµ, which implies

that g dµ = h dµ. �

We ﬁnally verify property S-6. Suppose that g ≥ 0, a.e., and g dµ = 0.

�We write g = g+ − g− , where g+ ≥ 0 and � g− ≥ 0. Then,

� g− �= 0, a.e., and

g− dµ = 0. Thus, using property S-7, g+ dµ = g dµ + g− dµ = 0.

Note that g+ is simple. Hence, its�canonical representation is of the form g+ =

� k k

i=1 ai 1Ai , with ai > 0. Since i=1 ai µ(Ai ) = 0, it follows that µ(Ai ) = 0,

for every i. From ﬁnite additivity, we conclude that µ(∪ki=1 Ai ) = 0. Therefore,

g+ = 0, a.e., and also g = 0, a.e.

from below, using simple functions.

set of all nonnegative simple (hence automatically measurable) functions q

that satisfy q ≤ g, and deﬁne

� �

g dµ = sup q dµ.

q∈S(g)

We will now verify that with this deﬁnition, properties N-2 to N-10 are all

satisﬁed. This is easy for some (e.g., property N-2). Most of our effort will

be devoted to establishing properties N-7 (linearity) and N-9 (monotone conver

gence theorem).

The arguments that follow will make occasional use of the following conti

nuity property for monotonic sequences of measurable sets Bi : If Bi ↑ B, then

7

µ(Bi ) ↑ µ(B). This property was established in the notes for Lecture 1, for the

special case where µ is a probability measure, but the same proof applies to the

general case.

Throughout this subsection, we assume that g is measurable and nonnegative.

�

Property

� N-2: For every

� q ∈ S(g), we have q dµ ≥ 0 (property S-2). Thus,

g dµ = supq∈S(g) q dµ ≥ 0.

q dµ = 0 for every q ∈ S(g) (by property S-3), which implies that g dµ = 0.

Property N-4: Suppose that 0 ≤ g ≤ h. Then, S(g) ⊂ S(h), which implies

that � � � �

g dµ = sup q dµ ≤ sup q dµ = h dµ.

q∈S(g) q∈S(h)

note that the complement of A has zero measure, so that q = 1A q, a.e., for any

function q. Then,

� � � �

g dµ = sup q dµ = sup 1A q dµ ≤ sup q dµ

q∈S(g) q∈S(g) q∈S(1A g)

� �

≤ sup q dµ = h dµ.

q∈S(h)

� �

A symmetrical argument yields h, dµ ≤ g dµ.

Exercise: Justify the above sequence of equalities and inequalities.

Property N-4� : Suppose that g ≤ h, a.e. Then,� there exists a function g � such

that g � �≤ h and g �= g � , a.e. Property N-5 yields �

�

g dµ

� = g dµ. Property N-4

�

�

yields g dµ ≤ h dµ. These imply that g dµ ≤ h dµ.

Property N-6: �Suppose that g ≥ 0 but the relation g = 0, a.e., is not true. We

will show that g dµ > 0. Let B = {ω | g(ω) > 0}. Then, µ(B) > 0. Let

Bn = {ω | g(ω) > 1/n}. Then, Bn ↑ B and, therefore, µ(Bn ) ↑ µ(B) > 0.

This shows that for some n we have µ(Bn ) > 0. Note that g ≥ (1/n)1Bn .

Then, properties S-4, S-8, and 1 yield

� � �

1 1 1

g dµ ≥ · 1Bn dµ = 1Bn dµ = µ(Bn ) > 0.

n n n

8

Property N-8, when a ≥ 0: If a = 0, the result is immediate. Assume that

a > 0. It is not hard to see that q ∈ S(g) if and only if aq ∈ S(ag). Thus,

� � � � �

(ag) dµ = sup q dµ = sup (aq) dµ = sup (aq) dµ = a q dµ.

q∈S(ag) aq∈S(ag) q∈S(g)

We ﬁrst provide the proof of property 9, for the special case where g is equal to

a simple function q, and then generalize.

Let q be a nonnegative simple function, represented in the form q = ki=1 ai 1Ai ,

�

where the ai are ﬁnite positive numbers, and the sets Ai are measurable and

disjoint. Let gn be a sequence of nonnegative measurable functions such that

gn ↑ q. We distinguish between two different cases, depending on whether

�

q dµ is ﬁnite or inﬁnite.

�

(i) Suppose that q dµ = ∞. This implies that µ(Ai ) = ∞ for some i. It

follows that µ(Ai ) = ∞ for some i. Fix such an i, and let

For every ω ∈ Ai , there exists some n such that gn (ω) > ai /2. Therefore,

Bn ↑ Ai . From the continuity of measures, we obtain µ(Bn ) ↑ ∞. Now,

note that gn ≥ (ai /2)1Bn . Then, using property N-4, we have

� �

ai

gn dµ ≥ µ(Bn ) ↑ ∞ = q dµ.

2

�

(ii) Suppose now that q dµ < ∞. Then, µ(Ai ) < ∞, for all i ∈ S. Let

� �

Bn = ω ∈ A | gn (ω) ≥ q(ω) − (1/r) .

µ(Bn ) + µ(A \ Bn ), and µ(A) < ∞, this also yields µ(A \ Bn ) ↓ 0.

Note that 1A q = q, a.e. Using properties, S-5 and S-7, we have

� � � �

q dµ = 1A q dµ = 1Bn q dµ + 1A\Bn q dµ. (2)

9

For ω ∈ Bn , we have gn (ω) + (1/r) ≥ q(ω). Thus, gn + (1/r)1Bn ≥

1Bn q. Using properties N-4 and S-7, together with Eq. (2), we have

� � � � �

1

gn dµ + 1B dµ ≥ 1Bn q dµ = q dµ − 1A\Bn q dµ

r n

�

≥ q dµ − aµ(A \ Bn ).

� �

1

lim gn dµ + µ(A) ≥ q dµ.

n→∞ r

Since this is true for every r > 1/a, we must have

� �

lim gn dµ ≥ q dµ.

n→∞

� �

On the other

� hand, �we have gn ≤ q, so that gn dµ ≤ q dµ, and

limn→∞ gn dµ ≤ q dµ.

q ∈ S(g), so that 0 ≤ q ≤ g. We have

0 ≤ min{gn , q} ↑ min{g, q} = q.

Therefore,

� � �

lim gn dµ ≥ lim min{gn , q} dµ = q dµ.

n→∞ n→∞

(The inequality above uses property N-4; the equality relies on the fact that we

already proved the MCT for the case where the limit function is simple.) By

taking the supremum over q ∈ S(g), we obtain

� � �

lim gn dµ ≥ sup q dµ = g dµ.

n→∞ q∈S(g)

� �

On the other

� hand, �we have gn ≤ g, so that gn dµ ≤ g dµ. Therefore,

limn→∞ gn dµ ≤ g dµ, which concludes the proof of property N-9.

To prove property 9� , suppose that gn ↑ g, a.e. Then, there exist functions

gn and g � , such that gn = g � n, a.e., g = g � , a.e., and gn� ↑ g � . By combining

�

� � � �

� �

lim gn dµ = lim gn dµ = g dµ = g dµ.

n→∞ n→∞

10

�

Let g be a nonnegative measurable function. From the deﬁnition � of � g dµ,

it follows that there exists a sequence qn ∈ S(g) such that qn dµ → q dµ.

This does not provide us with much information on the sequence qn . In contrast,

� construction that follows provides us with a concrete way of approximating

the

g dµ.

r, if g(ω) ≥ r

�

gr (ω) = i i i + 1

r

, if r ≤ g(ω) < r , i = 0, 1, . . . , r2r − 1

2 2 2

In words, the function gr is a quantized version of g. For every ω, the value of

g(ω) is ﬁrst capped at r, and then rounded down to the nearest multiple of 2−r .

We note a few properties of gr that are direct consequences of its deﬁnition.

(a) For every r, the function gr is simple (and, in particular, measurable).

(b) We have 0 ≤ gr ↑ g; that is, for every ω, we have gr (ω) ↑ g(ω).

(c) If g is bounded above by c and r ≥ c, then |gr (ω) − g(ω)| ≤ 1/2r , for

every ω.

Statement (b) above gives us a transparent characterization of the set of mea

surable functions. Namely, a nonnegative function is measurable if and only if

it is the monotonic and� pointwise� limit of simple functions. Furthermore, the

MCT indicates that gr dµ ↑ g dµ, for this particular choice of simple func

tions gr . (In� an alternative line of� development of the subject, some texts start

by deﬁning g dµ as the limit of gr dµ.)

5.4 Linearity

We now prove linearity (property N-7). Let gr and hr be the approximants of

g and h, respectively, deﬁned in Section 5.3. Since gr ↑ g and hr ↑ h, we have

(gr + hr ) ↑ (g + h). Therefore, using the MCT and property S-7 (linearity for

simple functions),

� �

(g + h) dµ = lim (gr + hr ) dµ

r→∞

�� � �

= lim gr dµ + hr dµ

r→∞

� �

= lim gr dµ + lim hr dµ

r→∞ r→∞

� �

= g dµ + h dµ.

11

6 THE GENERAL CASE

A− = {ω | g(ω) < 0}; note that these are measurable sets. Let g+ = g · 1A+

and g− = −1A− g; note that these are nonnegative measurable functions. We

then have g = g+ − g− , and we deﬁne

� � �

g dµ = g+ dµ − g− dµ.

� �

� integral g dµ is well-deﬁned, as long as we do not have both g+ dµ and

The

g− dµ equal to inﬁnity.

With this deﬁnition, verifying properties 3-8 is not too difﬁcult, and there

are no surprises. We decompose the functions g and h into negative and positive

parts, and apply the properties already proved for the nonnegative case. The

details are left as an exercise.

In order to prove the last property (property 10) one uses countable additivity

for the measure µ, a limiting argument based on the approximation of g by

simple functions, and the MCT. The detailed proof is left as an exercise for the

reader.

12

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 12 10/20/2008

ABSTRACT INTEGRATION — II

Contents

1. Borel-Cantelli revisited

2. Connections between abstract integration and elementary deﬁnitions of

integrals and expectations

3. Fatou’s lemma

4. Dominated convergence theorem

(a) We deﬁned the �notion of an integral of a measurable function with respect

� ( g dµ), which subsumes the special case of expectations

to a measure

(E[X] = X dP), where X is a random variable, and P is a probability

measure.

(b) We saw that integrals are always well-deﬁned, though possibly inﬁnite, if

the function being integrated is nonnegative.

(c) For a general function g, we decompose it as the sum g = g+ − g− of a

positive and a negative function, and

� integrate each

� piece separately. The

integral is well deﬁned unless both g+ dµ and g− dµ happen to be inﬁ

nite.

(d) �We saw that integrals

� obey� a long list natural properties, including linearity:

(g + h) dµ = g dµ + h dµ.

(e) We stated the Monotone Convergence Theorem (MCT), according to which,

if {gn } is a nondecreasing sequence of nonnegative measurable

� �functions

that converge pointwise to a function g, then limn→∞ gn dµ = g dµ.

(f) Finally, we saw that for every nonnegative measurable function g, we can

ﬁnd an nondecreasing sequence of nonnegative simple functions that con

verges (pointwise) to g.

1

1 BOREL-CANTELLI REVISITED

�

i=1 P(Ai ) < ∞,

then P(Ai i.o.) = 0. In this section, we rederive this result using the new ma

chinery that we have available.

Let Xi be the indicator

� function of the event Ai , so that E[X� i ] = P(Ai ).

Thus, by assumption ∞ i=1 E[X i ] < ∞. The random variables n

i=1 Xi are

nonnegative and form an increasing sequence, as n increases. Furthermore,

n

� ∞

�

lim Xi = Xi ,

n→∞

i=1 i=1

� �

i=1 Xi (ω).

We can now apply the MCT, and then the linearity property of expectations

(for ﬁnite sums), to obtain

��∞ � n

�� �

E Xi = lim E Xi

n→∞

i=1 i=1

n

�

= lim E[Xi ]

n→∞

i=1

n

�

= lim P(Ai )

n→∞

i=1

∞

�

= P(Ai )

i=1

< ∞.

This implies that ∞

�

i=1 Xi < ∞, a.s. (This is intuitively obvious, but a short

formal proof is actually needed.) It follows that, with probability 1, only ﬁnitely

many of the events Ai can occur. Equivalently, the probability that inﬁnitely

many of the events Ai occur is zero, i.e., P(Ai i.o.) = 0.

EMENTARY DEFINITIONS OF INTEGRALS AND EXPECTATIONS

Abstract integration would not be useful theory if it were inconsistent with the

more elementary notions of integration. For discrete random variables taking

values in a ﬁnite range, this consistency is automatic because of the deﬁnition of

an integral of a simple function. We will now verify some additional aspects of

this consistency.

2

2.1 Connection with Riemann integration.

We state here, without proof, the following reassuring result. Let Ω = R, en

dowed with the usual Borel σ-ﬁeld. Let λ be the Lebesgue measure. Sup

pose that g is a measurable function which� is Riemann integrable�on some in

b

terval [a, b]. Then, the Riemann integral a g(x) dx is equal to [a,b] g dλ =

�

1[a,b] g dλ.

Consider a probability space (Ω, F, P). Let X : Ω → R, be a random variable.

We then obtain a second probability space (R, B, PX ), where B is the Borel

σ-ﬁeld, and PX is the probability law of X, deﬁned by

dom variable Y = g(X), and a corresponding probability space (R, B, PY ). The

expectation of Y can be evaluated in three different ways, that is, by integrating

over either of the three spaces we have introduced.

Theorem 1. We have

� � �

Y dP = g dPX = y dPY ,

Proof: We follow the “standard program”: ﬁrst establish the result for simple

functions, then take the limit to deal with nonnegative functions, and ﬁnally

generalize.

Let g be a simple function, which takes values in a ﬁnite set y1 , . . . , yk .

Using the deﬁnition of the integral of a simple function we have

� � �

Y dP = yi P({ω | Y (ω) = yi }) = yi P({ω | g(X(ω)) = yi }).

yi yi

Similarly, � �

g dPX = yi PX ({x | g(x) = yi }).

yi

3

However, from the deﬁnition of PX , we obtain

= P({ω | X(ω) ∈ g −1 (yi )})

= P({ω | g(X(ω)) = yi }),

and the ﬁrst equality in the theorem follows, for simple functions.

Let now g be nonnegative function, and let {gn } be an increasing sequence

of nonnegative simple functions that converges to g. Note that gn (X) converges

monotonically to g(X). We then have

� � � � �

Y dP = g(X) dP = lim gn (X) dP = lim gn dPX = g dPX .

n→∞ n→∞

(The second equality is the MCT; the third is the result that we already proved

for simple functions; the last equality is once more the MCT.)

The case of general (not just nonnegative) functions follows easily from the

above – the details are omitted. This proves the ﬁrst equality in the theorem.

For the second equality,

� note that � by considering the special case g(x) = x,

we obtain Y = X and � X dP�= x dPX . By a change of notation, we have

also established that Y dP = y dPY .

We can now revisit the development of continuous random variables (Lecture 8),

in a more rigorous manner. We say that a random variable X : Ω → R is

continuous if its CDF can be written in the form

�

FX (x) = P(X ≤ x) = 1(−∞,x] f dλ, ∀x ∈ R,

can then be shown that for any Borel subset A of the real line, we have

�

PX (A) = f dλ. (1)

A

When f is�Riemann integrable and the set A is an interval, we can also write

PX (A) = A f (x) dx, where the latter integral is an ordinary Riemann integral.

satisﬁes |g| dPX < ∞. Then,

� �

E[g(X)] = g dPX = (gf ) dλ.

Proof: The ﬁrst equality holds by deﬁnition. The second was shown in Theorem

1. So, let us concentrate on the third. Following the usual program,

� let us ﬁrst

consider the case where g is a simple function, of the form g = ki=1 ai 1Ai , for

some measurable disjoint subsets Ai of the real line. We have

� k

�

g dPX = ai PX (Ai )

i=1

�k �

= ai f dλ

i=1 Ai

�k �

= ai 1Ai f dλ

i=1

k

� �

= ai 1Ai f dλ

i=1

�

= (gf ) dλ.

The ﬁrst equality is the deﬁnition of the integral for simple functions. The sec

ond uses Eq. (1). The fourth uses linearity of integrals. The ﬁfth uses the deﬁni

tion of g.

Suppose now that g is a nonnegative function, and let {gn } be an increasing

sequence of nonnegative functions that converges to g, pointwise. Since f is

nonnegative, note that gn f also increases monotonically and converges to gf .

Then,

� � � �

g dPX = lim gn dPX = lim (gn f ) dλ = (gf ) dλ.

n→∞ n→∞

The ﬁrst and the third equality above is the MCT. The middle equality is the

result we already proved, for the case of a simple function gn .

Finally, if g is not nonnegative, the result is proved by considering separately

the positive and negative parts of g.

5

When g and f are “nice” functions, e.g., piecewise continuous, Theorem 2

yields the familiar formula

�

E[g(X)] = g(x)f (x) dx,

3 FATOU’S LEMMA

Note that for any two random variables, we have min{X, Y } ≤ X and min{X, Y } ≤

Y . Taking expectations, we obtain E[min{X, Y }] ≤ min{E[X], E[Y ]}. Fa

tou’s lemma is in the same spirit, except that inﬁnitely many random variables

are involved, as well as a limiting operation, so some additional technical con

ditions are needed.

(a) If Y ≤ Xn , for all n, then E[lim inf n→∞ Xn ] ≤ lim inf n→∞ E[Xn ].

(b) If Xn ≤ Y , for all n, then E[lim supn→∞ Xn ] ≥ lim supn→∞ E[Xn ].

By considering the case where Y = 0, we see that parts (a) and (b) ap

ply in particular to the case of nonnegative (respectively, nonpositive) random

variables.

Proof: Let us only prove the ﬁrst part; the second part frollows from a symmet

rical argument, or by simply applying the ﬁrst part to the random variables −Y

and −Xn .

Fix some n. We have

inf Xk − Y ≤ Xm − Y, ∀ m ≥ n.

k≥n

E[ inf Xk − Y ] ≤ E[Xm − Y ], ∀ m ≥ n.

k≥n

k≥n m≥n

Note that the sequence inf k≥n Xk − Y is nonnegative and nondecreasing with

n, and converges to lim inf n→∞ Xn − Y . Taking the limit of both sides of the

6

preceding inequality, and using the MCT for the left-hand side term, we obtain

n→∞ n→∞

Note that E[Xn − Y ] = E[Xn ] − E[Y ]. (This step makes use of the as

sumption that E[|Y |] < ∞.) For similar reasons, the term −E[Y ] can also be

removed from the right-hand side. After canceling the terms E[Y ] from the two

sides, we obtain the desired result.

We note that Fatou’s lemma remains valid for the case of general measures

and integrals.

alternative set of conditions under which a limit and an expectation can be inter

changed.

converges to X (pointwise). Suppose that |Xn | ≤ Y , for all n, where Y is a

random variable that satisﬁes E[|Y |] < ∞. Then, limn→∞ E[Xn ] = E[X].

obtain

E[X] = E[lim inf Xn ] ≤ lim inf E[Xn ] ≤ lim sup E[Xn ] ≤ E[lim sup Xn ] = E[X].

n→∞ n→∞ n→∞ n→∞

E[X] = lim inf E[Xn ] = lim sup E[Xn ].

n→∞ n→∞

A special case of the DCT is the BCT (Bounded Convergence Theorem),

which asserts that if there exists a constant c ∈ R such that |Xn | ≤ c, a.s., for

all n, then E[Xn ] → X.

�∞

Corollary 1. Suppose that n=1 E[|Zn |] < ∞. Then,

∞

� ∞

�� �

E[Zn ] = E Zn .

n=1 n=1

�n

Proof: By the monotone convergence theorem, applied to Yn = k=1 |Zk |, we

have

��∞ � � ∞

E |Zn | = E[|Zn |] < ∞.

n=1 n=1

�n �∞

Let Xn = �∞ i=1 Zi and note that limn→∞ X n = i=1 Zn . We observe that

|Xn | ≤ i=1 |Zi |, which has ﬁnite expectation, as shown earlier. The result

follows from the dominated convergence theorem.

Exercise: Can you prove Corollary 1 directly from the monotonone convergence theo

rem, without appealing to the DCT or Fatou’s lemma?

Remark: We note that the MCT, DCT, and Fatou’s lemma are also valid for

general measures, not just for probability measures

� (the proofs are the same);

just replace expressions such as E[X] by integrals g dµ.

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 13 10/22/2008

Contents

1. Product measure

2. Fubini’s theorem

tion and integration. The discussion here is concerned with conditions under

which this is legitimate.

1 PRODUCT MEASURE

and (�2 , F2 , P2 ), respectively. We are interested in forming a probabilistic

model of a “joint experiment” in which the original two experiments are car

ried out independently.

If the ﬁrst experiment has an outcome � 1 , and the second has an outcome �2 ,

then the outcome of the joint experiment is the pair (� 1 , �2 ). This leads us to

deﬁne a new sample space � = �1 × �2 .

Next, we need a �-ﬁeld on �. If A1 ≤ F1 , we certainly want to be able to talk

about the event {�1 ≤ A1 } and its probability. In terms of the joint experiment,

this would be the same as the event

A1 × �1 = {(�1 , �2 ) | �1 ≤ A1 , �2 ≤ �2 }.

1

Thus, we would like our �-ﬁeld on � to include all sets of the form A 1 × �2 ,

(with A1 ≤ F1 ) and by symmetry, all sets of the form � 1 × A2 (with (A2 ≤ F2 ).

This leads us to the following deﬁnition.

that contains all sets of the form A1 × �2 and �1 × A2 , where A1 ≤ F1 and

A2 ≤ F 2 .

Note that the notation F1 ×F2 is misleading: this is not the Cartesian product

of F1 and F2 !

Since �-ﬁelds are closed under intersection, we observe that if A i ≤ Fi , then

A1 × A2 = (A1 × �2 ) � (�1 � A2 ) ≤ F1 × F2 . It turns out (and is not hard

to show) that F1 × F2 can also be deﬁned as the smallest �-ﬁeld containing all

sets of the form A1 × A2 , where Ai ≤ Fi .

We now deﬁne a measure, to be denoted by P 1 × P2 (or just P, for short) on the

measurable space (�1 × �2 , F1 × F2 ). To capture the notion of independence,

we require that

has property (1).

measure on certain subsets that generate the �-ﬁeld F 1 × F2 , and then extend

it to the entire �-ﬁeld. However, Caratheodory’s extension theorem involves

certain conditions, and checking them does take some nontrivial work. Various

proofs can be found in most measure-theoretic probability texts.

Everything in these notes extends to the case where instead of probability mea

sures Pi , we are dealing with general measures µ i , under the assumptions that

the measures µi are �-ﬁnite. (A measure µ is called �-ﬁnite if the set � can be

partitioned into a countable union of sets, each of which has ﬁnite measure.)

2

The most relevant example of a �-ﬁnite measure is the Lebesgue measure

on the real line. Indeed, the real line can be broken into a countable sequence of

intervals (n, n + 1], each of which has ﬁnite Lebesgue measure.

The two-dimensional plane R2 is the Cartesian product of R with itself. We

endow each copy of R with the Borel �-ﬁeld B and one-dimensional Lebesgue

measure. The resulting �-ﬁeld B × B is called the Borel �-ﬁeld on R 2 . The

resulting product measure on R2 is called two-dimensional Lebesgue measure,

to be denoted here by �2 . The measure �2 corresponds to the natural notion of

area. For example,

More generally, for any “nice” set of the form encountered in calculus, e.g., sets

of the form A = {(x, y) | f (x, y) ∀ c}, where f is a continuous function,

�2 (A) coincides with the usual notion of the area of A.

Remark for those of you who know a little bit of topology – otherwise ignore

it. We could deﬁne the Borel �-ﬁeld on R 2 as the �-ﬁeld generated by the

collection of open subsets of R2 . (This is the standard way of deﬁning Borel

sets in topological spaces.) It turns out that this deﬁnition results in the same

�-ﬁeld as the method of Section 1.2.

2 FUBINI’S THEOREM

the order of integration in a double integral. Given that sums are essentially

special cases of integrals (with respect to discrete measures), it also gives con

ditions for interchanging the order of summations, or the order of a summation

and an integration. In this respect, it subsumes results such as Corollary 1 at the

end of the notes for Lecture 12.

In the sequel, we will assume that g : � 1 ×�2 � R is a measurable function.

This means that for any Borel set A ∩ R, the set {(� 1 , �2 ) | g(�1 , �2 ) ≤ A}

belongs to the �-ﬁeld F1 × F2 . As a practical matter, it is enough to verify that

for any scalar c, the set {(�1 , �2 ) | g(�1 , �2 ) ∀ c} is measurable. Other than

using this deﬁnition directly, how else can we verify that such a function g is

measurable? The basic tools at hand are the following:

3

(b) indicator functions of measurable sets are measurable;

(c) combining measurable functions in the usual ways (e.g., adding them, mul

tiplying them, taking limits, etc.) results in measurable functions.

Fubini’s theorem holds under two different sets of conditions: (a) nonnega

tive functions g (compare with the MCT); (b) functions g whose absolute value

has a ﬁnite integral (compare with the DCT). We state the two versions sepa

rately, because of some subtle differences.

The two statements below are taken verbatim from the text by Adams &

Guillemin, with minor changes to conform to our notation.

Let P = P1 × P2 be a product measure. Then,

�

�

(e) We have

� �� � � �� �

g(�1 , �2 ) dP2 dP1 = g(�1 , �2 ) dP1 dP2

�1 �2

��2 �1

= g(�1 , �2 ) dP.

�1 ×�2

Note that some of the integrals above may be inﬁnite, but this is not a prob

lem; since everything is nonnegative, expressions of the form ⊂ − ⊂ do not

arise.

Recall now that a function is said to be integrable if it is measurable and the

integral of its absolute value is ﬁnite.

�

|g(�1 , �2 )| dP < ⊂,

�1 ×�2

where P = P1 × P2 .

�

h(�

�2 g(�1 , �2 ) dP2 is undeﬁned or inﬁnite).

�

h(�

�1 g(�1 , �2 ) dP1 is undeﬁned or inﬁnite).

(e) We have

� �� � � �� �

g(�1 , �2 ) dP2 dP1 = g(�1 , �2 ) dP1 dP2

�1 �2

��2 �1

= g(�1 , �2 ) dP.

�1 ×�2

We repeat that all of these results remain valid when dealing with �-ﬁnite

measures, such as the Lebesgue measure on R 2 . This provides us with condi

tions for the familiar calculus formula

� � � �

g(x, y) dx dy = g(x, y) dy dx.

integrability condition

�

|g(�1 , �2 )| dP < ⊂.

�1 ×�2

have � � �

|g(�1 , �2 )| dP = |g(�1 , �2 )| dP2 dP1 ,

�1 ×�2 �1 �2

5

so all we need is to work with the right hand side, and integrate one variable at

a time, possibly also using some bounds on the way.

Finally, let us note that all the hard work goes into proving Theorem 2.

Theorem 3 is relatively easy to derive once Theorem 2 is available: Given a

function g, decompose it into its positive and negative parts, apply Theorem 2

to each part, and in the process make sure that you do not encounter expressions

of the form ⊂ − ⊂.

Suppose both of our sample spaces are the nonnegative integers: � 1 = �2 =

{1, 2, . . . , }. The �-ﬁelds F1 and F2 will be all subsets of �1 and �2 , respec

tively. Then, �(F1 × F2 ) will be composed of all subsets of {1, 2, . . . , } 2 . Both

P1 and P2 will be the counting measure, i.e. P (A) = |A|. This means that

� � � � � �

gdP1 = f (a), hdP2 = h(b), f dP1 × P2 = f (c).

A a�A B b�B C c�C

f = 0 elsewhere. It is easier to visualize f with a picture:

1 −1 0 0 ···

0 1 −1 0 · · ·

0 0 1 −1 · · ·

0 0 0 1 ···

.. .. .. .. ..

. . . . .

So

� � �� �� � �

f dP1 dP2 = f (n, m) = 0 =

≥ 1= f (n, m) = f dP2 dP1

�1 �2 n m m n �2 �1

The problem is that the function we are integrating is neither nonnegative nor

integrable.

6

3.2 �-ﬁniteness

Let �1 = (0, 1), and let F1 be the Borel sets, and P1 be the Lebesgue measure.

Let �2 = (0, 1) and F2 be the set of all subsets of (0, 1), and let P 2 be the

counting measure.

Deﬁne f (x, y) = 1 if x = y and 0 otherwise. Then,

� � �

f (x, y)dP2 (y)dP1 (x) = 1dP1 (y) = 1,

�1 �2 �1

but � � �

f (x, y)dP1 (x)dP2 (y) = 0dP2 (y) = 0.

�2 �1 �2

The problem is that the counting measure on (0, 1) is not �-ﬁnite.

4 An application

a beginning probability course.

Let X be a nonnegative integer-valued random variable. Then,

�

�

E[X] = P (X → i).

i=1

�

�

E[X] = ip(i)

i=1

�� �i

= p(i)

i=1 k=1

�� � �

= p(i)

k=1 i=k

��

= P (X → k)

k=1

Let’s rigourously prove a justiﬁcation of this relation in the most general

case. We will show that if X is a nonnegative random variable, then

� �

E[X] = P (X → x)dx.

0

7

Proof: Deﬁne A = {(w, x) | 0 ∀ x ∀ X(w)}. Intuitively, if � = R, then A

would be the region under the curve X(w). We argue that

� � � �

E[X] = X(w)dP = 1A (w, x)dxdP,

� � 0

and now let’s postpone the technical issues for a moment and interchange the

integrals to get

� ��

E[X] = 1A (w, x)dP dx

0 �

� �

= P (X → x)dx.

0

Now let’s consider the technical details necessary to make the above argument

work. The integral interchange can be justiﬁed on account of the funciton 1 A

being nonnegative, so we just need to show that all the functions we deal with

are measurable. In particular we need to show that:

4. P (X → x) is a measurable function of x.

be measurable.

2. For ﬁxed w, 1A (w, x) is the indicator function of the interval [0, X(w)],

so it is lebesgue measurable.

so is every number below a. It follows that the set {Z → z} is always an

interval, so it is Lebesgue measurable.

8

5. To show that 1A is measurable, we argue that A is measurable.Indeed, the

function g : � × R � R deﬁned by g(w, x) = X(w) is measurable,

since for any Borel set B, g −1 (B) = X −1 (B) × (−⊂, +⊂). Similarly,

h : �×R � R deﬁned as h(w, x) = x is measurable for the same reason.

Since �

A = {g → h} {h → 0},

it follows that A is measurable.

9

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 14 10/27/2008

Contents

2. Sum of a random number of random variables

3. Transforms associated with joint distributions

ating functions and characteristic functions) provide an alternative way of rep

resenting a probability distribution by means of a certain function of a single

variable.

These functions turn out to be useful in many different ways:

(a) They provide an easy way of calculating the moments of a distribution.

(b) They provide some powerful tools for addressing certain counting and

combinatorial problems.

(c) They provide an easy way of characterizing the distribution of the sum of

independent random variables.

(d) They provide tools for dealing with the distribution of the sum of a random

number of independent random variables.

(e) They play a central role in the study of branching processes.

(f) They play a key role in large deviations theory, that is, in studying the

asymptotics of tail probabilities of the form P(X ≥ c), when c is a large

number.

(g) They provide a bridge between complex analysis and probability, so that

complex analysis methods can be brought to bear on probability problems.

(h) They provide powerful tools for proving limit theorems, such as laws of

large numbers and the central limit theorem.

1

1 MOMENT GENERATING FUNCTIONS

1.1 Deﬁnition

variable X is a function MX : R → [0, ∞] deﬁned by

MX (s) = E[esX ].

�

MX (s) = esx pX (x).

x

�

MX (s) = esx fX (x) dx.

Note that this is essentially the same as the deﬁnition of the Laplace transform

of a function fX , except that we are using s instead of −s in the exponent.

Note that 0 ∈ DX , because MX (0) = E[e0X ] = 1. For a discrete random

variable that takes only a ﬁnite number of different values, we have DX = R.

For example, if X takes the values 1, 2, and 3, with probabilities 1/2, 1/3, and

1/6, respectively, then

1 1 1

MX (s) = es + e2s + e3s , (1)

2 3 6

which is ﬁnite for every s ∈ R. On the other hand, for the Cauchy distribution,

fX (x) = 1/(π(1 + x2 )), for all x, it is easily seen that MX (s) = ∞, for all

s=� 0.

In general, DX is an interval (possibly inﬁnite or semi-inﬁnite) that contains

zero.

Exercise 1. Suppose that MX (s) < ∞ for some s > 0. Show that MX (t) < ∞ for all

t ∈ [0, s]. Similarly, suppose that MX (s) < ∞ for some s < 0. Show that MX (t) < ∞

for all t ∈ [s, 0].

2

Exercise 2. Suppose that

log P(X > x)

lim sup � −ν < 0.

x→∞ x

By inspection of the formula for MX (s) in Eq. (1), it is clear that the distribu

tion of X is readily determined. The various powers esx indicate the possible

values of the random variable X, and the associated coefﬁcients provide the

corresponding probabilities.

At the other extreme, if we are told that MX (s) = ∞ for every s �= 0, this

is certainly not enough information to determine the distribution of X.

On this subject, there is the following fundamental result. It is intimately

related to the inversion properties of Laplace transforms. Its proof requires so

phisticated analytical machinery and is omitted.

(a) Suppose that MX (s) is ﬁnite for all s in an interval of the form [−a, a],

where a is a positive number. Then, MX determines uniquely the CDF of

the random variable X.

(b) If MX (s) = MY (s) < ∞, for all s ∈ [−a, a], where a is a positive number,

then the random variables X and Y have the same CDF.

There are explicit formulas that allow us to recover the PMF or PDF of a ran

dom variable starting from the associated transform, but they are quite difﬁcult

to use (e.g., involving “contour integrals”). In practice, transforms are usually

inverted by “pattern matching,” based on tables of known distribution-transform

pairs.

There is a reason why MX is called a moment generating function. Let us con

sider the derivatives of MX at zero. Assuming for a moment we can interchange

3

the order of integration and differentiation, we obtain

dMX (s) �� d � �

�

= E[esX ]�

= E[XesX ]�

= E[X],

� �

ds

s=0 ds s=0 s=0

dm MX (s) �� dm sX �

�

m sX �

�

= E[e ] = E[X e

]�

= E[X m ]

dsm dsm

�

�

Thus, knowledge of the transform MX allows for an easy calculation of the

moments of a random variable X.

Justifying the interchange of the expectation and the differentiation does

require some work. The steps are outlined in the following exercise. For sim

plicity, we restrict to the case of nonnegative random variables.

Exercise 3. Suppose that X is a nonnegative random variable and that MX (s) < ∞

for all s ∈ (−∞, a], where a is a positive number.

(a) Show that E[X k ] < ∞, for every k.

(b) Show that E[X k esX ] < ∞, for every s < a.

(c) Show that (ehX − 1)/h ≤ XehX .

(d) Use the DCT to argue that

� ehX − 1 � E[ehX ] − 1

E[X] = E lim = lim .

h↓0 h h↓0 h

For discrete random variables, the following probability generating function

is sometimes useful. It is deﬁned by

gX (s) = E[sX ],

with s usually restricted to positive values. It is of course closely related to the

moment generating function in that, for s > 0, we have gX (s) = MX (log s).

One difference is that when X is a positive random variable, we can deﬁne

gX (s), as well as its derivatives, for s = 0. So, suppose that X has a PMF

pX (m), for m = 1, . . .. Then,

∞

�

gX (s) = sm pX (m),

m=1

resulting in

dm �

g (s) �

= m! pX (m).

�

X

dsm s=0

(The interchange of the summation and the differentiation needs justiﬁcation,

but is indeed legitimate for small s.) Thus, we can use gX to easily recover the

PMF pX , when X is a positive integer random variable.

4

1.6 Examples

d

Example : X = Exp(λ). Then,

� ∞ � λ

MX (s) = sx −λx

e λe dx = λ−s , s < λ;

0 ∞, otherwise.

d

Example : X = Ge(p)

∞ es p

�

�

sm m−1 1−(1−p)es , es < 1/(1 − p);

MX (s) = e p(1 − p)

m=1

∞, otherwise.

In this case, we also ﬁnd gX (s) = ps/(1 − (1 − p)s), s < 1/(1 − p) and

gX (s) = ∞, otherwise.

d

Example : X = N (0, 1). Then,

� ∞

1 x2

MX (s) = √ exp(sx) exp(− )dx

2π −∞ 2

2 � ∞

exp(s /2) x + 2sx − s2

2

= √ exp(− )dx

2π −∞ 2

= exp(s2 /2).

We record some useful properties of moment generating functions.

Theorem 2.

probability p, and equal to Y , with probability 1 − p. Then,

MX (aX + b) = E[exp(saX + sb)] = exp(sb)E[exp(saX)] = exp(sb)MX (as).

5

For part (b), we have

For part (c), by conditioning on the random choice between X and Y , we have

(a) Let X be a standard normal random variable, and let Y = σX + µ, which we know

to have a N (µ, σ 2 ) distribution. We then ﬁnd that MY (s) = exp(sµ + 12 s2 σ 2 ).

d

(b) Let X = N (µ1 , σ12 ) and Y = N (µ2 , σ22 ). Then,

� 1 �

MX+Y (s) = exp( s(µ1 + µ2 ) + s2 (σ12 + σ22 ) .

2

d

Using the inversion property of transforms, we conclude that X + Y = N (µ1 +

µ2 , σ12 + σ22 ), thus corroborating a result we ﬁrst obtained using convolutions.

ABLES

ance σ 2 . Let N be another�independent random variable that takes nonnegative

integer values. Let Y = N i=1 Xi . Let us derive the mean, variance, and mo

ment generating function of Y .

We have

� � � �

var(Y ) = E var(Y | N ) + var E[Y | N ]

= E[N σ 2 ] + var(N µ)

= E[N ]σ 2 + µ2 var(N ).

n

E[exp(sY ) | N = n] = MX (s) = exp(n log MX (s)),

6

implying that

∞

�

MY (s) = exp(n log MX (s))P(N = n) = MN (log MX (s)).

n=1

The reader is encouraged to take the derivative of the above expression, and

evaluate it at s = 0, to recover the formula E[Y ] = E[N ]E[X].

Example : Suppose that each Xi is exponentially distributed, with parameter λ, and

that N is geometrically distributed, with parameter p ∈ (0, 1). We ﬁnd that

MY (s) = log M (s)

= =

1−e X (1 − p) 1 − λ(1 − p)/(λ − s) λp − s

with parameter λp. Using the inversion theorem, we conclude that Y is exponentially

distributed. In view of the fact that the sum of a ﬁxed number of exponential random

variables is far from exponential, this result is rather surprising. An intuitive explanation

will be provided later in terms of the Poisson process.

If two random variables X and Y are described by some joint distribution (e.g.,

a joint PDF), then each one is associated with a transform MX (s) or MY (s).

These are the transforms of the marginal distributions and do not convey infor

mation on the dependence between the two random variables. Such information

is contained in a multivariate transform, which we now deﬁne.

Consider n random variables X1 , . . . , Xn related to the same experiment.

Let s1 , . . . , sn be real parameters. The associated multivariate transform is a

function of these n parameters and is deﬁned by

� �

variate case. That is, if Y1 , . . . , Yn is another set of random variables and

MX1 ,...,Xn (s1 , . . . , sn ), MY1 ,...,Yn (s1 , . . . , sn ) are the same functions of s1 , . . . , sn ,

in a neighborhood of the origin, then the joint distribution of X1 , . . . , Xn is the

same as the joint distribution of Y1 , . . . , Yn .

Example :

(a) Consider two random variables X and Y . Their joint transform is

to calculating the univariate transform associated with a single random variable that

is a linear combination of the original random variables.

(b) If X and Y are independent, then MX,Y (s, t) = MX (s)MY (t).

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 15 10/29/2008

Contents

2. Deﬁnition of the multivariate normal distribution

3. Means and covariances of vector random variables

4. Key properties of the multivariate normal

and its properties, relying mostly on algebraic manipulation and integration of

normal PDFs. Here, we revisit the subject in more generality (n dimensions),

while using more elegant tools. First, some background.

(a) We say that A is positive deﬁnite, and write A > 0, if xT Ax > 0, for every

nonzero x ∈ Rn .

(b) We say that A is nonnegative deﬁnite, and write A ≥ 0, if xT Ax ≥ 0, for

every x ∈ Rn .

(a) A symmetric matrix has n real eigenvalues.

(b) A positive deﬁnite matrix has n real and positive eigenvalues.

(c) A nonnegative deﬁnite matrix has n real and nonnegative eigenvalues.

1

(d) To each eigenvalue of a symmetric matrix, we can associate a real eigen

vector. Eigenvectors associated with distinct eigenvalues are orthogonal;

eigenvectors associated with repeated eigenvalues can always be taken to be

orthogonal. Without loss of generality, all these eigenvectors can be nor

malized so that they have unit length, resulting in an orthonormal basis.

(e) The above essentially states that a symmetric deﬁnite matrix becomes diag

onal after a suitable orthogonal change of basis.

sition formula: Every symmetric matrix A can be expressed in the form

n

�

A= λi zi zTi ,

i=1

lection of orthonormal eigenvectors. (Note here that zi zTi is a n × n matrix, of

rank 1.)

For nonnegative deﬁnite matrices, we have λi ≥ 0, which allows us to take

square roots and deﬁne

�n �

B= λi zi zTi .

i=1

(a) The matrix B is symmetric.

(b) We have B 2 = A (this is an easy calculation). Thus B is a symmetric

square root of A.

√

(c) The matrix B has eigenvalues λi . Therefore, it is positive (respectively,

nonnegative) deﬁnite if and only if A is positive (respectively, nonnegative)

deﬁnite.

the matrix

n

� 1

C= z i zT .

λi i

i=1

Our interest in positive deﬁnite matrices stems for the following. When A is

positive deﬁnite, the quadratic form q(x) = xT Ax goes to inﬁnity as �x� → ∞,

so that e−q(x) decays to zero, as �x� → ∞, and therefore can be used to deﬁne

a multivariate PDF.

There are multiple ways of deﬁning multivariate normal distributions. We

will present three, and will eventually show that they are consistent with each

other.

The ﬁrst generalizes our deﬁnition of the bivariate normal. It is the most

explicit and transparent; on the downside it can lead to unpleasant algebraic

manipulations. Recall that |V | stands for the absolute value of the determinant

of a square matrix V .

mal distribution if it has a joint PDF of the form

1 � (x − µ)V −1 (x − µ)T �

fX (x) = � exp − ,

(2π)n |V | 2

if it can be expressed in the form

X = DW + µ,

for some matrix D and some real vector µ, where W is a random vector

whose components are independent N (0, 1) random variables.

The last deﬁnition is possibly the hardest to penetrate, but in the eyes of

some, it is the most elegant.

if for every real vector a, the random variable aT X is normal.

der Deﬁnition 2, fX (x) > 0 for all x. On the other hand, consider the following

example. Let X1 ∼ N (0, 1) and let X2 = 0. The random vector X = (X1 , X2 )

is normal according to Deﬁnitions 3 or 4, but cannot be described by a joint

PDF (all of the probability is concentrated on the horizontal axis, a set of zero

area). This is an example of a degenerate normal distribution: the distribution

is concentrated on a proper subspace of Rn . The most extreme example is a

one-dimensional random variable, which is identically equal to zero. This qual

iﬁes as normal under Deﬁnitions 3 and 4. One may question the wisdom of

calling the number “zero” a “normal random variable;” the reason for doing so

is that it allows us to state results such as “a linear function of a normal random

variable is normal”, etc., without having to worry about exceptions and special

conditions that will prevent degeneracy.

vector, we deﬁne

E[X] = (E[X1 ], . . . , E[Xn ]),

which we treat as a column vector. Similarly, If A is a random matrix (a matrix

with each entry being a random variable Aij ), we use the notation E[A], to

denote the matrix whose entries are E[Aij ].

Given two random vectors X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Ym ), we

can consider all the possible covariances

Cov(X, Y)

� �

Cov(X, Y) = E (X − E[X])(Y − E[Y])T .

Exercise 1. Prove that Cov(X, X) is nonnegative deﬁnite.

4

4 KEY PROPERTIES OF THE MULTIVARIATE NORMAL

The theorem below includes almost everything useful there is to know about

multivariate normals. We will prove and state the theorem, while working

mostly with Deﬁnition 3. The proof of equivalence of the three deﬁnitions will

be completed in the next lecture, together with some additional observations.

sense of Deﬁnition 3, and let µi be the ith component of µ.

(b) We have Cov(X, X) = DDT .

(c) If C is a m × n matrix and d is a vector in Rm , then Y = CX + d is

multivariate normal in the sense of Deﬁnition 3, with mean Cµ + d and

covariance matrix CDDT C T .

(d) If |D| =

� 0, then X is a nondegenerate multivariate normal in the sense of

Deﬁnition 2, with V = DDT = Cov(X, X).

(e) The joint CDF FX of X is completely determined by the mean and co

variance of X.

(f) The components of X are uncorrelated (i.e., the covariance matrix is di

agonal) if and only if they are independent.

(g) If � � �� � � ��

X µX VXX VXY

∼N , ,

Y µY VY X VY Y

and VY Y > 0, then:

Y (Y − µY ).

(ii) Let X̃ = X − E[X | Y]. Then, X̃ is independent of Y, and inde

pendent of E[X | Y].

(iii) Cov(X̃, X̃ | Y) = Cov(X̃, X̃) = VXX − VXY VY−1

Y VY X .

Proof:

(a) Under deﬁnition 3, Xi is a linear function of independent normal random

variables, hence normal. Since E[W] = 0, we have E[Xi ] = µi .

(b) For simplicity, let us just consider the zero mean case. We have

Cov(X, X) = E[XXT ] = E[DWWT DT ] = DE[WWT ]DT = DDT ,

5

where the last equality follows because the components of W are in

dependent (hence the covariance matrix is diagonal), with unit variance

(hence the diagonal entries are all equal to 1).

function of independent standard normal random variables. Thus, Y is

multivariate normal. The formula for E[Y] is immediate. The formula

for the covariance matrix follows from part (b), with D being replaced by

(CD).

(d) This is an exercise in derived distributions. Let us again just consider the

case of µ = 0. We already know (Lecture 10) that when X = DW , with

D invertible, then

fW (D−1 w)

fX (x) = .

| det D|

In our case, since the Wi are i.i.d. N(0,1), we have

1 � 1 �

fW (w) = � exp − wT w ,

(2π)n 2

leading to

1 � 1 �

fX (x) = � exp − xT (D−1 )T D−1 x .

(2π)n |DDT | 2

with part (b), we also have Cov(X, X) = V . The argument for the non

zero mean case is essentially the same.

(e) Using part (d), the joint PDF of X is completely determined by the matrix

V , which happens to be equal to Cov(X, X), together with the vector µ.

The degenerate case is a little harder, because of the absence of a conve

nient closed form formula. One could think of a limiting argument that

involves injecting a tiny bit of noise in all directions, to make the distri

bution nondegenerate, and then taking the limit. This type of argument

can be made to work, but will involve tedious technicalities. Instead, we

will take a shortcut, based on the inversion property of transforms. This

argument is simpler, but relies on the heavy machinery behind the proof

of the inversion property.

T

Let us ﬁnd the multivariate transform MX (s) = E[es x ]. We note that

sT X is normal with mean sT µ. Letting X̃ = X − µ, the variance of sT X

6

is

Using the formula for the transform of a single normal random variable

(sT X in this case), we have

Tx Tµ TV

MX (s) = E[es ] = MsT X (1) = es es s/2

.

property of transforms, µ and V completely determine the distribution

(e.g., the CDF) of X.

For the converse, suppose that the components of X are uncorrelated, i.e.,

the matrix V is a diagonal. Consider another random vector Y that has

the same mean and as X, whose components are independent normal, and

such that the variance of Yi is the same as the variance of Xi . Then, X

and Y have the same mean and covariance. By part (e), X and Y have the

same distribution. Since the components of Y are independent, it follows

that the components of X are also independent.

(g) Once more, to simplify notation, let us just deal with the zero-mean case.

Let us deﬁne

X̂ = VXY VY−1 Y Y.

We then have

E[XY YY

a linear function of (X, Y), so, by part (c), it is also multivariate normal.

Using an argument similar to the one in the proof of part (f), we conclude

that X − X̂ is independent of Y, and therefore independent from any

function of Y. Recall now the abstract deﬁnition of conditional expecta

tions. The relation E[(X − X̂)g(Y)] = 0, for every function g, implies

that X̂ = E[X | Y], which proves part (i).

For part (ii), note that we already proved that X˜ =X−X ˆ = X − E[X |

Y] is independent of Y. Since E[X | Y] is a function of Y, it follows

that X̃ is independent of E[X | Y].

7

˜ is independent of Y, which implies that Cov(X,

For part (iii), note that X ˜ X˜ |

˜ ˜

Y) = Cov(X, X). Finally,

˜X

Cov(X̃, X̃) = E[X ˜ T ] = E[(X − X)(X

ˆ ˆ T ] = E[(X − X)X

− X) ˆ T]

= VXX − E[VXY VY−1 T −1 T

Y YX ] = VXX − VXY VY Y E[YX ]

= VXX − VXY VY−1

Y VY X .

Note that in the case of the bivariate normal, we have Cov(X, Y ) = ρσX σY ,

VXX = σX 2 ,V 2

Y Y = σY . Then, part (g) of the preceding theorem, for the zero-

mean case, reduces to

σX 2

E[X | Y ] = ρ Y, var(X̃) = σX (1 − ρ2 ),

σY

which agrees with the formula we derived through elementary means in Lec

ture 9, for the special case of unit variances.

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 16 11/3/2008

CHARACTERISTIC FUNCTIONS

Contents

2. Proof of equivalence

3. Whitening of a sequence of normal random variables

4. Characteristic functions

VARIATE NORMAL DISTRIBUTION

Recall the following three deﬁnitions from the previous lecture.

mal distribution if it has a joint PDF of the form

1 � (x − µ)V −1 (x − µ)T �

fX (x) = � exp − ,

(2π)n |V | 2

if it can be expressed in the form

X = DW + µ,

for some matrix D and some real vector µ, where W is a random vector

whose components are independent N (0, 1) random variables.

if for every real vector a, the random variable aT X is normal.

2 PROOF OF EQUIVALENCE

In the course of the proof of Theorem 1 in the previous lecture, we argued that

if X is multivariate normal, in the sense of Deﬁnition 2, then:

dent, then aT X is a linear function of independent normals, hence normal.

is nonsingular), X also satisﬁes Deﬁnition 1. (We used the derived distribu

tions formula.)

two statements.

Theorem 1.

Proof:

tive deﬁnite. Let D be a symmetric matrix such that D2 = V . Since

2

we see that D is nonsingular, and therefore invertible. Let

W = D−1 (X − µ).

= D−1 E[(X − µ)(X − µ)T ]D−1

= D−1 V D−1 = I.

We have shown thus far that the Wi are normal and uncorrelated. We now

proceed to show that they are independent. Using the formula for the PDF

of X and the change of variables formula, we ﬁnd that the PDF of W is of

the form

for some normalizing constant c, which is the joint PDF of a vector of in

dependent normal random variables. It follows that X = DW + µ is a

multivariate normal in the sense of Deﬁnition 2.

(b) Suppose that X satisﬁes Deﬁnition 3, i.e., any linear function aT X is nor

mal. Let V = Cov(X, X), and let D be a symmetric matrix such that

D2 = V . We ﬁrst give the proof for the easier case where V (and therefore

D) is invertible.

Let W = D−1 (X − µ). As before, E[W] = 0, and Cov(W, W) = I. Fix

a vector s, Then, sT W is a linear function of W, and is therefore normal.

Note that

pendent standard normal random variables. By the inversion property of

transforms, it follows that W is a vector of independent standard normal

random variables. Therefore, X = DW + µ is multivariate normal in the

sense of Deﬁnition 2.

3

(b)� Suppose now that V is singular (as opposed to positive deﬁnite). For sim

plicity, we will assume that the mean of X is zero. Then, there exists some

a �= 0, such that V a = 0, and aT V a = 0. Note that

aT V a = E (aT X)2 .

� �

ponent of X is a deterministic linear function of the remaining components.

By possibly rearranging the components of X, let us assume that Xn is a lin

ear function of (X1 , . . . , Xn−1 ). If the covariance matrix of (X1 , . . . , Xn−1 )

is also singular, we repeat the same argument, until eventually a nonsingu

lar covariance matrix is obtained. At that point we have reach the situation

where X is partitioned as X = (Y, Z), with Cov(Y, Y) > 0, and with Z a

linear function of Y (i.e., Z = AY, for some matrix A, with probability 1).

The vector Y also satisﬁes Deﬁnition 3. Since its covariance matrix is non-

singular, the previous part of the proof shows that it also satisﬁes Deﬁni

tion 2. Let k be the dimension of Y. Then, Y = DW, where W consists

of k independent standard normals, and D is a k × k matrix. Let W be a

vector of n − k independent standard normals. Then, we can write

� � � �� �

Y D 0 W

X= = ,

Z AD 0 W

We should also consider the extreme possibility that in the process of elimi

nating components of X, a nonsingular covariance matrix is never obtained.

But in that case, we have X = 0, which also satisﬁes Deﬁnition 2, with

D = 0. (This is the most degenerate case of a multivariate normal.)

The last part of the proof in the previous section provides some interesting intu

ition. Given a multivariate normal vector X, we can always perform a change of

coordinates, and obtain a representation of that vector in terms of independent

normal random variables. Our process of going from X to W involved factoring

the covariance matrix V of X in the form V = D2 , where D was a symmetric

square root of V . However, other factorizations are also possible. The most

useful one is described below.

4

Let

W1 = X1 ,

W2 = X2 − E[X2 | X1 ],

W3 = X3 − E[X3 | X1 , X2 ],

.. ..

. .

Wn = Xn − E[Xn | X1 , . . . , Xn−1 ].

the past, (X1 , . . . , Xi−1 ). The Wi are sometimes called the innovations.

(b) When we deal with multivariate normals, conditional expectations are linear

functions of the conditioning variables. Thus, the Wi are linear functions of

the Xi . Furthermore, we have W = LX, where L is a lower triangular ma

trix (all entries above the diagonal are zero). The diagonal entries of L are

all equal to 1, so L is invertible. The inverse of L is also lower triangular.

This means that the transformation from X to W is causal (Wi can be de

termined from X1 , . . . , Xi ) and causally invertible (Xi can be determined

from W1 , . . . , Wi ). Engineers sometimes call this a “causal and causally

invertible whitening ﬁlter.”

(c) The Wi are independent of each other. This is a consequence of the general

fact E[(X − E[X | Y ])Y ] = 0, which shows that the Wi is uncorrelated

with X1 , . . . , Xi−1 , hence uncorrelated with W1 , . . . , Wi−1 . For multivari

ate normals, we know that zero correlation implies independence. As long

as the Wi have nonzero variance, we can also normalize them so that their

variance is equal to 1.

shows that Cov(X, X) = L−1 B(L−1 )T . This kind of factorization into a

product of a lower triangular (L−1 B 1/2 ) and upper triangular (B 1/2 (L−1 )T )

matrix is called a Cholesky factorization.

We have deﬁned the moment generating function MX (s), for real values of

s, and noted that it may be inﬁnite for some values of s. In particular, if

MX (s) = ∞ for every s �= 0, then the moment generating function does not

provide enough information to determine the distribution of X. (As an example,

malizing constant.) A way out of this difﬁculty is to consider complex values

of s, and in√particular, the case where s is a purely imaginary number: s = it,

where i = −1, and t ∈ R. The resulting function is called the characteristic

function, formally deﬁned by

φX (t) = E[eitX ].

For example, when X is a continuous random variable with PDF f , we have

�

φX (t) = eixt f (x) dx,

which very similar to the Fourier transform of f (except for the absence of a

minus sign in the exponent). Thus, the relation between moment generating

functions and characteristic functions is of the same kind as the relation between

Laplace and Fourier trasnforms.

Note that eitX is a complex-valued random variable, a new concept for us.

However, using the relation eitX = cos(tX)+i sin(tX), deﬁning its expectation

is straightforward:

φX (t) = E[cos(tX)] + iE[sin(tX)].

We make a few key observations:

(a) Because |eitX | ≤ 1 for every t, its expectation, φX (t) is well-deﬁned and

ﬁnite for every t ∈ R. In fact, |φX (t)| ≤ 1, for every t.

(b) The key properties of moment generating functions (cf. Lecture 14) are also

valid for characteristic functions (same proof).

Theorem 2.

(b) If X and Y are independent, then φX+Y (t) = φX (t)φY (t).

(c) Let X and Y be independent random variables. Let Z be equal to X,

with probability p, and equal to Y , with probability 1 − p. Then,

(c) Inversion theorem: If two random variables have the same characteristic

function, then their distributions are the same. (The proof of this is nontriv

ial.)

6

(d) The above inversion theorem remains valid for multivariate characteristic

T

functions, deﬁned by φX (t) = E[eit X ].

(e) For the univariate case, if X is a continuous random variable with PDF fX ,

there is an explicit inversion formula, namely

� T

1

fX (x) = lim e−itx φX (t) dt,

2π T →∞ −T

for every x at which fX is differentiable. (Note the similarity with inversion

formulas for Fourier transforms.)

variables (simply apply the DCT separately to the complex and imaginary

parts). Thus, if limn→∞ Xn = X, a.s., then, for every t ∈ R,

e

itXn = E[e

itX ] = lim φX (t).

� �

The DCT applies here, because the random variables |eitXn | are bounded

by 1.

dk �

φ (t)�

= ik E[X k ].

�

k X

dt t=0

a formal justiﬁcation is needed.)

Two useful characteristic functions:

(a) Exponential: If fX (x) = λe−λx , x ≥ 0, then

λ

φX (y) = .

λ − it

Note that this is the same as starting with MX (s) = λ(λ − s) and replacing

s by it; however, this is not a valid proof. One must either use tools from

complex analysis (contour integration), or evaluate separately E[cos(tX)],

E[sin(tX)], which can be done using integration by parts.

d

(b) Normal: In X = N (µ, σ 2 ), then

2 σ 2 /2

φX (t) = eitµ e−t .

7

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 17 11/5/2008

Contents

1. Deﬁnitions

2. Convergence in distribution

3. The hierarchy of convergence concepts

1 DEFINITIONS

a.s.

Xn → X, if there is a (measurable) set A ⊂ Ω such that:

(b) P(A) = 1.

Note that for a.s. convergence to be relevant, all random variables need to

be deﬁned on the same probability space (one experiment). Furthermore, the

different random variables Xn are generally highly dependent.

Two common cases where a.s. convergence arises are the following.

(a) The probabilistic experiment runs over time. To each time n, we associate

a�nonnegative random variable Zn (e.g., income on day� n). Let Xn =

n ∞

k=1 Zk be the income on the ﬁrst n days. Let X = k=1 Zk be the

lifetime income. Note that X is well deﬁned (as an extended real number)

a.s.

for every ω ∈ Ω, because of our assumption that Zk ≥ 0, and Xn → X.

1

(b) The various random variables are deﬁned as different functions of a sin

gle underlying random variable. More precisely, suppose that Y is a ran

dom variable, and let gn : R → R be measurable functions. Let Xn =

gn (Y ) [which really means, Xn (ω) = gn (Y (ω)), for all ω]. Suppose

a.s.

that limn→∞ gn (y) = g(y) for every y. Then, Xn → X. For exam

ple, let gn (y) = y + y 2 /n, which converges to g(y) = y. We then have

a.s.

Y + Y 2 /n → Y .

a.s.

When Xn → X, we always have

E[Xn ] → E[X]

is not always true; sufﬁcient conditions are provided by the monotone and dom

inated convergence theorems. For an example, where convergence of expecta

tions fails to hold, consider a random variable U which is uniform on [0, 1], and

let: �

n, if U ≤ 1/n,

Xn = (1)

0, if U > 1/n.

We have

lim E[Xn ] = lim nP(U ≤ 1/n) = 1.

n→∞ n→∞

On the other hand, for any outcome ω for which U (ω) > 0 (which happens with

a.s.

probability one), Xn (ω) converges to zero. Thus, Xn → 0, but E[Xn ] does not

converge to zero.

Fn , respectively. We say that the sequence Xn converges to X in distribu

d

tion, and write Xn → X, if

n→∞

(a) Recall that CDFs have discontinuities (“jumps”) only at the points that have

positive probability mass. More precisely, F is continuous at x if and only

if P(X = x) = 0.

(b) Let Xn = 1/n, and X = 0, with probability 1. Note that FXn (0) =

P(Xn ≤ 0) = 0, for every n, but FX (0) = 1. Still, because of the excep

d

tion in the above deﬁnition, we have Xn → X. More generally, if Xn = an

d

and X = a, with probability 1, and an → a, then Xn → X. Thus, conver

gence in distribution is consistent with the deﬁnition of convergence of real

numbers. This would not have been the case if the deﬁnition required the

condition limn→∞ Fn (x) = F (x) to hold at every x.

(c) Note that this deﬁnition just involves the marginal distributions of the ran

dom variables involved. These random variables may even be deﬁned on

different probability spaces.

(d) Let Y be a continuous random variable whose PDF is symmetric around 0.

Let Xn = (−1)n Y . Then, every Xn has the same distribution, so, trivially,

Xn converges to Y in distribution. However, for almost all ω, the sequence

Xn (ω) does not converge.

(e) If we are dealing with random variables whose distribution is in a parametric

class, (e.g., if every Xn is exponential with parameter λn ), and the parame

ters converge (e.g., if λn → λ > 0 and X is exponential with parameter λ),

then we usually have convergence of Xn to X, in distribution.

(f) It is possible for a sequence of discrete random variables to converge in dis

tribution to a continuous one. For example, if Yn is uniform on {1, . . . , n}

and Xn = Yn /n, then Xn converges in distribution to a random variable

which is uniform on [0, 1].

(g) Similarly, it is possible for a sequence of continuous random variables to

converge in distribution to a discrete one. For example if Xn is uniform

on [0, 1/n], then Xn converges in distribution to a discrete random variable

which is identically equal to zero.

(h) If X and all Xn are continuous, convergence in distribution does not imply

convergence of the corresponding PDFs. (Find an example, by emulating

the example in (f).)

(i) If X and all Xn are integer-valued, convergence in distribution turns out

to be equivalent to convergence of the corresponding PMFs: pXn (k) →

pX (k), for all k.

sarily deﬁned on the same probability space) converges in probability to a

i.p.

real number c, and write Xn → c, if

n→∞

(b) Suppose that X and Xn , n ∈ N are all deﬁned on the same probability

space. We say that the sequence Xn converges to X, in probability, and

i.p.

write Xn → X, if Xn − X converges to zero, in probability, i.e.,

n→∞

(a) When X in part (b) of the deﬁnition is deterministic, say equal to some

constant c, then the two parts of the above deﬁnition are consistent with

each other.

i.p.

(b) The intuitive content of the statement Xn → c is that in the limit as n in

creases, almost all of the probability mass becomes concentrated in a small

interval around c, no matter how small this interval is. On the other hand,

for any ﬁxed n, there can be a small probability mass outside this interval,

with a slowly decaying tail. Such a tail can have a strong impact on expected

values. For this reason, convergence in probability does not have any im

plications on expected values. See for instance the example in Eq. (1). We

i.p.

have Xn → X, but E[Xn ] does not converge to E[X].

i.p. i.p.

(c) If Xn → X and Yn → Y , and all random variables are deﬁned on the same

i.p.

probability space, then (Xn + Yn ) → (X + Y ). (Can you prove it?)

2 CONVERGENCE IN DISTRIBUTION

The following result provides insights into the meaning of convergence in dis

tribution, by showing a close relation with almost sure convergence.

4

d

Theorem 1. Suppose that Xn → X. Then, there exists a probability space

and random variables Y , Yn deﬁned on that space with the following proper

ties:

(a) For every n, the random variables Xn and Yn have the same CDF; similarly,

X and Y have the same CDF.

a.s.

(b) Yn → Y .

variables Xn are independent or not; they do not even need to be deﬁned on the

same probability space. On the other hand, almost sure convergence implies a

strong form of dependence between the random variables involved. The idea in

the preceding theorem is to preserve the marginal distributions, but introduce a

particular form of dependence between the Xn , which then results in almost sure

convergence. This dependence is introduced by generating random variables Yn

and Y with the desired distributions, using a common random number generator,

e.g., a single random variable U , uniformly distributed on [0, 1].

If we assume that all CDFs involved are continuous and strictly increasing,

then we can let Yn = FX−n1 (U ), and Y = FX−1 (U ). This guarantees the ﬁrst

a.s.

condition in the theorem. It then follows that Yn → Y , as discussed in item (b)

in Section 1.1. For the case of more general CDFs, the argument is similar in

spirit but takes more work; see [GS], p. 315.

Theorem 2. We have

a.s. i.p. d

[Xn → X] ⇒ [Xn → X] ⇒ [Xn → X] ⇒ [φXn (t) → φX (t), ∀ t].

(The ﬁrst two implications assume that all random variables be deﬁned on

the same probability space.)

Proof:

a.s. i.p.

(a) [Xn → X] ⇒ [Xn → X]:

We give a short proof, based on the DCT, but more elementary proofs are

also possible. Fix some � > 0. Let

Yn = �I{|Xn −X|≥�} .

5

a.s. a.s.

If Xn → X, then Yn → 0. By the DCT, E[Yn ] → 0. On the other hand,

� �

E[Yn ] = �P Xn − X| ≥ � .

� � i.p.

This implies that P |Xn − X| ≥ � → 0, and therefore, Xn → X.

i.p. d

(b) [Xn → X] ⇒ [Xn → X]:

The proof is omitted; see, e.g., [GS].

d

(c) [Xn → X] ⇒ [φXn (t) → φX (t), ∀ t]:

d a.s.

Suppose that Xn → X. Let Yn and Y be as in Theorem 1, so that Yn → Y .

Then, for any t ∈ R,

lim φXn (t) = lim φYn (t) = lim E[eitYn ] = E[eitY ] = φY (t) = φX (t),

n→∞ n→∞ n→∞

a.s. a.s.

where we have made use of the facts Yn → Y , eitYn → eitY , and the

DCT.

Theorem 2 hold. For the ﬁrst two, the answer is, in general, “no”, although we

will also note some exceptions.

i.p. a.s.

[Xn → X] does not imply [Xn → X]:

Let Xn be equal to 1, with probability 1/n, and equal to zero otherwise. Suppose

i.p.

that the Xn are independent. We have Xn → 0. On the other hand, by the Borel

Cantelli lemma, the event {Xn = 1, i.o} has probability 1. Thus, for almost all

ω, the sequence Xn (ω) does not converge to zero.

Nevertheless, a weaker form of the converse implication turns out to be true.

i.p.

If Xn → X, then there exists an increasing (deterministic) sequence nk of

integers, such that limk→∞ Xnk = X, a.s. (We omit the proof.)

For an illustration of the last statement in action, consider the preceding

counterexample. If we let nk = k 2 , then we note that P(Xnk = � 0) = 1/k 2 ,

which is summable. By the Borel-Cantelli lemma, the event {Xnk = � 0} will

occur for only ﬁnitely many k, with probability 1. Therefore, Xnk converges,

a.s., to the zero random variable.

6

3.2 Convergence in probability versus in distribution

The converse turns out to be false in general, but true when the limit is deter

ministic.

d i.p.

[Xn → X] does not imply [Xn → X]:

Let the random variables X, Xn be i.i.d. and nonconstant random variables, in

d

which case we have (trivially) Xn → X. Fix some �. Then, P(|Xn − X | ≥ �)

is positive and the same for all n, which shows that Xn does not converge to X,

in probability.

d i.p.

[Xn → c] implies [Xn → c]:

The proof is omitted; see, e.g., [GS].

Finally, the converse of the last implication in Theorem 2 is always true. We

know that equality of two characteristic functions implies equality of the corre

sponding distributions. It is then plausible to hope that “near-equality” of char

acteristic functions implies “near equality” of corresponding distributions. This

would be essentially a statement that the mapping from characteristic functions

to distributions is a continuous one. Even though such a result is plausible, its

proof is beyond our scope.

variables with given CDFs and corresponding characteristic functions. We

have

d

[φXn (t) → φX (t), ∀ t] ⇒ [Xn → X].

The preceding theorem involves two separate conditions: (i) the sequence

of characteristic functions φXn converges (pointwise), and (ii) the limit is the

characteristic function associated with some other random variable. If we are

only given the ﬁrst condition (pointwise convergence), how can we tell if the

limit is indeed a legitimate characteristic function associated with some random

variable? One way is to check for various properties that every legitimate char

acteristic function must possess. One such property is continuity: if t → t∗ ,

then (using dominated convergence),

∗X

lim∗ φX (t) = lim∗ E[eitX ] = E[eit ] = φX (t∗ ).

t→t t→t

7

Theorem 4. Continuity of inverse transforms: Let Xn be random vari

ables with characteristic functions φXn , and suppose that the limit φ(t) =

limn→∞ φXn (t) exists for every t. Then, either

(i) The function φ is discontinuous at zero (in this case Xn does not converge

in distribution); or

(ii) There exists a random variable X whose characteristic function is φ, and

d

Xn → X.

and assume that Xn is exponential with parameter λn , so that φXn (t) = λn /(λn −

it).

characteristic functions φXn converges to the function φ deﬁned by φ(t) =

λ/(λ−it). We recognize this as the characteristic function of an exponential

distribution with parameter λ. In particular, we conclude that Xn converges

in distribution to an exponential random variable with parameter λ.

(b) Suppose now that λn converges to zero. Then,

�

λn λ 1, if t = 0,

lim φXn (t) = lim = lim =

n→∞ n→∞ λn − it λ↓0 λ − it 0, if t �= 0.

and Xn does not converge in distribution. Intuitively, this is because the

distribution of Xn keeps spreading in a manner that does not yield a limiting

distribution.

8

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 18 11/12/2008

Contents

1. Useful inequalities

2. The weak law of large numbers

3. The central limit theorem

1 USEFUL INEQUALITIES

E[X]/a.

Proof: Let I be the indicator function of the event {X ≥ a}. Then, aI ≤ X.

Taking expectations of both sides, we obtain the claimed result.

Chebyshev inequality: P |X − E[X]| ≥ � ≤ var(X)/�2 .

� �

Proof: Apply the Markov inequality, to the random variable |X − E[X]|2 , and

with a = �2 .

an inﬁnite repetition of the same experiment. If so, the observed average in a

ﬁnite number of repetitions (which is called the sample mean) should approach

the expectation, as the number of repetitions increases. This is a vague state

ment, which is made more precise by so-called laws of large numbers.

random variables, and assume that E[|X1 |] < ∞. Let Sn = X1 + · · · + Xn .

Then,

Sn i.p.

→ E[X1 ].

n

This is called the “weak law” in order to distinguish it from the “strong

a.s.

law” of large numbers, which asserts, under the same assumptions, that Xn →

E[X1 ]. Of course, since almost sure convergence implies convergence in proba

bility, the strong law implies the weak law. On the other hand, the weak law can

be easier to prove, especially in the presence of additional assumptions. Indeed,

in the special case where the Xi have mean µ and ﬁnite variance, Chebyshev’s

inequality yields, for every � > 0,

� � var(Sn /n) var(X1 )

P |(Sn /n) − µ| ≥ � ≤ 2

= ,

� n�2

which converges to zero, as n → ∞, thus establishing convergence in probabil

ity.

Before we proceed to the proof for the general case, we note two important

facts that we will use.

(a) First-order Taylor series expansion. Let g : R → R be a function that has

a derivative at zero, denoted by d. Let h be a function that represents the

error in a ﬁrst order Taylor series approximation:

g(�) − g(0) d� + h(�) h(�)

d = lim = lim = d + lim .

�→0 � �→0 � �→0 �

is often written as o(�). This discussion also applies to complex-valued

functions, by considering separately the real and imaginary parts.

(b) A classical sequence. Recall the well known fact

� a �n

lim 1 + = ea , a ∈ R. (1)

n→∞ n

We note (without proof) that this fact remains true even when a is a complex

number. Furthermore, with little additional work, it can be shown that if

2

{an } is a sequence of complex numbers that converges to a, then,

� an �n

lim 1 + = ea .

n→∞ n

that the Xi are independent, and the fact that the derivative of φX1 at t = 0

equals iµ, the characteristic function of Sn /n is of the form

�n � �n � µit �n

φn (t) = E[eitX1 /n ] = φX1 (t/n) = 1 +

�

+ o(t/n) ,

n

where the function o satisﬁes lim�→0 o(�)/� = 0. Therefore,

n→∞

able which is equal to µ, with probability one.

Applying Theorem 3 from the previous lecture (continuity of inverse trans

forms), we conclude that Sn /n converges to µ, in distribution. Furthermore,

as mentioned in the previous lecture, convergence in distribution to a constant

implies convergence in probability.

Remark: It turns out that the assumption E[|X1 |] < ∞ can be relaxed, although

not by much. Suppose that the distribution of X1 is symmetric around zero. It is

known that Sn /n → 0, in probability, if and only if limn→∞ nP(|X1 | > n) = 0.

There exist distributions that satisfy this condition, while E[|X1 |] < ∞. On the

other hand, it can be shown that any such distribution satisﬁes E[|X1 |1−� ] < ∞,

for every � > 0, so the condition limn→∞ nP(|X1 | > n) = 0 is not much

weaker than the assumption of a ﬁnite mean.

Suppose that X1 , X2 , . . . are i.i.d. with common (and ﬁnite) mean µ and variance

σ 2 . Let Sn = X1 + · · · + Xn . The central limit theorem (CLT) asserts that

Sn − nµ

√

σ n

of the uses of the central limit theorem, see the handout from [BT] (pages 388

394).

3

Proof of the CLT: For simplicity, suppose that the random variables Xi have

zero mean and unit variance. Finiteness of the ﬁrst two moments of X1 implies

that φX1 (t) is twice differentiable at zero. The ﬁrst derivative is the mean (as

sumed zero), and the second derivative is −E[X 2 ] (assumed equal to one), and

we can write

φX (t) = 1 − t2 /2 + o(t2 ),

where o(t2 ) indicates a function such that o(t2 )/t2 → 0, as t → 0. The charac

√

teristic function of Sn / n is of the form

√ �

n �

t2 �n

+ o(t

2 /n) .

φX (t/ n) =

1 −

2n

2

For any ﬁxed t, the limit as n → ∞ is e−t /2 , which is the characteristic function

φZ of a standard normal random variable Z. Since φSn /√n (t) → φZ (t) for

√

every t, we conclude that Sn / n converges to Z, in distribution.

The central limit theorem, as stated above, does not give any information on

the PDF or PMF of Sn . However, some further reﬁnements are possible, under

some additional assumptions. We state, without proof, two such results.

(a) Suppose that |φX1 (t)|r dt < ∞, for some positive integer r. Then, Sn is

�

√

µn )/(σ n) converges pointwise to the standard normal PDF:

1 2

lim fn (z) = √ e−z /2 , ∀ z.

n→∞ 2π

In fact, convergence is uniform over all z:

� 1 2

�

lim

sup

�

fn (z) −

√ e−z /2 �

= 0.

� �

n→∞ z 2π

(b) Suppose that Xi is a discrete random variable that takes values of the form

a+kh, where a and h are constants, and k ranges over the integers. Suppose

furthermore that X has zero mean and unit variance. Then, for any z of the

√ √

form z = (na + kh)/ n (these are the possible values of Sn / n), we have

√

n 1 −z 2 /2

lim P(Sn = z) = e .

n→∞ h 2π

4

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 19 11/17/2008

Contents

2. The Chernoff bound

While the weak law of large numbers establishes convergence of the sample

mean, in probability, the strong law establishes almost sure convergence.

Before we proceed, we point out two common methods for proving almost

sure convergence.

independent.

a.s.

(i) If ∞ E[|Xn |s ] < ∞, and s > 0, then Xn → 0.

�

�n=1 a.s.

(ii) If ∞n=1 P(|Xn | > �) < ∞, for every � > 0, then Xn → 0.

E[ ∞ s

�

Proof. (i) By the monotone convergence theorem,

�∞ we obtain n=1 |Xn | ] <

s

∞, which implies that the random variable n=1 |Xn | is ﬁnite, with probabil

a.s. a.s.

ity 1. Therefore, |Xn |s → 0, which also implies that Xn → 0.

(ii) Setting � = 1/k, for any positive integer k, the Borel-Cantelli Lemma shows

that the event {|Xn | > 1/k} occurs only a ﬁnite number of times, with prob

ability 1. Thus, P(lim supn→∞ Xn > 1/k) = 0, for every positive integer k.

Note that the sequence of events {lim supn→∞ |Xn | > 1/k} is monotone and

converges to the event {lim supn→∞ |Xn | > 0}. The continuity of probabil

ity measures implies that P(lim supn→∞ |Xn | > 0) = 0. This establishes that

a.s.

Xn → 0.

1

Theorem 1: Let X, X1 , X2 , . . . be i.i.d. random variables, and assume that

E[|X|] < ∞. Let Sn = X1 + · · · + Xn . Then, Sn /n converges almost surely

to E[X].

Proof, assuming ﬁnite fourth moments. Let us make the additional assump

tion that E[X 4 ] < ∞. Note that this implies E[|X|] < ∞. Indeed, using the

inequality |x| ≤ 1 + x4 , we have

� �

�∞ �

� (X1 + · · · + Xn )4

E < ∞.

n4

n=1

We have

n n n n

(X1 + · · · + Xn )4

� �

1 ����

E = E[Xi1 Xi2 Xi3 Xi4 ].

n4 n4

i1 =1 i2 =1 i3 =1 i4 =1

Let us consider the various terms in this sum. If one of the indices is differ

ent from all of the other indices, the corresponding term is equal to zero. For

example, if i1 is different from i2 , i3 , or i4 , the assumption E[Xi ] = 0 yields

Therefore, the nonzero terms in the above sum are either of the form E[Xi4 ]

(there are n such terms), or of the form E[Xi2 Xj2 ], with i �= j. Let us count the

number of terms of the second type. Such terms are obtained in three different

ways: by setting i1 = i2 �= i3 = i4 , or by setting i1 = i3 �= i2 = i4 , or by

setting i1 = i4 =� i2 = i3 . For each one of these three ways, we have n choices

for the ﬁrst pair of indices, and n − 1 choices for the second pair. We conclude

that there are 3n(n − 1) terms of this type. Thus,

� �

E = .

n4 n4

Using the inequality xy ≤ (x2 + y 2 )/2, we obtain E[X12 X22 ] ≤ E[X14 ], and

n + 3n(n − 1) E[X14 ]

� � �

(X1 + · · · + Xn )4 3n2 E[X14 ] 3E[X14 ]

�

E ≤ ≤ = .

n4 n4 n4 n2

2

It follows that

�∞ � ∞ ∞

� (X1 + · · · + Xn )4 � 1 � � � 3

4

E 4

= 4

E (X1 +· · ·+Xn ) ≤ E[X14 ] < ∞,

n n n2

n=1 n=1 n=1

�∞

where the last step uses the well known property n=1 n−2 < ∞. This implies

that (X1 + · · · + Xn )4 /n4 converges to zero with probability 1, and therefore,

(X1 + · · · +Xn )/n also converges to zero with probability 1, which is the strong

law of large numbers.

For the more general case where the mean of � the random variables X�i is

nonzero, the preceding argument establishes that X1 + · · · + Xn − nE[X1 ] /n

converges to zero, which is the same as (X1 + · · · +Xn )/n converging to E[X1 ],

with probability 1.

Proof, assuming ﬁnite second moments. We now consider the case where we

only assume that E[X 2 ] < ∞. We have

�� �2 � var(X)

Sn

E −µ = .

n n

If we only consider values of n that are perfect squares, we obtain

∞ �� �2 � � ∞

� Si2 var(X)

E 2

−µ = < ∞,

i i2

i=1 i=1

�2

which implies that (Si2 /i2 ) − E[X] converges to zero, with probability 1.

�

Suppose that the random variables Xi are nonnegative. Consider some n

such that i2 ≤ n < (i + 1)2 . We then have Si2 ≤ Sn ≤ S(i+1)2 . It follows that

S i2 Sn S(i+1)2

2

≤ ≤ ,

(i + 1) n i2

or

i2 Si2 Sn (i + 1)2 S(i+1)2

· ≤ ≤ · .

(i + 1)2 i2 n i2 (i + 1)2

As n → ∞, we also have i → ∞. Since i/(i + 1) → 1, and since Si2 · i2

converges to E[X], with probability 1, we see that for almost all sample points,

Sn /n is sandwiched between two sequences that converge to E[X]. This proves

that Sn /n → E[X], with probability 1.

For a general random variable X, we write it in the form X = X + − X − ,

where X + and X − are nonnegative. The strong law applied to X − and X −

separately, implies the strong law for X as well.

3

The proof for the most general case (ﬁnite mean, but possibly inﬁnite vari

ance) is omitted. It involves truncating the distribution of X, so that its moments

are all ﬁnite, and then verifying that the “errors” due to such truncation are not

signiﬁcant in the limit.

plicity, that E[X] = 0. According to the weak law of large numbers, we know

that P(Sn ≥ na) → 0, for every a > 0. We are interested in a more detailed

estimate of P(Sn ≥ na), involving the rate at which this probability converges

to zero. It turns out that if the moment generating function of X is ﬁnite on

some interval [0, c] (where c > 0), then P(Sn ≥ na) decays exponentially with

n, and much is known about the precise rate of exponential decay.

Let M (s) = E[esX ], and assume that M (s) < ∞, for s ∈ [0, c], where c > 0.

Recall that MSn (s) = E[es(X1 +···+Xn ) ] = (M (s))n . For any s > 0, the Markov

inequality yields

P(Sn ≥ na) = P(esSn ≥ ensa ) ≤ e−nsa E[esSn ] = e−nsa (M (s))n .

Every nonnegative value of s, gives us a particular bound on P(Sn ≥ a). To

obtain the tightest possible bound, we minimize over s, and obtain the following

result.

Theorem 2. (Chernoff upper bound) Suppose that E[esX ] < ∞ for some

s > 0, and that a > 0. Then,

where � �

φ(a) = sup sa − log M (s) .

s≥0

For s = 0, we have

sa − log M (s) = 0 − log 1 = 0,

where we have used the generic property M (0) = 1. Furthermore,

d

�

�

�� 1

d

�

sa − log M (s)

�

= a − ·

M (s)

�

= a − 1 · E[X] > 0.

�

ds

s=0 M (s) ds

s=0

Since the function sa − log M (s) is zero and has a positive derivative at s = 0,

it must be positive when s is positive and small. It follows that the supremum

φ(a) of the function sa − log M (s) over all s ≥ 0 is also positive. In particular,

for any ﬁxed a > 0, the probability P(Sn ≥ na) decays at least exponentially

fast with n.

2

Example: For a standard normal random variable X, we have M (s) = es /2 .

Therefore, sa − log M (s) = sa − s2 /2. To maximize this expression over all

s ≥ 0, we form the derivative, which is a − s, and set it to zero, resulting in

s = a. Thus, φ(a) = a2 /2, which leads to the bound

2 /2

P(X ≥ a) ≤ e−a .

Remarkably, it turns out that the estimate φ(a) of the decay rate is tight, under

minimal assumptions. To keep the argument simple, we introduce some simpli

fying assumptions.

Assumption 1.

(i) M (s) = E[esX ] < ∞, for all s ∈ R.

(ii) The random variable X is continuous, with PDF fX .

(iii) The random variable X does not admit ﬁnite upper and lower bounds.

(Formally, 0 < FX (x) < 1, for all x ∈ R.)

1

lim log P(Sn ≥ na) = −φ(a), (1)

n→∞ n

for every a > 0.

exercise:

log M (s)

(a) lims→∞ = ∞;

s

(b) M (s) is differentiable at every s.

The ﬁrst property guarantees that for any a > 0 we have lims→∞ (log M (s)−

sa) = ∞. Since M (s) > 0 for all s, and since M (s) is differentiable, it follows

that log M (s) is also differentiable and that there exists some s∗ ≥ 0 at which

5

log M (s) − sa is minimized over all s ≥ 0. Taking derivatives, we see that such

a s∗ satisﬁes a = M � (s∗ )/M (s∗ ), where M � stands for the derivative of M . In

particular,

φ(a) = s∗ a − log M (s∗ ). (2)

Let us introduce a new PDF

∗

es x

fY (x) = fX (x).

M (s∗ )

This is a legitimate PDF because

� �

1 ∗ 1

fY (x) dx = es x fX (x) dx = · M (s∗ ) = 1.

M (s∗ ) M (s∗ )

The moment generating function associated with the new PDF is

M (s + s∗ )

�

1 sx s∗ x

MY (s) = e e fX (x) dx = .

M (s∗ ) M (s∗ )

Thus,

1 d �

∗ � M � (s∗ )

E[Y ] = · M (s + s ) = = a,

M (s∗ ) ds M (s∗ )

�

s=0

where the last equality follows from our deﬁnition of s∗ . The distribution of Y

is called a “tilted” version of the distribution of X.

Let Y1 , . . . , Yn be i.i.d. random variables with PDF fY . Because of the

close relation between fX and fY , approximate probabilities of events involving

Y1 , . . . , Yn can be used to obtain approximate probabilities of events involving

X1 , . . . , Xn .

We keep assuming that a > 0, and ﬁx some δ > 0. Let

n

� � 1� �

B = (x1 , . . . , xn ) � a − δ ≤ xi ≤ a + δ ⊂ Rn .

�

n

i=1

� � � �

P Sn ≥ n(a − δ) ≥ P n(a − δ) ≤ Sn ≤ n(a + δ)

�

= fX (x1 ) · · · fX (xn ) dx1 · · · dxn

(x1 ,...,xn )∈B

�

∗ ∗

= (M (s∗ ))n e−s x1 fY (x1 ) · · · e−s xn fY (xn ) dx1 · · · dxn

(x1 ,...,xn )∈B

�

∗ n −ns∗ (a+δ)

≥ (M (s )) e fY (x1 ) · · · fY (xn ) dx1 · · · dxn

(x1 ,...,xn )∈B

∗ n −ns∗ (a+δ)

= (M (s )) e P(Tn ∈ B). (3)

6

The second inequality above was obtained because for every (x1 , . . . , xn )] ∈ B,

∗ ∗ ∗

we have x1 + · · · + xn ≤ n(a + δ), so that e−s x1 · · · e−s xn ≥ e−ns (a+δ) .

By the weak law of large numbers, we have

�Y + · · · + Y �

1 n

P(Tn ∈ B) = P ∈ [na − nδ, na + nδ] → 1,

n

as n → ∞. Taking logarithms, dividing by n, and then taking the limit of the

two sides of Eq. (3), and ﬁnally using Eq. (2), we obtain

1

lim inf log P(Sn > na) ≥ log M (s∗ ) − s∗ a − δ = −φ(a) − δ.

n→∞ n

This inequality is true for every δ > 0, which establishes the lower bound in

Eq. (1).

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 20 11/19/2008

Contents

2. The Poisson process

We now turn to the study of some simple classes of stochastic processes. Exam

ples and a more leisurely discussion of this material can be found in the corre

sponding chapter of [BT].

A discrete-time stochastic is a sequence of random variables {Xn } deﬁned

on a common probability space (Ω, F, P). In more detail, a stochastic process is

a function X of two variables n and ω. For every n, the function ω �→ Xn (ω) is a

random variable (a measurable function). An alternative perspective is provided

by ﬁxing some ω ∈ Ω and viewing Xn (ω) as a function of n (a “time function,”

or “sample path,” or “trajectory”).

A continuous-time stochastic process is deﬁned similarly, as a collection of

random variables {Xn } deﬁned on a common probability space (Ω, F, P).

In the Bernoulli process, the random variables Xn are i.i.d. Bernoulli, with com

mon parameter p ∈ (0, 1). The natural sample space in this case is Ω = {0, 1}∞ .

Let Sn = X1 + · · · +Xn (the number of “successes” or “arrivals” in n steps).

The random variable Sn is binomial, with parameters n and p, so that

� �

n k

pSn (k) = p (1 − p)n−k , k = 0, 1 . . . , n,

k

1

Let T1 be the time of the ﬁrst success. Formally, T1 = min{n | Xn = 1}.

We already know that T1 is geometric:

1

pT1 (k) = (1 − p)k−1 p, k = 1, 2, . . . ; E[T1 ] = .

p

The Bernoulli process has a very special structure. The discussion below is

meant to capture some of its special properties in an abstract manner.

Consider a Bernoulli process {Xn }. Fix a particular positive integer m, and

let Yn = Xm+n . Then, {Yn } is the process seen by an observer who starts

watching the process {Xn } at time m + 1, as opposed to time 1. Clearly, the

process {Yn } also involves a sequence of i.i.d. Bernoulli trials, with the same pa

rameter p. Hence, it is also a Bernoulli process, and has the same distribution as

the process {Xn }. More precisely, for every k, the distribution of (Y1 , . . . , Yk )

is the same as the distribution of (X1 , . . . , Xk ). This property is called station

arity property.

In fact a stronger property holds. Namely, even if we are given the values

of X1 , . . . , Xm , the distribution of the process {Yn } does not change. Formally,

for any measurable set A ⊂ Ω, we have

P((Xn+1 , Xn+2 , . . .) ∈ A | X1 , . . . , Xn ) = P((Xn+1 , Xn+2 , . . .) ∈ A)

= P((X1 , X2 . . . , . . .) ∈ A).

We refer to the ﬁrst equality as a memorylessness property. (The second in

equality above is just a restatement of the stationarity property.)

We just discussed a situation where we start “watching” the process at some time

m + 1, where m is an integer constant. We next consider the case where we start

watching the process at some random time N + 1. So, let N be a nonnegative

integer random variable. Is the process {Yn } deﬁned by Yn = XN +n a Bernoulli

process with the same parameter? In general, this is not the case. For example,

if N = min{n | Xn+1 = 1}, then P(Y1 = 1) = P(XN +1 = 1) = 1 �= p. This

inequality is due to the fact that we chose the special time N by “looking into

the future” of the process; that was determined by the future value Xn+1 .

This motivates us to consider random variables N that are determined causally,

by looking only into the past and present of the process. Formally, a nonneg

ative random variable N is called a stopping time if, for every n, the occur

rence or not of the event {N = n} is completely determined by the values of

2

X1 , . . . , Xn . Even more formally, for every n, there exists a function hn such

that

I{N =n} = hn (X1 , . . . , Xn ).

We are now a position to state a stronger version of the memorylessness

property. If N is a stopping time, then for all n, we have

= P((X1 , X2 . . . , . . .) ∈ A).

In words, the process seen if we start watching right after a stopping time is also

Bernoulli with the same parameter p.

For k ≥ 1, let Yk be the kth arrival time. Formally, Yk = min{n | Sn = k}.

For convenience, we deﬁne Y0 = 0. The kth interarrival time is deﬁned as

Tk = Yk − Yk−1 .

We already mentioned that T1 is geometric. Note that T1 is a stopping time,

so the process (XT1 +1 , XT1 +2 , . . .) is also a Bernoulli process. Note that the

second interarrival time T2 , in the original process is the ﬁrst arrival time in

this new process. This shows that T2 is also geometric. Furthermore, the new

process is independent from (X1 , . . . , XT1 ). Thus, T2 (a function of the new

process) is independent from (X1 , . . . , XT1 ). In particular, T2 is independent

from T1 .

By repeating the above argument, we see that the interarrival times Tk are

i.i.d. geometric. As a consequence, Yk is the sum of k i.i.d. geometric random

variables, and its PMF can be found by repeated convolution. In fact, a simpler

derivation is possible. We have

P(Xt = 1)

� � � �

t − 1 k−1 t−k t−1 k

= p (1 − p) ·p= p (1 − p)t−k .

k−1 k−1

Suppose that {Xn } and {Yn } are independent Bernoulli processes with param

eters p and q, respectively. Consider a “merged” process {Zn } which records

3

an arrival at time n if and only if one or both of the original processes record an

arrival. Formally,

Zn = max{Xn , Yn }.

The random variables Zn are i.i.d. Bernoulli, with parameter

“Splitting” is in some sense the reveIf there is an arrival at time n (i.e.,

Xn = 1), we ﬂip an independent coin, with parameter q, and record an arrival

of “type I” or “type II”, depending on the coin’s outcome. Let {Xn } and {Yn }

be the processes of arrivals of the two different types. Formally, let {Un } be

a Bernoulli process with parameter q, independent from the original process

{Zn }. We then let

Xn = Zn · Un , Yn = Zn · (1 − Un ).

Note that the random variables Xn are i.i.d. Bernoulli, with parameter pq, so that

{Xn } is a Bernoulli process with parameter pq. Similarly, {Yn } is a Bernoulli

process with parameter p(1 − q). Note however that the two processes are de

pendent. In particular, P(Xn = 1 | Yn = 1) = 0 �= pq = P(Xn = 1).

process. The process starts at time zero, and involves a sequence of arrivals, at

random times. It is described in terms of a collection of random variables N (t),

for t ≥ 0, all deﬁned on the same probability space, where N (0) = 0 andN (t),

t > 0, represents the number of arrivals during the interval (0, t].

If we ﬁx a particular outcome (sample path) ω, we obtain a time function

whose value at time t is the realized value of N (t). This time function has dis

continuities (unit jumps) whenever an arrival occurs. Furthermore, this time

function is right-continuous: formally, limt↓t N (τ ) = N (t); intuitively, the

value of N (t) incorporates the jump due to an arrival (if any) at time t.

We introduce some notation, analogous to the one used for the Bernoulli

process:

We also let

P (k; t) = P(N (t) = k).

4

The Poisson process, with parameter λ > 0, is deﬁned implicitly by the

following properties:

(a) The numbers of arrivals in disjoint intervals are independent. Formally,

if 0 < t1 < t2 < · · · < tk , then the random variables N (t1 ), N (t2 ) −

N (t1 ), . . . , N (tk ) − N (tk−1 ) are independent. This is an analog of the

independence of trials in the Bernoulli process.

by λ and the length of the interval. Formally, if t1 < t2 , then

ok (δ)

lim = 0,

δ ↓0 δ

and

P (0; δ) = 1 − λδ + o1 (δ)

P (1; δ) = λδ + o2 (δ),

∞

�

P (k; δ) = o3 (δ),

k=2

The ok functions are meant to capture second and higher order terms in a Taylor

series approximation.

Let us ﬁx the parameter λ of the process, as well as some time t > 0. We wish

to derive a closed form expression for P (k; t). We do this by dividing the time

interval (0, t] into small intervals, using the assumption that the probability of

two or more arrivals in a small interval is negligible, and then approximate the

process by a Bernoulli process.

Having ﬁxed t > 0, let us choose a large integer n, and let δ = t/n. We

partition the interval [0, t] into n “slots” of length δ. The probability of at least

one arrival during a particular slot is

λt

p = 1 − P (0; δ) = λδ + o(δ) = + o(1/n),

n

5

for some function o that satisﬁes o(δ)/δ → 0.

We ﬁx k and deﬁne the following events:

A: exactly k arrivals occur in (0, t];

B: exactly k slots have one or more arrivals;

C: at least one of the slots has two or more arrivals.

The events A and B coincide unless event C occurs. We have

B ⊂ A ∪ C, A ⊂ B ∪ C,

and, therefore,

Note that

P(C) ≤ n · o3 (δ) = (t/δ) · o3 (δ),

which converges to zero, as n → ∞ or, equivalently, δ → 0. Thus, P(A), which

is the same as P (k; t) is equal to the limit of P(B), as we let n → ∞.

The number of slots that record an arrival is binomial, with parameters n

and p = λt/n + o(1/n). Thus, using the binomial probabilities,

� ��

n λt �k � λt �n−k

P(B) = + o(1/n) 1− + o(1/n) .

k n n

When we let n → ∞, essentially the same calculation as the one carried out in

Lecture 6 shows that the right-hand side converges to the Poisson PMF, and

(λt)k −λt

P (k; t) = e .

k!

This establishes that N (t) is a Poisson random variable with parameter λt, and

E[N (t)] = var(N (t)) = λt.

In full analogy with the Bernoulli process, we will now argue that the interarrival

times Tk are i.i.d. exponential random variables.

We have

P(T1 > t) = P(N (t) = 0) = P (0; t) = e−λt .

6

We recognize this as an exponential CDF. Thus,

Let us now ﬁnd the joint PDF of the ﬁrst two interarrival times. We give a

heuristic argument, in which we ignore the probability of two or more arrivals

during a small interval and any o(δ) terms. Let t1 > 0, t2 > 0, and let δ be a

small positive number, with δ < t2 . We have

P(t1 ≤ T1 ≤ t1 + δ, t2 ≤ T2 ≤ t2 + δ)

≈ P (0; t1 ) · P (1; δ) · P (0; t2 − t1 − δ) · P (1; δ)

= e−λt1 λδe−λ(t2 −δ) λδ.

This shows that T2 is independent of T1 , and has the same exponential distribu

tion. This argument is easily generalized to argue that the random variables Tk

are i.i.d. exponential, with common parameter λ.

We will ﬁrst ﬁnd the joint PDF of Y1 and Y2 . Suppose for simplicity that λ = 1.

let us ﬁx some s and t that satisfy 0 < s ≤ t. We have

� �

P(Y1 ≤ s, Y2 ≤ t) = P N (s) ≥ 1, N (t) ≥ 2

= P(N (s) = 1)P(N (t) − N (s) ≥ 1) + P(N (s) ≥ 2)

= se−s (1 − e−(t−s) ) + (1 − e−s − se−s )

= −se−t + 1 − e−s .

Differentiating, we obtain

∂2

fY1 ,Y2 (s, t) = P(Y1 ≤ s, Y2 ≤ t) = e−t , 0 ≤ s ≤ t.

∂t∂s

We point out an interesting consequence: conditioned on Y2 = t, Y1 is

uniform on (0, t); that is given the time of the second arrival, all possible times

of the ﬁrst arrival are “equally likely.”

We now use the linear relations

T1 = Y1 , T2 = Y2 − Y1 .

7

The determinant of the matrix involved in this linear transformation is equal to 1.

Thus, the Jacobian formula yields

fT1 ,T2 (t1 , t2 ) = fY1 ,Y2 (t1 , t1 + t2 ) = e−t1 e−t2 ,

conﬁrming our earlier independence conclusion. Once more this approach can

be generalized to deal with ore than two interarrival times, although the calcula

tions become more complicated

The characterization of the interarrival times leads to an alternative, but equiva

lent, way of describing the Poisson process. Start with a sequence of indepen

dent exponential random variables T1 , T2 ,. . ., with common parameter λ, and

record an arrival at times T1 , T1 + T2 , T1 + T2 + T3 , etc. It can be veriﬁed

that starting with this new deﬁnition, we can derive the properties postulated

in our original deﬁnition. Furthermore, this new deﬁnition, being constructive,

establishes that a process with the claimed properties does indeed exist.

Since Yk is the sum of k i.i.d. exponential random variables, its PDF can be

found by repeating convolution.

A second, somewhat heuristic, derivation proceeds as follows. If we ignore

the possibility of two arrivals during a small interval, We have

λk−1 k−1 −λy

P(y ≤ Yk ≤ y + δ) = P (k − 1; y)P (1; δ) = y e λδ.

(k − 1)!

We divide by δ, and take the limit as δ ↓ 0, to obtain

λk−1 k−1 −λy

fYk (y) = y e λ, y > 0.

(k − 1)!

This is called a Gamma or Erlang distribution, with k degrees of freedom.

For an alternative derivation that does not rely on approximation arguments,

note that for a given y ≥ 0, the event {Yk ≤ y} is the same as the event

� �

number of arrivals in the interval [0, y] is at least k .

Thus, the CDF of Yk is given by

∞ k−1 k−1

� � � � � (λy)n e−λy

FYk (y) = P Yk ≤ y = P (n, y) = 1 − P (n, y) = 1 − .

n!

n=k n=0 n=0

moving the differentiation inside the summation (this can be justiﬁed). After

some straightforward calculation we obtain the Erlang PDF formula

d λk y k−1 e−λy

fYk (y) = FYk (y) = .

dy (k − 1)!

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

6.436J/15.085J Fall 2008

Lecture 21 11/24/2008

Readings:

Posted excerpt from Chapter 6 of [BT, 2nd edition], as well as starred exercises

therein.

process. In particular, if we start watching a Poisson process at some time t∗ , or

at some causally determined random time S, the process we will see is a Poisson

process in its own right, and is independent from the past. This is an intuitive

consequence of the close relation of the Poisson and Bernoulli processes. We

will give a more precise statement below, but we will not provide a formal proof.

Nevertheless, we will use these properties freely in the sequel.

We ﬁrst introduce a suitable deﬁnition of the notion of a stopping time. A

nonnegative random variable S is called a stopping time if for every s ≥ 0, the

occurrence or not of the event {S ≤ s} can be determined from knowledge of

the values of the random variables N (t), t ≤ s.1

Example: The ﬁrst arrival time T1 is a stopping time. To see this, note that the

event {T1 ≤ s} can be written as {N (s) ≥ 1}; the occurrence of the latter event

can be determined from knowledge of the value of the random variable N (s).

The arrival process {M (t)} seen by an observer who starts watching at a

stopping time S is deﬁned by M (t) = N (S +t)−N (S). If S is a stopping time,

this new process is a Poisson process (with the same parameter λ). Furthermore,

the collection of random variables {M (t) | t ≥ 0}�(the “future” after S) is �

independent from the collection of random variables N (min{t, S}) | t ≥ 0

(the “past” until S).

1

In a more formal deﬁnition, we ﬁrst deﬁne the σ-ﬁeld Fs generated by all events of the

form {N (t) = k}, as t ranges over [0, s] and k ranges over the integers. We then require that

{S ≤ s} ∈ Fs , for all s ≥ 0.

1

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Fundamentals of probability.

6.436/15.085

LECTURE 23

Markov chains

23.1. Introduction

d

Recall

� a model we considered earlier: random walk. We have Xn = Be(p), i.i.d. Then Sn =

1≤j≤n Xj was deﬁned to be a simple random walk. One of its key property is that the distribu

tion of Sn+1 conditioned on the state Sn = x at n is independent from the past history, namely

Sm , m ≤ n − 1. To see this formally, note

P(Sn+1 = y|Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )

P(Xn+1 = y − x, Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )

=

P(Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )

P(Xn+1 = y − x)P(Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )

=

P(Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )

= P(Xn+1 = y − x),

where the second equality follows from the independence assumption for the sequence Xn , n ≥ 1.

A similar derivation gives P(Sn+1 = y|Sn = x) = P(Xn+1 = y − x) and we get the required

equality: P(Sn+1 = y|Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 ) = P(Sn+1 = y|Sn = x).

Deﬁnition 23.1. A discrete time stochastic process (Xn , n ≥ 1) is deﬁned to be a Markov chain

if it takes values in some countable set X , and for every x1 , x2 , . . . , xn ∈ X it satisﬁes the property

P(Xn = xn |Xn−1 = xn−1 , Xn−2 = xn−2 , . . . , X1 = x1 ) = P(Xn = xn |Xn−1 = xn−1 )

The elements of X are called states. We say that the Markov chain is in state s ∈ X at time

n if Xn = s. Mostly we will consider the case when X is ﬁnite. In this case we call Xn a ﬁnite

state Markov chain and, without the loss of generality, we will assume that X = {1, 2, . . . , n}.

Let us establish some properties of Markov chains.

Proposition 1. Given a Markov chain Xn , n ≥ 1.

(a) For every collection of states s, x1 , x2 , . . . , xn−1 and every m

P(Xn+m = s|Xn−1 = xn−1 , . . . , X1 = x1 ) = P(Xn+m = s|Xn−1 = xn−1 ).

1

2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

= P(Xn = xn , Xn−1 = xn−1 , . . . , Xk+1 = xk+1 |Xk = xk )P(Xk−1 = xk−1 , . . . X1 = x1 |Xk = xk ).

Proof. Exercise. �

23.2. Examples

We already have an example of a Markov chain - random walk.

Consider now the following example (Exercise 2, Section 6.1 [2]). Suppose we roll a die

repeatedly and Xn is the number of 6-s we have seen so far. Then Xn is a Markov chain and

P(Xn = x + 1|Xn−1 = x) = 1/6, P(Xn = x|Xn−1 = x) = 5/6 and P(Xn = y|Xn−1 = x) = 0 for all

y=� x, x + 1. Note, that we can think of Xn as a random walk, where the transition to the right

occurs with probability 1/6 and the transition to the same state with the probability 5/6.

Also, let Xn be the largest outcome seen so far. Then Xn is again a Markov chain. What are

its transition probabilities?

Now consider the following model of an inventory process. The inventory can hold ﬁnish

goods up to capacity C ∈ N. Every month n there is some current inventory level In and a

certain ﬁxed amount of product x ∈ N is produced, as long as limit is not reached, namely

In + x ≤ C. If In + x > C, than just enough C − In is produced to reach the capacity. Every

month there is a random demand Dn , n ≥ 1, which we assume is i.i.d. If the current inventory

level is at least as large as the demand, then the full demand is satisﬁed. Otherwise as much of

the demand is satisﬁed as possible, bringing the inventory level down to zero.

Let In be the inventory level in month n. Then In is a Markov chain. Note

Speciﬁcally, the probability distribution of In+1 given In = i, is independent from the values

Im , m ≤ n − 1. In is a Markov chain taking values in 0, 1, . . . , C.

We say that the Markov chain Xn is homogeneous if P(Xn+1 = y|Xn = x) = P(X2 = y|X1 = x)

for all n. Observe that all of our examples are homogeneous Markov chains. For a homogenous

Markov chain Xn we can specify transition probabilities P(Xn+1 = y|Xn = x) by a sequence of

values px,y = P(Xn+1 = y|Xn = x). For the case of ﬁnite state Markov chain, say the state space

is {1, 2, . . . , N }. Then the transition probabilities are pi,j , 1 ≤ i, j ≤ N . We call P =

�(pi,j ) the

transition matrix of Xn . The transition matrix P has the following obvious property j pi,j = 1

for all i. Any non-negative matrix with such property is called stochastic matrix, for obvious

reason.

LECTURE 23. 3

Observe that

�

pi,j = P(Xn+2 = j|Xn = i) = P(Xn+2 = j|Xn+1 = k, Xn = i)P(Xn+1 = k|Xn = i)

1≤k≤N

�

= P(Xn+2 = j|Xn+1 = k)P(Xn+1 = k|Xn = i)

1≤k≤N

�

= pk,j pi,k .

1≤k≤N

This means that the matrix P 2 gives the two-step transition probabilities of the underlying

(2)

Markov chain. Namely, the (i, j)-th entry of P 2 , which we denote by pi,j is precisely P(Xn+2 =

j|Xn = i). This is not hard to extend to the general case: for every r ≥ 1, P r is the transition

matrix of r-steps of the Markov chain. One of our goals is understanding the long-term dynamics

of P r as r → ∞. We will see that for a broad class of Markov chains the following property

(r)

happens: the limit limr→∞ pi,j exists and depends on j only. Namely, the starting state i is

irrelevant, as far as the limit is concerned. This property is called mixing and is a very important

property of Markov chains.

Now, we use ej to denote the j-th N -dimensional column vector. Namely ej has j-th co

ordinate equal to one, and all the other coordinates equal to zero. We also let e denote the

N -dimensional column vector consisting of ones. Suppose X0 = i, for some state i ∈ {1, . . . , N }.

Then the probability vector of Xn can be written as eTi P n in vector form. Suppose at time

zero, the state of the chain is random and is given by some probability vector µ. Namely

P(X0 = i) = µi , i = 1, 2, . . . , N . Then the probability vector of Xn is precisely µT P n in vector

form.

Consider the following simple Markov chain on states 1, 2: p1,1 = p1,2 = 1/2, p2,1 = 1, p2,2 = 0.

Suppose we start at random at time zero with the following probability distribution µ: µ1 =

P(X0 = 1) = 2/3, µ2 = P(X0 = 2) = 1/3. What is the probability distribution of X1 ? We have

P(X1 = 1) = (1/2)P(X0 = 1) + P(X0 = 2) = (1/2)(2/3) + (1/3) = 2/3. From this we ﬁnd

P(X1 = 2) = 1 − P(X1 = 1) = 1/3. We see that the probability distribution of X0 and X1 are

identical. The same applies to every n.

Deﬁnition 23.2. A probability vector π = (πi ), 1 ≤ i ≤ N is deﬁned to be a stationary

distribution if P(Xn = i) = πi for all times n ≥ 1 and states i = 1, . . . , N , conditioned on

P(X0 = i) = πi , 1 ≤ i ≤ N . In this case we also say that the Markov chain Xn is in steady-

state.

Repeating the derivation above for the case of general Markov chains, it� is not hard to see

that the vector π is stationary iﬀ it satisﬁes the following properties: πi ≥ 0, i πi = 1 and

�

πi = pk,i πk , ∀i.

1≤k≤N

(23.3) π T = π T P,

where wT denotes the (row) transpose of a column vector w.

4 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

One of the fundamental properties of ﬁnite state Markov chains is that a stationary distri

bution always exists.

Theorem 23.4. Given a ﬁnite state Markov chain with transition matrix P , the exists at least

one stationary distribution

� π. Namely the system of equation (23.3) has at least one solution

satisfying π ≥ 0, i πi = 1.

Proof. There are many proofs of this fundamental results. One possibility is to use Brower’s

Fixed Point Theorem. Later on we give a probabilistic proof which provides important intuition

about the meaning of πi . For now let us give a quick proof, but one that relies on linear

programming (LP). If you are not familiar with linear programming theory, you can simply

ignore this proof.

Consider the following LP problem in variables π1 , . . . , πN .

�

max πi

1≤i≤N

Subject to:

P T π − π = 0,

π ≥ 0.

Note that a stationary vector π exists iﬀ this LP has an unbounded optimal solution. Indeed,

if π is a stationary vector, then it clearly

� is a feasible solution to this LP. Note that απ is also

a solution for every α > 0. Since α 1≤i≤N πi = α, then we can obtain a feasible solution as

large as we want. On the other hand, suppose� this LP has an unbounded objective

� value. In

particular, there exists a solution x satisfying i xi > 0. Taking πi = xi / i xi we obtain a

stationary distribution.

Now using LP duality theory, this LP has an unbounded solution iﬀ the dual solution is

infeasible. The dual solution is

�

min 0yi

1≤i≤N

Subject to:

P y − y ≥ e.

∗

Let us show that indeed this �dual LP problem

� is infeasible. Take any y and ﬁnd k such that

yk∗ = maxi yi . Observe that i pk∗ ,i yi ≤ i pk∗ ,i yk∗ = yk∗ < 1 + yk∗ , since the rows of P sum

to one. Thus the constraint P y − y ≥ e is violated in the k ∗ -th row. We conclude that the

dual problem is indeed infeasible. Thus the primal LP problem is unbounded and the stationary

distribution exists. �

As we mentioned, stationary distribution π is not necessarily unique, but it is quite often. In

this case it can be obtained as a unique solution to the system of equations π T = π T P, j πj =

�

1, πj ≥ 0.

Example : [Example 6.6 from [1]] An absent-minded professor has two umbrellas, used when

commuting from home to work and back. If it rains and umbrella is available, the professor takes

it. If umbrella is not available, the professor gets wet. If it does not rain the professor does not

LECTURE 23. 5

take the umbrella. It rains on a given commute with probability p, independently for all days.

What is the steady-state probability that the professor will get wet on a given day?

We model the process as a Markov chain with states j = 0, 1, 2. The state j means the

location where the professor is currently in has j umbrellas. Then the corresponding transition

probabilities are p0,2 = 1, p2,1 = p, p1,2 = p, p1,1 = 1−p, p2,0 = 1−p. The corresponding equations

for πj , j = 0, 1, 2 are then π0 = π2 (1 − p), π1 = (1 − p)π1 + pπ2 , π2 = π0 + pπ1 . From the second

equation π1 = π2 . Combining with the ﬁrst equation and with the fact π0 + π1 + π2 = 1, we

obtain π1 = π2 = 3−p 1

, π0 = 1−p

3−p

. The steady-state probability that the professor gets wet is

the probability of being in state zero times probability that it rains on this day. Namely it is

P(wet) = (13−−pp)p .

Given a ﬁnite state homogeneous Markov chain with transition matrix P , construct a directed

graph as follows: the nodes are i = 1, 2, . . . , N . Put edges (i, j) for every pair of states such

that pi,j > 0. Given two states i, j suppose there is a directed path from i to j. We say that

i communicates with j and write i → j. What is the probabilistic interpretation of this? It

� (n)

means there is a positive probability of getting to state j starting from i. Formally n pi,j > 0.

Suppose, there is a path from i to j, but not from j to i. This means that if the chain starting

from i, got to j, then it will never return to i again. Since, there is a positive chance of going

from i to j, intuitively, this will happen with probability one. Thus with probability one we will

never return to i. We would like to formalize this intuition.

Deﬁnition 23.5. A state i is called transient if there exists a state j such that i → j, but j � i.

Otherwise i is called recurrent.

We write i ↔ j if they communicate to each other. Observe that i ↔ i. Also if i ↔ j then

j ↔ i and if i ↔ j and j ↔ k then i ↔ k. Thus i ↔ is equivalency relationship and we can

partition all the recurrent states into equivalency classes R1 , R2 , . . . , Rr . Thus the entire states

space {1, 2, . . . , N } can be partitioned as T ∪ R1 ∪ · · · ∪ Rr , where T is the (possibly empty) set

of transient states.

23.6. References

• Sections 6.1-6.4 [2]

• Chapter 6 [1]

BIBLIOGRAPHY

2. G. R. Grimmett and D. R. Stirzaker, Probability and random processes, Oxford University

Press, 2005.

7

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Fundamentals of probability.

6.436/15.085

LECTURE 24

Markov chains II. Mean recurrence times.

Recall the relations →, ↔ introduced in the previous lecture for the class of ﬁnite state Markov

chains. Recall that we deﬁned a state i to be recurrent if whenever i → j we also have j → i,

namely i ↔ j. We have observed that ↔ is an equivalency relation, so that set of recurrent

states is partitioned into equivalency classes R1 , . . . , Rr . The remaining states T are transient.

Lemma 24.1. For every ∀i ∈ R, j ∈

/ R we must have pi,j = 0.

This means that once the chain is in some recurrent class R it stays there forever.

Proof. The proof is simple: pi,j > 0 implies i → j. Since i is recurrent then also j → i implying

j ∈ R - contradiction. �

Introduce the following basic random quantities. Given states i, j let

Ti = min{n ≥ 1 : Xn = i|X0 = i}.

In case no such n exists, we set Ti = ∞. (Thus the range of Ti is N ∪ {∞}.) The quantity is

called the the ﬁrst passage time.

Lemma 24.2. For every state i ∈ T , P(Xn = i, i.o.) = 0. Namely, almost surely, after some

ﬁnite time n0 , the chain will never return to i. In addition E[Ti ] = ∞ .

Proof. By deﬁnition there exists a state j ∈ / T such that i → j, j � i. It then follows that

P(Ti = ∞) > 0 implying E[Ti ] = ∞. Now, let us establish the ﬁrst part.

Let Ii,m be the indicator of the event that the M.c. returned to state i at least m times.

Notice that P(Ii,1 ) = P(Ti < ∞) < 1. Also by M.c. property we have P(Ii,m |Ii,m−1 ) = P(Ti <

∞), as conditioning that at some point the M.c. returned to state i m − 1 times does not

impact its likelihood to return to this state again. Also notice Ii,m ⊂ Ii,m−1 . Thus P(Ii,m ) =

P(Ii,m |Ii,m−1 )P(Ii,m−1 ) = P(Ti < ∞)P(Ii,m−1 ) = · · · = Pm (Ti < ∞). Since P(Ti < ∞) < 1, then

by continuity of probability property we obtain P(∩m Ii,m ) = limm→∞ P(Ii,m ) = limm→∞ Pm (Ti <

∞) = 0. Notice that the event ∩m Ii,m is precisely the event Xn = i, i.o. �

We now focus on the family of Markov chains with only one recurrent class. Namely X =

T ∪ R. If in addition T = ∅, then such a M.c. is called irreducible.

1

2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

� X . Namely, in every ﬁnite state M.c. there exists at least one

recurrent state.

Exercise 2. Let i ∈ T and let π be an arbitrary stationary distribution. Establish that πi = 0.

Exercise 3. Suppose M.c. has one recurrent class R. Show that for every i ∈ R P(Xn =

i, i.o.) = 1. Moreover, show that there exists 0 < q < 1 and C > 0 such that P(Ti > t) ≤ Cq t

for all t ≥ 0. As a result, show that E[Ti ] < ∞.

Let µi = E[Ti ], possibly with µi = ∞. This is called mean recurrence time of the state i.

We now establish a fundamental result on M.c. with a single recurrence class.

Theorem 24.3. A ﬁnite state M.c. with a single recurrence class has a unique stationary

distribution π, which is given as πi = µ1i for all states i. Speciﬁcally, πi > 0 iﬀ the state i is

recurrent.

Proof. Let P be the transition matrix of the chain. We let the state space be X = {1, . . . , N }.

We ﬁx an arbitrary recurrent state k. We know that one exists by Exercise 1. Let Ni be the

number of visits to state i between two successive visits to state k. In case i = k, the last visit

is counted but the initial is not. Namely, in the special case i = k the number of visits is 1

]. Consider the event {Xn = i, Tk ≥ n} and consider

� let ρi (k) = E[Ni�

with probability one. and

the indicator function n≥1 IXn =i,Tk ≥n = 1≤n≤Tk IXn =i . Notice that this sum is precisely Ni .

Namely,

�

(24.4) ρi (k) = P(Xn = i, Tk ≥ n|X0 = k).

n≥1

�

Then using the formula E[Z] = n≥1 P(Z ≥ n) for integer valued r.v., we obtain

� �

(24.5) ρi (k) = P(Tk ≥ n|X0 = k) = E[Tk ] = µk .

i n≥1

Since k is recurrent, then by Exercise 3, µk < ∞ implying ρi (k) < ∞. We let ρ(k) denote the

vector with components ρi (k).

Lemma 24.6. ρ(k) satisﬁes ρT (k) = ρT (k)P . In particular, for every recurrent state k, πi =

ρi (k)

µk

, 1 ≤ i ≤ N deﬁnes a stationary distribution.

Proof. The second part follows from (24.5) and the fact that µk < ∞. Now we prove the ﬁrst

part. We have for every n ≥ 2

�

(24.7) P(Xn = i, Tk ≥ n|X0 = k) = P(Xn = i, Xn−1 = j, Tk ≥ n|X0 = k)

�

j=k

�

(24.8) = P(Xn−1 = j, Tk ≥ n − 1|X0 = k)pj,i

�

j=k

Observe that P(X1 = i, Tk ≥ 1|X0 = k) = pk,i . We now sum the (24.7) over n and apply it to

(24.4) to obtain

��

ρi (k) = pk,i + P(Xn−1 = j, Tk ≥ n − 1|X0 = k)pj,i

� k n≥2

j=

LECTURE 24. 3

�

We recognize n≥2 P(Xn−1 = j, Tk ≥ n − 1|X0 = k) as ρj (k). Using ρk (k) = 1 we obtain

� �

ρi (k) = ρk (k)pk,i + ρj (k)pj,i = ρj (k)pj,i

�

j=k j

We now return to the proof of the theorem. Let π denote an arbitrary stationary distribution

of our M.c. We know one exists by Lemma 24.6 and, independently by our linear programming

based proof. By Exercise 2 we already know that πi = 1/µi = 0 for every transient state i.

We now show that πk = 1/µk for every recurrent state k. Fix an arbitrary stationary

distribution π. Assume that at time zero we start with distribution π. Namely P(X0 = i) = πi

for all i. Of course this implies that P(Xn = i) is also πi for all n. On the other hand, ﬁx any

recurrent state k and consider

� �

µk πk = E[Tk |X0 = k]P(X0 = k) = P(Tk ≥ n|X0 = k)P(X0 = k) = P(Tk ≥ n, X0 = k)

n≥1 n≥1

P(Tk ≥ n, X0 = k) = P(X0 = k, Xj =

� k, 1 ≤ j ≤ n − 1)

� k, 1 ≤ j ≤ n − 1) − P(Xj =

= P(Xj = � k, 0 ≤ j ≤ n − 1)

(∗)

= P(Xj =� k, 0 ≤ j ≤ n − 2) − P(Xj =

� k, 0 ≤ j ≤ n − 1)

= an−2 − an−1 ,

where an = P(Xj = � k, 0 ≤ j ≤ n) and (*) follows from stationarity of π. Now a0 = P(X0 =

� k).

Putting together, we obtain

�

µk πk = P(X0 = k) + (an−2 − an−1 )

n≥2

n

= 1 − lim an

n

But by continuity of probabilities limn an = P(Xn �= k, ∀n). By Exercise 3, the state k, being

recurrent is visited inﬁnitely often with probability one. We conclude that limn an = 0, which

gives µk πk = 1, implying that πk is uniquely deﬁned as 1/µk . �

Let Ni (t) denote the number of times the state i is visited during the times 0, 1, . . . , t. What can

be said about the behavior of Ni (t)/t when t is large? The answer turns out to be very simple:

it is πi . These type of results are called ergodic properties, as they show how the time average

of the system, namely Ni (t)/t relates to the spatial average, namely πi .

Ni (t)

Theorem 24.9. For arbitrary starting state X0 = k and for every state i, limt→∞ t

= πi

almost surely. Also limt→∞ E[Nti (t)] = πi .

Proof. Suppose X0 = k. If i is a transient state, then, as we have established, almost surely

after some ﬁnite time, the chain will never enter i, meaning limt Ni (t)/t = 0 almost surely. Since

4 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

also πi = 0, then we have established the required equality for the case when i is a transient

state.

Suppose now i is a recurrent state. Let T1 , T2 , T3 , . . . denote the time of successive visits to

i. Then the sequence Tn , n ≥ 2 is i.i.d. Also T1 is independent from the rest of the sequence,

although it distribution is diﬀerent from the one of Tm , m ≥ 2 since we have started the chain

from k which is in general diﬀerent from i. By the deﬁnition of Ni (t) we have

� �

Tm ≤ t < Tm

1≤m≤Ni (t) 1≤m≤Ni (t)+1

� �

1≤m≤Ni (t) Tm t 1≤m≤Ni (t)+1 Tm Ni (t) + 1

(24.10) ≤ < .

Ni (t) Ni (t) Ni (t) + 1 Ni (t)

We know from Exercise 3 that E[Tm ] < ∞, m ≥ 2. Using a similar approach it can be shown

that E[T1 ] < ∞, in particular T1 < ∞ a.s. Applying SLLN we have that almost surely

� �

2≤m≤n Tm 2≤m≤n Tm n − 1

lim = lim = E[T2 ]

n→∞ n n→∞ n−1 n

which further implies

� �

1≤m≤n Tm 2≤m≤n Tm T1

lim = lim + lim = E[T2 ]

n→∞ n n→∞ n n→∞ n

almost surely. �

Since i is a recurrent state then by Exercise 3, Ni (t) → ∞ almost surely as t → ∞. Combining

the preceding identity with (24.10) we obtain

t

lim = E[T2 ] = µi ,

t→∞ Ni (t)

i = πi almost surely.

To establish the convergence in expectation, notice that Ni (t) ≤ t almost surely, implying

Ni (t)/t ≤ 1. Applying bounded convergence theorem, we obtain that limt E[Ni (t)]/t = πi , and

the proof is complete.

How does the theory extend to the case when the M.c. has several recurrence classes R1 , . . . , Rr ?

The summary of the theory is as follows (the proofs are very similar to the case of single recurrent

class case and is omitted). It turns out that such a M.c. chain possesses r stationary distributions

π i = (π1i , . . . , πN

i

), 1 ≤ i ≤ r, each ”concentrating” on the class Ri . Namely for each i and each

state k ∈ / Ri we have πki = 0. The i-th stationary distribution is described by πki = 1/µk for

all k ∈ Ri and where µk is the mean return time from state k ∈ Rj into itself. Intuitively, the

stationary distribution π i corresponds to the case when the M.c. ”lives” entirely in the class

Ri . One can prove that the family of all of the stationary distributions of such a M.c. can be

obtained by taking all possible convex combinations of π i , 1 ≤ i ≤ r, but we omit the proof.

(Show that a convex combination of stationary distributions is a stationary distribution).

LECTURE 24. 5

24.5. References

• Sections 6.3-6.4 [1].

BIBLIOGRAPHY

Press, 2005.

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Fundamentals of probability.

6.436/15.085

LECTURE 25

Markov chains III. Periodicity, Mixing, Absorption

25.1. Periodicity

Previously we showed that when a ﬁnite state M.c. has only one recurrent class and π is the

corresponding stationary distribution,

�t then E[Ni (t)|X0 = k]/t → πi as t → ∞, irrespective of the

starting state k. Since Ni (t) = n=1 1{Xn =i} is the number of times state i is visited up till time

(n)

t, we have shown that 1t tn=1 P(Xn = i|X0 = k) → πi for every state k, i.e., pki converges to πi

�

(n)

in the Cesaro sense. However, pki need not converge, as the following example shows. Consider

(n)

a 2 state Markov Chain with states {1, 2} and p12 = 1 = p21 . Then p12 = 1 when n is odd and

0 when n is even.

Let x be a recurrent state and consider all the times when x is accessible from itself, i.e., the

(n)

times in the set Ix = {n ≥ 1 : pxx > 0} (note that this set is non-empty since x is a recurrent

state). One property of Ix we will make use of is that it is closed under addition, i.e., if m, n ∈ Ix ,

(m+n) (m) (n)

then m + n ∈ Ix . This is easily seen by observing that pxx ≥ pxx pxx > 0. Let dx be the

greatest common divisor of the numbers in Ix . We call dx the period of x. We now show that all

states in the same recurrent class has the same period.

Lemma 25.1. If x and y are in the same recurrent class, then dx = dy .

(m) (n) (m+n) (m) (n)

Proof. Let m and n be such that pxy , pyx > 0. Then pyy ≥ pxy pyx > 0. So dy divides

(l) (m+n+l) (n) (l) (m)

m+n. Let l be such that pxx > 0, then pyy ≥ pyx pxx pxy > 0. Therefore dy divides m+n+l,

hence it divides l. This implies that dy divides dx . A similar argument shows that dx divides dy ,

so dx = dy . �

A recurrent class is said to be periodic if the period d is greater than 1 and aperiodic if d = 1.

(n)

The 2 state Markov Chain in the example above has a period of 2 since p11 > 0 iﬀ n is even.

A recurrent class with period d can be divided into d subsets, so that all transitions from one

subset lead to the next subset.

Why is periodicity of interest to us? It is because periodicity is exactly what prevents the

(n) (n)

convergence of pxy to πy . Suppose y is a recurrent state with period d > 1. Then pyy = 0 unless

n is a multiple of d, but πy > 0. However, if d = 1, we have positive probability of returning to

y for all time steps n suﬃciently large.

1

2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

(n)

Lemma 25.2. If dy = 1, then there exists some N ≥ 1 such that pyy > 0 for all n ≥ N .

(n)

Proof. We ﬁrst show that Iy = {n ≥ 1 : pyy > 0} contains two consecutive integers. Let n

and n + k be elements of Iy . If k = 1, then we are done. If not, then since dy = 1, we can

ﬁnd a n1 ∈ Iy such that k is not a divisor of n1 . Let n1 = mk + r where 0 < r < k. Consider

(m + 1)(n + k) and (m + 1)n + n1 , which are both in Iy since Iy is closed under addition. We

have

(m + 1)(n + k) − (m + 1)n + n1 = k − r < k.

So by repeating the above argument at most k times, we eventually obtain a pair of consecutive

integers m, m + 1 ∈ Iy . If N = m2 , then for all n ≥ N , we have n − N = km + r, where

0 ≤ r < m. Then n = m2 + km + r = r(1 + m) + (m − r + k)m ∈ Iy . �

When a Markov chain has one recurrent class (irreducible) and aperiodic, we have that the

steady state behavior is given by the stationary distribution. This is also known as mixing.

Theorem 25.3. Consider an irreducible, aperiodic Markov chain. Then for all states x, y,

(n)

lim pxy = πy .

n→∞

(n)

For the case of periodic chains, there is a similar statement regarding convergence of pxy , but

now the convergence holds only for certain subsequences of the time index n. See [1] for further

details.

There are at least two generic ways to prove this theorem. One is based on the Perron-

Frobenius Theorem which characterizes eigenvalues and eigenvectors of non-negative matrices.

Speciﬁcally the largest eigenvalue of P is equal to unity and all other eigenvalues are strictly

smaller than unity in absolute value. The P-F Theorem is especially useful in the special case

of so-called reversible M.c.. These are irreducible M.c. for which the unique stationary distri

bution satisﬁes πx pxy = πy pyx for all states x, y. Then the following important reﬁnement of

Theorem 25.4 is known.

Theorem 25.4. Consider an irreducible aperiodic Markov chain which is reversible. Then there

(n)

exists a constant C such that for all states x, y, |pxy − πy | ≤ C|λ2 |n , where λ2 is the second

largest (in absolute value) eigenvalue of P .

Since by P-F Theorem |λ2 | < 1, this theorem is indeed a reﬁnement of Theorem 25.4 as it

gives a concrete rate of convergence to the steady-state.

We have considered the long-term behavior of Markov chains. Now, we study the short-term

behavior. In such considerations, we are concerned with the behavior of the chain starting in a

transient state, till it enters a recurrent state. For simplicity, we can therefore assume that every

recurrent state i is absorbing, i.e., pii = 1. The Markov chain that we will work with in this

section has only transient and absorbing states.

If there is only one absorbing state i, then πi = 1, and i is reached with probability 1. If

there are multiple absorbing states, the state that is entered is random, and we are interested in

the absorbing probability

aki = P(Xn eventually equals i | X0 = k),

LECTURE 25. 3

i.e., the probability that state i is eventually reached, starting from state k. Note that aii = 1

and aji = 0 for all absorbing j =� i. For k a transient state, we have

aki = P(∃n : Xn = i | X0 = k)

N

�

= P(∃n : Xn = i | X1 = j)pkj

j=1

N

�

= aji pkj .

j=1

So we can ﬁnd the absorption probabilities by solving the above system of linear equations.

Example: Gambler’s Ruin A gambler wins 1 dollar at each round, with probability p, and

loses a dollar with probability 1 − p. Diﬀerent rounds are independent. The gambler plays

continuously until he either accumulates a target amount m or loses all his money. What is the

probability of losing his fortune?

We construct a Markov chain with state space {0, 1, . . . , m}, where the state i is the amount

of money the gambler has. So state i = 0 corresponds to losing his entire fortune, and state

m corresponds to accumulating the target amount. The states 0 and m are absorbing states.

We have the transition probabilities pi,i+1 = p, pi,i−1 = 1 − p for i = 1, 2, . . . , m − 1, and

p00 = pmm = 1. To ﬁnd the absorbing probabilities for the state 0, we have

a00 = 1,

am0 = 0,

ai0 = (1 − p)ai−1,0 + pai+1,0 , for i = 1, . . . , m − 1.

(1 − p)(ai−1,0 − ai0 ) = p(ai0 − ai+1,0 )

bi = ρbi−1

so we obtain bi = ρi b0 . Note that b0 +b1 + · · · +bm−1 = a00 −am0 = 1, hence (1+ρ+. . .+ρm−1 )b0 =

1, which gives us

� i

ρ (1−ρ)

1−ρm

, if ρ �= 1,

bi = 1

m

, otherwise.

ai0 = a00 − bi−1 − . . . − b0

= 1 − (ρi−1 + . . . + ρ + 1)b0

1 − ρi 1 − ρ

=1−

1 − ρ 1 − ρm

ρi − ρm

=

1 − ρm

4 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

and for ρ = 1,

m−i

ai0 = .

m

This shows that for any ﬁxed i, if ρ > 1, i.e., p < 1/2, the probability of losing goes to 1 as

m → ∞. Hence, it suggests that if the gambler aims for a large target while under unfavorable

odds, ﬁnancial ruin is almost certain.

The expected time of absorption µk when starting in a transient state k can be deﬁned as

µk = E[min{n ≥ 1 : Xn is recurrent} | X0 = k]. A similar analysis by conditioning on the ﬁrst

step of the Markov chain shows that the expected time to absorption can be found by solving

µk = 0 for all recurrent states k,

N

�

µk = 1 + pkj µj for all transient states k.

j=1

25.3. References

• Sections 6.4,6.6 [2].

• Section 5.5 [1].

BIBLIOGRAPHY

1. R. Durrett, Probability: theory and examples, Duxbury Press, second edition, 1996.

2. G. R. Grimmett and D. R. Stirzaker, Probability and random processes, Oxford University

Press, 2005.

5

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Fundamentals of probability.

6.436/15.085

LECTURE 26

Inﬁnite Markov chains. Continuous time Markov chains.

26.1. Introduction

In this lecture we cover variants of Markov chains, not covered in earlier lectures. We will discuss

inﬁnite state Markov chains. Then we consider ﬁnite and inﬁnite state M.c. where the transition

between the states happens during some random time interval, as opposed to unit time steps.

Most of the times we state the results without proofs. Our treatment of this material is also

very brief. A more in depth analysis of these concepts is devoted by the course 6.262 - Discrete

Stochastic Processes.

Suppose we have a (homogeneous) Markov chain whose state space is countably inﬁnite X =

{0, 1, 2, . . .}. In this case the theory is similar in some respects to the ﬁnite state counterpart,

but diﬀerent in other respects. We denote again by pi,j the probability of transition from state

i to state j. We introduce the notion of i communicates with j, written as i → j, in the same

manner as before. Thus again we may decompose the state space into states i such that for

some j, i → j but j � i; and the states without this property. However, in the case of inﬁnite

M.c. a new complication appears. To discuss it, let us again deﬁne a probability distribution π

on X to�be stationary if it is time invariant. The necessary and suﬃcient condition for this is

πi ≥ 0, i∈X πi = 1 and for every state i

�

πi = πj pj,i .

j

d d

As a result, if the M.c. Xn has the property X0 = π, then Xn = π for all n.

Now let us consider the following M.c. on Z+ . A parameter p is ﬁxed. For every i > 0,

pi,i+1 = p, pi,i−1 = 1 − p and p0,1 = p, p0,0 = 1 − p. This M.c. is called random walk with reﬂection

at zero. Let us try to ﬁnd a stationary distribution π of this M.c. It must satisfy

πi = πi−1 pi−1,i + πi+1 pi+1,i = πi−1 p + πi+1 (1 − p), i ≥ 1.

π0 = π0 (1 − p) + π1 (1 − p).

1

2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

p

From this we obtain π1 = π

1−p 0

and iterating

p

(26.1) πi+1 = πi .

1−p

i

� gives πi = (p/(1 − p)) π0 . Now if p > 1/2 then πi → ∞ and we cannot possibly have that

This

i πi = 1. Thus no stationary distribution exists. Note, that however all pairs of states i, j

communicate, as we can get from i to j > i in j − i steps with probability pj−i > 0, and from j

to i in j − i steps with probability (1 − p)j−i .

We conclude that an inﬁnite state M.c. does not necessarily have a stationary distribution

even if all states communicate. Recall that in the case of ﬁnite state M.c. if i is a recurrent state,

then its recurrence time Ti has ﬁnite expected value (as it has geometrically decreasing tails). It

turns out that the diﬃculty is the fact that while every state i communicates with every other

state j, it is possible that the chain starting from i wanders oﬀ to ”inﬁnity” for every without

ever returning to i. Furthermore, it is possible that even if the chain returns to i inﬁnitely often

with probability one, the expected return time from i to i is inﬁnite. Recall, that the return time

is deﬁned to be Ti = min{n ≥ 1 : Xn = i}, when the M.c. starts at i at time 0.

Deﬁnition 26.2. Given an inﬁnite M.c. Xn , n ≥ 1, the state i is deﬁned to be transient if the

probability of never returning to i is positive. Namely,

� i, ∀n ≥ 1|X0 = i) > 0.

P(Xn =

Otherwise the state is deﬁned to be recurrent. It is deﬁned to be positive recurrent if E[Ti ] < ∞

and null-recurrent if E[Ti ] = ∞.

Thus, unlike the ﬁnite state case, the state is transient if there is a positive probability of no

return, as opposed to existence of a state from which the return to starting state has probability

zero. It is an exercise to see that the deﬁnition above when applied to the ﬁnite state case is

consistent with the earlier deﬁnition. Also, observe that there is no notion of a null-recurrent

state in the ﬁnite state case.

The following theorem holds, the proof of which we skip.

Theorem 26.3. Given an inﬁnite M.c. Xn , n ≥ 1 suppose all the states communicate. Then

there exists a stationary distribution π iﬀ there exists at least one positive recurrent state i. In

this case in fact all the states are positive recurrent and the stationary distribution π is unique.

It is given as πj = 1/E[Tj ] > 0 for every state j.

We see that in the case when all the states communicate, all states have the same status:

positive recurrent, null recurrent or transient. In this case we will say the M.c. itself is positive

recurrent, null recurrent, or transient. There is an extension of this theorem to the cases when

not all states communicate, but we skip the discussion of those. The main diﬀerence is that if

the state i is such that for some j, i → j and j � i, then the steady state probability of i is

zero, just as in the case of ﬁnite state M.c. Similarly, if there are several communicating classes,

then there exists at least one stationary distribution per class which contains at least one positive

recurrent state (and as a result all states are positive recurrent).

Theorem 26.4. A random walk with reﬂection Xn on Z+ is positive recurrent if p < 1/2,

null-recurrent if p = 1/2 and transient if p > 1/2.

Proof. The case p < 1/2 will be resolved by exhibiting explicitly at least one steady state

distribution π. Since all the states communicate, then by Theorem 26.3 we know that the

LECTURE 26. 3

stationary distribution is unique and E[Ti ] = 1/πi < ∞ for all i. Thus the chain is positive

recurrent. To construct a stationary distribution look again at the recurrence (26.1), which

suggests πi = (p/(1 − p))i π0 . From this we obtain

�

π0 (1 + (p/(1 − p))i ) = 1

i>0

1 − 2p � p �i

πi = , i ≥ 0.

1−p 1−p

�

This gives us a probability vector π with i πi = 1 and completes the proof for the case p < 1/2.

The case p ≥ 1/2 will be analyzed using our earlier result on random walk on Z. Recall that

for such a r.w. the probability of return to zero is = 1 iﬀ p = 1/2. In the case p = 1/2 we have

also established that the expected return time to zero is inﬁnite. Thus suppose p = 1/2. A r.w.

without reﬂection makes the ﬁrst step into 1 or −1 with probability 1/2 each. Conditioning on

X1 = 1 and conditioning on X1 = −1, we have that the expected return time to zero is again

inﬁnite. If the ﬁrst transition is into 1, then the behavior of this r.w. till the ﬁrst return to

zero is the same as of our r.w. with reﬂection at zero. In particular, the return to zero happens

with probability one and the expected return time is inﬁnite. We conclude that the state 0 is

null-recurrent.

Finally, suppose p > 1/2. We already saw that the M.c. cannot have a stationary distribution.

Thus by Theorem 26.3, since all the states communicate we have that all states are null-recurrent

or transient. We just need to reﬁne this result to show that in fact all states are transient.

For the unreﬂected r.w. we have that with positive probability the walk never returns to

zero. Let, as usual, T0 denote return time to 0 - the time it takes to come back to zero when the

chain starts at zero. We claim that P(T0 = ∞|X1 = 1) > 0, P(T0 = ∞|X1 = −1) = 0. Namely,

the ”no return to zero” happens iﬀ the ﬁrst step is to the right. First let us see why this implies

our result: if the ﬁrst step is to the right, then the r.w. behaves as r.w. with reﬂection at zero

until the ﬁrst return to zero. Since there is a positive probability of no return, then there is a

positive probability of no return for the reﬂected r.w. as well.

Now we establish that claim. We have P(T0 = ∞) = pP(T0 = ∞|X1 = 1) + (1 − p)P(T0 =

∞|X1 = −1). We also have that P(T0 = ∞) > 0. We now establish that P(T0 = ∞|X1 =

−1) = 0. This immediately implies P(T0 = ∞|X1 = 1) > 0, which we need. Now assume

X1 = −1. Consider Yn = −Xn . Observe that, until the ﬁrst return to zero, Yn is a reﬂected

r.w. with parameter q = 1 − p. Since q < 1/2, then, as we established at the beginning

of the proof, the process Yn returns to zero with probability one (moreover the return time

has ﬁnite expected value). We conclude that Xn returns to zero with probability one, namely

P(T0 = ∞|X1 = −1) = 0. This completes the proof. �

We consider a stochastic process X(t) which is a function of a real argument t instead of integer

n. Let X be the state space of this process, which is assumed to be ﬁnite or countably inﬁnite.

Deﬁnition 26.5. X(t) is deﬁned to be a continuous time Markov chain if for every j, i1 , . . . , in−1 ∈

X and every sequence of times t1 < t2 < · · · < tn ,

(26.6) P(X(tn ) = j|X(tn−1 ) = in−1 , . . . , X(t1 ) = i1 ) = P(X(tn ) = j|X(tn−1 ) = in−1 ).

4 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

for every i, j and s < t.

From now on we assume without explicitly saying that our M.c. is homogeneous. We write

(t)

pi,jfor P(X(t) = j |X(0) = i). The continuous time Markov chain is a special case of a Markov

process, the deﬁnition of which we skip. Loosely speaking, a stochastic process is a Markov

process if its future trajectory is completely determined by its current state, independently from

the past. We already know an example of a continuous time M.c. - Poisson process. It is given

d

as P(X(t) = i + k|X(s) = i) = Pois(λ(t − s)), k ≥ 0 and P(X(t) = i + k|X(s) = i) = 0 for k < 0.

Given a state i and time t0 introduce “holding time” U (i, t0 ) as inf{s > 0 : X(t0 + s) �= i},

when X(t0 ) = i. Namely, it is the time that the chain spends in state i after time t0 , assuming

that it is in i at time t0 . It might turn out in special cases that U (i, t0 ) = 0 almost surely. But

in many special cases this will not happen. For now we assume that U (i, t0 ) > 0 a.s. In special

cases we can establish this directly.

d

Proposition 1. For every state i and time t0 , U (i, t0 ) = Exp(µi ) for some parameter µi which

depends only on the state.

Since, per proposition above, the distribution of holding time is exponential, and therefore

memoryless, we see that the time till the next transition occurs is independent from the past

history of the chain and only depends on the current state i. The parameter µi is usually called

transition rate out of state i. This is a very fundamental (and useful) property of continuous

time Markov chains.

Proof sketch. Consider

P(U (i, t0 ) > x + y|U (i, t0 ) > x, X(t0 ) = i).

The event U (i, t0 ) > x, X(t0 ) = i implies in particular X(t0 + x) = i. Since we have a M.c. the

trajectory of X(t) for t ≥ t0 + x depends only on the state at time t0 + x which is i in our case.

Namely

P(U (i, t0 ) > x + y|U (i, t0 ) > x, X(t0 ) = i) = P(U (i, t0 + x) > y|X(t0 + x) = i).

But the latter expression by homogeneity is P(U (i, t0 ) > y|X(t0 ) = i), as it is the probability of

the holding time being larger than y when the current state is i. We conclude that

P(U (i, t0 ) > x + y|U (i, t0 ) > x, X(t0 ) = i) = P(U (i, t0 ) > y|X(t0 ) = i),

namely

P(U (i, t0 ) > x + y) = P(U (i, t0 ) > y|X(t0 ) = i)P(U (i, t0 ) > x|X(t0 ) = i).

Since the exponential function is the only one satisfying this property, then U (i, t0 ) must be

exponentially distributed. �

There is an omitted subtlety in the proof. We assumed that for every t, z > 0 and state i,

P(X(t + s) = i, ∀ s ∈ [0, z]|X(t) = i, �t ) = P(X(t + s) = i, ∀ s ∈ [0, z]|X(t) = i) where �t denotes

the history of the process up to time t. We deduced this based on the assumption (26.6). This

requires a technical proof, which we ignored above.

Thus the evolution of a continuous M.c. X(t) can be described as follows. It stays in a given

state i during some exponentially distributed time Ui , with parameter µi which only depends on

the state. After this time it makes a transition to the next state j. If we consider the process

LECTURE 26. 5

only at the times of transitions, denoted say by t1 < t2 < · · · , then we obtain an embedded

discrete time process Yn = X(tn ). It is an exercise to show that Yn is in fact a homogeneous

Markov chain. Denote the transition rates of this Markov chain by pi,j . The value qi,j = µi pi,j is

called “transition rate” from state i to state j. Note, that the values pi,j were� introduced only for

j �= i, as they were derived from M.c. changing its state. Deﬁne qi,i = − j=i � qi,j . The matrix

G = (qi,j ), i, j ∈ X is deﬁned to be the generator of the M.c. X(t) and plays an important role,

speciﬁcally for the discussion of a stationary distribution.

A stationary distribution π of a continuous M.c. is deﬁned in the same way as for the discrete

time case: it is the distribution which is time invariant. The following fact can be established.

�

Proposition 2. A vector (π i ), i ∈ X is a stationary distribution iﬀ πi ≥ 0, i πi = 1 and

�

�

j πj qj,i = 0 for every state i. In vector form q G = 0.

As for the discrete time case, the theory of continuous time M.c. has a lot of special structure

when the state space is ﬁnite. We now summarize without proofs some of the basic results. First

there exists a stationary distribution. The condition for uniqueness of the stationary distribution

are the same - single recurrence class, with communications between the states deﬁned similarly.

A nice “advantage” of continuous M.c. is the lack of periodicity. There is no notion of a period

of a state. Moreover, and most importantly, suppose the chain has a unique recurrence class.

Then, for π, the corresponding unique stationary distribution, the mixing property

(t)

lim pi,j = πj

t→∞

holds for every starting state i. For the modeling purposes, it is useful sometimes to consider a

continuous as opposed to a discrete M.c.

There is an alternative way to describe a continuous M.c. and the embedded discrete time

M.c. Assume that to each pair of states i, j we associate an exponential “clock” - exponen

tially distributed r.v. Ui,j with parameter µi,j . Each time the process jumps into i all of the

clocks turned on simultaneously. Then at time Ui � min Ui,j the process jumps into state

j = arg minj Ui,j . It is not hard to establish the following: the resulting process is a continu

ous time ﬁnite state M.c. The embedded discrete time M.c. has then transition probabilities

µ

P(X(tn+1 ) = j|X(tn ) = i) = � i,jµi,k , as the probability that Ui,j = mink Ui,k is given by this

k

expression, when the distribution of Ui,j is exponential

� with parameters µi,j . The holding time

has then the distribution Exp(µi ) where µi = k µi,k . Thus we obtain an alternative description

of a M.c. The transition rates of this M.c. are qi,j = µi pi,j = µi,j . In other words, we described

the M.c. via the rates qi,j as given.

This description extends to the inﬁnite M.c., when the notion of holding times is well deﬁned

(see the comments above).

26.4. References

• Sections 6.2,6.3,6.9 [1].

BIBLIOGRAPHY

Press, 2005.

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Fundamentals of probability.

6.436/15.085

LECTURE 27

Birth-death processes

An important and a fairly tractable class of inﬁnite continuous time M.c. is a birth-death process.

Loosely speaking this is a process which combines the property of a random walk with reﬂection

at zero, studied in the previous lecture and continuous time nature of the transition times. The

process is describe as follows. The state space is Z+ . To specify the transition we use the

”exponential clock” model. To each state n ≥ 0 two exponential clocks with rates λn , µn are

attached. Assume the current state is X(t) = n > 0. If the clock corresponding to rate λn

expires ﬁrst, the chain moves to the state n + 1 at time t + U . Otherwise, it moves to the state

d d

n − 1 at time t + V . Formally, let U = Exp(λn ), V = Exp(µn ) be independent. If U < V , then

X(t+U ) = n+1. If V < U , then X(t+V ) = n−1. The case U = V occurs with probability zero,

so it is ignored. In the special case n = 0 we assume that we only have a r.v. U and the chain

moves into the state 1 after time U . Using the language of embedded M.c. we can alternatively

d

describe the process as follows. Assume X(t) = n. Then a random time U = Exp(λn + µn ) the

chain moves to a new state which is n + 1 with probability λn /(λn + µn ), or n − 1 with probability

µn /(λn + µn ).

One can characterize the dynamics of birth-death processes in terms of a certain family of

diﬀerential equations. Recall, that in addition to memoryless property, the exponential distribu

d

tion has the following property: if U = Exp(λ) then for every t, P(U ∈ [t, t + h)) = λh + o(h).

From this obtain the following relation

⎧

λ h + o(h), if m = 1;

⎨ n

⎪

⎪

µn h + o(h), if m = −1;

P(X(t + h) = n + m|X(t) = n) =

⎪

⎪ 1 − λn h − µn h + o(h), if m = 0;

⎩

o(h), if |m| > 1;

1

2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

Fix

� an arbitrary state k0 and let pn (t) = P(X(t) = n|X(0) = k0 ). We have pn (t + h) =

j pj (t)P(X(t + h) = n|X(t) = j). Assume n > 0. Then we can rewrite the expression above as

+ pn+1 (t)(µn+1 h + o(h))

+ pn (t)(1 − λn h − µn h + o(h))

�

+ o(h)( pj (t))

� n−1,n,n+1

j=

�

Note that �

j=n−1,n,n+1 pj (t) ≤ 1, so the last term in the sum is simply o(h). We conclude

pn (t + h) − pn (t)

= λn−1 pn−1 (t) + µn+1 pn+1 (t) − (λn + µn )pn (t) + o(1).

h

By taking the limit h → 0, we obtain a system of diﬀerential equations

(27.1) ṗn (t) = λn−1 pn−1 (t) + µn+1 pn+1 (t) − (λn + µn )pn (t)

The case n = 0 is obtained similarly and simpliﬁes to

(27.2) ṗ0 (t) = µ1 p1 (t) − λ0 p0 (t).

The initial condition for this system of equations is pk0 (0) = 1, pj (0) = 0, j �= k0 .

The system of diﬀerential equations we derived gives us a very good “hint” at what should be

the stationary distribution, if one exists. Note that since all the states communicate, then one

can have only one stationary distribution. We now use this system of diﬀerential equations to

derive the stationary distribution. Then we will see an alternative way of deriving it using the

condition π � G = 0 stated in the last lecture.

Steady state distribution is, by deﬁnition, a distribution which is time invariant. Thus we

ask the following question: what should be a distribution π such that if we initialize our M.c. at

π at time zero, as opposed a ﬁxed state k0 , we get the same distribution at every time t? Clearly,

for this we must have that all derivatives ṗn (t) vanish and all pn (t) = pk are constants. Thus we

must have the following system of equations:

(27.3) λn−1 pn−1 + µn+1 pn+1 − (λn + µn )pn = 0, n ≥ 1

(27.4) µ1 p1 − λ0 p0 = 0.

On the other hand, if we initialize our M.c. at time zero with distribution π = p = (pn ), n ≥ 0,

then at every time t we obtain the same distribution, as the system of functions pn (t) = p, ∀t ≥ 0

solves (27.1),(27.2). Recursively solving this system we ﬁnd that

λ0 λ1 · · · λn−1

(27.5) pn = p0

µ1 µ2 · · · µn

�

Since we must also have that n≥0pn = 1, then we must have

� � λ0 λ1 · · · λn−1 �−1

(27.6) p0 = 1 + ,

n≥1

µ1 µ2 · · · µn

LECTURE 27. 3

� λ0 λ1 · · · λn−1

(27.7) < ∞.

n≥1

µ1 µ2 · · · µn

We arrive at the following result.

Theorem 27.8. A birth-death process with parameters λn , µn has a stationary distribution if

and only if the condition (27.7) holds. In this case the stationary distribution is unique and is

given by (27.6),(27.5).

There is a simpler alternative way of deriving the stationary distribution. Recall the condition

�

π G, where G = (gi,j ) is the generator matrix – the matrix of rates. In this case we have

gn,n+1 = λn , gn,n−1 = µn , gn,n = −(λn + µn ), n ≥ 1 and g0,1 = λ0 , g0,0 = −λ0 . Then the equation

π � G = 0 translates into (27.3),(27.4).

Even a faster way, although the rigorization of this approach is based on the reversibility

theory which we did not cover, is as follows. Observe that every time there is a transition from

the state n to the state n + 1 there must be a reverse transition (otherwise the state 0 will not be

ever visited after some ﬁnite time period). In steady state the probability of transition n → n + 1

occurring at a given time interval [t, t + h] is πn (λn h + o(h)). The transition n + 1 → n occurs

in the same time interval with probability πn+1 (µn+1 h + o(h)). Since the two are the same, we

obtain πn λn = πn+1 µn+1 , which gives us the same solution (27.6),(27.6). These equations are

sometimes called balance equations for an obvious reason.

One of the immediate applications of birth-death processes is queueing theory. Consider the

following system, known broadly as M/M/1 queueing system (M/M standing for memoryless

arrival, memoryless service distribution). We have a server which serves arriving customer.

d

Serving time for the n-th customer takes a random amount of time which is = Exp(µ). Customers

arrive to the system according to the Poisson process with rate λ. Whenever there are no

customers in the system, the server idles. Let L(t) denote the number of customers at time t.

What is the distribution of L(t)? What is the steady state distribution of L(t) if any exists? It

turns out that L(t) is described exactly as the birth-death process with rates λn = λ, µn = µ.

Indeed if L(t) = n > 0 then the next transition will occur either if there is an arrival, or if there

is a service completion. Thus the transition occurs when the earlier of the two ”exponential

d

clocks” ticks. This happens at a random time = Exp(λ + µ) and the transition is to n + 1 with

probability λ/(λ + µ) and to n − 1 with probability µ/(λ + µ). When L(t) = 0 the only change

d

is due to an arrival. So after a random time = Exp(λ) we have a transition to state 1. We

see that indeed we have a birth-death process. Clearly, the condition (27.7) holds if and only if

ρ = λ/µ < 1 and, in this case, we we obtain the following form of the stationary distribution.

� � �−1

p0 = 1 + ρn = 1 − ρ,

n≥1

n

pn = (1 − ρ)ρ .

There is an intuitive reason why we need condition ρ < 1. When ρ > 1 the customers arrive at

a rate faster than service rate and on average queue builds up at a linear rate. Contrast this

4 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085

with the reﬂected random walk (essentially the same model except it is discrete time) and the

condition 1 − p < p for existence of steady state. The parameter ρ is called traﬃc intensity

and plays a very important role in the theory of queueing systems. For one thing notice that in

steady state p0 = P(L = 0) = 1 − ρ. On the other hand, from the general theory of discrete and

continuous time M.c. we know that p0 is the average fraction of time the system spends in this

state. Thus 1 − ρ is the average time the server is idle. Alternatively, ρ is the average time the

system is busy. Hence the term “traﬃc intensity”.

Consider now the following variant of a queueing system, known as M/M/∞ system. Here

d

we have inﬁnitely many servers. Each service time is again = Exp(µ) and the arrival occurs

according to a Poisson process with rate λ. The diﬀerence is that there is no queue any more

as, due to inﬁnity of servers, every customer gets instantly to be served. Let L(t) be the number

of customers being served at time t. It is not hard to see that this corresponds to a birth-death

process with parameters λn = λ, µn = µn. The arrival parameter explained as before. The

service rate being µn is explained as follows: when there are n customers being served the next

transition occurs at a time which is minimum of an arrival time till the next customer or the

d d

smallest of the n service time. The former is = Exp(λ), the latter = Exp(nµ), hence µn = µn.

Let ρ = λ/µ. In this case we ﬁnd the stationary distribution as

� � ρn �−1

p0 = = e−ρ ,

n≥0

n!

ρn −ρ

pn = e .

n!

In particular, the distribution is a familiar Pois(ρ). The system always has a steady state

distribution, irrespectively of the values λ, µ. This is explained by the fact that we have inﬁnitely

many servers and the queue disappears.

27.4. References

• Sections 6.11, [1].

BIBLIOGRAPHY

Press, 2005.

MIT OpenCourseWare

http://ocw.mit.edu

Fall 2008

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.