0 оценок0% нашли этот документ полезным (0 голосов)

26 просмотров32 страницыStatistics 110, Intro to Probability by Joe Blitzstein - Lecture Notes
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Nov 10, 2019

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

Statistics 110, Intro to Probability by Joe Blitzstein - Lecture Notes
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

26 просмотров32 страницыStatistics 110, Intro to Probability by Joe Blitzstein - Lecture Notes
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

© All Rights Reserved

Вы находитесь на странице: 1из 32

Notes by Max Wang

Harvard University, Fall 2011

Lecture 5: 9/12/11 . . . . . . . . . . . . . 2 Lecture 22: 10/26/11 . . . . . . . . . . . . 17

Lecture 6: 9/14/11 . . . . . . . . . . . . . 3

Lecture 23: 10/28/11 . . . . . . . . . . . . 18

Lecture 7: 9/16/11 . . . . . . . . . . . . . 3

Lecture 24: 10/31/11 . . . . . . . . . . . . 19

Lecture 8: 9/19/11 . . . . . . . . . . . . . 4

Lecture 25: 11/2/11 . . . . . . . . . . . . 20

Lecture 9: 9/21/11 . . . . . . . . . . . . . 5

Lecture 26: 11/4/11 . . . . . . . . . . . . 21

Lecture 10: 9/23/11 . . . . . . . . . . . . 6

Lecture 27: 11/7/11 . . . . . . . . . . . . 22

Lecture 11: 9/26/11 . . . . . . . . . . . . 7

Lecture 28: 11/9/11 . . . . . . . . . . . . 24

Lecture 12: 9/28/11 . . . . . . . . . . . . 8

Lecture 29: 11/14/11 . . . . . . . . . . . . 25

Lecture 13: 9/30/11 . . . . . . . . . . . . 8

Introduction

Statistics 110 is an introductory statistics course o↵ered at Harvard University. It covers all the basics of probability—

counting principles, probabilistic events, random variables, distributions, conditional probability, expectation, and

Bayesian inference. The last few lectures of the course are spent on Markov chains.

These notes were partially live-TEXed—the rest were TEXed from course videos—then edited for correctness and

clarity. I am responsible for all errata in this document, mathematical or otherwise; any merits of the material here

should be credited to the lecturer, not to me.

Feel free to email me at mxawng@gmail.com with any comments.

Acknowledgments

In addition to the course sta↵, acknowledgment goes to Zev Chonoles, whose online lecture notes (http://math.

uchicago.edu/~chonoles/expository-notes/) inspired me to post my own. I have also borrowed his format for

this introduction page.

The page layout for these notes is based on the layout I used back when I took notes by hand. The LATEX styles can

be found here: https://github.com/mxw/latex-custom.

Copyright

Copyright © 2011 Max Wang.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This means you are free to edit, adapt, transform, or redistribute this work as long as you

• include an attribution of Joe Blitzstein as the instructor of the course these notes are based on, and an

attribution of Max Wang as the note-taker;

• do so in a way that does not suggest that either of us endorses you or your use of this work;

• use this work for noncommercial purposes only; and

• if you adapt or build upon this work, apply this same license to your contributions.

Stat 110—Intro to Probability Max Wang

to put k indistinguishable particles in n distinguishable

Definition 2.1. A sample space S is the set of all possi- boxes. Suppose we order the particles; then this count is

ble outcomes of an experiment. simply the number of ways to place “dividers” between

the particles, e.g.,

Definition 2.2. An event A ✓ S is a subset of a sample

space. • • •| • | • •|| • |•

“n+k-1 positions, and we need to

Definition 2.3. Assuming that all outcomes are equally There are choose k of ◆

them✓to put the◆dots”

✓

likely and that the sample space is finite, n+k 1 n+k 1

=

k n 1

# favorable outcomes

P (A) = ways to place the particles, which determines the place-

# possible outcomes

ment of the divisiors (or vice versa); this is our result. ⌅

is the probability that A occurs. ✓ ◆ ✓ ◆

n n

Example. 1. =

Proposition 2.4 (Multiplication Rule). If there are r ex- k n k

✓ ◆ ✓ ◆

periments and each experiment has ni possible outcomes, n 1 n

then the overall sample space has size 2. n =k

k 1 k

n1 n2 · · · nr Pick k people out of n, then designate one as spe-

cial. The RHS represents how many ways we can

Example. The probability of a full house in a five card do this by first picking the k individuals and then

poker hand (without replacement, and without other making our designation. On the LHS, we see the

players) is number of ways to pick a special individual and

✓ ◆ ✓ ◆ then pick the remaining k 1 individuals from the

4 4 remaining pool of n 1.

13 · 12

3 2

P (full house) = ✓ ◆ 3. (Vandermonde)

52 ✓ ◆ X k ✓ ◆✓ ◆

n+m n m

E.g., three 7’s, two 10’s 5 =

k i=0

i k i

Definition 2.5. The binomial coefficient is given by

On the LHS, we choose k people out of n + m.

✓ ◆ On the RHS, we sum up, for every i, how to choose

n n!

= i from the n people and k i from the m people.

k (n k)!k!

Definition 3.2. A probability space consists of a sample

or 0 if k > n. # subsets of size k of set of size n. space S along with a function P : P(S) ! [0, 1] taking

events to real numbers, where A is event in S, i.e. A

Theorem 2.6 (Sampling Table). The number of subsets

of size k chosen from a set of n distinct elements is given 1. P (;) = 0, P (S) = 1 subset of S

by the following table: S1 P1

2. P ( n=1 An ) = n=1 P (An ) if the An are disjoint

ordered ✓unordered◆

k n+k 1 Lecture 4 — 9/9/11

replacement n

✓k ◆ Example (Birthday Problem). The probability that at

n! n

no replacement least two people among a group of k share the same birth-

(n k)! k

day, assuming that birthdays are evenly distributed across

the 365 standard days, is given by

Lecture 3 — 9/7/11

P (match) = 1 P (no match)

Proposition 3.1. The number of ways to choose k ele- 365 · 364 · · · (365 k + 1)

=1

ments from a set of order n, with replacement and where 365k

order doesn’t matter, is Proposition 4.1.

✓ ◆

n+k 1 1. P (AC ) = 1 P (A).

k 2. If A ✓ B, then P (A) P (B).

1

Stat 110—Intro to Probability Max Wang

3. P (A [ B) = P (A) + P (B) P (A \ B). Definition 5.1. The probability of two events A and B

are independent if

Proof. All immediate. ⌅

P (A \ B) = P (A)P (B)

Corollary 4.2 (Inclusion-Exclusion). Generalizing 3

above, In general, for n events A1 , . . . , An , independence re-

0 1 quires i-wise independence for every i = 2, . . . , n; that

n

[ Xn X is, say, pairwise independence alone does not imply inde-

P @ Ai A = P (Ai ) P (Ai \ Aj ) pendence.

i=1 i=1 i<j

X Note. We will write P (A \ B) as P (A, B).

+ P (Ai \ Aj \ Ak ) ··· Example (Newton-Pepys Problem). Suppose we have

i<j<k

0 1 some fair dice; we want to determine which of the fol-

n

\ lowing is most likely to occur:

+ ( 1)n+1 P @ Ai A

1. At least one 6 given 6 dice.

i=1

2. At least two 6’s with 12 dice.

Example (deMontmort’s Problem). Suppose we have n

cards labeled 1, . . . , n. We want to determine the proba- 3. At least three 6’s with 18 dice.

bility that for some card in a shu✏ed deck of such cards, For the first case, we have

the ith card has value i. Since the number of orderings of ✓ ◆6

the deck for which a given set of matches occurs is simply 1

P (A) = 1

the permutations on the remaining cards, we have 5

For the second,

(n 1)! 1

P (Ai ) = = ✓ ◆12 ✓ ◆✓ ◆11

n! n 1 1 1

(n 2)! 1 P (B) = 1 12

P (A1 \ A2 ) = = 5 1 5

n! n(n 1) and for the third,

(n k)!

P (A1 \ · · · \ Ak ) = 2 ✓ ◆✓

X ◆k ✓ ◆18 k

n! 18 1 1

P (C) = 1

k 1 5

So using the above corollary, k=0

1 n(n 1) 1

P (A1 [ · · · [ An ) = n · · probability.)

n 2! n(n 1)

n(n 1)(n 2) 1 Thus far, all the probabilities with which we have con-

+ · cerned ourselves have been unconditional. We now turn

3! n(n 1)(n 2)

to conditional probability, which concerns how to update

1 1 1

=1 + · · · + ( 1)n our beliefs (and computed probabilities) based on new

2! 3! n! evidence?

1

⇡1 Definition 5.2. The probability of an event A given B

e

is

P (A \ B)

Lecture 5 — 9/12/11 P (A|B) =

P (B)

Note. Translation from English to inclusion-exclusion: if P (B) > 0.

P (A \ B) = P (B)P (A|B) = P (A)P (B|A)

P (A1 [ · · · [ An )

or, more generally,

• Probability that none of the Ai occurs: P (A1 \ · · · \ An ) = P (A1 ) · P (A2 |A1 ) · P (A3 |A1 , A2 )

1 P (A1 [ · · · [ An ) · · · P (An |A1 , . . . , An 1)

• Probability that all of the Ai occur:

P (B|A)P (A)

P (A|B) =

P (A1 \ · · · \ An ) = 1 P (AC

1 [ ··· [ AC

n) P (B)

2

Stat 110—Intro to Probability Max Wang

conditionally independent. Suppose we know that a fire

Theorem 6.1 (Law of Total Probability). Let S be a alarm goes o↵ (event A). Suppose there are only two

sample space and A1 , . . . , An a partition of S. Then possible causes, that a fire happened, F , or that someone

was making popcorn, C, and suppose moreover that these

P (B) = P (B \ A1 ) + · · · + P (B \ An )

events are independent. Given, however, that the alarm

= P (B|A1 )P (A1 ) + · · · + P (B|An )P (An ) went o↵, we have

Example. Suppose we are given a random two-card hand

P (F |(A \ C C )) = 1

from a standard deck.

1. What is the probability that both cards are aces and hence we do not have conditional independence.

given that we have an ace?

P (both aces, have ace)

P (both aces | have ace) = Lecture 7 — 9/16/11

P (have ace)

4 52

2 / 2 Example (Monty Hall Problem). Suppose there are

= 48 52

1 2 / 2 three doors, behind two of which are goats and behind

1 one of which is a car. Monty Hall, the game show host,

= knows the contents of each door, but we, the player, do

33

not, and have one chance to choose the car. After choos-

2. What is the probability that both cards are aces ing a door, Monty then opens one of the two remaining

given that we have the ace of spades? doors to reveal a goat (if both remaining doors have goats,

he chooses with equal probability). We are then given the

3 1

P (both aces | ace of spades) = = option to change our choice—should we do so?

51 17

In fact, we should; the chance that switching will give

Example. Suppose that a patient is being tested for a us the car is the same as the chance that we did not origi-

disease and it is known that 1% of similar patients have nally pick the car, which is 23 . However, we can also solve

the disease. Suppose also that the patient tests positive the problem by conditioning. Suppose we have chosen a

and that the test is 95% accurate. Let D be the event that door (WLOG, say the first). Let S be the event of finding

the patient has the disease and T the event that he tests the car by switching, and let Di be the event that the car

positive. Then we know P (T |D) = 0.95 = P (T C |DC ). is in door i. Then by the Law of Total Probability,

Using Bayes’ theorem and the Law of Total Probability,

we can compute 1 1 1

P (S) = P (S|D1 ) + P (S|D2 ) + P (S|D3 )

3 3 3

P (T |D)P (D) 1 1

P (D|T ) = =0+1· +1·

P (T ) 3 3

P (T |D)P (D) 2

= =

P (T |D)P (D) + P (T |DC )P (DC ) 3

⇡ 0.16 By symmetry, the probability that we succeed condi-

tioned on the door Monty opens is the same.

Definition 6.2. Two events A and B are conditionally

independent of an event C if

Example (Simpson’s Paradox). Suppose we have the

P ((A \ B) | C) = P (A|C)P (B|C) two following tables:

not necessarily unconditionally independent. For in- heart band-aid

stance, suppose we have a chess opponent of unknown success 70 10

strength. We might say that conditional on the oppo- failure 20 0

nent’s strength, all games outcomes would be indepen-

dent. However, without knowing the opponent’s strength, Nick

earlier games would give us useful information about the heart band-aid

opponent’s strength; hence, without the conditioning, the success 2 81

game outcomes are not independent. failure 8 9

3

Stat 110—Intro to Probability Max Wang

p

for the success of two doctors for two di↵erent operations. 1± 4p2 4p + 1

=

Note that although Hibbert has a higher success rate 2p

conditional on each operation, Nick’s success rate is 1 ⌥ (2p 1)

higher overall. Let us denote A to be the event of a suc- =

2p

cessful operation, B the event of being treated by Nick, q

and C the event of having heart surgery. In other words, = 1,

p

then, we have

As with di↵erential equations, this gives a general solu-

P (A|B, C) < P (A|B C , C) tion of the form

✓ ◆i

i 1

and pi = A1 + B

q

P (A|B, C C ) < P (A|B C , C C )

for p 6= q (to avoid a repeated root). Our boundary con-

but

C ditions for p0 and pn give

P (A|B) > P (A|B )

B= A

In this example, C is the confounder.

and ✓ ◆n

q

Lecture 8 — 9/19/11 1=A 1

p

q

Definition 8.1. A one-dimensional random walk mod- To solve for the case where p = q, we can guess x = p

els a (possibly infinite) sequence of successive steps along and take

the number line, where, starting from some position i, 1 xi ixi 1 i

lim = lim =

we have a probability p of moving +1 and a probability x!1 1 xn x!1 nxn 1 n

q = 1 p of moving 1. So we have

8 ⇣ ⌘i

>

> 1

Example. An example of a one-dimensional random >

> 1

>

< q

walk is the gambler’s ruin problem, which asks: Given ⇣ ⌘n p 6= q

pi = 1 1

two individuals A and B playing a sequence of successive > q

>

>

rounds of a game in which they bet $1, with A winning >

> i

: p=q

B’s dollar with probability p and A losing a dollar to B n

with probability q = 1 p, what is the probability that Now suppose that p = 0.49 and i = n i. Then we

A wins the game (supposing A has i dollars and B has have the following surprising table

n i dollars)? This problem can be modeled by a random

walk with absorbing states at 0 and n, starting at i. N P (A wins)

To solve this problem, we perform first-step analy- 20 0.40

sis; that is, we condition on the first step. Let pi = 100 0.12

P (A wins game | A start at i). Then by the Law of Total 200 0.02

Probability, for 1 i n 1. Note that this table is true when the odds are only slightly

against A and when A and B start o↵ with equal funding;

pi = ppi+1 + qpi 1 it is easy to see that in a typical gambler’s situation, the

chance of winning is extremely small.

and of course we have p0 = 0 and pn = 1. This equation

is a di↵erence equation. Definition 8.2. A random variable is a function

To solve this equation, we start by guessing X:S!R

pi = x i from some sample space S to the real line. A random

variable acts as a “summary” of some aspect of an exper-

Then we have iment.

xi = pxi+1 + qxi 1 Definition 8.3. A random variable X is said to have the

Bernoulli distribution if X has only two possible values,

px2 xi + q = 0

p 0 and 1, and there is some p such that

1± 1 4pq

x= P (X = 1) = p P (X = 0) = 1 p

2p

p We say that

1 ± 1 4p(1 p)

= X ⇠ Bern(p)

2p

4

Stat 110—Intro to Probability Max Wang

Note. We write X = 1 to denote the event Proof. This is clear from our “story” definition of the

binomial distribution, as well as from our indicator r.v.’s.

1

{s 2 S : X(s) = 1} = X {1} Let us also check this using PMFs.

k

X

Definition 8.4. The distribution of successes in n inde- P (X + Y = k) = P (X + Y = k | X = j)P (X = j)

pendent Bern(p) trials is called the binomial distribution j=0

and is given by k ✓ ◆

X n j n

✓ ◆ = P (Y = k j|X j) p q j

n k j

P (X = k) = p (1 p)n k j=0

k ✓ ◆

k

X n j n j

where 0 k n. We write independence = P (Y = k j) p q

j=0

j

X ⇠ Bin(n, p) Xk ✓ ◆ ✓ ◆

m n j n

= pk j m (k j)

q p q j

k j j

Definition 8.5. The probability mass function (PMF) j=0

m n

enumerable values) is a function that gives the probabil- = pk q n+m k

k j j

ity that the random variable takes some value. That is, j=0

✓ ◆

given a discrete random variable X, its PMF is n + m k n+m k

Vandermonde = p q

k

fX (x) = P (X = x)

⌅

Example. Suppose we draw a random 5-card hand from

Lecture 9 — 9/21/11

a standard 52-card deck. We want to find the distribu-

In addition to our definition of the binomial distribu- tion of the number of aces in the hand. Let X = #aces.

tion by its PMF, we can also express a random variable We want to determine the PMF of X (or the CDF—but

X ⇠ Bin(n, p) as a sum of indicator random variables, the PMF is easier). We know that P (X = k) = 0 ex-

cept if k = 0, 1, 2, 3, 4. This is clearly not binomial since

X = X1 + · · · + Xn the trials (of drawing cards) are not independent. For

k = 0, 1, 2, 3, 4, we have

where ( 4 48

k 5 k

1 ith trial succeeds P (X = k) = 52

Xi = 5

0 otherwise

which is just the probability of choosing k out of the 4

In other words, the Xi are i.i.d. (independent, identically aces and 5 k of the non-aces. This is reminiscient of the

distributed) Bern(p). elk problem in the homework.

Definition 9.1. The cumulative distribution function Definition 9.3. Suppose we have w white and b black

(CDF) of a random variable X is marbles, out of which we choose a simple random sam-

ple of n. The distribution of # of white marbles in the

FX (x) = P (X x) sample, which we will call X, is given by

w b

k n k

Note. The requirements

P for a PMF with values pi is that P (X = k) = w+b

each pi 0 and i pi = 1. For Bin(n, p), we can easily n

verify this with the binomial theorem, which yields where 0 k w and 0 n k b. This is called the

n ✓ ◆

hypergeometric distribution, denoted HGeom(w, b, n).

X n k n k

p q = (p + q)n = 1 Proof. We should show that the above is a valid PMF. It

k

k=0 is clearly nonnegative. We also have, by Vandermonde’s

identity,

Proposition 9.2. If X, Y are independent random vari- Xw w b w+b

k n k n

ables and X ⇠ Bin(n, p), Y ⇠ Bin(m, p), then w+b

= w+b

=1

k=0 n n

X + Y ⇠ Bin(n + m, p) ⌅

5

Stat 110—Intro to Probability Max Wang

Note. The di↵erence between the hypergometric and bi- The above shows that to get the probability of an

nomial distributions is whether or not we sample with event, we can simply compute the expected value of an

replacement. We would expect that in the limiting case indicator.

of n ! 1, they would behave similarly.

Observation 10.6. Let X ⇠ Bin(n, p). Then (using the

binomial theorem),

Lecture 10 — 9/23/11 n ✓ ◆

X n k n k

Proposition 10.1 (Properties of CDFs). A function E(X) = k p q

k

k=0

FX is a valid CDF i↵ the following hold about FX : n ✓ ◆

X n k n k

1. monotonically nondecreasing = k p q

k

k=1

2. right-continuous Xn ✓ ◆

n 1 k n k

= n p q

3. limx! 1 FX (x) = 0 and limx!1 FX (x) = 1. k 1

k=1

Xn ✓ ◆

Definition 10.2. Two random variables X and Y are n 1 k 1 n k

= np p q

independent if 8x, y, k 1

k=1

X1 ✓n 1◆

n

P (X x, Y y) = P (X x)P (Y y) = np pj q n 1 j

j=0

j

In the discrete case, we can say equivalently that

= np

P (X = x, Y = y) = P (X = x)P (Y = y)

Proposition 10.7. Expected value is linear; that is, for

Note. As an aside before we move on to discuss averages random variables X and Y and some constant c,

and expected values, recall that E(X + Y ) = E(X) + E(Y )

n

X

1 n+1 and

i=

n i=1

2 E(cX) = cE(X)

Observation 10.8. Using linearity, given X ⇠ Bin(n, p),

Example. Suppose we want to find the average of

since we know

1, 1, 1, 1, 1, 3, 3, 5. We could just add these up and divide

by 8, or we could formulate the average as a weighted X = X1 + · · · + Xn

average,

5 2 1 where the Xi are i.i.d. Bern(p), we have

·1+ ·3+ ·5

8 8 8

X = p + · · · + p = np

Definition 10.3. The expected value or average of a dis-

crete random variable X is Example. Suppose that, once again, we are choosing a

X five card hand out of a standard deck, with X = #aces.

E(X) = xP (X = x) If Xi is an indicator of the ith card being an ace, we have

x2Im(X)

E(X) = E(X1 + · · · + X5 )

Observation 10.4. Let X ⇠ Bern(p). Then = E(X1 ) + · · · + E(X5 )

E(X) = 1 · P (X = 1) + 0 · P (X = 0) = p by symmetry = 5E(X1 )

= 5P (first card is ace)

Definition 10.5. If A is some event, then an indicator 5

random variable for A is =

13

(

1 A occurs Note that this holds even though the Xi are dependent.

X=

0 otherwise Definition 10.9. The geometric distribution, Geom(p),

is the number of failures of independent Bern(p) tri-

By definition, X ⇠ Bern(P (A)), and by the above, als before the first success. Its PMF is given by (for

X ⇠ Geom(p))

E(X) = P (A) P (X = k) = q k p

6

Stat 110—Intro to Probability Max Wang

for k 2 N. Note that this PMF is valid since Definition 11.1. The negative binomial distribution,

1

NB(r, p), is given by the number of failures of indepen-

X 1

pq k = p · =1 dent Bern(p) trials before the rth success. The PMF for

1 q X ⇠ NB(r, p) is given by

k=0

✓ ◆

n+r 1 r

Observation 10.10. Let X ⇠ Geom(p). We have our P (X = n) = p (1 p)n

formula for infinite geometric series, r 1

1

for n 2 N.

X 1

k

q = Observation 11.2. Let X ⇠ NB(r, p). We can write

1 q

k=0 X = X1 + · · · + Xr where each Xi is the number of

failures between the (i 1)th and ith success. Then

Taking the derivative of both sides gives Xi ⇠ Geom(p). Thus,

1

X rq

1 E(X) = E(X1 ) + · · · + E(Xr ) =

kq k 1

= p

(1 q)2

k=1

Observation 11.3. Let X ⇠ FS(p), where FS(p) is the

Then time until the first success of independent Bern(p) trials,

1 1 counting the success. Then if we take Y = X 1, we have

X X pq q

E(X) = kpq k = p kq k = = Y ⇠ Geom(p). So,

(1 q) 2 p

k=0 k=0 q 1

E(X) = E(Y ) + 1 = +1=

Alternatively, we can use first step analysis and write a p p

recursive formula for E(X). If we condition on what hap- Example. Suppose we have a random permutation of

pens in the first Bernoulli trial, we have {1, . . . , n} with n 2. What is the expected number

of local maxima—that is, numbers greater than both its

E(X) = 0 · p + (1 + E(X))q neighbors?

E(X) qE(X) = q Let Ij be the indicator random variable for position j

q being a local maximum (1 j n). We are interested in

E(X) =

1 q E(I1 + · · · + In ) = E(I1 ) + · · · + E(In )

q

E(X) =

p For the non-endpoint positions, in each local neighbor-

hood of three numbers, the probability that the largest

Lecture 11 — 9/26/11 number is in the center position is 13 .

5, 2, · · · , 28, 3, 8, · · · , 14

Recall our assertion that E, the expected value function, | {z }

is linear. We now prove this statement. Moreover, these positions are all symmetrical. Analo-

gously, the probability that an endpoint position is a local

Proof. Let X and Y be discrete random variables. We

maximum is 12 . Then we have

want to show that E(X + Y ) = E(X) + E(Y ).

X n

2 2 n+1

E(I1 ) + · · · + E(In ) = + =

E(X + Y ) = tP (X + Y = t) 3 2 3

t

X Example (St. Petersburg Paradox). Suppose you are

= (X + Y )(s)P ({s}) given the o↵er to play a game where a coin is flipped until

X

s a heads is landed. Then, for the number of flips i made up

= (X(s) + Y (s))P ({s}) to and including the heads, you receive $2i . How much

s should you be willing to pay to play this game? That

X X is, what price would make the game fair, or the expected

= X(s)P ({s}) + Y (s)P ({s})

s s

value zero?

X X Let X be the number of flips of the fair coin up to and

= xP (X = x) + yP (Y = y) including the first heads. Clearly, X ⇠ F S( 12 ). If we let

x y

Y = 2X , we want to find E(Y ). We have

= E(X) + E(Y ) 1 1

X 1 X

k

E(Y ) = 2 · k = 1

The proof that E(cX) = cE(X) is similar. ⌅ 2

k=1 k=1

7

Stat 110—Intro to Probability Max Wang

This assumes, however, that our cash source is boundless. Proof. Fix k. Then as n ! 1 and p ! 0,

If we bound it at 2K for some specific K, we should only ✓ ◆

n k

bet K dollars for a fair game—this is a sizeable di↵erence. lim P (X = k) = n!1

lim p (1 p)n k

n!1

p!0 p!0

k

✓ ◆k

n(n 1) · · · (n k + 1) 1

Lecture 12 — 9/28/11 = n!1

lim ·

p!0

k!

Definition 12.1. The Poisson distribution, Pois( ), is ✓ ◆n ✓ ◆ k

given by the PMF 1 1

n n

k k

e = ·e

P (X = k) = k!

k!

⌅

for k 2 N, X ⇠ Pois( ). We call the rate parameter.

Example. Suppose we have n people and we want to

Observation 12.2. Checking that this PMF is indeed know the approximate probability that at least three in-

valid, we have dividuals have the same birthday. There are n3 triplets

of people; for each triplet, let Iijk be the indicator r.v.

1

X k that persons i, j, and k have the same birthday. Let

e =e e =1 X = # triple matches. Then we know that

k!

k=0 ✓ ◆

n 1

E(X) =

Its mean is given by 3 3652

1

X k To approximate P (X 1), we approximate X ⇠ Pois( )

E(X) = e k with = E(X). Then we have

k!

k=0 0

X1 k P (X 1) = 1 P (X = 0) = 1 e =1 e

=e 0!

(k 1)!

k=1

X1 k 1 Lecture 13 — 9/30/11

= e

(k 1)! Definition 13.1. Let X be a random variable. Then X

k=1

= e e has a probability density function (PDF) fX (x) if

Z b

=

P (a X b) = fX (x) dx

a

The Poisson distribution is often used for applications A valid PDF must satisfy

where we count the successes of a large number of trials

where the per-trial success rate is small. For example, the 1. 8x, fX (x) 0

Poisson distribution is a good starting point for counting Z 1

the number of people who email you over the course of 2. fX (x) dx = 1

1

an hour. The number of chocolate chips in a chocolate

chip cookie is another good candidate for a Poisson dis- Note. For ✏ > 0 very small, we have

✓ ◆

tribution, or the number of earthquakes in a year in some ✏ ✏

fX (x0 ) · ✏ ⇡ P X 2 (x0 , x0 + )

particular region. 2 2

Since the Poisson distribution is not bounded, these

Theorem 13.2. If X has PDF fX , then its CDF is

examples will not be precisely Poisson. However, in gen- Z x

eral, with a large number of events Ai with small P (Ai ), FX (x) = P (X x) = fX (t) dt

and where the Ai are all independent or “weakly depen- 1

dent,” then the number of thePn Ai that occur is approx- If X is continuous and has CDF FX , then its PDF is

imately Pois( ), with ⇡ i=1 P (Ai ). We call this a 0

Poisson approximation. fX (x) = FX (x)

Moreover,

Proposition 12.3. Let X ⇠ Bin(n, p). Then as n ! 1, Z b

p ! 0, and where = np is held constant, we have

P (a < X < b) = fX (x) dx = FX (b) FX (a)

X ⇠ Pois( ). a

8

Stat 110—Intro to Probability Max Wang

Z x Z x

Definition 13.3. The expected value of a continuous

random variable X is given by FU (x) = fU (t) dt = fU (t) dt

1 a

Z 1 Z x

1

E(X) = xfX (x) dx = dt

a b a

1

x a

=

Giving the expected value is like giving a one number b a

summary of the average, but it provides no information

about the spread of a distribution. Observation 13.7. The expected value of an r.v. U ⇠

Unif(a, b) is

Definition 13.4. The variance of a random variable X Z b

is given by x

E(U ) = dx

Var(X) = E((X EX)2 ) a b a

b

which is the expected value of the distance from X to its x2

mean; that is, it is, on average, how far X is from its =

2(b a)

mean. a

2 2

b a

We can’t use E(X EX) because, by linearity, we =

2(b a)

have

(b + a)(b a)

=

E(X EX) = EX E(EX) = EX EX = 0 2(b a)

b+a

We would like to use E|X EX|, but absolute value is =

2

hard to work with; instead, we have

This is the midpoint of the interval [a, b].

Definition 13.5. The standard deviation of a random

variable X is p Finding the variance of U ⇠ Unif(a, b), however, is a

SD(X) = Var(X) bit more trouble. We need to determine E(U 2 ), but it is

too much of a hassle to figure out the PDF of U 2 . Ideally,

Note. Another way we can write variance is things would be as simple as

Z 1

Var(X) = E((X EX)2 )

E(U 2 ) = x2 fU (x) dx

= E(X 2 2X(EX) + (EX)2 ) 1

2 2

= E(X ) 2E(X)E(X) + (EX) Fortunately, this is true:

2

= E(X ) (EX)2

Theorem 13.8 (Law of the Unconscious Statistician

Definition 13.6. The uniform distribution, Unif(a, b), is (LOTUS)). Let X be a continuous random variable, g :

given by a completely random point chosen in the interval R ! R continuous. Then

[a, b]. Note that the probability of picking a given point Z 1

x0 is exactly 0; the uniform distribution is continuous. E(g(X)) = g(x)fX (x) dx

1

The PDF for U ⇠ Unif(a, b) is given by

( where fX is the PDF of X. This allows us to determine

c axb the expected value of g(X) without knowing its distribu-

fU (x) =

0 otherwise tion.

for some constant c. To find c, we note that, by the defi- Observation 13.9. The variance of U ⇠ Unif(a, b) is

nition of PDF, we have given by

Z b Var(U ) = E(U 2 ) (EU )2

c dx = 1 Z b ✓ ◆2

a 2 b+a

= x fU (x) dx

c(b a) = 1 a 2

1 Z b ✓ ◆2

c= 1 b+a

b a = x2 dx

b a a 2

9

Stat 110—Intro to Probability Max Wang

1 x3 b+a

= · between measuring U from the right vs. from the left of

b a 3 2

a [0, 1].

b3 a3 (b + a)2 The general uniform distribution is also linear; that is,

=

3(b a) 3(b a) 4 a + bU is uniform on some interval [a, b]. If a distribution

(b a)2 is nonlinear, it is hence nonuniform.

=

12 Definition 14.3. We say that random variables

X1 , . . . , Xn are independent if

The following table is useful for comparing discrete

and continuous random variables: • for continuous, P (X1 x1 , . . . , Xn xn ) =

P (X1 x1 ) · · · P (Xn xn )

P?F PMF P (X = x) PDF fX (x) x1 ) · · · P (Xn = xn )

CDF FX (x) = P (X x) FX (x) = P (X x) The expressions on the LHS are called joint CDFs and

P R1

E(X) x xP (X = x) 1

xfX (x) dx joint PMFs respectively.

Var(X) EX 2 (EX)2 EX 2

(EX)2 Note that pairwise independence does not imply inde-

E(g(X)) P R1 pendence.

[LOTUS] x g(x)P (X = x) 1

g(x)fX (x) dx

Example. Consider the penny matching game, where

X1 , X2 ⇠ Bern( 12 ), i.i.d., and let X3 be the indicator r.v.

Lecture 14 — 10/3/11 for the event X1 = X2 (the r.v. for winning the game). All

of these are pairwise independent, but the X3 is clearly

Theorem 14.1 (Universality of the Uniform). Let us dependent on the combined outcomes of X1 and X2 .

take U ⇠ Unif(0, 1), F a strictly increasing CDF. Then

Definition 14.4. The normal distribution, given by

for X = F 1 (U ), we have X ⇠ F . Moreover, for any

N (0, 1), is defined by PDF

random variable X, if X ⇠ F , then F (X) ⇠ Unif(0, 1).

z 2 /2

f (z) = ce

Proof. We have

where c is the normalizing constant required to have f

1

P (X x) = P (F (U ) x) = P (U F (x)) = F (x) integrate to 1.

since P (U F (x)) is the length of the interval [0, F (x)], Proof. We want to prove that our PDF is valid; to do

which is F (x). For the second part, so, we will simply determine the value of the normalizing

constant that makes it so. We will integrate the square

P (F (X) x) = P (X F 1

(x)) = F (F 1

(x)) = x of the PDF sans constant because it is easier than inte-

grating naı̈vely

since F is X’s CDF. But this shows that F (X) ⇠ Z 1 Z 1

z 2 /2 2

Unif(0, 1). ⌅ e dz e z /2 dz

1 1

Z 1 Z 1

Example. Let F (x) = 1 e x with x > 0 be the CDF x2 /2 2

= e dx e y /2 dy

of an r.v. X. Then F (X) = 1 e X by an application of 1

Z 1Z 1

1

the second part of Universality of the Uniform. 2 2

= e (x +y )/2 dx dy

1 1

Example. Let F (x) = 1 e x with x > 0, and also let Z 2⇡ Z 1

U ⇠ Unif(0, 1). Suppose we want to simulate F with a r 2 /2

= e r dr d✓

random variable X; that is, X ⇠ F . Then computing the 0 0

inverse r2

Substituting u = , du = r dr

F 1 (u) = ln(1 u) 2

Z 2⇡ ✓Z 1 ◆

u

yields F 1

(U ) = ln(1 U) ⇠ F. = e du d✓

0 0

symmetric; that is, if U ⇠ Unif(0, 1), then also 1 U ⇠

So our normalizing constant is c = p1 . ⌅

Unif(0, 1). 2⇡

10

Stat 110—Intro to Probability Max Wang

Observation 14.5. Let us compute the mean and vari- which yields a PDF of

ance of Z ⇠ N (0, 1). We have

1 x µ 2

Z 1 fX (x) = p e ( ) /2

1 2 2⇡

EZ = p ze z /2 dz = 0

2⇡ 1 We also have X = µ + ( Z) ⇠ N ( µ, 2 ).

Later, we will show that if Xi ⇠ N (µi , i2 ) are inde-

by symmetry (the integrand is odd). The variance re- pendent, then

duces to

Xi + Xj ⇠ N (µi + µj , i2 + j2 )

2 2 2

Var(Z) = E(Z ) (EZ) = E(Z )

and

By LOTUS, Xi Xj ⇠ N (µi µj , i2 + j2 )

Z 1 Observation 15.2. If X ⇠ N (µ, 2 ), we have

2 1 2 z 2 /2

E(Z ) = p z e dz

2⇡ 1 P (|X µ| ) ⇡ 68%

Z 1

2 2 P (|X µ| 2 ) ⇡ 95%

evenness = p z 2 e z /2 dz

2⇡ 0 P (|X µ| 3 ) ⇡ 99.7%

Z 1

2 z 2 /2

by parts = p z ze dz Observation 15.3. We observe some properties of the

2⇡ 0 |{z} | {z }

0

u dv

1 variance.

1 Z 1

2

= p @uv +

2

e z /2 dz A Var(X) = E((X EX)2 ) = EX 2 (EX)2

2⇡ 0

0 For any constant c,

p !

2 2⇡ Var(X + c) = Var(X)

=p 0+

2⇡ 2

Var(cX) = c2 Var(X)

=1

Since variance is not linear, in general, Var(X + Y ) 6=

We use to denote the standard normal CDF; so Var(X) + Var(Y ). However, if X and Y are independent,

Z z we do have equality. On the other extreme,

1 t2 /2

(z) = p e dt Var(X + X) = Var(2X) = 4 Var(X)

2⇡ 1

Also, in general,

By symmetry, we also have

Var(X) 0

( z) = 1 (z) Var(X) = 0 () 9a : P (X = a) = 1

Observation 15.4. Let us compute the variance of the

Lecture 15 — 10/5/11 Poisson distribution. Let X ⇠ Pois( ). We have

Recall the standard normal distribution. Let Z be an 1

X k

e

r.v., Z ⇠ N (0, 1). Then Z has CDF ; it has E(Z) = 0, E(X 2 ) = k2

k!

Var(Z) = E(Z 2 ) = 1, and E(Z 3 ) = 0.1 By symmetry, k=0

also Z ⇠ N (0, 1). To reduce this sum, we can do the following:

Definition 15.1. Let X = µ + Z, with µ 2 R (the 1

X k

mean or center), > 0 (the SD or scale). Then we say =e

k!

X ⇠ N (µ, 2 ). This is the general normal distribution. k=0

Taking the derivative w.r.t. ,

If X ⇠ N (µ, 2 ), we have E(X) = µ and Var(µ + X1

k k 1

Z) = 2 Var(Z) = 2 . We call Z = X µ the standard- =e

ization of X. X has CDF k!

k=1

✓ ◆ ✓ ◆ X1

k k 1

X µ x µ x µ = e

P (X x) = P = k!

k=0

1 These are called the first, second, and third moments.

11

Stat 110—Intro to Probability Max Wang

1

X k

k Lecture 17 — 10/14/11

= e

k!

k=1

Definition 17.1. The exponential distribution, Expo( ),

Repeating,

is defined by PDF

1

X k2 k 1

= e +e x

k! f (x) = e

k=1

X1

k2 k

= e ( + 1) for x > 0 and 0 elsewhere. We call the rate parameter.

k!

k=1 Integrating clearly yields 1, which demonstrates valid-

ity. Our CDF is given by

So, Z x

t

1

X k F (x) = e dt

2 2e

E(X ) = k ( 0

k! x

k=0 1 e x>0

=

=e e ( + 1) 0 otherwise

2

= +

So for our variance, we have Observation 17.2. We can normalize any X ⇠ Expo( )

2 2

by multiplying by , which gives Y = X ⇠ Expo(1). We

Var(X) = ( + ) = have

Observation 15.5. Let us compute the variance of the-

y y/ y

binomial distribution. Let X ⇠ Bin(n, p). We can write P (Y y) = P (X )=1 e =1 e

X = I1 + · · · + In

where the Ij are i.i.d. Bern(p). Then, Let us now compute the mean and variance of Y ⇠

Expo(1). We have

X 2 = I12 + · · · + In2 + 2I1 I2 + 2I1 I3 + · · · + 2In 1 In

Z 1

where Ii Ij is the indicator of success on both i and j. y

✓ ◆ E(Y ) = ye dy

n 0

2 2 1 Z

E(X ) = nE(I1 ) + 2 E(I1 I2 ) 1

2 y y

= ( ye ) + e dy

= np + n(n 1)p2 0

0

= np + n2 p2 np2 =1

So,

Var(X) = (np + n2 p2 np2 ) n 2 p2 Var(Y ) = EY 2 (EY )2

= np(1 p) Z 1

= y 2 e y dy 1

= npq 0

Proof. (of Discrete LOTUS) =1

P

We want to show that E(g(X) = x g(x)P (X = x). To

do so, once again we can “ungroup” our expected value Then for X = Y

, we have E(X) = 1

and Var(X) = 1

2 .

expression:

X X

g(x)P (X = x) = g(X(s))P ({s}) Definition 17.3. A random variable X has a memoryless

x s2S distribution if

We can rewrite this as

X X X X P (X s+t|X s) = P (X t)

g(X(s))P ({s}) = g(x) P ({s})

x s:X(s)=x x s:X(s)=x

X Intuitively, if we have a random variable that we in-

= g(x)P (X = x) terpret as a waiting time, memorylessness means that no

x matter how long we have already waited, the probability

⌅ of having to wait a given time more is invariant.

12

Stat 110—Intro to Probability Max Wang

Proposition 17.4. The exponential distribution is mem- Observation 18.3. We might ask why we call M

oryless. “moment-generating.” Consider the Taylor expansion of

M:

Proof. Let X ⇠ Expo( ). We know that 0 1

X1 n n X1

P (X t) = 1 P (X t) = e t X t E(X n )tn

E(etX ) = E @ A=

n=0

n! n=0

n!

Meanwhile,

P (X s + t, X s) Note that we cannot simply make use of linearity since

P (X s+t|X s) = our sum is infinite; however, this equation does hold for

P (X s)

reasons beyond the scope of the course.

P (X s + t)

= This observation also shows us that

P (X s)

e (s+t) E(X n ) = M (n) (0)

= s

e

t Claim 18.4. If X and Y have the same MGF, then they

=e

have the same CDF.

= P (X t)

We will not prove this claim.

which is our desired result. ⌅

Observation 18.5. If X has MGF MX and Y has MGF

Example. Let X ⇠ Expo( ). Then by linearity and by MY , then

the memorylessness,

MX+Y = E(et(X+Y ) ) = E(etX )E(etY ) = MX MY

E(X | X > a) = a + E(X a | X > a)

1 The second inequality comes from the claim (which we

=a+

will prove later) that if for X, Y independent, E(XY ) =

E(X)E(Y ).

Lecture 18 — 10/17/11 Example. Let X ⇠ Bern(p). Then

Theorem 18.1. If X is a positive, continuous random

M (t) = E(etX ) = pet + q

variable that is memoryless (i.e., its distribution is mem-

oryless), then there exists 2 R such that X ⇠ Expo( ). Suppose now that X ⇠ Bin(n, p). Again, we write

Proof. Let F be the CDF of X and G = 1 F . By X = I1 + · · · + In where the Ij are i.i.d Bern(p). Then we

memorylessness, see that

M (t) = (pet + q)n

G(s + t) = G(s)G(t)

Example. Let Z ⇠ N (0, 1). We have

We can easily derive from this identity that 8k 2 Q, Z 1

1 2

M (t) = p etz z /2 dz

G(kt) = G(t)k 2⇡ 1

2 Z

This can be extended to all k 2 R. If we take t = 1, then et /2 1 (1/2)(z t)2

completing the square = p e dz

we have 2⇡ 1

G(x) = G(1)x = ex ln G(1) 2

= et /2

will have > 0. Then this gives us Example. Suppose X1 , X2 , . . . are conditionally inde-

pendent (given p) random variables that are Bern(p).

x

F (x) = 1 G(x) = 1 e Suppose also that p is unknown. In the Bayesian ap-

proach, let us treat p as a random variable. Let p ⇠

as desired. ⌅ Unif(0, 1); we call this the prior distribution.

Definition 18.2. A random variable X has Let Sn = X1 + · · · + Xn . Then Sn | p ⇠ Bin(n, p). We

moment-generating function (MGF) want to find the posterior distribution, p | Sn , which will

give us P (Xn+1 = 1 | Sn = k). Using “Bayes’ Theorem,”

M (t) = E(etX )

P (Sn = k | p)f (p)

if M (t) is bounded on some interval ( ✏, ✏) about zero. f (p | Sn = k) =

P (Sn = k)

13

Stat 110—Intro to Probability Max Wang

=

P (Sn = k) MGF is given by

/ pk (1 p)n k

1

X k

E(etX ) = etk e

In the specific case of Sn = n, normalizing is easier: k!

k=0

f (p | Sn = n) = (n + 1)pn

t

e

=e e

t

Computing P (Xn+1 = 1 | Sn = k) simply requires find- = e (e 1)

ing the expected value of an indicator with the above

probability p | Sn = n. Observation 19.4. Now let X ⇠ Pois( ) and Y ⇠

Z 1 Pois(µ) independent. We want to find the distribution

n n+1 of X + Y . We can simply multiply their MGFs, yielding

P (Xn+1 = 1 | Sn = n) = p(n + 1)p dp =

0 n + 2 t t

MX (t)MY (t) = e (e 1) eµ(e 1)

+µ)(et 1)

Lecture 19 — 10/19/11 = e(

termine the MGF M of X. By LOTUS,

Example. Suppose X, Y above are dependent; specifi-

M (t) = E(etX ) cally, take X = Y . Then X + Y = 2X. But this cannot

Z 1 be Poisson since it only takes on even values. We could

= etx e x dx also compute the mean and variance

0

Z 1

= e x(1 t) dx E(2X) = 2 Var(2X) = 4

0

1 but they should be equal for the Poisson.

= , t<1

1 t We now turn to the study of joint distributions. Recall

If we write that joint distributions for independent random variables

1

X 1

X

1 tn can be given simply by multiplying their CDFs; we want

= tn = n!

1 t n! also to study cases where random variables are not inde-

n=0 n=0

pendent.

we get immediately that

Definition 19.5. Let X, Y be random variables. Their

E(X n ) = n! joint CDF is given by

Now take Y ⇠ Expo(1) and let X = Y ⇠ Expo(1). So

n F (x, y) = P (X x, Y y)

Y n = Xn , and hence

n! In the discrete case, X and Y have a joint PMF given by

E(Y n ) = n

P (X = x, Y = y)

Observation 19.2. Let Z ⇠ N (0, 1), and let us deter-

mine all its moments. We know that for n odd, by sym- and in the continuous case, X and Y have a joint PDF

metry, given by

n @

E(Z ) = 0 f (x, y) = F (x, y)

@x@y

We previously showed that

and we can compute

2

M (t) = et /2 ZZ

P ((X, Y ) 2 B) = f (x, y) dx dy

But we can write B

X1 X1 1

X

(t2 /2)n t2n (2n)! t2n Their separate CDFs and PMFs (e.g., P (X x)) are

= =

n! 2n n! n=0 2n n! (2n)! referred to as marginal CDFs, PMFs, or PDFs. X and

n=0 n=0

Y are independent precisely when the the joint CDF is

So equal to the product of the marginal CDFs:

(2n)!

E(Z 2n ) =

2n n! F (x, y) = FX (x)FY (y)

14

Stat 110—Intro to Probability Max Wang

P (X = x, Y = y) = P (X = x)P (Y = y) Definition 20.1. Let X and Y be random variables.

Then the conditional PDF of Y |X is

or

fX,Y (x, y) fX|Y (x|y)fY (y)

f (x, y) = fX (x)fY (y) fY |X (y|x) = =

fX (x) fX (x)

for all x, y 2 R.

Note that Y |X is shorthand for Y | X = x.

Definition 19.6. To get the marginal PMF or PDF of Example. Recall the PDF for our uniform distribution

a random variable X from its joint PMF or PDF with on the disk,

another random variable Y , we can marginalize over Y (

1

by computing y 2 = 1 x2

f (x, y) = ⇡

X 0 otherwise

P (X = x) = P (X = x, Y = y)

y and marginalizing over Y , we have

Z p

or Z 1 x2

1 2p

1

fX (x) = p dy = 1 x2

fX (x) = fX,Y (x, y) dy 1 x2 ⇡ ⇡

1

for 1 x 1. As a check, we could integrate this again

Example. Let X ⇠ Bern(p), X ⇠ Bern(q). Suppose with respect to dx to ensure that it is 1. From this, it is

they have joint PMF given by easy to find the conditional PDF,

Y =0 Y =1 1/⇡ 1

fY |X (y|x) = p = p

X=0 2/6 1/6 3/6 2

1 x 2 2 1 x2

⇡

X=1 2/6 1/6 3/6 p p

4/6 2/6 for 1 x2 y 1 x2 . Sincepwe are p holding x

constant, we see that Y |X ⇠ Unif( 1 x2 , 1 x2 ).

Here we have computed the marginal probabilities (in the From these computations, it is clear, in many ways,

margin), and they demonstrate that X and Y are inde- that X and Y are not independent. It is not true that

pendent. fX,Y = fX fY , nor that fY |X = fY

Example. Let us define the uniform distribution on the Proposition 20.2. Let X, Y have joint PDF f , and let

unit square, {(x, y) : x, y 2 [0, 1]}. We want the joint g : R2 ! R. Then

PDF to be constant everywhere in the square and 0 oth- Z 1Z 1

erwise; that is, E(g(X, Y )) = g(x, y)f (x, y) dx dy

1 1

(

c 0 x 1, 0 y 1 This is LOTUS in two dimensions.

f (x, y) =

0 otherwise

Theorem 20.3. If X, Y are independent random vari-

1

ables, then E(XY ) = E(X)E(Y ).

Normalizing, we simply need c = area = 1. It is apparent

that the marginal PDFs are both uniform. Proof. We will prove this in the continuous case. Using

LOTUS, we have

Example. Let us define the uniform distribution on the Z 1Z 1

unit disc, {(x, y) : x2 + y 2 1}. Their joint PDF can be E(XY ) = xyfX,Y (x, y) dx dy

given by 1 1

( Z 1Z 1

1

x2 + y 2 1 by independence = xyfX (x)fY (y) dx dy

f (x, y) = ⇡ 1 1

0 otherwise Z 1

p p = E(X)yfY (y) dy

Given X = x, we have 1 x2 y 1 x2 . We 1

might guess that Y is uniform, but clearly X and Y are = E(X)E(Y )

dependent in this case, and it turns out that this is not

the case. as desired. ⌅

15

Stat 110—Intro to Probability Max Wang

E|X Y |. By LOTUS (and since the joint PDF is 1), we

want to integrate Theorem 21.1. Let X ⇠ N (µ1 , 12 ) and Y ⇠ N (µ2 , 22 )

be independent random variables. Then X + Y ⇠ N (µ1 +

Z 1 Z 1 µ2 , 12 + 22 ).

E|X Y|= |x y| dx dy

Proof. Since X and Y are independent, we can simply

Z0Z 0

ZZ

multiply their MGFs. This is given by

= (x y) dx dy + (y x) dx dy

ZZ

x>y xy MX+Y (t) = MX (t)MY (t)

by symmetry = 2 (x y) dx dy t2 t2

= exp(µ1 t + 1 ) exp(µ2 t + 2 )

x>y 2 2

Z 1 Z1 2

2 2 t

= (x y) dx dy = exp((µ1 + µ2 )t + ( 1 + 2) )

0 y 2

Z ! 1 which yields our desired result. ⌅

1

x2

=2 yx dy Example. Let Z1 , Z2 ⇠ N (0, 1), i.i.d.; let us find

0 2

y

E|Z1 Z2 |. By the above, Z1 Z2 ⇠ N (0, 2). Let

1 Z ⇠ N (0, 1). Then

=

3 p

E|Z1 Z2 | = E| 2Z|

If we let M = max{X, Y } and L = min{X, Y }, then we p

= 2E|Z|

would have |X Y | = M L, and hence also Z 1

1 2

= |z| p e z /2 dz

1 1 2⇡

E(M L) = E(M ) E(L) = Z 1

3 1 2

evenness = 2 |z| p e z /2 dz

2⇡

We also have r0

2

=

E(X + Y ) = E(M + L) = E(M ) + E(L) = 1 ⇡

Definition 21.2. Let X = (X1 , . . . , Xk ) be a multivari-

This gives ate random variable,

P p = (p1 , . . . , pk ) a probability vector

2 1 with pj 0 and j pj = 1. The multinomial distribution

E(M ) = E(L) =

3 3 is given by assorting n objects into k categories, each ob-

ject having probability pj of being in category j, and tak-

Example (Chicken-Egg Problem). Suppose there are ing the number of objects in each category, Xj . If X has

N ⇠ Pois( ) eggs, each hatching with probability p, in- the multinomial distribution, we write X ⇠ Multk (n, p).

dependent (these are Bernoulli). Let X be the number of The PMF of X is given by

eggs that hatch. Thus, X|N ⇠ Bin(N, p). Let Y be the

n!

number that don’t hatch. Then X + Y = N . P (X1 = n1 , . . . , Xk = nk ) = pn1 · · · pnk k

Let us find the joint PMF of X and Y . n1 ! · · · n k ! 1

P

if k nk = n, and 0 otherwise.

1

X Observation 21.3. Let X ⇠ Multk (n, p). Then the

P (X = i,Y = j) = P (X = i, Y = j | N = n)P (N = n)

marginal distribution of Xj is simply Xj ⇠ Bin(n, pj ),

n=0

since each object is either in j or not, and we have

= P (X = i, Y = j | N = i + j)P (N = i + j)

= P (X = i | N = i + j)P (N = i + j) E(Xj ) = npj Var(Xj ) = npj (1 pj )

= pq e together for X ⇠ Multk (n, p), then the result is still

i!j! (i + j)!

! ! multinomial. That is, taking

i j

p ( p) q ( q)

= e e Y = (X1 , . . . , Xl 1 , Xl + · · · + Xk )

i! j!

and

In other words, the randomness of the number of eggs p0 = (p1 , . . . , pl 1 , pl + · · · + pk )

o↵sets the dependence of Y on X given a fixed number of we have Y ⇠ Multl (n, p0 ), and this is true for any combi-

X. This is a special property of the Poisson distribution. nations of lumpings.

16

Stat 110—Intro to Probability Max Wang

X 1 = n1 ,

and then we proceed as before.

(X2 , . . . , Xk ) ⇠ Multk 1 (n n1 , (p02 , . . . , p0k ))

pj pj

p0j = =

1 p1 p 2 + · · · + pk Definition 22.1. The covariance of random variables X

This is symmetric for all j. and Y is

tion of T = X

Y with X, Y ⇠ N (0, 1) i.i.d.

Note. The following properties are immediately true of

Note. The Cauchy distribution has no mean, but has the covariance:

property that an average of many Cauchy distributions is

still Cauchy. 1. Cov(X, X) = Var(X)

Observation 21.7. Let us compute the PDF of X with 2. Cov(X, Y ) = Cov(Y, X)

the Cauchy distribution. The CDF is given by

3. Cov(X, Y ) = E(XY ) E(X)E(Y )

X X

P( t) = P ( t) 4. 8c 2 R, Cov(X, c) = 0

Y |Y |

= P (X t|Y |) 5. 8c 2 R, Cov(cX, Y ) = c Cov(X, Y )

Z 1 Z t|y|

1 2 1 2

= p e x /2 p e y /2

dx dy 6. Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z)

1 1 2⇡ 2⇡

Z 1 Z t|y| The last two properties demonstrate that covariance is

1 2 1 x2 /2

=p e y /2 p e dx dy bilinear. In general,

2⇡ 1 1 2⇡

Z 1 0 1

1 2

Xm n

X X

=p e y /2 (t|y|) dy

2⇡ 1

Cov @ a i Xi , bj Y j A = ai bj Cov(Xi , Yj )

r Z 1 i=1 j=1 i,j

2 2

= e y /2 (ty) dy

⇡ 0 Observation 22.2. We can use covariance to compute

the variance of sums:

There is little we can do to compute this integral. In-

stead, let us compute the PDF, calling the CDF above Var(X + Y ) = Cov(X, X) + Cov(X, Y )

F (t). Then we have

r Z 1 + Cov(Y, X) + Cov(Y, Y )

0 2 2 1 2 2 = Var(X) + 2 Cov(X, Y ) + Var(Y )

F (t) = ye y /2 p e t y /2 dy

⇡ 0 2⇡

Z and more generally,

1 1 2 2

= ye (1+t )y /2 dy X X X

⇡ 0 Var( X) = Var(X) + 2 Cov(Xi , Xj )

(1 + t2 )y 2 i<j

Substituting u = =) du = (1 + t2 )y dy,

2

1 Theorem 22.3. If X, Y are independent, then

= Cov(X, Y ) = 0 (we say that they are uncorrelated).

⇡(1 + t)2

We could also have performed this computation using the Example. The converse of the above is false. Let Z ⇠

Law of Total Probability. Let be the standard normal N (0, 1), X = Z, Y = Z 2 , and let us compute the covari-

PDF. We have ance.

Z 1

P (X t|Y |) = P (X t|Y | | Y = y) (y) dy Cov(X, Y ) = E(XY ) (EX)(EY )

3

1

Z 1 = E(Z ) (EZ)(EZ 2 )

by independence = P (X ty) (y) dy =0

1

Z 1

But X and Y are very dependent, since Y is a function

= (ty) (y) dy of X.

1

17

Stat 110—Intro to Probability Max Wang

Definition 22.4. The correlation of two random vari- Example. Let X ⇠ Bin(n, p). Write X = X1 + · · · + Xn

ables X and Y is where the Xj are i.i.d. Bern(p). Then

Cov(X, Y ) Var(Xj ) = EXj2 (EXj )2

Cor(X, Y ) =

SD(X) SD(Y )

✓ ◆ =p p2

X EX Y EY

= Cov , = p(1 p)

SD(X) SD(Y )

It follows that

The operation of

X EX Var(X) = np(1 p)

SD(X)

since Cor(Xi , Xj ) = 0 for i 6= j by independence.

is called standardization; it gives the result a mean of 0

and a variance of 1.

Lecture 23 — 10/28/11

Theorem 22.5. | Cor(X, Y )| 1.

Example. Let X ⇠ HGeom(w, b, n). Let us write p =

Proof. We could apply Cauchy-Schwartz to get this re- w

w+b and N = w+b. Then we can write X = X1 +· · ·+Xn

sult immediately, but we shall also provide a direct proof. where the Xj are Bern(p). (Note, however, that unlike

WLOG, assume X and Y are standardized. Let ⇢ = with the binomial, the Xj are not independent.) Then

Cor(X, Y ). We have

✓ ◆

n

Var(X + Y ) = Var(X) + Var(Y ) + 2⇢ = 2 + 2⇢ Var(X) = n Var(X1 ) + 2 Cov(X1 , X2 )

2

✓ ◆

and we also have n

= np(1 p) + 2 Cov(X1 , X2 )

2

Var(X Y ) = Var(X) + Var(Y ) 2⇢ = 2 2⇢

Computing the covariance, we have

But since Var 0, this yields our result. ⌅

Cov(X1 , X2 ) = E(X1 X2 ) (EX1 )(EX2 )

✓ ◆ ✓ ◆2

Example. Let (X1 , . . . , Xk ) ⇠ Multk (n, p). We shall w w 1 w

compute Cov(Xi , Xj ) for all i, j. If i = j, then =

w+b w+b 1 w+b

✓ ◆

w w 1

Cov(Xi , Xi ) = Var(Xi ) = npi (1 pi ) = p2

w+b w+b 1

Suppose i 6= j. We can expect that the covariance will be

and simplifying,

negative, since more objects in category i means less in

category j. We have N n

Var(X) = np(1 p)

N 1

Var(Xi + Xj ) = npi (1 pi ) + npj (1 pj ) + 2 Cov(Xi , Xj )

The term N n

N 1 is called the finite population correction;

But by “lumping” i and j together, we also have it represents the “o↵set” from the binomial due to lack of

replacement.

Var(Xi + Xj ) = n(pi + pj )(1 (pi + pj ))

Theorem 23.1. Let X be a continuous random variable

Then solving for c, we have with PDF fX , and let Y = g(X) where g is di↵erentiable

and strictly increasing. Then the PDF of Y is given by

Cov(Xi , Xj ) = npi pj

dx

fY (y) = fX (x)

Note. Let A be an event and IA its indicator random dy

variable. It is clear that 1

where y = g(x) and x = g (y). (Also recall from calcu-

n

⇣ ⌘ 1

IA = IA lus that dx dy

dy = dx .)

for any n 2 N. It is also clear that Proof. From the CDF of Y , we get

IA IB = IA\B P (Y y) = P (g(X) y)

18

Stat 110—Intro to Probability Max Wang

Z 1

1

= P (X g (y)) = FY (t x)fX (x) dx

1 1

= FX (g (y))

= FX (x) Then taking the derivative of both sides,

Z 1

Then, di↵erentiating, we get by the Chain Rule that fT (t) = fX (x)fY (t x) dx

1

dx We now briefly turn our attention to proving the exis-

fY (y) = fX (x) ⌅

dy tence of objects with some desired property A using prob-

ability. We want to show that P (A) > 0 for some random

Example. Consider the log normal distribution, which

object, which implies that some such object must exist.

is given by Y = eZ for Z ⇠ N (0, 1). We have

Reframing this question, suppose each object in our

1 z 2 /2

universe of objects has some kind of “score” associated

fY (y) = p e with this property; then we want to show that there is

2⇡

some object with a “good” score. But we know that there

To put this in terms of y, we substitute z = ln y. More- is an object with score at least equal to the average score,

over, we know that i.e., the score of a random object. Showing that this aver-

age is “high enough” will prove the existence of an object

dy

= ez = y without specifying one.

dz

Example. Suppose there are 100 people in 15 commit-

and so, tees of 20 people each, and that each person is on exactly

1 1 ln y/2

fY (y) = p e 3 committees. We want to show that there exist 2 com-

y 2⇡ mittees with overlap 3. Let us find the average of two

Theorem 23.2. Suppose that X is a continuous random random committees. Using indicator random variables

variable in n dimensions, Y = g(X) where g : Rn ! Rn for the probability that a given person is on both of those

is continuously di↵erentiable and invertible. Then two committees, we get

3

dx 2 300 20

fY (y) = fX (x) det E(overlap) = 100 · 15 = =

dy 2

105 7

0 1

@ x1 @ xn least 20

7 . But since all overlaps must be integral, there is

B @y ···

@yn C a pair of committees with overlap 3.

dx B C

1

B .. C

= B ... ..

. . C

dy B C

@ @ xn @ xn A Lecture 24 — 10/31/11

···

@y1 @yn

Definition 24.1. The beta distribution, Beta(a, b) for

is the Jacobian matrix. a, b > 0, is defined by PDF

(

Observation 23.3. Let T = X + Y , where X and Y are cxa 1 (1 x)b 1 0 < x < 1

independent. In the discrete case, we have f (x) =

0 otherwise

X

P (T = t) = P (X = x)P (Y = t x) where c is a normalizing constant (defined by the beta

x function).

For the continuous case, we have The beta distribution is a flexible family of continuous

distributions on (0, 1). By flexible, we mean that the ap-

fT (t) = (fX ⇤ fY )(t) pearance of the distribution varies significantly depending

Z 1

on the values of its parameters. If a = b = 1, the beta

= fX (x)fY (t x) dx reduces to the uniform. If a = 2 and b = 1, the beta ap-

1

pears as a line with positive slope. If a = b = 12 , the beta

This is true because we have appears to be concave-up and parabolic; if a = b = 2, it

is concave down.

FT (t) = P (T t) The beta distribution is often used as a prior distri-

Z 1

bution for some parameter on (0, 1). In particular, it is

= P (X + Y t | X = x)fX (x) dx the conjugate prior to the binomial distribution.

1

19

Stat 110—Intro to Probability Max Wang

Observation 24.2. Suppose that, based on some data, for any a > 0. The gamma function is a continuous ex-

we have X | p ⇠ Bin(n, p), and that our prior distribu- tension of the factorial operator on natural numbers. For

tion for p is p ⇠ Beta(a, b). We want to determine the n a positive integer,

posterior distribution of p, p | X. We have

(n) = (n 1)!

P (X = k | p)f (p)

f (p | X = k) = More generally,

P (X = k)

n k

p (1 p)n k cpa 1 (1 p)b 1 (x + 1) = x (x)

= k

P (X = k)

Definition 25.2. The standard gamma distribution,

a+k 1

/ cp (1 p)b+n k 1 Gamma(a, 1), is defined by PDF

x e

Beta the conjugate prior to the binomial because both its (a)

prior and posterior distribution are Beta.

for x > 0, which is simply the integrand of the normalized

Observation 24.3. Let us find a specific case of the nor- Gamma function. More generally, let X ⇠ Gamma(a, 1)

malizing constant and Y = X . We say that Y ⇠ Gamma(a, ). To get the

PDF of Y , we simply change variables; we have x = y,

Z 1

so

c 1= xk (1 x)n k dx

0 dx

fY (y) = fX (x)

dy

To do this, consider the story of “Bayes’ billiards.” Sup-

1 1

pose we have n + 1 billiard balls, all white; then we paint = ( y)a e y

one pink and throw them along (0, 1) all independently. (a) y

Let X be the number of balls to the left of the pink ball. 1 1

= ( y)a e y

Then conditioning on where the pink ball ends up, we (a) y

have

Definition 25.3. We define a Poisson process as a pro-

Z 1 cess in which events occur continuously and indepen-

P (X = k) = P (X = k | p) f (p) dp dently such that in any time interval t, the number of

0 |{z}

1 events which occur is Nt ⇠ Pois( t) for some fixed rate

Z 1✓ ◆ parameter .

n k

= p (1 p)n k dp

0 k Observation 25.4. The time T1 until the first event oc-

curs is Expo( ):

where, given the pink ball’s location, X is simply binomial

(each white ball has an independent chance p of landing P (T1 > t) = P (Nt = 0) = e t

to the left). Note, however, that painting a ball pink and

then throwing the balls along (0, 1) is the same as throw- which means that

ing the balls along the real line and then painting one

P (T1 t) = 1 e t

pink. But then it is clear that there is an equal chance

for any given number from 0 to n of white balls to be to as desired. More generally, the time until the next event

the pink ball’s left. So we have is always Expo( ); this is clear from the memoryless prop-

Z 1✓ ◆ erty.

n k n k 1

p (1 p) dp = Proposition 25.5. Let Tn be the time of the nth event

0 k n+1

in a Poisson process with rate parameter . Then, for Xj

i.i.d. Expo( ), we have

Lecture 25 — 11/2/11

n

X

Definition 25.1. The gamma function is given by Tn = Xj ⇠ Gamma(n, )

j=1

Z 1

(a) = xa 1 e x dx The exponential distribution is the continuous ana-

Z0 1 logue of the geometric distribution; in this sense, the

1 gamma distribution is the continuous analogue of the neg-

= xa e x dx

0 x ative binomial distribution.

20

Stat 110—Intro to Probability Max Wang

Proof. One method of proof, which we will not use, Lecture 26 — 11/4/11

would be to repeatedly convolve the PDFs of the i.i.d. Xj .

Instead, we will use MGFs. Suppose that the Xj are i.i.d Observation 26.1 (Gamma-Beta). Let us take X ⇠

Expo(1); we will show that their sum is Gamma(n, 1). Gamma(a, ) to be your waiting time in line at the bank,

The MGF of Xj is given by and Y ⇠ Gamma(b, ) your waiting time in line at the

post office. Suppose that X and Y are independent.

1 Let T = X + Y ; we know that this has distribution

MXj (t) = Gamma(a + b, ).

1 t

Let us compute the joint distribution of T and of

for t < 1. Then the MGF of Tn is X

W = X+Y , the fraction of time spent waiting at the

✓ ◆n bank. For simplicity of notation, we will take = 1. The

1 joint PDF is given by

MTn (t) =

1

@ (x, y)

fT,W (t, w) = fX,Y (x, y) det

also for t < 1. We will show that the gamma distribution @(t, w)

has the same MGF. 1 1 @ (x, y)

Let Y ⇠ Gamma(n, 1). Then by LOTUS, = xa e x b

y e y

det

(a) (b) xy @(t, w)

Z 1

tY 1 1 We must find the determinant of the Jacobian (here ex-

E(e ) = ety y n e y dy

(n) 0 y pressed in silly-looking notation). We know that

Z 1

1 1 x

= y n e(1 t)y dy x+y =t =w

(n) 0 y x+y

Changing variables, with x = (1 t)y, then Solving for x and y, we easily find that

Z

(1 t) n 1 n x1

x = tw y = t(1 w)

E(etY ) = x e dx

(n) 0 x

✓ ◆n Then the determinant of our Jacobian is given y

1 (n)

= w t

1 (n) = tw t(1 w) = t

✓ ◆n 1 w t

1

= ⌅

1 Taking the absolute value, we then get

Note that this is the MGF for any n > 0, although the 1 1

fT,W (t, w) = xa e x y b e y t

sum of exponentials expression requires integral n. (a) (b) xy

1 a 1 1

= w (1 w)b 1 ta+b e t

Observation 25.6. Let us compute the moments of (a) (b) t

X ⇠ Gamma(a, 1). We want to compute E(X c ). We (a + b) a 1 1 t1

have = w (1 w)b 1 ta+b e

(a) (b) (a + b) t

Z 1

c 1 1 This is a product of some function of w with the PDF of

E(X ) = xc xa e x dx

(a) 0 x T , so we see that T and W are independent. To find the

Z 1

1 1 marginal distribution of W , we note that the PDF of T

= xa+c e x dx

(a) 0 x integrates to 1 just like any PDF, so we have

(a + c) Z 1

= fW (w) = fT,W (t, w) dt

(a)

1

a(a + 1) · · · (a + c) (a) (a + b) a

= = w 1

(1 w)b 1

(a) (a) (b)

= a(a + 1) · · · (a + c)

This yields W ⇠ Beta(a, b) and also gives the normalizing

If instead, we take X ⇠ Gamma(a, ), then we will have constant of the beta distribution.

It turns out that if X were distributed according to

a(a + 1) · · · (a + c) any other distribution, we would not have independence,

E(X c ) = c but proving so is out of the scope of the course.

21

Stat 110—Intro to Probability Max Wang

Observation 26.2. Let us find E(W ) for W ⇠ Example. Let U1 , . . . , Un be i.i.d. Unif(0, 1), and let us

X

Beta(a, b). Let us write W = X+Y with X and Y de- determine the distribution of U(j) . Applying the above

fined as above. We have result, we have

✓ ◆ ✓ ◆

1 E(X) a n 1 j 1

E = = fU(j) (x) = n x (1 x)n j

X E(X + Y ) a+b j 1

Note that in general, the first equality is false! However, for 0 < x < 1. Thus, we have U(j) ⇠ Beta(j, n j + 1).

because X +Y and X+Y X

, they are uncorrelated and hence This confirms our earlier result that, for U1 and U2 i.i.d.

linear. So Unif(0, 1), we have

✓ ◆ 1

1 E|U1 U2 | = E(Umax ) E(Umin ) =

E E(X + Y ) = E(X) 3

X

because Umax ⇠ Beta(2, 1) and Umin ⇠ Beta(1, 2), which

Definition 26.3. Let X1 , . . . , Xn be i.i.d. The order

have means 23 and 13 respectively.

statistics of this sequence is

where Example (Two Envelopes Paradox). Suppose we have

X(1) = min{X1 , . . . , Xn } two envelopes containing sums of money X and Y , and

suppose we are told that one envelope has twice as much

X(n) = max{X1 , . . . , Xn }

money as the next. We choose one envelope; by symme-

and the remaining X(j) fill out the order. If n is odd, we try, take X WLOG. Then it appears that Y has equal

have the median X( 1 ) . The order statistics lets us find

n+1

probabilities of containing X2 and of 2X, and thus av-

arbitrary quantiles for the sequence. erages 1.25X. So it seems that we ought to switch to

envelope Y . But then, by the same reasoning, it would

The order statistics are hard to work with because seem we ought to switch back to X.

they are dependent (and positively correlated), even We can argue about this paradox in two ways. First,

though we started with i.i.d. random variables. They are we can say, by symmetry, that

particularly tricky in the discrete case because of ties, so

we will assume that the Xj are continuous. E(X) = E(Y )

Observation 26.4. Let X1 , . . . , Xn be i.i.d. continuous which is simple and straightforward. We might also, how-

with PDF fj and CDF Fj . We want to find the CDF and ever, try to condition on the value of Y with respect to

PDF of X(j) . For the CDF, we have X using the Law of Total Probability

Xn ✓ ◆ X X

n + E(Y | Y = )P (Y = )

= P (X1 x)k (1 P (X1 x))n k

2 2

k 1 X 1

k=j

n ✓ ◆

= E(2X) + E( )

X n 2 2 2

= F (x)k (1 F (x))n k 5

k = E(X)

k=j 4

Turning now to the PDF, recall that a PDF gives a den- Assuming that X and Y are not 0 or infinite, these can-

sity rather than a probability. We can multiply the PDF not both be correct, and the argument from symmetry is

of X(j) at a point x by a tiny interval dx about x in or- immediately correct.

der to obtain the probability that X(j) is in that interval. The flaw in our second argument is that, in general,

Then we can simply count the number of ways to have E(Y | Y = Z) 6= E(Z)

one of the Xi be in that interval and precisely j 1 of the

Xi below the interval. So, because we cannot drop the condition that Y = Z; we

✓ ◆ must write

n 1

fX(j) (x) dx = n(f (x) dx) F (x)j 1 (1 F (x))n j E(Y | Y = Z) = E(Z | Y = Z)

j 1

✓ ◆

n 1 In other words, if we let I be the indicator for Y = 2X,

fX(j) (x) = n F (x)j 1 (1 F (x))n j f (x)

j 1 we are saying that X and I are dependent.

22

Stat 110—Intro to Probability Max Wang

Example (Patterns in coin flips). Suppose we repeat- Definition 27.1. Now let us write

edly flip a fair coin. We want to determine how many

flips it takes until HT is observed (including the H and g(x) = E(Y | X = x)

T ); similarly, we can ask how many flips it takes to get Then

HH. Let us call these random variables WHT and WHH E(Y |X) = g(X)

respectively. Note that, by symmetry,

So, suppose for instance that g(x) = x2 ; then g(X) = X 2 .

We can see that E(Y |X) is a random variable and a func-

E(WHH ) = E(WT T )

tion of X. This is a conditional expectation.

and Example. Let X and Y be i.i.d. Pois( ). Then

E(WHT ) = E(WT H )

E(X + Y | X) = E(X|X) + E(Y |X)

Let us first consider WHT . This is the time to the first X is a function of itself = X + E(Y |X)

H, which we will call W1 , plus the time W2 to the next X and Y independent) = X + E(Y )

T . Then we have

=X+

E(WHT ) = E(W1 ) + E(W2 ) = 2 + 2 = 4 Note that, in general,

Now let us consider WHH . The distinction here is that Now let us determine E(X | X + Y ). We can do this

no “progress” can be easily made; once we get a heads, in two di↵erent ways. First, let T = X + Y and let us

we are not decidedly halfway to the goal, because if the find the conditional PMF.

next flip is tails, we lose all our work. Instead, we make

P (T = n | X = k)P (X = k)

use of conditional expectation. Let Hi be the event that P (X = k | T = n) =

the ith toss is heads, Ti = HiC the event that it is tails. P (T = n)

Then P (Y = n k)P (X = k)

=

P (T = n)

1 1 n k k

E(WHH ) = E(WHH | H1 ) + E(WHH | T1 ) e

2 2 (n k)! e k!

✓ ◆ = n

1

= E(WHH | H1 , H2 ) + E(WHH | H1 , T2 )

1 1 e 2 (2n!)

2 2 2 ✓ ◆✓ ◆n

n 1

1 =

+ (1 + E(WHH )) k 1

✓ 2 ◆ 1

1 1 1 That is, X | T = n ⇠ Bin(n, 2 ). Thus, we have

= 1 + (2 + E(WHH )) + (1 + E(WHH ))

2 2 2 n

E(X | T = n) =

2

Solving for E(WHH ) gives

which means that

E(WHH ) = 6 T

E(X|T ) =

2

So far, we have been conditioning expectations on In our second method, first we note that

events. Let X and Y be random variables; then this kind

of conditioning includes computing E(Y | X = x). If Y E(X | X + Y ) = E(Y | X + Y )

is discrete, then by symmetry (since they are i.i.d.). We have

X

E(Y | X = x) = yP (Y = y | X = x) E(X | X + Y ) + E(Y | X + Y ) = E(X + Y | X + Y )

y =X +Y

and if Y is continuous, =T

Z T

1 So, without even using the Poisson, E(X|T ) = 2 .

E(Y | X = x) = yfY |X=x (y|x) dy Proposition 27.2 (Adam’s Law). Let X and Y be ran-

1

Z 1 dom variables. Then

fX,Y (x, y)

if X continuous = y dy

1 fX (x) E(E(Y |X)) = E(Y )

23

Stat 110—Intro to Probability Max Wang

alent to factoring out at constant (by linearity).

Example. Let X ⇠ N (0, 1), Y = X 2 . Then

2. Immediate.

E(Y |X) = E(X 2 |X) = X 2 = Y

3. We will prove the discrete case. Let E(Y |X) =

On the other hand, g(X). Then by discrete LOTUS, we have

X

Eg(X) = g(x)P (X = x)

E(X|Y ) = E(X|X 2 ) = 0

x

p X

2 E(Y | X = x)P (X = x)

since, after observing X = a, then X = ± a with equal =

likelihood of being positive or negative (since the standard x

0 1

normal is symmetric about 0). Note that this doesn’t X X

mean X and X 2 are independent. = @ yP (Y = y | X = x)A P (X = x)

x y

Example. Suppose we have a stick, break o↵ a random 0 1

piece, and then break o↵ another random piece. We can X X

model this as X ⇠ Unif(0, 1), Y |X ⇠ Unif(0, X). We = @ yP (Y = y | X = x)P (X = x)A

know that x y

x

E(Y | X = x) = conditional PMF times marginal PMF = joint PMF

2 XX

and hence = yP (Y = y, X = x)

X x y

E(Y |X) = XX

2

= yP (Y = y, X = x)

Note that y x

1 X

E(E(Y |X)) = = E(Y )

4 = yP (Y = y)

That is, on average, we take half the stick and then take y

Proposition 28.1. Let X and Y be random variables. 4. We have

1. E(h(X)Y | X) = h(X)E(Y |X). E((Y E(Y |X))h(X))

= E(Y h(X)) E(E(Y |X)h(X))

2. E(Y |X) = E(Y ) if X and Y are independent (the

converse, however, is not true in general). = E(Y h(X)) E(E(h(X)Y |X))

= E(Y h(X)) E(Y h(X))

3. E(E(Y |X)) = E(Y ). This is called iterated expec-

tation or Adam’s Law; it is usually more useful to =0

think of this as computing E(Y ) by choosing a sim-

⌅

ple X to work with.

Definition 28.2. We can define the conditional variance

4. E((Y E(Y |X))h(X)) = 0. In words, the residual much as we did conditional expectation. Let X and Y be

(i.e., Y E(Y |X)) is uncorrelated with h(X): random variables. Then

Cov(Y E(Y |X), h(X)) Var(Y |X) = E(Y 2 |X) (E(Y |X))2

= E((Y E(Y |X))h(X)) E(Y E(Y |X)) E(h(X)) = E((Y E(Y |X))2 | X)

| {z } | {z }

0 0

Proposition 28.3 (Eve’s Law).

To better understand (4), we can think of the func- Var(Y ) = E(Var(Y |X)) + Var(E(Y |X))

tions X and Y as vectors (the vector space has inner prod-

Example. Suppose we have three populations, where

uct hX, Y i = E(XY )). We can think of E(Y |X) as the

X = 1 is the first, X = 2 the second, and X = 3 the

projection of Y onto the plane consisting of all functions

third, and suppose we know the mean and variance of

of X. In this picture, the residual vector Y E(Y |X) is

the height Y of individuals in each of the separate pop-

orthogonal to the plane of all functions of X, and thus

ulations. Then Eve’s law says we can take the variance

hY E(Y |X), h(X)i = 0.

of all three means, and add it to the mean of all three

Proof. We will prove all the properties above. variances, to get the total variance.

24

Stat 110—Intro to Probability Max Wang

Example. Suppose we choose a random city and then make N a constant, so let us condition on N . Then using

choose a random sample of n people in that city. Let X the Law of Total Probability, we have

be the number of people with a particular disease, and 1

Q the proportion of people in the chosen city with the X

E(X) = E(X | N = n)P (N = n)

disease. Let us determine E(X) and Var(X), assuming

n=0

Q ⇠ Beta(a, b) (a mathematically convenient, flexible dis- 1

X

tribution). = µnP (N = n)

Assume that X|Q ⇠ Bin(n, Q). Then n=0

= µE(N )

E(X) = E(E(X|Q))

= E(nQ) Note that we can drop the conditional because N and the

a Xj are independent; otherwise, this would not be true.

=n We could also apply Adam’s Law to get

a+b

Var(X) = E(Var(X|Q)) + Var(E(X|Q)) To get the variance, we apply Eve’s Law to get

= E(nQ(1 Q)) + n2 Var(Q) Var(X) = E(Var(X|N )) + Var(E(X|N ))

2

We have = E(N ) + Var(µN )

2

Z = E(N ) + µ2 Var(N )

(a + b) 1 a

E(Q(1 Q)) = q (1 q)b dq

(a) (b) 0 We now turn our attention to statistical inequalities.

(a + b) (a + 1) (b + 1)

= Theorem 29.1 (Cauchy-Schwartz Inequality).

(a) (b) (a + b + 2)

p

ab (a + b) |E(XY )| E(X 2 )E(Y 2 )

=

(a + b + 1)(a + b) (a + b)

ab If X and Y are uncorrelated, E(XY ) = (EX)(EY ), so

= we don’t need inequality.

(a + b)(a + b + 1)

We will not prove this inequality in general. However,

and if X and Y have mean 0, then

µ(1 µ)

Var(Q) = E(XY )

a+b+1 | Corr(X, Y )| = p 1

E(X 2 )E(Y 2 )

a

where µ = a+b . This gives us all the information we need

to easily compute Var(X). Theorem 29.2 (Jensen’s Inequality). If g : R ! R is

convex (i.e., g 00 > 0), then

Example. Consider a store with a random number N If g ic concave (i.e., g 00 < 0), then

of customers. Let Xj be the amount the jth customer

spends, with E(Xj ) = µ and Var(Xj ) = 2 . Assume Eg(X) g(EX)

that N, X1 , X2 , . . . are independent. We want to deter-

mine the mean and variance of Example. If X is positive, then

N 1 1

X E( )

X= Xj X EX

j=1

and

E(ln X) ln(EX)

We might, at first, mistakenly invoke linearity to claim

that E(X) = N µ. But this is incoherent; the LHS is a Proof. It is true of any convex function g that

real number whereas the RHS is a random variable. How-

ever, this error highlights something useful: we want to g(x) a + bx

25

Stat 110—Intro to Probability Max Wang

the graph of g. Take x0 = E(X). Then we have

Definition 30.1. Let X1 , X2 , . . . be i.i.d. random vari-

g(x) a + bx ables with mean µ and variance 2 . The sample mean of

g(X) a + bX the first n random variables is

n

Eg(X) E(a + bX) 1X

X̄n = Xj

= a + bE(X) n j=1

= g(EX) We want to answer the question: What happens to

the sample mean when n gets large?

⌅

Theorem 30.2 (Law of Large Numbers). With probabil-

Theorem 29.3 (Markov Inequality). ity 1, as n ! 1,

X̄n ! µ

E|X| pointwise. That is, the sample mean of a collection of

P (|X| a)

a i.i.d. random variables converges to the true mean.

for any a > 0. Example. Suppose that Xj ⇠ Bern(p). The Law of

Large Numbers says that n1 (X1 + · · · + Xn ) ! p.

Proof. Let I|X| a be the indicator random variable for

Note that the Law of Large Numbers says nothing

the event |X| a. It is always true that

about the value of any individual Xj . For instance, in the

above example with simple success and failures (which

aI|X| a |X|

we may model as a series of coin flips), flipping heads

many times does not mean that a tails is on its way.

because if I|X| a = 1, then |X| a and then inequality

Rather, it means that the large but finite string of heads

holds, and if I|X| a = 0, the inequality is trivial since

is “swamped” by the infinite flips yet to come.

|X| 0. Then, taking expected values, we have

Theorem 30.3 (Weak Law of Large Numbers). For any

aEI|X| a E|X| c > 0, as n ! 0,

Proof. (of Weak LoLN) By Chebyshev’s inequality,

Example. Suppose we have 100 people. It is easily pos-

sible that at least 95% of the people are younger than Var(X̄n )

P (|X̄n µ| > c)

average in the group. However, it is not possible that at c2

1 2

least 50% are older than twice the average age. 2n

= n 2

c

Theorem 29.4 (Chebyshev Inequality). 2

=

nc2

Var(X)

P (|X µ| a) !0 ⌅

a2

Note that the Law of Large Numbers does not tell us

for µ = EX and a > 0. Alternatively, we can write anything about the distribution of X̄n . To study this dis-

tribution, and in particular the rate at which X̄n ! 0, we

1

P (|X µ| c SD(X)) might consider

c2 ni (X̄n µ)

for c > 0. for various values of i.

(X̄n µ)

P (|X µ| a) = P ((X µ)2 a2 ) n1/2 ! N (0, 1)

2

E((X µ) ) in distribution; that is, the CDFs converge. Equivalently,

by Markov

a2 Pn

Var(X) j=1 Xj nµ

= ⌅ p ! N (0, 1)

a2 n

26

Stat 110—Intro to Probability Max Wang

Proof. We will prove the CLT assuming that the MGF The Poisson approximation works well when n is large,

M (t) of the Xj exists (note that we have been assuming p is small, and = np is moderate. In contrast, the Nor-

all along that the first two moments exist). We will show mal approximation works well when n is large and p is

that the MGFs converge, which will imply that the CDFs near 12 (to match the symmetry of the normal).

converge (however, we will not show this fact). It seems a little strange that we are approximating a

Let us assume WLOG that µ = 0 and = 1. Let discrete distribution with a continuous distribution. In

general, to correct for this, we can write

n

X

Sn = Xj P (X = a) = P (a ✏ < X < a + ✏)

j=1

where (a ✏, a + ✏) contains only a

Sn

We will show that the MGF of p

n

converges to the MGF

of N (0, 1). We have

Lecture 31 — 11/18/11

p

E(etSn / n

) Definition 31.1. Let V = Z12 + · · · + Zn2 where the Zj ⇠

uncorrelated since independent N (0, 1) i.i.d. Then V has the chi-squared distribution

p p

= E(etX1 / n

) · · · E(etXn / n

) with n degrees of freedom, V ⇠ 2n .

p

= E(e ) tXj / n n Observation 31.2. It is true, but we will not prove, that

✓ ◆n

t 2 1 1

=M p 1 = Gamma( , )

n 2 2

2

P 2

Taking the limit results in the indeterminate form 11 , Since n = 1, we have

which is hard to work with. Instead, we take the log of

2 n 1

both sides and then take the limit, to get n = Gamma( , )

2 2

✓ ◆ ln M ( ptn )

t Definition 31.3. Let Z ⇠ N (0, 1) and V ⇠ 2

be inde-

lim n ln M p = lim 1

n

n!1 n n!1

n pendent. Let

1 Z

substitute y = p T =p

n V /n

ln M (ty) Then T has the Student-t distribution with n degrees of

= lim

y!0 y2 freedom, T ⇠ tn .

tM 0 (ty)

L’Hopital’s = lim Observation 31.4. The Student-t is symmetric; that is

y!0 2yM (ty)

T ⇠ tn . Note that if n = 1, then T is the ratio of two

t M 0 (ty) i.i.d. standard normals, so T becomes the Cauchy distri-

[M (0) = 1, M 0 (0) = 0] = lim

2 y!0 y bution (and hence has no mean).

t2 M 00 (ty) If n 2, then

L’Hopital’s = lim

2 y!0 1

1

t2 E(T ) = E(Z)E( p )=0

= V /n

2

2

= ln et /2 Note that in general, T ⇠ tn will only have moments up

to (but not including) the nth.

2

and et /2

is the N (0, 1) MGF. ⌅

Observation 31.5. We proved that

Pn

Corollary 30.5. Let X ⇠ Bin(n, p) with X = j=1 Xj , E(Z 2 ) = 1, E(Z 4 ) = 1 · 3, E(Z 6 ) = 1 · 3 · 5

Xj ⇠ Bern(p) i.i.d.

! using MGFs. We can also prove this by noting that

a np X np b np

P (a X b) = P p p p E(Z 2n ) = E((Z 2 )n )

npq npq npq

✓ ◆ ✓ ◆

1 1 and that Z 2 ⇠ 21 = Gamma( 12 , 12 ). Then we can simply

⇡

b np a np use LOTUS to get our desired mean.

27

Stat 110—Intro to Probability Max Wang

much like the normal distribution but is heavier-tailed,

especially if n is small. As n ! 1, we claim that the Definition 32.1. Let X0 , X1 , X2 , . . . be sequence of ran-

Student-t converges to the standard normal. dom variables. We think of Xn as the state of a finite

Let system at a discrete time n (that is, the Xn have discrete

Z

Tn = p indices and each has finite range). The sequence has the

Vn /n Markov property if

where Z1 , Z2 , . . . ⇠ N (0, 1) i.i.d., Vn = Z12 + · · · + Zn2 , and

Z ⇠ N (0, 1) independent of the Zj . By the Law of Large P (Xn+1 = j | Xn = i, Xn 1 = in 1 , . . . , X0 = i0 )

numbers, with probability 1, = P (Xn+1 = j | Xn = i)

Vn

lim =1

n!1 n In casual terms, in a system with the Markov property,

So Tn ! Z, which is standard normal as desired. the past and future are conditionally independent given

the present. Such a system is called a Markov chain.

Definition 31.7. Let X = (X1 , . . . , Xk ) be a random

vector. We say that X has the multivariate normal If P (Xn+1 = j | Xn = i) does not depend on time n,

distribution (MVN) if every linear combination then we denote

t1 X 1 + · · · t k X k qij := P (Xn+1 = j | Xn = i)

of the Xj is normal.

Example. Let Z, W be i.i.d. N (0, 1). Then (Z + called the transition probability, and we call the sequence

2W, 3Z + 5W ) is MVN, since a homogenous Markov chain.

To describe a homogenous Markov chain we simply

s(Z + 2W ) + t(3Z + 5W ) = (s + 3t)Z + (2s + 5t)W need to show the states of the process and the transi-

is a sum of independent normals and hence normal. tion probabilities. We could, instead, array the qij ’s as a

Example. Let Z ⇠ N (0, 1). Let S be a random sign (±1 matrix,

with equal probabilities) independent of Z. Then Z and Q = qij

SZ are marginally standard normal. However, (Z, SZ) is

not multivariate normal, since Z + SZ is 0 with proba- called the transition matrix.

bility 12 .

Observation 31.8. Recall that the MGF for X ⇠ Note. More generally, we could consider continuous sys-

N (µ, 2 ) is given by tems (i.e., spaces) at continous times and more broadly

2 2

study stochastic processes. However, in this course, we

E(etX ) = etµ+t /2

will restrict our study to homogenous Markov chains.

Suppose that X = (X1 , . . . , Xk ) is MVN. Let µj = EXj .

Then the MGF of X is given by Example. The following diagram describes a (homoge-

nous) Markov chain:

E(et1 X1 +···+tk Xk )

1

= exp(t1 µ1 + · · · + tk µk +

Var(t1 X1 + · · · + tk Xk )) 1/2

2

Theorem 31.9. Let X = (X1 , . . . , Xk ) be MVN. Then

within X, uncorrelated implies independence. For in- 2/3 1

stance, if we write X = (X1 , X2 ), if every component 1/3 1 2 3 4 1/4

1/2

of X1 is uncorrelated with every component of X2 , then 1/2 1/4

X1 is independent of X2 .

Example. Let X, Y be i.i.d. N (0, 1). Then (X + Y, X

We could alternatively describe the same Markov chain

Y ) is MVN. We also have that

by specifying its transition matrix

Cov(X + Y, X Y)

0 1

= Var(X) + Cov(X, Y ) Cov(X, Y ) Var(Y ) 1 2

0 0

B 31 3

1 C

=0 B2 0 2 0C

Q= B C

@0 0 0 1A

So by our above theorem, X + Y and X Y are indepen- 1 1 1

2 0 4 4

dent.

28

Stat 110—Intro to Probability Max Wang

tribution s (a row vector in the transition matrix, which

represents the PMF). Then Example. The following are some pathological examples

of Markov chains (sans transition probabilities), in state-

X diagram form:

P (Xn+1 = j) = P (Xn+1 = j | Xn = i)P (Xn = i)

i 1. Unpathological Markov chain

X

= qij si

i 1 2 3

= sQ

2. Disconnected Markov chain

have that sQj is the distribution of Xn+j .

We can also compute the two-step transition proba- 1 2 3

bility:

P (Xn+2 = j | Xn = i)

X

= P (Xn+2 = j | Xn+1 = k, Xn = i) 4 5 6

k

P (Xn+1 = k | Xn = i) 3. Markov chain with absorbing states

X

= P (Xn+2 = j | Xn+1 = k)P (Xn+1 = k | Xn = i)

k

0 1 2 3

X

= qkj qik

k

4. Periodic Markov chain

X

= qik qkj

k 1 2 3

= (Q2 )ij

Definition 33.1. A state is recurrent if, starting from

More generally, we have that state, there is probability 1 of transitioning back to

that state after a finite number of transitions. If a state

P (Xn+m = j | Xn = i) = (Qm )ij is not recurrent, it is transient.

Definition 33.2. A Markov chain is irreducible if it is

Definition 32.3. Let s be some probability vector for a

possible (with positive probability) to transition from any

Markov chain with transition matrix Q. We say that s is

state to any other state in a finite number of transitions.

stationary for the chain if

Note that in an irreducible chain, all states are recurrent;

over an infinite number of transitions, any nonzero prob-

sQ = s ability of returning to a state means that the event of

return will occur with probability 1.

We also call s a stationary distribution. Note that this is

the transpose of an eigenvector equation. Observation 33.3. In our example above, Markov

chains 1 and 4 are irreducible; chains 2 and 3 are not.

This definition raises the following questions: All the states of chain 2 are recurrent; even though the

chain itself has two connected components, we will al-

1. Does a stationary distribution exist for every ways (i.e., with probability 1), return to the state which

Markov chain? we started from.

However, in chain 3, states 1 and 2 are transient. With

2. Is the stationary distribution unique? probability 1, from states 1 and 2, we will at some point

transition to state 0 or 3; after that point, we will never

return to state 1 or 2. On the other hand, if we start in

3. Does the chain (in some sense) converge to the sta-

0 or 3, we stay there forever; they are clearly recurrent.

tionary distribution?

Theorem 33.4. For any irreducible Markov chain,

4. How can we compute it (efficiently)? 1. A stationary distribution s exists.

29

Stat 110—Intro to Probability Max Wang

Assume i 6= j. Since the Markov chain is undirected,

1

3. si = , where ri is the average time to return to qij and qji are either both zero or both nonzero. If (i, j)

ri

state i starting from state i. is an edge, then

1

m

qij =

4. If Q is strictly positive for some m, then di

since our Markov chain represents a random walk. But

lim P (Xn = i) = si this suffices to prove our claim.

n!1

Let us now normalize di to a stationary vector si . This

Alternatively, if t is any (starting-state) probability is easy; we can simply take

vector, then

lim tQ = s di

n!1 si = P

j dj

Definition 33.5. A Markov chain with transition matrix

Q is reversible if there is a probability vector s such that and we have thus found our desired stationary distribu-

tion.

si qij = sj qji

Reversibility is also known as time-reversibility. Intu-

itively, the progression of a reversible Markov chain could

be played back backwards, and the probabilities would be

consistent with the original Markov chain.

Theorem 33.6. If a Markov chain is reversible with re-

spect to s, then s is stationary.

Proof. We know that si qij = sj qji for some s. Summing

over all states,

X X

si qij = sj qji

i j

X

= sj qji

j

= sj

But since this is true for every j, this is exactly the state-

ment of

sQ = s

as desired. ⌅

Example (Random walk on an undirected network).

Consider the following example undirected Markov chain

1 2

3 4

d4 = 1). Then we claim that (in general)

di qij = dj qji

30