Академический Документы
Профессиональный Документы
Культура Документы
Acknowledgments
In addition to the course sta↵, acknowledgment goes to Zev Chonoles, whose online lecture notes (http://math.
uchicago.edu/~chonoles/expository-notes/) inspired me to post my own. I have also borrowed his format for
this introduction page.
The page layout for these notes is based on the layout I used back when I took notes by hand. The LATEX styles can
be found here: https://github.com/mxw/latex-custom.
Copyright
Copyright © 2011 Max Wang.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This means you are free to edit, adapt, transform, or redistribute this work as long as you
• include an attribution of Joe Blitzstein as the instructor of the course these notes are based on, and an
attribution of Max Wang as the note-taker;
• do so in a way that does not suggest that either of us endorses you or your use of this work;
• use this work for noncommercial purposes only; and
• if you adapt or build upon this work, apply this same license to your contributions.
1
Stat 110—Intro to Probability Max Wang
3. P (A [ B) = P (A) + P (B) P (A \ B). Definition 5.1. The probability of two events A and B
are independent if
Proof. All immediate. ⌅
P (A \ B) = P (A)P (B)
Corollary 4.2 (Inclusion-Exclusion). Generalizing 3
above, In general, for n events A1 , . . . , An , independence re-
0 1 quires i-wise independence for every i = 2, . . . , n; that
n
[ Xn X is, say, pairwise independence alone does not imply inde-
P @ Ai A = P (Ai ) P (Ai \ Aj ) pendence.
i=1 i=1 i<j
X Note. We will write P (A \ B) as P (A, B).
+ P (Ai \ Aj \ Ak ) ··· Example (Newton-Pepys Problem). Suppose we have
i<j<k
0 1 some fair dice; we want to determine which of the fol-
n
\ lowing is most likely to occur:
+ ( 1)n+1 P @ Ai A
1. At least one 6 given 6 dice.
i=1
2. At least two 6’s with 12 dice.
Example (deMontmort’s Problem). Suppose we have n
cards labeled 1, . . . , n. We want to determine the proba- 3. At least three 6’s with 18 dice.
bility that for some card in a shu✏ed deck of such cards, For the first case, we have
the ith card has value i. Since the number of orderings of ✓ ◆6
the deck for which a given set of matches occurs is simply 1
P (A) = 1
the permutations on the remaining cards, we have 5
For the second,
(n 1)! 1
P (Ai ) = = ✓ ◆12 ✓ ◆✓ ◆11
n! n 1 1 1
(n 2)! 1 P (B) = 1 12
P (A1 \ A2 ) = = 5 1 5
n! n(n 1) and for the third,
(n k)!
P (A1 \ · · · \ Ak ) = 2 ✓ ◆✓
X ◆k ✓ ◆18 k
n! 18 1 1
P (C) = 1
k 1 5
So using the above corollary, k=0
2
Stat 110—Intro to Probability Max Wang
3
Stat 110—Intro to Probability Max Wang
p
for the success of two doctors for two di↵erent operations. 1± 4p2 4p + 1
=
Note that although Hibbert has a higher success rate 2p
conditional on each operation, Nick’s success rate is 1 ⌥ (2p 1)
higher overall. Let us denote A to be the event of a suc- =
2p
cessful operation, B the event of being treated by Nick, q
and C the event of having heart surgery. In other words, = 1,
p
then, we have
As with di↵erential equations, this gives a general solu-
P (A|B, C) < P (A|B C , C) tion of the form
✓ ◆i
i 1
and pi = A1 + B
q
P (A|B, C C ) < P (A|B C , C C )
for p 6= q (to avoid a repeated root). Our boundary con-
but
C ditions for p0 and pn give
P (A|B) > P (A|B )
B= A
In this example, C is the confounder.
and ✓ ◆n
q
Lecture 8 — 9/19/11 1=A 1
p
q
Definition 8.1. A one-dimensional random walk mod- To solve for the case where p = q, we can guess x = p
els a (possibly infinite) sequence of successive steps along and take
the number line, where, starting from some position i, 1 xi ixi 1 i
lim = lim =
we have a probability p of moving +1 and a probability x!1 1 xn x!1 nxn 1 n
q = 1 p of moving 1. So we have
8 ⇣ ⌘i
>
> 1
Example. An example of a one-dimensional random >
> 1
>
< q
walk is the gambler’s ruin problem, which asks: Given ⇣ ⌘n p 6= q
pi = 1 1
two individuals A and B playing a sequence of successive > q
>
>
rounds of a game in which they bet $1, with A winning >
> i
: p=q
B’s dollar with probability p and A losing a dollar to B n
with probability q = 1 p, what is the probability that Now suppose that p = 0.49 and i = n i. Then we
A wins the game (supposing A has i dollars and B has have the following surprising table
n i dollars)? This problem can be modeled by a random
walk with absorbing states at 0 and n, starting at i. N P (A wins)
To solve this problem, we perform first-step analy- 20 0.40
sis; that is, we condition on the first step. Let pi = 100 0.12
P (A wins game | A start at i). Then by the Law of Total 200 0.02
Probability, for 1 i n 1. Note that this table is true when the odds are only slightly
against A and when A and B start o↵ with equal funding;
pi = ppi+1 + qpi 1 it is easy to see that in a typical gambler’s situation, the
chance of winning is extremely small.
and of course we have p0 = 0 and pn = 1. This equation
is a di↵erence equation. Definition 8.2. A random variable is a function
To solve this equation, we start by guessing X:S!R
pi = x i from some sample space S to the real line. A random
variable acts as a “summary” of some aspect of an exper-
Then we have iment.
xi = pxi+1 + qxi 1 Definition 8.3. A random variable X is said to have the
Bernoulli distribution if X has only two possible values,
px2 xi + q = 0
p 0 and 1, and there is some p such that
1± 1 4pq
x= P (X = 1) = p P (X = 0) = 1 p
2p
p We say that
1 ± 1 4p(1 p)
= X ⇠ Bern(p)
2p
4
Stat 110—Intro to Probability Max Wang
Note. We write X = 1 to denote the event Proof. This is clear from our “story” definition of the
binomial distribution, as well as from our indicator r.v.’s.
1
{s 2 S : X(s) = 1} = X {1} Let us also check this using PMFs.
k
X
Definition 8.4. The distribution of successes in n inde- P (X + Y = k) = P (X + Y = k | X = j)P (X = j)
pendent Bern(p) trials is called the binomial distribution j=0
and is given by k ✓ ◆
X n j n
✓ ◆ = P (Y = k j|X j) p q j
n k j
P (X = k) = p (1 p)n k j=0
k ✓ ◆
k
X n j n j
where 0 k n. We write independence = P (Y = k j) p q
j=0
j
X ⇠ Bin(n, p) Xk ✓ ◆ ✓ ◆
m n j n
= pk j m (k j)
q p q j
k j j
Definition 8.5. The probability mass function (PMF) j=0
Definition 9.1. The cumulative distribution function Definition 9.3. Suppose we have w white and b black
(CDF) of a random variable X is marbles, out of which we choose a simple random sam-
ple of n. The distribution of # of white marbles in the
FX (x) = P (X x) sample, which we will call X, is given by
w b
k n k
Note. The requirements
P for a PMF with values pi is that P (X = k) = w+b
each pi 0 and i pi = 1. For Bin(n, p), we can easily n
verify this with the binomial theorem, which yields where 0 k w and 0 n k b. This is called the
n ✓ ◆
hypergeometric distribution, denoted HGeom(w, b, n).
X n k n k
p q = (p + q)n = 1 Proof. We should show that the above is a valid PMF. It
k
k=0 is clearly nonnegative. We also have, by Vandermonde’s
identity,
Proposition 9.2. If X, Y are independent random vari- Xw w b w+b
k n k n
ables and X ⇠ Bin(n, p), Y ⇠ Bin(m, p), then w+b
= w+b
=1
k=0 n n
X + Y ⇠ Bin(n + m, p) ⌅
5
Stat 110—Intro to Probability Max Wang
Note. The di↵erence between the hypergometric and bi- The above shows that to get the probability of an
nomial distributions is whether or not we sample with event, we can simply compute the expected value of an
replacement. We would expect that in the limiting case indicator.
of n ! 1, they would behave similarly.
Observation 10.6. Let X ⇠ Bin(n, p). Then (using the
binomial theorem),
Lecture 10 — 9/23/11 n ✓ ◆
X n k n k
Proposition 10.1 (Properties of CDFs). A function E(X) = k p q
k
k=0
FX is a valid CDF i↵ the following hold about FX : n ✓ ◆
X n k n k
1. monotonically nondecreasing = k p q
k
k=1
2. right-continuous Xn ✓ ◆
n 1 k n k
= n p q
3. limx! 1 FX (x) = 0 and limx!1 FX (x) = 1. k 1
k=1
Xn ✓ ◆
Definition 10.2. Two random variables X and Y are n 1 k 1 n k
= np p q
independent if 8x, y, k 1
k=1
X1 ✓n 1◆
n
P (X x, Y y) = P (X x)P (Y y) = np pj q n 1 j
j=0
j
In the discrete case, we can say equivalently that
= np
P (X = x, Y = y) = P (X = x)P (Y = y)
Proposition 10.7. Expected value is linear; that is, for
Note. As an aside before we move on to discuss averages random variables X and Y and some constant c,
and expected values, recall that E(X + Y ) = E(X) + E(Y )
n
X
1 n+1 and
i=
n i=1
2 E(cX) = cE(X)
Observation 10.8. Using linearity, given X ⇠ Bin(n, p),
Example. Suppose we want to find the average of
since we know
1, 1, 1, 1, 1, 3, 3, 5. We could just add these up and divide
by 8, or we could formulate the average as a weighted X = X1 + · · · + Xn
average,
5 2 1 where the Xi are i.i.d. Bern(p), we have
·1+ ·3+ ·5
8 8 8
X = p + · · · + p = np
Definition 10.3. The expected value or average of a dis-
crete random variable X is Example. Suppose that, once again, we are choosing a
X five card hand out of a standard deck, with X = #aces.
E(X) = xP (X = x) If Xi is an indicator of the ith card being an ace, we have
x2Im(X)
E(X) = E(X1 + · · · + X5 )
Observation 10.4. Let X ⇠ Bern(p). Then = E(X1 ) + · · · + E(X5 )
E(X) = 1 · P (X = 1) + 0 · P (X = 0) = p by symmetry = 5E(X1 )
= 5P (first card is ace)
Definition 10.5. If A is some event, then an indicator 5
random variable for A is =
13
(
1 A occurs Note that this holds even though the Xi are dependent.
X=
0 otherwise Definition 10.9. The geometric distribution, Geom(p),
is the number of failures of independent Bern(p) tri-
By definition, X ⇠ Bern(P (A)), and by the above, als before the first success. Its PMF is given by (for
X ⇠ Geom(p))
E(X) = P (A) P (X = k) = q k p
6
Stat 110—Intro to Probability Max Wang
for k 2 N. Note that this PMF is valid since Definition 11.1. The negative binomial distribution,
1
NB(r, p), is given by the number of failures of indepen-
X 1
pq k = p · =1 dent Bern(p) trials before the rth success. The PMF for
1 q X ⇠ NB(r, p) is given by
k=0
✓ ◆
n+r 1 r
Observation 10.10. Let X ⇠ Geom(p). We have our P (X = n) = p (1 p)n
formula for infinite geometric series, r 1
1
for n 2 N.
X 1
k
q = Observation 11.2. Let X ⇠ NB(r, p). We can write
1 q
k=0 X = X1 + · · · + Xr where each Xi is the number of
failures between the (i 1)th and ith success. Then
Taking the derivative of both sides gives Xi ⇠ Geom(p). Thus,
1
X rq
1 E(X) = E(X1 ) + · · · + E(Xr ) =
kq k 1
= p
(1 q)2
k=1
Observation 11.3. Let X ⇠ FS(p), where FS(p) is the
Then time until the first success of independent Bern(p) trials,
1 1 counting the success. Then if we take Y = X 1, we have
X X pq q
E(X) = kpq k = p kq k = = Y ⇠ Geom(p). So,
(1 q) 2 p
k=0 k=0 q 1
E(X) = E(Y ) + 1 = +1=
Alternatively, we can use first step analysis and write a p p
recursive formula for E(X). If we condition on what hap- Example. Suppose we have a random permutation of
pens in the first Bernoulli trial, we have {1, . . . , n} with n 2. What is the expected number
of local maxima—that is, numbers greater than both its
E(X) = 0 · p + (1 + E(X))q neighbors?
E(X) qE(X) = q Let Ij be the indicator random variable for position j
q being a local maximum (1 j n). We are interested in
E(X) =
1 q E(I1 + · · · + In ) = E(I1 ) + · · · + E(In )
q
E(X) =
p For the non-endpoint positions, in each local neighbor-
hood of three numbers, the probability that the largest
Lecture 11 — 9/26/11 number is in the center position is 13 .
5, 2, · · · , 28, 3, 8, · · · , 14
Recall our assertion that E, the expected value function, | {z }
is linear. We now prove this statement. Moreover, these positions are all symmetrical. Analo-
gously, the probability that an endpoint position is a local
Proof. Let X and Y be discrete random variables. We
maximum is 12 . Then we have
want to show that E(X + Y ) = E(X) + E(Y ).
X n
2 2 n+1
E(I1 ) + · · · + E(In ) = + =
E(X + Y ) = tP (X + Y = t) 3 2 3
t
X Example (St. Petersburg Paradox). Suppose you are
= (X + Y )(s)P ({s}) given the o↵er to play a game where a coin is flipped until
X
s a heads is landed. Then, for the number of flips i made up
= (X(s) + Y (s))P ({s}) to and including the heads, you receive $2i . How much
s should you be willing to pay to play this game? That
X X is, what price would make the game fair, or the expected
= X(s)P ({s}) + Y (s)P ({s})
s s
value zero?
X X Let X be the number of flips of the fair coin up to and
= xP (X = x) + yP (Y = y) including the first heads. Clearly, X ⇠ F S( 12 ). If we let
x y
Y = 2X , we want to find E(Y ). We have
= E(X) + E(Y ) 1 1
X 1 X
k
E(Y ) = 2 · k = 1
The proof that E(cX) = cE(X) is similar. ⌅ 2
k=1 k=1
7
Stat 110—Intro to Probability Max Wang
This assumes, however, that our cash source is boundless. Proof. Fix k. Then as n ! 1 and p ! 0,
If we bound it at 2K for some specific K, we should only ✓ ◆
n k
bet K dollars for a fair game—this is a sizeable di↵erence. lim P (X = k) = n!1
lim p (1 p)n k
n!1
p!0 p!0
k
✓ ◆k
n(n 1) · · · (n k + 1) 1
Lecture 12 — 9/28/11 = n!1
lim ·
p!0
k!
Definition 12.1. The Poisson distribution, Pois( ), is ✓ ◆n ✓ ◆ k
given by the PMF 1 1
n n
k k
e = ·e
P (X = k) = k!
k!
⌅
for k 2 N, X ⇠ Pois( ). We call the rate parameter.
Example. Suppose we have n people and we want to
Observation 12.2. Checking that this PMF is indeed know the approximate probability that at least three in-
valid, we have dividuals have the same birthday. There are n3 triplets
of people; for each triplet, let Iijk be the indicator r.v.
1
X k that persons i, j, and k have the same birthday. Let
e =e e =1 X = # triple matches. Then we know that
k!
k=0 ✓ ◆
n 1
E(X) =
Its mean is given by 3 3652
1
X k To approximate P (X 1), we approximate X ⇠ Pois( )
E(X) = e k with = E(X). Then we have
k!
k=0 0
X1 k P (X 1) = 1 P (X = 0) = 1 e =1 e
=e 0!
(k 1)!
k=1
X1 k 1 Lecture 13 — 9/30/11
= e
(k 1)! Definition 13.1. Let X be a random variable. Then X
k=1
= e e has a probability density function (PDF) fX (x) if
Z b
=
P (a X b) = fX (x) dx
a
The Poisson distribution is often used for applications A valid PDF must satisfy
where we count the successes of a large number of trials
where the per-trial success rate is small. For example, the 1. 8x, fX (x) 0
Poisson distribution is a good starting point for counting Z 1
the number of people who email you over the course of 2. fX (x) dx = 1
1
an hour. The number of chocolate chips in a chocolate
chip cookie is another good candidate for a Poisson dis- Note. For ✏ > 0 very small, we have
✓ ◆
tribution, or the number of earthquakes in a year in some ✏ ✏
fX (x0 ) · ✏ ⇡ P X 2 (x0 , x0 + )
particular region. 2 2
Since the Poisson distribution is not bounded, these
Theorem 13.2. If X has PDF fX , then its CDF is
examples will not be precisely Poisson. However, in gen- Z x
eral, with a large number of events Ai with small P (Ai ), FX (x) = P (X x) = fX (t) dt
and where the Ai are all independent or “weakly depen- 1
dent,” then the number of thePn Ai that occur is approx- If X is continuous and has CDF FX , then its PDF is
imately Pois( ), with ⇡ i=1 P (Ai ). We call this a 0
Poisson approximation. fX (x) = FX (x)
Moreover,
Proposition 12.3. Let X ⇠ Bin(n, p). Then as n ! 1, Z b
p ! 0, and where = np is held constant, we have
P (a < X < b) = fX (x) dx = FX (b) FX (a)
X ⇠ Pois( ). a
8
Stat 110—Intro to Probability Max Wang
for some constant c. To find c, we note that, by the defi- Observation 13.9. The variance of U ⇠ Unif(a, b) is
nition of PDF, we have given by
Z b Var(U ) = E(U 2 ) (EU )2
c dx = 1 Z b ✓ ◆2
a 2 b+a
= x fU (x) dx
c(b a) = 1 a 2
1 Z b ✓ ◆2
c= 1 b+a
b a = x2 dx
b a a 2
9
Stat 110—Intro to Probability Max Wang
since P (U F (x)) is the length of the interval [0, F (x)], Proof. We want to prove that our PDF is valid; to do
which is F (x). For the second part, so, we will simply determine the value of the normalizing
constant that makes it so. We will integrate the square
P (F (X) x) = P (X F 1
(x)) = F (F 1
(x)) = x of the PDF sans constant because it is easier than inte-
grating naı̈vely
since F is X’s CDF. But this shows that F (X) ⇠ Z 1 Z 1
z 2 /2 2
Unif(0, 1). ⌅ e dz e z /2 dz
1 1
Z 1 Z 1
Example. Let F (x) = 1 e x with x > 0 be the CDF x2 /2 2
= e dx e y /2 dy
of an r.v. X. Then F (X) = 1 e X by an application of 1
Z 1Z 1
1
the second part of Universality of the Uniform. 2 2
= e (x +y )/2 dx dy
1 1
Example. Let F (x) = 1 e x with x > 0, and also let Z 2⇡ Z 1
U ⇠ Unif(0, 1). Suppose we want to simulate F with a r 2 /2
= e r dr d✓
random variable X; that is, X ⇠ F . Then computing the 0 0
inverse r2
Substituting u = , du = r dr
F 1 (u) = ln(1 u) 2
Z 2⇡ ✓Z 1 ◆
u
yields F 1
(U ) = ln(1 U) ⇠ F. = e du d✓
0 0
10
Stat 110—Intro to Probability Max Wang
Observation 14.5. Let us compute the mean and vari- which yields a PDF of
ance of Z ⇠ N (0, 1). We have
1 x µ 2
Z 1 fX (x) = p e ( ) /2
1 2 2⇡
EZ = p ze z /2 dz = 0
2⇡ 1 We also have X = µ + ( Z) ⇠ N ( µ, 2 ).
Later, we will show that if Xi ⇠ N (µi , i2 ) are inde-
by symmetry (the integrand is odd). The variance re- pendent, then
duces to
Xi + Xj ⇠ N (µi + µj , i2 + j2 )
2 2 2
Var(Z) = E(Z ) (EZ) = E(Z )
and
By LOTUS, Xi Xj ⇠ N (µi µj , i2 + j2 )
Z 1 Observation 15.2. If X ⇠ N (µ, 2 ), we have
2 1 2 z 2 /2
E(Z ) = p z e dz
2⇡ 1 P (|X µ| ) ⇡ 68%
Z 1
2 2 P (|X µ| 2 ) ⇡ 95%
evenness = p z 2 e z /2 dz
2⇡ 0 P (|X µ| 3 ) ⇡ 99.7%
Z 1
2 z 2 /2
by parts = p z ze dz Observation 15.3. We observe some properties of the
2⇡ 0 |{z} | {z }
0
u dv
1 variance.
1 Z 1
2
= p @uv +
2
e z /2 dz A Var(X) = E((X EX)2 ) = EX 2 (EX)2
2⇡ 0
0 For any constant c,
p !
2 2⇡ Var(X + c) = Var(X)
=p 0+
2⇡ 2
Var(cX) = c2 Var(X)
=1
Since variance is not linear, in general, Var(X + Y ) 6=
We use to denote the standard normal CDF; so Var(X) + Var(Y ). However, if X and Y are independent,
Z z we do have equality. On the other extreme,
1 t2 /2
(z) = p e dt Var(X + X) = Var(2X) = 4 Var(X)
2⇡ 1
Also, in general,
By symmetry, we also have
Var(X) 0
( z) = 1 (z) Var(X) = 0 () 9a : P (X = a) = 1
Observation 15.4. Let us compute the variance of the
Lecture 15 — 10/5/11 Poisson distribution. Let X ⇠ Pois( ). We have
Recall the standard normal distribution. Let Z be an 1
X k
e
r.v., Z ⇠ N (0, 1). Then Z has CDF ; it has E(Z) = 0, E(X 2 ) = k2
k!
Var(Z) = E(Z 2 ) = 1, and E(Z 3 ) = 0.1 By symmetry, k=0
also Z ⇠ N (0, 1). To reduce this sum, we can do the following:
Definition 15.1. Let X = µ + Z, with µ 2 R (the 1
X k
mean or center), > 0 (the SD or scale). Then we say =e
k!
X ⇠ N (µ, 2 ). This is the general normal distribution. k=0
Taking the derivative w.r.t. ,
If X ⇠ N (µ, 2 ), we have E(X) = µ and Var(µ + X1
k k 1
Z) = 2 Var(Z) = 2 . We call Z = X µ the standard- =e
ization of X. X has CDF k!
k=1
✓ ◆ ✓ ◆ X1
k k 1
X µ x µ x µ = e
P (X x) = P = k!
k=0
1 These are called the first, second, and third moments.
11
Stat 110—Intro to Probability Max Wang
1
X k
k Lecture 17 — 10/14/11
= e
k!
k=1
Definition 17.1. The exponential distribution, Expo( ),
Repeating,
is defined by PDF
1
X k2 k 1
= e +e x
k! f (x) = e
k=1
X1
k2 k
= e ( + 1) for x > 0 and 0 elsewhere. We call the rate parameter.
k!
k=1 Integrating clearly yields 1, which demonstrates valid-
ity. Our CDF is given by
So, Z x
t
1
X k F (x) = e dt
2 2e
E(X ) = k ( 0
k! x
k=0 1 e x>0
=
=e e ( + 1) 0 otherwise
2
= +
So for our variance, we have Observation 17.2. We can normalize any X ⇠ Expo( )
2 2
by multiplying by , which gives Y = X ⇠ Expo(1). We
Var(X) = ( + ) = have
Observation 15.5. Let us compute the variance of the-
y y/ y
binomial distribution. Let X ⇠ Bin(n, p). We can write P (Y y) = P (X )=1 e =1 e
X = I1 + · · · + In
where the Ij are i.i.d. Bern(p). Then, Let us now compute the mean and variance of Y ⇠
Expo(1). We have
X 2 = I12 + · · · + In2 + 2I1 I2 + 2I1 I3 + · · · + 2In 1 In
Z 1
where Ii Ij is the indicator of success on both i and j. y
✓ ◆ E(Y ) = ye dy
n 0
2 2 1 Z
E(X ) = nE(I1 ) + 2 E(I1 I2 ) 1
2 y y
= ( ye ) + e dy
= np + n(n 1)p2 0
0
= np + n2 p2 np2 =1
12
Stat 110—Intro to Probability Max Wang
Proposition 17.4. The exponential distribution is mem- Observation 18.3. We might ask why we call M
oryless. “moment-generating.” Consider the Taylor expansion of
M:
Proof. Let X ⇠ Expo( ). We know that 0 1
X1 n n X1
P (X t) = 1 P (X t) = e t X t E(X n )tn
E(etX ) = E @ A=
n=0
n! n=0
n!
Meanwhile,
P (X s + t, X s) Note that we cannot simply make use of linearity since
P (X s+t|X s) = our sum is infinite; however, this equation does hold for
P (X s)
reasons beyond the scope of the course.
P (X s + t)
= This observation also shows us that
P (X s)
e (s+t) E(X n ) = M (n) (0)
= s
e
t Claim 18.4. If X and Y have the same MGF, then they
=e
have the same CDF.
= P (X t)
We will not prove this claim.
which is our desired result. ⌅
Observation 18.5. If X has MGF MX and Y has MGF
Example. Let X ⇠ Expo( ). Then by linearity and by MY , then
the memorylessness,
MX+Y = E(et(X+Y ) ) = E(etX )E(etY ) = MX MY
E(X | X > a) = a + E(X a | X > a)
1 The second inequality comes from the claim (which we
=a+
will prove later) that if for X, Y independent, E(XY ) =
E(X)E(Y ).
Lecture 18 — 10/17/11 Example. Let X ⇠ Bern(p). Then
Theorem 18.1. If X is a positive, continuous random
M (t) = E(etX ) = pet + q
variable that is memoryless (i.e., its distribution is mem-
oryless), then there exists 2 R such that X ⇠ Expo( ). Suppose now that X ⇠ Bin(n, p). Again, we write
Proof. Let F be the CDF of X and G = 1 F . By X = I1 + · · · + In where the Ij are i.i.d Bern(p). Then we
memorylessness, see that
M (t) = (pet + q)n
G(s + t) = G(s)G(t)
Example. Let Z ⇠ N (0, 1). We have
We can easily derive from this identity that 8k 2 Q, Z 1
1 2
M (t) = p etz z /2 dz
G(kt) = G(t)k 2⇡ 1
2 Z
This can be extended to all k 2 R. If we take t = 1, then et /2 1 (1/2)(z t)2
completing the square = p e dz
we have 2⇡ 1
G(x) = G(1)x = ex ln G(1) 2
= et /2
13
Stat 110—Intro to Probability Max Wang
X1 X1 1
X
(t2 /2)n t2n (2n)! t2n Their separate CDFs and PMFs (e.g., P (X x)) are
= =
n! 2n n! n=0 2n n! (2n)! referred to as marginal CDFs, PMFs, or PDFs. X and
n=0 n=0
Y are independent precisely when the the joint CDF is
So equal to the product of the marginal CDFs:
(2n)!
E(Z 2n ) =
2n n! F (x, y) = FX (x)FY (y)
14
Stat 110—Intro to Probability Max Wang
Y =0 Y =1 1/⇡ 1
fY |X (y|x) = p = p
X=0 2/6 1/6 3/6 2
1 x 2 2 1 x2
⇡
X=1 2/6 1/6 3/6 p p
4/6 2/6 for 1 x2 y 1 x2 . Sincepwe are p holding x
constant, we see that Y |X ⇠ Unif( 1 x2 , 1 x2 ).
Here we have computed the marginal probabilities (in the From these computations, it is clear, in many ways,
margin), and they demonstrate that X and Y are inde- that X and Y are not independent. It is not true that
pendent. fX,Y = fX fY , nor that fY |X = fY
Example. Let us define the uniform distribution on the Proposition 20.2. Let X, Y have joint PDF f , and let
unit square, {(x, y) : x, y 2 [0, 1]}. We want the joint g : R2 ! R. Then
PDF to be constant everywhere in the square and 0 oth- Z 1Z 1
erwise; that is, E(g(X, Y )) = g(x, y)f (x, y) dx dy
1 1
(
c 0 x 1, 0 y 1 This is LOTUS in two dimensions.
f (x, y) =
0 otherwise
Theorem 20.3. If X, Y are independent random vari-
1
ables, then E(XY ) = E(X)E(Y ).
Normalizing, we simply need c = area = 1. It is apparent
that the marginal PDFs are both uniform. Proof. We will prove this in the continuous case. Using
LOTUS, we have
Example. Let us define the uniform distribution on the Z 1Z 1
unit disc, {(x, y) : x2 + y 2 1}. Their joint PDF can be E(XY ) = xyfX,Y (x, y) dx dy
given by 1 1
( Z 1Z 1
1
x2 + y 2 1 by independence = xyfX (x)fY (y) dx dy
f (x, y) = ⇡ 1 1
0 otherwise Z 1
p p = E(X)yfY (y) dy
Given X = x, we have 1 x2 y 1 x2 . We 1
might guess that Y is uniform, but clearly X and Y are = E(X)E(Y )
dependent in this case, and it turns out that this is not
the case. as desired. ⌅
15
Stat 110—Intro to Probability Max Wang
16
Stat 110—Intro to Probability Max Wang
17
Stat 110—Intro to Probability Max Wang
Definition 22.4. The correlation of two random vari- Example. Let X ⇠ Bin(n, p). Write X = X1 + · · · + Xn
ables X and Y is where the Xj are i.i.d. Bern(p). Then
Cov(X, Y ) Var(Xj ) = EXj2 (EXj )2
Cor(X, Y ) =
SD(X) SD(Y )
✓ ◆ =p p2
X EX Y EY
= Cov , = p(1 p)
SD(X) SD(Y )
It follows that
The operation of
X EX Var(X) = np(1 p)
SD(X)
since Cor(Xi , Xj ) = 0 for i 6= j by independence.
is called standardization; it gives the result a mean of 0
and a variance of 1.
Lecture 23 — 10/28/11
Theorem 22.5. | Cor(X, Y )| 1.
Example. Let X ⇠ HGeom(w, b, n). Let us write p =
Proof. We could apply Cauchy-Schwartz to get this re- w
w+b and N = w+b. Then we can write X = X1 +· · ·+Xn
sult immediately, but we shall also provide a direct proof. where the Xj are Bern(p). (Note, however, that unlike
WLOG, assume X and Y are standardized. Let ⇢ = with the binomial, the Xj are not independent.) Then
Cor(X, Y ). We have
✓ ◆
n
Var(X + Y ) = Var(X) + Var(Y ) + 2⇢ = 2 + 2⇢ Var(X) = n Var(X1 ) + 2 Cov(X1 , X2 )
2
✓ ◆
and we also have n
= np(1 p) + 2 Cov(X1 , X2 )
2
Var(X Y ) = Var(X) + Var(Y ) 2⇢ = 2 2⇢
Computing the covariance, we have
But since Var 0, this yields our result. ⌅
Cov(X1 , X2 ) = E(X1 X2 ) (EX1 )(EX2 )
✓ ◆ ✓ ◆2
Example. Let (X1 , . . . , Xk ) ⇠ Multk (n, p). We shall w w 1 w
compute Cov(Xi , Xj ) for all i, j. If i = j, then =
w+b w+b 1 w+b
✓ ◆
w w 1
Cov(Xi , Xi ) = Var(Xi ) = npi (1 pi ) = p2
w+b w+b 1
Suppose i 6= j. We can expect that the covariance will be
and simplifying,
negative, since more objects in category i means less in
category j. We have N n
Var(X) = np(1 p)
N 1
Var(Xi + Xj ) = npi (1 pi ) + npj (1 pj ) + 2 Cov(Xi , Xj )
The term N n
N 1 is called the finite population correction;
But by “lumping” i and j together, we also have it represents the “o↵set” from the binomial due to lack of
replacement.
Var(Xi + Xj ) = n(pi + pj )(1 (pi + pj ))
Theorem 23.1. Let X be a continuous random variable
Then solving for c, we have with PDF fX , and let Y = g(X) where g is di↵erentiable
and strictly increasing. Then the PDF of Y is given by
Cov(Xi , Xj ) = npi pj
dx
fY (y) = fX (x)
Note. Let A be an event and IA its indicator random dy
variable. It is clear that 1
where y = g(x) and x = g (y). (Also recall from calcu-
n
⇣ ⌘ 1
IA = IA lus that dx dy
dy = dx .)
for any n 2 N. It is also clear that Proof. From the CDF of Y , we get
IA IB = IA\B P (Y y) = P (g(X) y)
18
Stat 110—Intro to Probability Max Wang
Z 1
1
= P (X g (y)) = FY (t x)fX (x) dx
1 1
= FX (g (y))
= FX (x) Then taking the derivative of both sides,
Z 1
Then, di↵erentiating, we get by the Chain Rule that fT (t) = fX (x)fY (t x) dx
1
dx We now briefly turn our attention to proving the exis-
fY (y) = fX (x) ⌅
dy tence of objects with some desired property A using prob-
ability. We want to show that P (A) > 0 for some random
Example. Consider the log normal distribution, which
object, which implies that some such object must exist.
is given by Y = eZ for Z ⇠ N (0, 1). We have
Reframing this question, suppose each object in our
1 z 2 /2
universe of objects has some kind of “score” associated
fY (y) = p e with this property; then we want to show that there is
2⇡
some object with a “good” score. But we know that there
To put this in terms of y, we substitute z = ln y. More- is an object with score at least equal to the average score,
over, we know that i.e., the score of a random object. Showing that this aver-
age is “high enough” will prove the existence of an object
dy
= ez = y without specifying one.
dz
Example. Suppose there are 100 people in 15 commit-
and so, tees of 20 people each, and that each person is on exactly
1 1 ln y/2
fY (y) = p e 3 committees. We want to show that there exist 2 com-
y 2⇡ mittees with overlap 3. Let us find the average of two
Theorem 23.2. Suppose that X is a continuous random random committees. Using indicator random variables
variable in n dimensions, Y = g(X) where g : Rn ! Rn for the probability that a given person is on both of those
is continuously di↵erentiable and invertible. Then two committees, we get
3
dx 2 300 20
fY (y) = fX (x) det E(overlap) = 100 · 15 = =
dy 2
105 7
19
Stat 110—Intro to Probability Max Wang
Observation 24.2. Suppose that, based on some data, for any a > 0. The gamma function is a continuous ex-
we have X | p ⇠ Bin(n, p), and that our prior distribu- tension of the factorial operator on natural numbers. For
tion for p is p ⇠ Beta(a, b). We want to determine the n a positive integer,
posterior distribution of p, p | X. We have
(n) = (n 1)!
P (X = k | p)f (p)
f (p | X = k) = More generally,
P (X = k)
n k
p (1 p)n k cpa 1 (1 p)b 1 (x + 1) = x (x)
= k
P (X = k)
Definition 25.2. The standard gamma distribution,
a+k 1
/ cp (1 p)b+n k 1 Gamma(a, 1), is defined by PDF
20
Stat 110—Intro to Probability Max Wang
Proof. One method of proof, which we will not use, Lecture 26 — 11/4/11
would be to repeatedly convolve the PDFs of the i.i.d. Xj .
Instead, we will use MGFs. Suppose that the Xj are i.i.d Observation 26.1 (Gamma-Beta). Let us take X ⇠
Expo(1); we will show that their sum is Gamma(n, 1). Gamma(a, ) to be your waiting time in line at the bank,
The MGF of Xj is given by and Y ⇠ Gamma(b, ) your waiting time in line at the
post office. Suppose that X and Y are independent.
1 Let T = X + Y ; we know that this has distribution
MXj (t) = Gamma(a + b, ).
1 t
Let us compute the joint distribution of T and of
for t < 1. Then the MGF of Tn is X
W = X+Y , the fraction of time spent waiting at the
✓ ◆n bank. For simplicity of notation, we will take = 1. The
1 joint PDF is given by
MTn (t) =
1
@ (x, y)
fT,W (t, w) = fX,Y (x, y) det
also for t < 1. We will show that the gamma distribution @(t, w)
has the same MGF. 1 1 @ (x, y)
Let Y ⇠ Gamma(n, 1). Then by LOTUS, = xa e x b
y e y
det
(a) (b) xy @(t, w)
Z 1
tY 1 1 We must find the determinant of the Jacobian (here ex-
E(e ) = ety y n e y dy
(n) 0 y pressed in silly-looking notation). We know that
Z 1
1 1 x
= y n e(1 t)y dy x+y =t =w
(n) 0 y x+y
Changing variables, with x = (1 t)y, then Solving for x and y, we easily find that
Z
(1 t) n 1 n x1
x = tw y = t(1 w)
E(etY ) = x e dx
(n) 0 x
✓ ◆n Then the determinant of our Jacobian is given y
1 (n)
= w t
1 (n) = tw t(1 w) = t
✓ ◆n 1 w t
1
= ⌅
1 Taking the absolute value, we then get
Note that this is the MGF for any n > 0, although the 1 1
fT,W (t, w) = xa e x y b e y t
sum of exponentials expression requires integral n. (a) (b) xy
1 a 1 1
= w (1 w)b 1 ta+b e t
Observation 25.6. Let us compute the moments of (a) (b) t
X ⇠ Gamma(a, 1). We want to compute E(X c ). We (a + b) a 1 1 t1
have = w (1 w)b 1 ta+b e
(a) (b) (a + b) t
Z 1
c 1 1 This is a product of some function of w with the PDF of
E(X ) = xc xa e x dx
(a) 0 x T , so we see that T and W are independent. To find the
Z 1
1 1 marginal distribution of W , we note that the PDF of T
= xa+c e x dx
(a) 0 x integrates to 1 just like any PDF, so we have
(a + c) Z 1
= fW (w) = fT,W (t, w) dt
(a)
1
a(a + 1) · · · (a + c) (a) (a + b) a
= = w 1
(1 w)b 1
(a) (a) (b)
= a(a + 1) · · · (a + c)
This yields W ⇠ Beta(a, b) and also gives the normalizing
If instead, we take X ⇠ Gamma(a, ), then we will have constant of the beta distribution.
It turns out that if X were distributed according to
a(a + 1) · · · (a + c) any other distribution, we would not have independence,
E(X c ) = c but proving so is out of the scope of the course.
21
Stat 110—Intro to Probability Max Wang
Observation 26.2. Let us find E(W ) for W ⇠ Example. Let U1 , . . . , Un be i.i.d. Unif(0, 1), and let us
X
Beta(a, b). Let us write W = X+Y with X and Y de- determine the distribution of U(j) . Applying the above
fined as above. We have result, we have
✓ ◆ ✓ ◆
1 E(X) a n 1 j 1
E = = fU(j) (x) = n x (1 x)n j
X E(X + Y ) a+b j 1
Note that in general, the first equality is false! However, for 0 < x < 1. Thus, we have U(j) ⇠ Beta(j, n j + 1).
because X +Y and X+Y X
, they are uncorrelated and hence This confirms our earlier result that, for U1 and U2 i.i.d.
linear. So Unif(0, 1), we have
✓ ◆ 1
1 E|U1 U2 | = E(Umax ) E(Umin ) =
E E(X + Y ) = E(X) 3
X
because Umax ⇠ Beta(2, 1) and Umin ⇠ Beta(1, 2), which
Definition 26.3. Let X1 , . . . , Xn be i.i.d. The order
have means 23 and 13 respectively.
statistics of this sequence is
Observation 26.4. Let X1 , . . . , Xn be i.i.d. continuous which is simple and straightforward. We might also, how-
with PDF fj and CDF Fj . We want to find the CDF and ever, try to condition on the value of Y with respect to
PDF of X(j) . For the CDF, we have X using the Law of Total Probability
22
Stat 110—Intro to Probability Max Wang
Example (Patterns in coin flips). Suppose we repeat- Definition 27.1. Now let us write
edly flip a fair coin. We want to determine how many
flips it takes until HT is observed (including the H and g(x) = E(Y | X = x)
T ); similarly, we can ask how many flips it takes to get Then
HH. Let us call these random variables WHT and WHH E(Y |X) = g(X)
respectively. Note that, by symmetry,
So, suppose for instance that g(x) = x2 ; then g(X) = X 2 .
We can see that E(Y |X) is a random variable and a func-
E(WHH ) = E(WT T )
tion of X. This is a conditional expectation.
and Example. Let X and Y be i.i.d. Pois( ). Then
E(WHT ) = E(WT H )
E(X + Y | X) = E(X|X) + E(Y |X)
Let us first consider WHT . This is the time to the first X is a function of itself = X + E(Y |X)
H, which we will call W1 , plus the time W2 to the next X and Y independent) = X + E(Y )
T . Then we have
=X+
E(WHT ) = E(W1 ) + E(W2 ) = 2 + 2 = 4 Note that, in general,
23
Stat 110—Intro to Probability Max Wang
24
Stat 110—Intro to Probability Max Wang
Example. Suppose we choose a random city and then make N a constant, so let us condition on N . Then using
choose a random sample of n people in that city. Let X the Law of Total Probability, we have
be the number of people with a particular disease, and 1
Q the proportion of people in the chosen city with the X
E(X) = E(X | N = n)P (N = n)
disease. Let us determine E(X) and Var(X), assuming
n=0
Q ⇠ Beta(a, b) (a mathematically convenient, flexible dis- 1
X
tribution). = µnP (N = n)
Assume that X|Q ⇠ Bin(n, Q). Then n=0
= µE(N )
E(X) = E(E(X|Q))
= E(nQ) Note that we can drop the conditional because N and the
a Xj are independent; otherwise, this would not be true.
=n We could also apply Adam’s Law to get
a+b
Var(X) = E(Var(X|Q)) + Var(E(X|Q)) To get the variance, we apply Eve’s Law to get
= E(nQ(1 Q)) + n2 Var(Q) Var(X) = E(Var(X|N )) + Var(E(X|N ))
2
We have = E(N ) + Var(µN )
2
Z = E(N ) + µ2 Var(N )
(a + b) 1 a
E(Q(1 Q)) = q (1 q)b dq
(a) (b) 0 We now turn our attention to statistical inequalities.
(a + b) (a + 1) (b + 1)
= Theorem 29.1 (Cauchy-Schwartz Inequality).
(a) (b) (a + b + 2)
p
ab (a + b) |E(XY )| E(X 2 )E(Y 2 )
=
(a + b + 1)(a + b) (a + b)
ab If X and Y are uncorrelated, E(XY ) = (EX)(EY ), so
= we don’t need inequality.
(a + b)(a + b + 1)
We will not prove this inequality in general. However,
and if X and Y have mean 0, then
µ(1 µ)
Var(Q) = E(XY )
a+b+1 | Corr(X, Y )| = p 1
E(X 2 )E(Y 2 )
a
where µ = a+b . This gives us all the information we need
to easily compute Var(X). Theorem 29.2 (Jensen’s Inequality). If g : R ! R is
convex (i.e., g 00 > 0), then
Example. Consider a store with a random number N If g ic concave (i.e., g 00 < 0), then
of customers. Let Xj be the amount the jth customer
spends, with E(Xj ) = µ and Var(Xj ) = 2 . Assume Eg(X) g(EX)
that N, X1 , X2 , . . . are independent. We want to deter-
mine the mean and variance of Example. If X is positive, then
N 1 1
X E( )
X= Xj X EX
j=1
and
E(ln X) ln(EX)
We might, at first, mistakenly invoke linearity to claim
that E(X) = N µ. But this is incoherent; the LHS is a Proof. It is true of any convex function g that
real number whereas the RHS is a random variable. How-
ever, this error highlights something useful: we want to g(x) a + bx
25
Stat 110—Intro to Probability Max Wang
26
Stat 110—Intro to Probability Max Wang
Proof. We will prove the CLT assuming that the MGF The Poisson approximation works well when n is large,
M (t) of the Xj exists (note that we have been assuming p is small, and = np is moderate. In contrast, the Nor-
all along that the first two moments exist). We will show mal approximation works well when n is large and p is
that the MGFs converge, which will imply that the CDFs near 12 (to match the symmetry of the normal).
converge (however, we will not show this fact). It seems a little strange that we are approximating a
Let us assume WLOG that µ = 0 and = 1. Let discrete distribution with a continuous distribution. In
general, to correct for this, we can write
n
X
Sn = Xj P (X = a) = P (a ✏ < X < a + ✏)
j=1
where (a ✏, a + ✏) contains only a
Sn
We will show that the MGF of p
n
converges to the MGF
of N (0, 1). We have
Lecture 31 — 11/18/11
p
E(etSn / n
) Definition 31.1. Let V = Z12 + · · · + Zn2 where the Zj ⇠
uncorrelated since independent N (0, 1) i.i.d. Then V has the chi-squared distribution
p p
= E(etX1 / n
) · · · E(etXn / n
) with n degrees of freedom, V ⇠ 2n .
p
= E(e ) tXj / n n Observation 31.2. It is true, but we will not prove, that
✓ ◆n
t 2 1 1
=M p 1 = Gamma( , )
n 2 2
2
P 2
Taking the limit results in the indeterminate form 11 , Since n = 1, we have
which is hard to work with. Instead, we take the log of
2 n 1
both sides and then take the limit, to get n = Gamma( , )
2 2
✓ ◆ ln M ( ptn )
t Definition 31.3. Let Z ⇠ N (0, 1) and V ⇠ 2
be inde-
lim n ln M p = lim 1
n
n!1 n n!1
n pendent. Let
1 Z
substitute y = p T =p
n V /n
ln M (ty) Then T has the Student-t distribution with n degrees of
= lim
y!0 y2 freedom, T ⇠ tn .
tM 0 (ty)
L’Hopital’s = lim Observation 31.4. The Student-t is symmetric; that is
y!0 2yM (ty)
T ⇠ tn . Note that if n = 1, then T is the ratio of two
t M 0 (ty) i.i.d. standard normals, so T becomes the Cauchy distri-
[M (0) = 1, M 0 (0) = 0] = lim
2 y!0 y bution (and hence has no mean).
t2 M 00 (ty) If n 2, then
L’Hopital’s = lim
2 y!0 1
1
t2 E(T ) = E(Z)E( p )=0
= V /n
2
2
= ln et /2 Note that in general, T ⇠ tn will only have moments up
to (but not including) the nth.
2
and et /2
is the N (0, 1) MGF. ⌅
Observation 31.5. We proved that
Pn
Corollary 30.5. Let X ⇠ Bin(n, p) with X = j=1 Xj , E(Z 2 ) = 1, E(Z 4 ) = 1 · 3, E(Z 6 ) = 1 · 3 · 5
Xj ⇠ Bern(p) i.i.d.
! using MGFs. We can also prove this by noting that
a np X np b np
P (a X b) = P p p p E(Z 2n ) = E((Z 2 )n )
npq npq npq
✓ ◆ ✓ ◆
1 1 and that Z 2 ⇠ 21 = Gamma( 12 , 12 ). Then we can simply
⇡
b np a np use LOTUS to get our desired mean.
27
Stat 110—Intro to Probability Max Wang
t1 X 1 + · · · t k X k qij := P (Xn+1 = j | Xn = i)
of the Xj is normal.
Example. Let Z, W be i.i.d. N (0, 1). Then (Z + called the transition probability, and we call the sequence
2W, 3Z + 5W ) is MVN, since a homogenous Markov chain.
To describe a homogenous Markov chain we simply
s(Z + 2W ) + t(3Z + 5W ) = (s + 3t)Z + (2s + 5t)W need to show the states of the process and the transi-
is a sum of independent normals and hence normal. tion probabilities. We could, instead, array the qij ’s as a
Example. Let Z ⇠ N (0, 1). Let S be a random sign (±1 matrix,
with equal probabilities) independent of Z. Then Z and Q = qij
SZ are marginally standard normal. However, (Z, SZ) is
not multivariate normal, since Z + SZ is 0 with proba- called the transition matrix.
bility 12 .
Observation 31.8. Recall that the MGF for X ⇠ Note. More generally, we could consider continuous sys-
N (µ, 2 ) is given by tems (i.e., spaces) at continous times and more broadly
2 2
study stochastic processes. However, in this course, we
E(etX ) = etµ+t /2
will restrict our study to homogenous Markov chains.
Suppose that X = (X1 , . . . , Xk ) is MVN. Let µj = EXj .
Then the MGF of X is given by Example. The following diagram describes a (homoge-
nous) Markov chain:
E(et1 X1 +···+tk Xk )
1
= exp(t1 µ1 + · · · + tk µk +
Var(t1 X1 + · · · + tk Xk )) 1/2
2
Theorem 31.9. Let X = (X1 , . . . , Xk ) be MVN. Then
within X, uncorrelated implies independence. For in- 2/3 1
stance, if we write X = (X1 , X2 ), if every component 1/3 1 2 3 4 1/4
1/2
of X1 is uncorrelated with every component of X2 , then 1/2 1/4
X1 is independent of X2 .
Example. Let X, Y be i.i.d. N (0, 1). Then (X + Y, X
We could alternatively describe the same Markov chain
Y ) is MVN. We also have that
by specifying its transition matrix
Cov(X + Y, X Y)
0 1
= Var(X) + Cov(X, Y ) Cov(X, Y ) Var(Y ) 1 2
0 0
B 31 3
1 C
=0 B2 0 2 0C
Q= B C
@0 0 0 1A
So by our above theorem, X + Y and X Y are indepen- 1 1 1
2 0 4 4
dent.
28
Stat 110—Intro to Probability Max Wang
P (Xn+2 = j | Xn = i)
X
= P (Xn+2 = j | Xn+1 = k, Xn = i) 4 5 6
k
P (Xn+1 = k | Xn = i) 3. Markov chain with absorbing states
X
= P (Xn+2 = j | Xn+1 = k)P (Xn+1 = k | Xn = i)
k
0 1 2 3
X
= qkj qik
k
4. Periodic Markov chain
X
= qik qkj
k 1 2 3
= (Q2 )ij
Definition 33.1. A state is recurrent if, starting from
More generally, we have that state, there is probability 1 of transitioning back to
that state after a finite number of transitions. If a state
P (Xn+m = j | Xn = i) = (Qm )ij is not recurrent, it is transient.
Definition 33.2. A Markov chain is irreducible if it is
Definition 32.3. Let s be some probability vector for a
possible (with positive probability) to transition from any
Markov chain with transition matrix Q. We say that s is
state to any other state in a finite number of transitions.
stationary for the chain if
Note that in an irreducible chain, all states are recurrent;
over an infinite number of transitions, any nonzero prob-
sQ = s ability of returning to a state means that the event of
return will occur with probability 1.
We also call s a stationary distribution. Note that this is
the transpose of an eigenvector equation. Observation 33.3. In our example above, Markov
chains 1 and 4 are irreducible; chains 2 and 3 are not.
This definition raises the following questions: All the states of chain 2 are recurrent; even though the
chain itself has two connected components, we will al-
1. Does a stationary distribution exist for every ways (i.e., with probability 1), return to the state which
Markov chain? we started from.
However, in chain 3, states 1 and 2 are transient. With
2. Is the stationary distribution unique? probability 1, from states 1 and 2, we will at some point
transition to state 0 or 3; after that point, we will never
return to state 1 or 2. On the other hand, if we start in
3. Does the chain (in some sense) converge to the sta-
0 or 3, we stay there forever; they are clearly recurrent.
tionary distribution?
Theorem 33.4. For any irreducible Markov chain,
4. How can we compute it (efficiently)? 1. A stationary distribution s exists.
29
Stat 110—Intro to Probability Max Wang
= sj
But since this is true for every j, this is exactly the state-
ment of
sQ = s
as desired. ⌅
Example (Random walk on an undirected network).
Consider the following example undirected Markov chain
1 2
3 4
di qij = dj qji
30