Probability and Statistics: Cookbook

Probability and Statistics
Cookbook
Version 0.2.6
19th December, 2017
http://statistics.zone/
Copyright
c Matthias Vallentin
Contents 14 Exponential Family 16 21.5 Spectral Analysis . . . . . . . . . . . . . 28
1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

1.1 Discrete Distributions . . . . . . . . . . 3 15.1 Credible Intervals . . . . . . . . . . . . . 16 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17 22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9 16.3 Rejection Sampling . . . . . . . . . . . . 19
16.4 Importance Sampling . . . . . . . . . . . 19
6 Inequalities 10
17 Decision Theory 19
7 Distribution Relationships 10 17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
17.2 Admissibility . . . . . . . . . . . . . . . 20
8 Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
Functions 11
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11 18 Linear Regression 20
9.1 Standard Bivariate Normal . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
9.3 Multivariate Normal . . . . . . . . . . . 11 18.3 Multiple Regression . . . . . . . . . . . 21
18.4 Model Selection . . . . . . . . . . . . . . 22
10 Convergence 11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation 22
10.2 Central Limit Theorem (CLT) . . . . . 12 19.1 Density Estimation . . . . . . . . . . . . 22
19.1.1 Histograms . . . . . . . . . . . . 23
11 Statistical Inference 12 19.1.2 Kernel Density Estimator (KDE) 23
11.1 Point Estimation . . . . . . . . . . . . . 12 19.2 Non-parametric Regression . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.3 Smoothing Using Orthogonal Functions 24
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes 24
20.1 Markov Chains . . . . . . . . . . . . . . 24
12 Parametric Inference 13 20.2 Poisson Processes . . . . . . . . . . . . . 25
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series 25
12.2.1 Delta Method . . . . . . . . . . . 14 21.1 Stationary Time Series . . . . . . . . . . 26 This cookbook integrates various topics in probability theory
12.3 Multiparameter Models . . . . . . . . . 14 21.2 Estimation of Correlation . . . . . . . . 26 and statistics, based on literature [1, 6, 3] and in-class material
12.3.1 Multiparameter delta method . . 15 21.3 Non-Stationary Time Series . . . . . . . 26 from courses of the statistics department at the University of
12.4 Parametric Bootstrap . . . . . . . . . . 15 21.3.1 Detrending . . . . . . . . . . . . 27 California in Berkeley but also influenced by others [4, 5]. If you
21.4 ARIMA models . . . . . . . . . . . . . . 27 find errors or have suggestions for improvements, please get in
13 Hypothesis Testing 15 21.4.1 Causality and Invertibility . . . . 28 touch at http://statistics.zone/.
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a + 1)2 − 1 eas − e−(b+1)s

bxc−a+1 I(a ≤ x ≤ b) a+b
Uniform Unif {a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

Bernoulli Bern (p) (1 − p)1−x px (1 − p)1−x p p(1 − p) 1 − p + pes
!
n x
Binomial Bin (n, p) I1−p (n − x, x + 1) p (1 − p)n−x np np(1 − p) (1 − p + pes )n
x
 
np1 !n
np1 (1 − p1 ) −np1 p2
k
! k
n! x
X  .  X
Multinomial Mult (n, p) px1 · · · pk k xi = n  ..  .. pi e si
x1 ! . . . xk ! 1 i=1 −np2 p1 . i=0
npk
m N −m
!
x − np x n−x nm nm(N − n)(N − m)
Hypergeometric Hyp (N, m, n) ≈Φ N N 2 (N − 1)
p
np(1 − p) n
N
! r
x+r−1 r 1−p 1−p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 − p)x r r 2
r−1 p p 1 − (1 − p)es
1 1−p pes
Geometric Geo (p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson Po (λ) e−λ λ λ eλ(e −1)
i=0
i! x!
1 We use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).
3
Uniform (discrete) Binomial Geometric Poisson
● n = 40, p = 0.3 0.8 ● p = 0.2 ● ●
● ● ● λ=1
● n = 30, p = 0.6 ● p = 0.5 ● λ=4
● n = 25, p = 0.9 ● p = 0.8 ● λ = 10
●
0.3
0.2 ● 0.6
● 0.2
PMF
PMF
PMF
PMF
1 ● ●
● ● ● ● ● ● ●
● ● ● ● 0.4 ●
n ●
● ● ●
● ●
●
0.1
● ●
● ● ● ● ●
● ●
●
●
0.1 ●
● 0.2 ●
● ● ● ●
● ● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
●●● ● ● ● ● ● ● ● ● ●
● ● ●
0.0 ●●●●
●●●●●●●●● ●●
●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●● 0.0 ● ●
●
● ●
● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
Uniform (discrete) Binomial Geometric Poisson
1 ● 1.00 ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●● 1.0 ● ● ● ● ● ● ● ● 1.00 ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ● ●
0.75 0.8 ● ● 0.75 ●
●
i ● ● ●
● ●
● ●
n ●
● ● ●
●
CDF
CDF
CDF
CDF
0.50 0.6 ● 0.50
● ●
● ●
●
i ● ● ●
● ● ●
n ●
●
0.25 ● 0.4 0.25 ●
●
● ●
●
●
● n = 40, p = 0.3 p = 0.2
● ● ● ● ● ● λ=1
n = 30, p = 0.6 p = 0.5 ●
● ● ● ● ●
●
● λ=4
● ●
0 ● 0.00 ●●●● ●
●●●●●●●●●● ●
●●●●●●●●●●●●●●●●● ● ● n = 25, p = 0.9 0.2 ● ● p = 0.8 0.00
●
● ● ● ● ● λ = 10
a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a)2 esb − esa

x−a I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

(x − µ)2
Z x
σ 2 s2

1
N µ, σ 2 σ2

Normal Φ(x) = φ(t) dt φ(x) = √ exp − µ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2

1 1 ln x − µ 1 2 2 2
ln N µ, σ 2 eµ+σ /2
(eσ − 1)e2µ+σ

Log-Normal + erf √ √ exp −
2 2 2σ 2 x 2πσ 2 2σ 2

1 T
Σ−1 (x−µ) 1
Multivariate Normal MVN (µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
−(ν+1)/2 ( ν
Γ ν+1

ν ν
2 x2 ν−2
ν>2
Student’s t Student(ν) Ix , √ 1 + 0 ν>1
νπΓ ν2

2 2 ν ∞ 1<ν≤2

1 k x 1
Chi-square χ2k γ , xk/2−1 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 − 2)

d1 d2 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 − 2 d1 (d2 − 2)2 (d2 − 4)

d1 x+d2 2 2 xB 2
, 2
1 −x/β 1
Exponential∗ Exp (β) 1 − e−x/β e β β2 s (s < β)
β 1− β
!α
γ(α, βx) β α α−1 −βx α α 1
Gamma∗ Gamma (α, β) x e s (s < β)
Γ(α) Γ (α) β β2 1− β
Γ α, βx

β α −α−1 −β/x β β2 2(−βs)α/2 p
Inverse Gamma InvGamma (α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2) Γ(α)
P
k
Γ i=1 αi Y α −1
k
αi E [Xi ] (1 − E [Xi ])
Dirichlet Dir (α) Qk xi i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
Γ (α + β) α−1 α αβ X Y α+r sk
Beta Beta (α, β) Ix (α, β) x (1 − x)β−1 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
∞
sn λn

k k x k−1 −(x/λ)k 1 2 X n
Weibull Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
x α
m xα αxm x2m α
Pareto Pareto(xm , α) 1− x ≥ xm m
α α+1 x ≥ xm α>1 α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x x α−1 (α − 1)2 (α − 2)
∗ 1
We use the rate parameterization where β = λ
. Some textbooks use β as scale parameter instead [6].
5
Uniform (continuous) Normal Log−Normal Student's t
2.0 1.00 0.4 ν=1
µ = 0, σ = 0.2
2
µ = 0, σ = 3
2
µ = 0, σ2 = 1 µ = 2, σ2 = 2 ν=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 ν=5
ν=∞
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
1.5 0.75 µ = 0.125, σ2 = 1 0.3
PDF
PDF
PDF
PDF
1
● ● 1.0 0.50 0.2
b−a
0.5 0.25 0.1
● ● 0.0 0.00 0.0
a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 β = 0.5 0.5 α = 1, β = 0.5
1.00 k=1 3 d1 = 2, d2 = 1 β=1 α = 2, β = 0.5
k=2 d1 = 5, d2 = 2 β = 2.5 α = 3, β = 0.5
k=3 d1 = 100, d2 = 1 α = 5, β = 1
k=4 d1 = 100, d2 = 100 0.4 α = 9, β = 2
k=5
1.5
0.75
2
0.3
PDF
PDF
PDF
PDF
0.50 1.0
0.2
1
0.25 0.5
0.1
0.00 0 0.0 0.0
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
α = 1, β = 1 5 α = 0.5, β = 0.5 2.0 λ = 1, k = 0.5 4 xm = 1, k = 1
α = 2, β = 1 α = 5, β = 1 λ = 1, k = 1 xm = 1, k = 2
α = 3, β = 1 α = 1, β = 3 λ = 1, k = 1.5 xm = 1, k = 4
4 α = 3, β = 0.5 α = 2, β = 2 λ = 1, k = 5
4 α = 2, β = 5
1.5 3
3
3
PDF
PDF
PDF
PDF
1.0 2
2
2
0.5 1
1 1
0 0 0.0 0
0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
6
Uniform (continuous) Normal Log−Normal Student's t
1 1.00 µ = 0, σ = 3
2 1.00
µ = 2, σ2 = 2
0.75 µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
0.75 µ = 0.125, σ2 = 1 0.75
0.50
CDF
CDF
CDF
CDF
0.50 0.50
0.25
0.25 0.25
µ = 0, σ2 = 0.2 ν=1
µ = 0, σ2 = 1 ν=2
µ = 0, σ2 = 5 ν=5
0 0.00 µ = −2, σ2 = 0.5 0.00 0.00 ν=∞
a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ 2 F Exponential Gamma
1.00 1.00 1.00
1.00
0.75 0.75 0.75 0.75

CDF
CDF
CDF
CDF
0.50 0.50 0.50 0.50
0.25 0.25 0.25 0.25

k=1 d1 = 1, d2 = 1 α = 1, β = 0.5
k=2 d1 = 2, d2 = 1 α = 2, β = 0.5
k=3 d1 = 5, d2 = 2 β = 0.5 α = 3, β = 0.5
k=4 d1 = 100, d2 = 1 β=1 α = 5, β = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 β = 2.5 0.00 α = 9, β = 2
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
0.75 0.75 0.75 0.75
CDF
CDF
CDF
CDF
0.50 0.50 0.50 0.50
0.25 0.25 0.25 0.25
α = 1, β = 1 λ = 1, k = 0.5
α = 2, β = 1 λ = 1, k = 1 xm = 1, k = 1
α = 3, β = 1 λ = 1, k = 1.5 xm = 1, k = 2
0.00 α = 3, β = 0.5 0.00 0.00 λ = 1, k = 5 0.00 xm = 1, k = 4
0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1
• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

• Event A ⊆ Ω
n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n n
r
[ X X \
• Probability Distribution P (−1)r−1

Ai = A ij

1. P [A] ≥ 0 ∀A i=1 r=1 i≤i1 <···<ir ≤n j=1
2. P [Ω] = 1
"∞ #
G ∞
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
• Probability space (Ω, A, P) X:Ω→R
Properties Probability Mass Function (PMF)
• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]

• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
Z
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF)
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B]
1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B]
2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B]
3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A]
Z
where A = i=1 Ai
T∞ P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai a
Independence ⊥
⊥ f (x, y)
fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Conditional Probability Independence
P [A ∩ B] 1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
Discrete • P [X = Y ] = 1 =⇒ E [X] = E [Y ]
X ∞
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =

fX (x)
X
• E [X] = P [X ≥ x] X discrete
x∈ϕ−1 (z) x=1
Continuous Sample mean

n
Z 1X
X̄n = Xi
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} n i=1
Az
Conditional expectation
Special case if ϕ strictly monotone Z

d

dx 1 • E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)

dz dz |J| • E [X] = E [E [X | Y ]]
Z ∞
The Rule of the Lazy Statistician • Eϕ(X,Y ) | X=x [=] ϕ(x, y)fY |X (y | x) dx
−∞
Z Z ∞
E [Z] = ϕ(x) dFX (x) • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z Z • E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
A
• E [Y | X] = c =⇒ Cov [X, Y ] = 0
Convolution
Z ∞ Z z
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx
−∞ 0 5 Variance
Z ∞
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z ∞ Z ∞ 2
2
X ⊥
⊥ • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
• Z := fZ (z) = |y|fX,Y (yz, y) dy = |y|fX (yz)fY (y) dy " n # n
Y −∞ −∞ X X X
• V Xi = V [Xi ] + Cov [Xi , Xj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
• V Xi = V [Xi ] if Xi ⊥
⊥ Xj
Definition and properties i=1 i=1
X

 xfX (x) X discrete Standard deviation p
sd[X] = V [X] = σX

Z  x

• E [X] = µX = x dFX (x) = Covariance

 Z
 xfX (x) dx X continuous


• Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• P [X = c] = 1 =⇒ E [X] = c • Cov [X, a] = 0
• E [cX] = c E [X] • Cov [X, X] = V [X]
• E [X + Y ] = E [X] + E [Y ] • Cov [X, Y ] = Cov [Y, X]
9
• Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
• Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
• Cov  Xi , Yj  = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
• Xi ∼ Bern (p) =⇒ Xi ∼ Bin (n, p)
i=1
Correlation • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
Cov [X, Y ]
ρ [X, Y ] = p • limn→∞ Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
• X ∼ NBin (1, p) = Geo (p)
Pr
Sample variance • X ∼ NBin (r, p) = i=1 Geo (p)
n P P
1 X • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
S2 = (Xi − X̄n )2
n − 1 i=1 • X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Conditional variance Poisson
2 n n
!
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] X X
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi ∼ Po λi
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
i=1 i=1
 
n n
X X λ i
6 Inequalities • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi Xj ∼ Bin  Xj , Pn 
j=1 j=1 j=1 λ j
Cauchy-Schwarz
2 Exponential
E [XY ] ≤ E X 2 E Y 2

n
X
Markov • Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
E [ϕ(X)]
P [ϕ(X) ≥ t] ≤ i=1
t • Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X − E [X]| ≥ t] ≤
t2
X−µ

Chernoff • X ∼ N µ, σ 2 =⇒ σ ∼ N (0, 1)
δ

e
• X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
2

P [X ≥ (1 + δ)µ] ≤ δ > −1
(1 + δ)1+δ
• Xi ∼ N µi , σi2 ∧ Xi ⊥⊥ Xj =⇒
P
Xi ∼ N
P
µi , i σi2
P
i i
Hoeffding
• P [a < X ≤ b] = Φ b−µ − Φ a−µ

σ σ
X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
−1
2 • Upper quantile of N (0, 1): zα = Φ (1 − α)
P X̄ − E X̄ ≥ t ≤ e−2nt t > 0

Gamma
2n2 t2

P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn 2
t>0
i=1 (bi − ai ) • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
Jensen • Gamma (α, β) ∼ i=1 Exp (β)
P P
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
10
Z ∞
Γ(α) 9.2 Bivariate Normal
• = xα−1 e−λx dx
λα 0
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
Beta
1 Γ(α + β) α−1 1 z
• xα−1 (1 − x)β−1 = x (1 − x)β−1 f (x, y) = exp −
2(1 − ρ2 )
p
B(α, β) Γ(α)Γ(β) 2πσx σy 1 − ρ2
B(α + k, β) α+k−1
E X k−1
" #
• E Xk =
2 2
=

B(α, β) α+β+k−1 x − µx y − µy x − µx y − µy
z= + − 2ρ
• Beta (1, 1) ∼ Unif (0, 1) σx σy σx σy
Conditional mean and variance
8 Probability and Moment Generating Functions E [X | Y ] = E [X] + ρ
σX
(Y − E [Y ])
σY
• GX (t) = E tX |t| < 1 p
V [X | Y ] = σX 1 − ρ2
"∞ # ∞
X (Xt)i X E Xi
· ti

• MX (t) = GX (et ) = E eXt = E =
i=0
i! i=0
i!
9.3 Multivariate Normal
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0) Covariance matrix Σ (Precision matrix Σ−1 )
(i)
GX (0)  
• P [X = i] = V [X1 ] · · · Cov [X1 , Xk ]
i! .. .. ..
Σ=
 
• E [X] = G0X (1− ) . . . 
(k)
• E X k = MX (0) Cov [Xk , X1 ] · · · V [Xk ]

X! (k) If X ∼ N (µ, Σ),
• E = GX (1− )
(X − k)!
2 1
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− )) fX (x) = (2π) −n/2
|Σ|
−1/2
exp − (x − µ)T Σ−1 (x − µ)
d 2
• GX (t) = GY (t) =⇒ X = Y
Properties
9 Multivariate Distributions • Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
• X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)
9.1 Standard Bivariate Normal • X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT

p
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX + 1 − ρ2 Z • X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa
Joint density
1 x2 + y 2 − 2ρxy
10 Convergence
f (x, y) = exp −
2(1 − ρ2 )
p
2π 1 − ρ2 Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
Conditionals the cdf of Xn and let F denote the cdf of X.
Types of Convergence
(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2

and D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) ∀t where F continuous
n→∞ 11
P
2. In probability: Xn → X √
X̄n − µ n(X̄n − µ) D
Zn := q = →Z where Z ∼ N (0, 1)
(∀ε > 0) lim P [|Xn − X| > ε] = 0 σ
n→∞ V X̄n
as
3. Almost surely (strongly): Xn → X lim P [Zn ≤ z] = Φ(z) z∈R
n→∞
h i h i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 CLT notations
n→∞ n→∞
qm
Zn ≈ N (0, 1)
4. In quadratic mean (L2 ): Xn → X
σ2

X̄n ≈ N µ,
lim E (Xn − X)2 = 0 n

n→∞
σ2

X̄n − µ ≈ N 0,
Relationships n
√ 2

qm P D n(X̄n − µ) ≈ N 0, σ
• Xn → X =⇒ Xn → X =⇒ Xn → X √
as
• Xn → X =⇒ Xn → X
P n(X̄n − µ)
≈ N (0, 1)
D P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X σ
P P P
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
qm qm qm
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y Continuity correction
P P P
• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY
x + 12 − µ
P P

• Xn →X =⇒ ϕ(Xn ) → ϕ(X)
P X̄n ≤ x ≈ Φ √
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
D σ/ n
qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0
x − 12 − µ

qm

• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ P X̄n ≥ x ≈ 1 − Φ √
σ/ n
Slutzky’s Theorem Delta method
D P D
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
σ2

2 σ2

D P D
• Xn → X and Yn → c =⇒ Xn Yn → cX Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N ϕ(µ), (ϕ0 (µ))
n n
D D D
• In general: Xn → X and Yn → Y =⇒
6 Xn + Yn → X + Y
11 Statistical Inference
10.1 Law of Large Numbers (LLN)
iid
Let X1 , · · · , Xn ∼ F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ.
Weak (WLLN)
P
X̄n → µ n→∞ 11.1 Point Estimation
Strong (SLLN) • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
as
h i
X̄n → µ n→∞ • bias(θbn ) = E θbn − θ
P
• Consistency: θbn → θ
10.2 Central Limit Theorem (CLT)
• Sampling distribution: F (θbn )
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
r h i
• Standard error: se(θn ) = V θbn
b
12
h i h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn 11.4 Statistical Functionals
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent • Statistical functional: T (F )
θbn − θ D • Plug-in estimator of θ = (F ): θbn = T (Fbn )
• Asymptotic normality: → N (0, 1) R
se • Linear functional: T (F ) = ϕ(x) dFX (x)
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis- • Plug-in estimator for linear functional:
tent estimator σ
bn . Z n
1X
T (Fbn ) = ϕ(x) dFbn (x) = ϕ(Xi )
11.2 Normal-Based Confidence Interval n i=1

b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2

Suppose θbn ≈ N θ, se

b 2 =⇒ T (Fbn ) ± zα/2 se
• Often: T (Fbn ) ≈ N T (F ), se b
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
Cn = θbn ± zα/2 se
b • µb = X̄n
n
1 X
b2 =
• σ (Xi − X̄n )2
11.3 Empirical distribution n − 1 i=1
1
Pn
Empirical Distribution Function (ECDF) n i=1 (Xi − µb)3
• κ
b=
Pn
I(Xi ≤ x) b3
Pσ
Fn (x) = i=1
b n
i=1 (Xi − X̄n )(Yi − Ȳn )
n • ρb = qP qP
n 2 n 2
(X − X̄ ) i=1 (Yi − Ȳn )
(
1 Xi ≤ x i=1 i n
I(Xi ≤ x) =
0 Xi > x
Properties (for any fixed x) 12 Parametric Inference
h i
• E Fbn = F (x)

Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
h i F (x)(1 − F (x)) and parameter θ = (θ1 , . . . , θk ).
• V Fbn =
n
F (x)(1 − F (x)) D 12.1 Method of Moments
• mse = →0
n
P j th moment
• Fbn → F (x) Z
αj (θ) = E X j = xj dFX (x)

Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )

P sup F (x) − Fn (x) > ε = 2e−2nε
b 2
j th sample moment
x n
1X j
Nonparametric 1 − α confidence band for F α
bj = X
n i=1 i
L(x) = max{Fbn − n , 0}
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s α1 (θ) = α
b1
1 2
= log α2 (θ) = α
b2
2n α
.. ..
.=.
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α αk (θ) = α
bk
13
Properties of the MoM estimator • Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
• θbn exists with probability tending to 1 • Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
P
• Consistency: θbn → θ ples. If θen is any other estimator, the asymptotic relative efficiency is:
p
• Asymptotic normality: 1. se ≈ 1/In (θ)
√ (θbn − θ) D
D
n(θb − θ) → N (0, Σ) → N (0, 1)
se
q
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , b ≈ 1/In (θbn )
2. se
∂ −1
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
(θbn − θ) D
→ N (0, 1)
se
b
12.2 Maximum Likelihood • Asymptotic optimality
Likelihood: Ln : Θ → [0, ∞) h i
V θbn
n
Y are(θen , θbn ) = h i ≤ 1
Ln (θ) = f (Xi ; θ) V θen
i=1
• Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n (θ) = log Ln (θ) = log f (Xi ; θ)
i=1 b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
Maximum likelihood estimator (mle)
τn − τ ) D
(b
→ N (0, 1)
Ln (θbn ) = sup Ln (θ) se(b
b τ)
θ
where τb = ϕ(θ)
b is the mle of τ and
Score function
∂
s(X; θ) = log f (X; θ) b = ϕ0 (θ)
se se(
b θn )
b b
∂θ
Fisher information
I(θ) = Vθ [s(X; θ)] 12.3 Multiparameter Models
In (θ) = nI(θ) Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Fisher information (exponential family)
∂ 2 `n ∂ 2 `n
Hjj = Hjk =
∂ ∂θ2 ∂θj ∂θk
I(θ) = Eθ − s(X; θ)
∂θ Fisher information matrix
Observed Fisher information 
Eθ [H11 ] ··· Eθ [H1k ]

n
In (θ) = −  .. .. ..
∂2 X
 
. . .
Inobs (θ) = −

log f (Xi ; θ)
∂θ2 i=1 Eθ [Hk1 ] · · · Eθ [Hkk ]
Properties of the mle Under appropriate regularity conditions

P
• Consistency: θbn → θ (θb − θ) ≈ N (0, Jn )
14
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then • Critical value c
• Test statistic T
(θbj − θj ) D • Rejection region R = {x : T (x) > c}
→ N (0, 1)
se
bj • Power function β(θ) = P [X ∈ R]
h i • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
θ∈Θ0
12.3.1 Multiparameter delta method
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be Retain H0 Reject H0
√

∂ϕ
 H0 true Type
√ I Error (α)
 ∂θ1  H1 true Type II Error (β) (power)
 . 
p-value
 .. 
∇ϕ =  
 ∂ϕ 
∂θk

• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα
Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ).
b Then, | {z }
1−Fθ (T (X)) since T (X ? )∼Fθ
τ − τ) D
(b
→ N (0, 1)
se(b
b τ)
p-value evidence
where r < 0.01 very strong evidence against H0
T
0.01 − 0.05 strong evidence against H0

se(b
b τ) = ∇ϕ
b Jbn ∇ϕ
b
0.05 − 0.1 weak evidence against H0
b and ∇ϕ

b = ∇ϕ b. > 0.1 little or no evidence against H0
and Jbn = Jn (θ) θ=θ
Wald test
12.4 Parametric Bootstrap
• Two-sided test
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator. θb − θ0
• Reject H0 when |W | > zα/2 where W =
se
b
• P |W | > zα/2 → α
13 Hypothesis Testing • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
Likelihood ratio test
Definitions
• Null hypothesis H0 supθ∈Θ Ln (θ) Ln (θbn )

• Alternative hypothesis H1 • T (X) = =
supθ∈Θ0 Ln (θ) Ln (θbn,0 )
• Simple hypothesis θ = θ0 k
• Composite hypothesis θ > θ0 or θ < θ0 iid
D
X
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
i=1
• One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
15
Multinomial LRT Natural form

X1 Xk
• mle: pbn = ,..., fX (x | η) = h(x) exp {η · T(x) − A(η)}
n n
k
Y pbj Xj = h(x)g(η) exp {η · T(x)}
Ln (b
pn )
• T (X) = = = h(x)g(η) exp η T T(x)

Ln (p0 ) j=1
p0j
k
X pbj D
• λ(X) = 2 Xj log → χ2k−1 15 Bayesian Inference
j=1
p 0j
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Bayes’ Theorem
Pearson Chi-square Test f (x | θ)f (θ) f (x | θ)f (θ)
f (θ | x) = =R ∝ Ln (θ)f (θ)
k f (xn ) f (x | θ)f (θ) dθ
X (Xj − E [Xj ])2
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Definitions
D
• T → χ2k−1 • X n = (X1 , . . . , Xn )

• p-value = P χ2k−1 > T (x) • xn = (x1 , . . . , xn )
D
2
• Faster → Xk−1 than LRT, hence preferable for small n • Prior density f (θ)
• Likelihood f (xn | θ): joint density of the data
Independence testing Yn
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)
• I rows, J columns, X multinomial sample of size n = I ∗ J i=1
X
• mles unconstrained: pbij = nij • Posterior density f (θ | xn )
X
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j
• Kernel: part of a density that dependsRon θ

PI PJ nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j θL (θ)f (θ)dθ
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Lnn(θ)f (θ) dθ
R
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) 15.1 Credible Intervals
Posterior interval
14 Exponential Family Z b
P [θ ∈ (a, b) | xn ] = f (θ | xn ) dθ = 1 − α
Scalar parameter a
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} Equal-tail credible interval

= h(x)g(θ) exp {η(θ)T (x)} Z a Z ∞
f (θ | xn ) dθ = f (θ | xn ) dθ = α/2
Vector parameter −∞ b
Highest posterior density (HPD) region Rn

( s
)
X
fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ)
i=1 1. P [θ ∈ Rn ] = 1 − α
= h(x) exp {η(θ) · T (x) − A(θ)} 2. Rn = {θ : f (θ | xn ) > k} for some k
= h(x)g(θ) exp {η(θ) · T (x)} Rn is unimodal =⇒ Rn is an interval
16
15.2 Function of parameters 15.3.1 Conjugate Priors
Continuous likelihood (subscript c denotes constant)
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Likelihood Conjugate prior Posterior hyperparameters
Posterior CDF for τ
Unif (0, θ) Pareto(xm , k) max x(n) , xm , k + n
Z Xn
n n n
H(r | x ) = P [ϕ(θ) ≤ τ | x ] = f (θ | x ) dθ Exp (λ) Gamma (α, β) α + n, β + xi
A
i=1
Pn
µ0 i=1 xi 1 n
2
2

Posterior density N µ, σc N µ0 , σ0 + / + 2 ,
σ2 σ2 σ02 σc
0 c−1
1 n
h(τ | xn ) = H 0 (τ | xn ) + 2
σ02 σc
Pn
νσ02 + i=1 (xi − µ)2
Bayesian delta method N µc , σ 2 Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )

νλ + nx̄ n

τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)

N µ, σ 2
b
Normal- , ν + n, α + ,
ν+n 2
scaled Inverse n 2
1X γ(x̄ − λ)
Gamma(λ, ν, α, β) β+ (xi − x̄)2 +
2 i=1 2(n + γ)
15.3 Priors −1
Σ−1 −1
Σ−1 −1

MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
−1 −1
Σ−1

Choice 0 + nΣc
Xn
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
• Subjective Bayesianism: prior should incorporate as much detail as possible Wishart(κ, Ψ) i=1
the research’s a priori knowledge—via prior elicitation n
X xi
• Objective Bayesianism: prior should incorporate as little detail as possible Pareto(xmc , k) Gamma (α, β) α + n, β + log
x mc
(non-informative prior) i=1
Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn
• Robust Bayesianism: consider various priors and determine sensitivity of Xn
our inferences to changes in the prior Gamma (αc , β) Gamma (α0 , β0 ) α0 + nαc , β0 + xi
i=1
Types
• Flat: f (θ) ∝ constant

R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s Prior (transformation-invariant):
p p
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family

17
Discrete likelihood Bayes factor
Likelihood Conjugate prior Posterior hyperparameters log10 BF10 BF10 evidence
n n 0 − 0.5 1 − 1.5 Weak
0.5 − 1 1.5 − 10 Moderate
X X
Bern (p) Beta (α, β) α+ xi , β + n − xi
i=1 i=1
1−2 10 − 100 Strong
Xn n
X n
X >2 > 100 Decisive
Bin (p) Beta (α, β) α+ xi , β + Ni − xi
p
i=1 i=1 i=1 1−p BF10
n
X p∗ = p where p = P [H1 ] and p∗ = P [H1 | xn ]
NBin (p) Beta (α, β) α + rn, β + xi 1 + 1−p BF10
i=1
n
16 Sampling Methods
X
Po (λ) Gamma (α, β) α+ xi , β + n
i=1
n
X 16.1 Inverse Transform Sampling
Multinomial(p) Dir (α) α+ x(i)
i=1 Setup
n
X
Geo (p) Beta (α, β) α + n, β + xi • U ∼ Unif (0, 1)
i=1 • X∼F
• F −1 (u) = inf{x | F (x) ≥ u}
15.4 Bayesian Testing Algorithm
1. Generate u ∼ Unif (0, 1)
If H0 : θ ∈ Θ0 :
2. Compute x = F −1 (u)
Z
Prior probability P [H0 ] = f (θ) dθ
Θ0 16.2 The Bootstrap
Z
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ Let Tn = g(X1 , . . . , Xn ) be a statistic.
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
∗ ∗
Let H0 . . .Hk−1 be k hypotheses. Suppose θ ∼ f (θ | Hk ), (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
the sampling distribution implied by Fn b
f (xn | Hk )P [Hk ] i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ] ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
(b) Then
Marginal likelihood B B
!2
1 X ∗ 1 X ∗
vboot = VFbn =
b Tn,b − T
B B r=1 n,r
Z
n
f (x | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ b=1
Θ
16.2.1 Bootstrap Confidence Intervals
Posterior odds (of Hi relative to Hj )
Normal-based interval
n
P [Hi | x ] n
f (x | Hi ) P [Hi ] Tn ± zα/2 se
b boot
= ×
P [Hj | xn ] f (xn | Hj ) P [Hj ] Pivotal interval
| {z } | {z }
Bayes Factor BFij prior odds 1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ 2. Generate u ∼ Unif (0, 1)
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn Ln (θcand )
∗ ∗
3. Accept θcand if u ≤
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: Ln (θbn )
B
1 X ∗ 16.4 Importance Sampling
H(r)
b = I(Rn,b ≤ r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q(θ) | xn ]:
5. θβ∗ = β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B ) iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
6. rβ∗ = beta sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn
Ln (θi )
2. wi = PB ∀i = 1, . . . , B

7. Approximate 1 − α confidence interval Cn = â, b̂ where
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
b −1 1 − α =

∗ ∗
â = θbn − H θbn − r1−α/2 = 2θbn − θ1−α/2
2
α
b̂ = θbn − Hb −1
2
= ∗
θbn − rα/2 = ∗
2θbn − θα/2 17 Decision Theory
Percentile interval Definitions
∗ ∗
Cn = θα/2 , θ1−α/2 • Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
16.3 Rejection Sampling • Action a ∈ A: possible value of the decision rule. In the estimation
context, the action is just an estimate of θ, θ(x).
b
Setup
• Loss function L: consequences of taking action a when true state is θ or
• We can easily sample from g(θ) discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
• We want to sample from h(θ), but it is difficult Loss functions
k(θ)
• We know h(θ) up to a proportional constant: h(θ) = R • Squared error loss: L(θ, a) = (θ − a)2
k(θ) dθ (
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
Algorithm
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
1. Draw θcand ∼ g(θ) • Lp loss: L(θ, a) = |θ − a|p
2. Generate u ∼ Unif (0, 1)
(
0 a=θ
k(θcand ) • Zero-one loss: L(θ, a) =
3. Accept θcand if u ≤ 1 a 6= θ
M g(θcand )
4. Repeat until B values of θcand have been accepted
17.1 Risk
Example
Posterior risk
• We can easily sample from the prior g(θ) = f (θ)
Z h i
r(θb | x) = L(θ, θ(x))f
b (θ | x) dθ = Eθ|X L(θ, θ(x))
b
• Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ)
• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M (Frequentist) risk
• Algorithm Z h i
1. Draw θ cand
∼ f (θ) R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
• Response variable Y
• Covariate X (aka predictor variable or feature)
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)
b
18.1 Simple Linear Regression

h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Model
17.2 Admissibility Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0 rb(x) = βb0 + βb1 x
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
Predicted (fitted) values
∃θ : R(θ, θb0 ) < R(θ, θ)
b Ybi = rb(Xi )
• θb is inadmissible if there is at least one other estimator θb0 that dominates Residuals
it. Otherwise it is called admissible. ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi
Residual sums of squares (rss)

17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(βb0 , βb1 ) = ˆ2i
i=1
• r(f, θ)
b = inf e r(f, θ)
θ
e
R Least square estimates
• θ(x)
b = inf r(θb | x) ∀x =⇒ r(f, θ)
b = r(θb | x)f (x) dx
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
Theorems
• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n

Pn Pn
• Absolute error loss: posterior median i=1 (Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
β1 =
b Pn = P n
• Zero-one loss: posterior mode i=1 (Xi − X̄n )
2 2 2
i=1 Xi − nX

β0
h i
E βb | X n =
17.4 Minimax Rules β1
σ 2 n−1 ni=1 Xi2 −X n
h i P
Maximum risk V βb | X n = 2
R̄(θ)
b = sup R(θ, θ)
b R̄(a) = sup R(θ, a) nsX −X n 1
θ θ r Pn
2
σ i=1 Xi
√
b
Minimax rule se(
b βb0 ) =
sX n n
sup R(θ, θ)
b = inf R̄(θ)
e = inf sup R(θ, θ)
e
θ θe θe θ σ
√
b
se(
b βb1 ) =
sX n
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c Pn Pn 2
where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = n−21
i=1
ˆi (unbiased estimate).
Least favorable prior Further properties:
P P
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ • Consistency: βb0 → β0 and βb1 → β1
20
• Asymptotic normality: 18.3 Multiple Regression
βb0 − β0 D βb1 − β1 D Y = Xβ +
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
where
• Approximate 1 − α confidence intervals for β0 and β1 :      
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  .. 
βb0 ± zα/2 se( and βb1 ± zα/2 se( X= . =.
 
b βb0 ) b βb1 ) . . 
Xn1 ··· Xnk βk n
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
W = βb1 /se(
b βb1 ). Likelihood

1
R2 L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss
Pn b 2
Pn 2 2σ
i=1 (Yi − Y ) ˆ rss
2
R = Pn 2
= 1 − Pn i=1 i 2 = 1 −
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
N
X
Likelihood rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = (Yi − xTi β)2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2
i=1 i=1 i=1 If the (k × k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) βb = (X T X)−1 X T Y
i=1 h i
V βb | X n = σ 2 (X T X)−1
n
( )
Y 1 X 2
−n
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi )
2σ i βb ≈ N β, σ 2 (X T X)−1

i=1
Under the assumption of Normality, the least squares estimator is also the mle
Estimate regression function
but the least squares variance estimator is not the mle.
n k
1X 2 X
b2 =
σ ˆ rb(x) = βbj xj
n i=1 i j=1
18.2 Prediction Unbiased estimate for σ 2

Observe X = x∗ of the covariate and want to predict their outcome Y∗ . n
1 X 2
b2 =
σ ˆ ˆ = X βb − Y
Yb∗ = βb0 + βb1 x∗ n − k i=1 i
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 mle
n−k 2
Prediction interval µ
b = X̄ b2 =
σ σ
Pn 2
n
2 2 i=1 (Xi − X∗ )
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
1 − α Confidence interval
Yb∗ ± zα/2 ξbn βbj ± zα/2 se(
b βbj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) − k
AIC(S) = `n (βbS , σ
Issues
Bayesian Information Criterion (BIC)
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance k
bS2 ) −
BIC(S) = `n (βbS , σ log n
Procedure 2
1. Assign a score to each model Validation and training

2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or
i=1
4 2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi − Ybi (S)
mspe = E (Yb (S) − Y ∗ )2 R
bCV (S) = (Yi − Yb(i) )2 =
i=1 i=1
1 − Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )−1 XS (“hat matrix”)
X X
R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1
Training error
n
R
btr (S) =
X
(Ybi (S) − Yi )2
19 Non-parametric Function Estimation
i=1
2 19.1 Density Estimation
R Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) R
R2 (S) = 1 − =1− =1− Estimate f (x), where f (x) = P [X ∈ A] = A
f (x) dx.
P n 2
i=1 (Yi − Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z 2 Z
h i L(f, fbn ) = f (x) − fn (x) dx = J(h) + f 2 (x) dx
b
E R btr (S) < R(S)
h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
b b Cov Ybi , Yi
i=1
h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n − 1 rss
R2 (S) = 1 −
n − k tss h i
Mallow’s Cp statistic b(x) = E fbn (x) − f (x)
h i
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n
Definitions 1X1 x − Xi
fbn (x) = K
n i=1 h h
• Number of bins m
Z Z
1 4 00 2 1
1 R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
• Binwidth h = m 4 nh
• Bin Bj has νj observations c
−2/5 −1/5 −1/5
c2 c3
Z Z
h∗ = 1 c = σ 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
• Define pbj = νj /n and pj = Bj f (u) du n1/5
1 K 2 3
Z 4/5 Z 1/5
∗ c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (σK ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x ∈ Bj )
j=1
h Epanechnikov Kernel
h i pj
E fbn (x) = (
3
√
h √
4 5(1−x2 /5)
|x| < 5
h i p (1 − p ) K(x) =
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) ≈ (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
1 6 n n n
1 X X ∗ Xi − Xj
Z
∗
h = 1/3 R 2Xb 2
2 du JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) ≈ K + K(0)
n (f 0 (u)) n i=1 hn2 i=1 j=1 h nh
2/3 Z 1/3
∗ b C 3 0 2
R (fn , f ) ≈ 2/3 C= (f (u)) du
n 4 Z
K ∗ (x) = K (2) (x) − 2K(x) K (2) (x) = K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]
Z
2Xb
n
2 n+1 X 2
m 19.2 Non-parametric Regression
JbCV (h) = fbn2 (x) dx − f(−i) (Xi ) = − pb
n i=1 (n − 1)h (n − 1)h j=1 j Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by
Yi = r(xi ) + i
19.1.2 Kernel Density Estimator (KDE)
E [i ] = 0
Kernel K V [i ] = σ 2
• K(x) ≥ 0 k-nearest Neighbor Estimator

R
• K(x) dx = 1
•
R
xK(x) dx = 0 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
•
R 2 2
x K(x) dx ≡ σK >0 k
i:xi ∈Nk (x)
23
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
x−xi

K {0, ±1, . . . } = Z discrete
wi (x) = h ∈ [0, 1] {Xt : t ∈ T } T =
[0, ∞)

Pn
K
x−xj continuous
j=1 h
4 Z 2
h4 f 0 (x)
Z
2 2 00 0 • Notations Xt , X(t)
R(brn , r) ≈ x K (x) dx r (x) + 2r (x) dx
4 f (x) • State space X
Z 2R 2
σ K (x) dx • Index set T
+ dx
nhf (x)
c1
h∗ ≈ 1/5 20.1 Markov Chains
n
c2
R∗ (b
rn , r) ≈ 4/5 Markov chain
n
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X

n
X n
X (Yi − rb(xi ))2 Transition probabilities
JbCV (h) = (Yi − rb(−i) (xi ))2 = !2
i=1 i=1 K(0) pij ≡ P [Xn+1 = j | Xn = i]
1− Pn x−x
j
K
j=1 h pij (n) ≡ P [Xm+n = j | Xm = i] n-step
19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation
∞ J • (i, j) element is pij
X X
r(x) = βj φj (x) ≈ βj φj (x) • pij > 0
P
j=1 j=1 • i pij = 1
Multivariate regression
Y = Φβ + η Chapman-Kolmogorov
 
φ0 (x1 ) ··· φJ (x1 ) X
 .. .. ..  pij (m + n) = pij (m)pkj (n)
where ηi = i and Φ =  . . .  k
φ0 (xn ) · · · φJ (xn )
Least squares estimator Pm+n = Pm Pn
βb = (ΦT Φ)−1 ΦT Y
Pn = P × · · · × P = Pn
1
≈ ΦT Y (for equally spaced observations only)
n Marginal probability
 2 µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]
n J
R
bCV (J) =
X
Yi −
X
φj (xi )βbj,(−i)  µ0 , initial distribution
i=1 j=1 µn = µ0 Pn
24
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] γ(s, t)
ρ(s, t) = p =p
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t V [xs ] V [xt ] γ(s, s)γ(t, t)
• X0 = 0
• Independent increments: Cross-covariance function (CCV)
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ · · · ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• Intensity function λ(t)
– P [Xt+h − Xt = 1] = λ(t)h + o(h) Cross-correlation function (CCF)
– P [Xt+h − Xt = 2] = o(h)
γxy (s, t)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) =
Rt
λ(s) ds ρxy (s, t) = p
0 γx (s, s)γy (t, t)
Homogeneous Poisson process
Backshift operator
λ(t) ≡ λ =⇒ Xt ∼ Po (λt) λ>0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs Difference operator

1 ∇d = (1 − B)d
Wt ∼ Gamma t,
λ
Interarrival times White noise
St = Wt+1 − Wt
2

1 • wt ∼ wn(0, σw )
St ∼ Exp iid 2

λ • Gaussian: wt ∼ N 0, σw
• E [wt ] = 0 t ∈ T
St • V [wt ] = σ 2 t ∈ T
• γw (s, t) = 0 s 6= t ∧ s, t ∈ T
Wt−1 Wt t
Random walk
21 Time Series
• Drift δ
Pt
Mean function Z ∞
• xt = δt + j=1 wj
µxt = E [xt ] = xft (x) dx • E [xt ] = δt
−∞
Autocovariance function Symmetric moving average
γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt k

X k
X
mt = aj xt−j where aj = a−j ≥ 0 and aj = 1
γx (t, t) = E (xt − µt )2 = V [xt ]

j=−k j=−k
25
21.1 Stationary Time Series Sample variance
n
Strictly stationary 1 X |h|
V [x̄] = 1− γx (h)
n n
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] h=−n
∀k ∈ N, tk , ck , h ∈ Z Sample autocovariance function
Weakly stationary n−h

1 X
γ
b(h) = (xt+h − x̄)(xt − x̄)
• E x2t < ∞ ∀t ∈ Z n t=1
2
• E xt = m ∀t ∈ Z
• γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z Sample autocorrelation function
Autocovariance function
γ
b(h)
ρb(h) =
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z γ
b(0)

• γ(0) = E (xt − µ)2
• γ(0) ≥ 0 Sample cross-variance function
• γ(0) ≥ |γ(h)|
n−h
• γ(h) = γ(−h) 1 X
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1
Autocorrelation function (ACF)
Cov [xt+h , xt ] γ(t + h, t) γ(h) Sample cross-correlation function

ρx (h) = p =p =
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0)
γ
bxy (h)
Jointly stationary time series ρbxy (h) = p
γbx (0)b
γy (0)
γxy (h) = E [(xt+h − µx )(yt − µy )]
Properties
γxy (h)
ρxy (h) = p 1
γx (0)γy (h) • σρbx (h) = √ if xt is white noise
n
Linear process 1
• σρbxy (h) = √ if xt or yt is white noise
∞
X ∞
X n
xt = µ + ψj wt−j where |ψj | < ∞
j=−∞ j=−∞
∞
21.3 Non-Stationary Time Series
X
2
γ(h) = σw ψj+h ψj Classical decomposition model
j=−∞
xt = µt + st + wt
21.2 Estimation of Correlation
Sample mean • µt = trend
n
1X • st = seasonal component
x̄ = xt
n t=1 • wt = random noise term
26
21.3.1 Detrending Moving average polynomial
Least squares θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0
2
1. Choose trend model, e.g., µt = β0 + β1 t + β2 t
Moving average operator
2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2
3. Residuals , noise wt θ(B) = 1 + θ1 B + · · · + θp B p
Moving average MA (q) (moving average model order q)
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 : xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
k q
1 X X
vt = xt−1 E [xt ] = θj E [wt−j ] = 0
2k + 1
i=−k j=0
Pk ( Pq−h
1 2
• If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
σw j=0 θj θj+h 0≤h≤q
γ(h) = Cov [xt+h , xt ] =
without distortion 0 h>q
Differencing MA (1)
xt = wt + θwt−1
• µt = β0 + β1 t =⇒ ∇xt = β1 
2 2
(1 + θ )σw h = 0

2
21.4 ARIMA models γ(h) = θσw h=1

0 h>1

Autoregressive polynomial
(
θ
φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0 2 h=1
ρ(h) = (1+θ )
0 h>1
Autoregressive operator
ARMA (p, q)
φ(B) = 1 − φ1 B − · · · − φp B p
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
Autoregressive model order p, AR (p)
φ(B)xt = θ(B)wt
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
Partial autocorrelation function (PACF)
AR (1) • xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
k−1 ∞ • φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
X k→∞,|φ|<1 X
• xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j ) • E.g., φ11 = corr(x1 , x0 ) = ρ(1)
j=0 j=0
| {z } ARIMA (p, d, q)
linear process
P∞ j
∇d xt = (1 − B)d xt is ARMA (p, q)
• E [xt ] = j=0 φ (E [wt−j ]) = 0
2 h
σw φ φ(B)(1 − B)d xt = θ(B)wt
• γ(h) = Cov [xt+h , xt ] = 1−φ2
γ(h) Exponentially Weighted Moving Average (EWMA)
• ρ(h) = γ(0) = φh
• ρ(h) = φρ(h − 1) h = 1, 2, . . . xt = xt−1 + wt − λwt−1
27
∞
X • Frequency index ω (cycles per unit time), period 1/ω
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
j=1
• Amplitude A
• Phase φ
x̃n+1 = (1 − λ)xn + λx̃n
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
Seasonal ARIMA
Periodic mixture
• Denoted by ARIMA (p, d, q) × (P, D, Q)s
q
• ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt X
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
21.4.1 Causality and Invertibility
P∞ • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that Pq
• γ(h) = k=1 σk2 cos(2πωk h)
Pq
∞
X • γ(0) = E x2t = k=1 σk2
xt = wt−j = ψ(B)wt
j=0 Spectral representation of a periodic process
P∞
ARMA (p, q) is invertible ⇐⇒ ∃{πj } : j=0 πj < ∞ such that γ(h) = σ 2 cos(2πω0 h)
∞ σ 2 −2πiω0 h σ 2 2πiω0 h
X = e + e
π(B)xt = Xt−j = wt 2 2
Z 1/2
j=0
= e2πiωh dF (ω)
Properties −1/2
• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle Spectral distribution function
∞

X θ(z)
j 0
 ω < −ω0
ψ(z) = ψj z = |z| ≤ 1
φ(z) F (ω) = σ 2 /2 −ω ≤ ω < ω0
j=0 
 2
σ ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
• F (−∞) = F (−1/2) = 0
∞
X φ(z) • F (∞) = F (1/2) = γ(0)
π(z) = πj z j = |z| ≤ 1
j=0
θ(z)
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models ∞
X 1 1
AR (p) MA (q) ARMA (p, q) f (ω) = γ(h)e−2πiωh − ≤ω≤
2 2
h=−∞
ACF tails off cuts off after lag q tails off
PACF cuts off after lag p tails off q tails off P∞ R 1/2
• Needs h=−∞ |γ(h)| < ∞ =⇒ γ(h) = −1/2
e2πiωh f (ω) dω h = 0, ±1, . . .
21.5 Spectral Analysis • f (ω) ≥ 0
• f (ω) = f (−ω)
Periodic process • f (ω) = f (1 − ω)
R 1/2
xt = A cos(2πωt + φ) • γ(0) = V [xt ] = −1/2 f (ω) dω
2
= U1 cos(2πωt) + U2 sin(2πωt) • White noise: fw (ω) = σw
28
• ARMA (p, q) , φ(B)xt = θ(B)wt : 22.2 Beta Function
Z 1
Γ(x)Γ(y)
|θ(e−2πiω )|2
2 • Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt =
fx (ω) = σw 0 Γ(x + y)
|φ(e−2πiω )|2 Z x
a−1 b−1
Pp Pq • Incomplete: B(x; a, b) = t (1 − t) dt
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k 0
• Regularized incomplete:
Discrete Fourier Transform (DFT) a+b−1
B(x; a, b) a,b∈N X (a + b − 1)!
Ix (a, b) = = xj (1 − x)a+b−1−j
n
X B(a, b) j=a
j!(a + b − 1 − j)!
d(ωj ) = n−1/2 xt e−2πiωj t
• I0 (a, b) = 0 I1 (a, b) = 1
i=1
• Ix (a, b) = 1 − I1−x (b, a)
Fourier/Fundamental frequencies
22.3 Series
ωj = j/n
Finite Binomial
Inverse DFT n n
n−1 X n(n + 1) X n
• = 2n
X
xt = n −1/2
d(ωj )e 2πiωj t k= •
2 k
j=0 k=1 k=0
n n
X X r+k r+n+1
Periodogram • (2k − 1) = n2 • =
I(j/n) = |d(j/n)|2 k n
k=1 k=0
n n
Scaled Periodogram
X n(n + 1)(2n + 1) X k n+1
• k2 = • =
6 m m+1
k=1 k=0
4 n
P (j/n) = I(j/n) X
n(n + 1)
2 • Vandermonde’s Identity:
n • k3 = r
m n

m+n

2
!2 !2 X
n n k=1 =
2X 2X n k r−k r
= xt cos(2πtj/n + xt sin(2πtj/n cn+1 − 1 k=0
n t=1 n t=1
X
• ck = c 6= 1 • Binomial Theorem:
c−1 n
n n−k k
k=0
X
a b = (a + b)n
22 Math k
k=0
22.1 Gamma Function Infinite

Z ∞
∞ ∞
• Ordinary: Γ(s) = ts−1 e−t dt X 1 X p
0 • pk = , pk = |p| < 1
Z ∞ 1−p 1−p
k=0 k=1
• Upper incomplete: Γ(s, x) = ts−1 e−t dt ∞ ∞
!
X d X d 1 1
Z xx • kpk−1 = pk
= = |p| < 1
dp dp 1 − p (1 − p)2
• Lower incomplete: γ(s, x) = ts−1 e−t dt k=0 k=0
0 ∞
X r+k−1 k
• Γ(α + 1) = αΓ(α) α>1 • x = (1 − x)−r r ∈ N+
k
• Γ(n) = (n − 1)! n∈N k=0
∞
• Γ(0) = Γ(−1) = ∞ X α k
√ • p = (1 + p)α |p| < 1 , α ∈ C
• Γ(1/2) = π k
k=0
• Γ(−1/2) = −2Γ(1/2)
29
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling [4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n w/o replacement w/ replacement [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
k−1 Springer, 2002.
Y n!
ordered nk = (n − i) = nk [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
i=0
(n − k)!
nk

n n! n−1+r n−1+r
unordered = = =
k k! k!(n − k)! r n−1
Stirling numbers, 2nd kind

(
n n−1 n−1 n 1 n=0
=k + 1≤k≤n =
k k k−1 0 0 else
Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1
Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.
|B| = n, |U | = m f arbitrary f injective f surjective f bijective

( (
mn m ≥ n

n n! m = n
B : D, U : D mn m!
0 else m 0 else
(
m+n−1 m n−1 1 m=n
B : ¬D, U : D
n n m−1 0 else
m
( (
X n 1 m≥n n 1 m=n
B : D, U : ¬D
k 0 else m 0 else
k=1
m
( (
X 1 m≥n 1 m=n
B : ¬D, U : ¬D Pn,k Pn,m
k=1
0 else 0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
30
Univariate distribution relationships, courtesy Leemis and McQueston [2].
31

Probability and Statistics: Cookbook

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Probability and Statistics: Cookbook

Загружено:

Авторское право:

Доступные форматы

Probability and Statistics

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

0.5 0.25 0.1

● ● 0.0 0.00 0.0

0.00 0 0.0 0.0

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

Properties Probability Mass Function (PMF)

• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]

Continuous Sample mean

Properties of the mle Under appropriate regularity conditions

• Null hypothesis H0 supθ∈Θ Ln (θ) Ln (θbn )

fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} Equal-tail credible interval

Highest posterior density (HPD) region Rn

• Flat: f (θ) ∝ constant

• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family

18.1 Simple Linear Regression

Residual sums of squares (rss)

• Squared error loss: posterior mean βb0 = Ȳn − βb1 X̄n

18.2 Prediction Unbiased estimate for σ 2

1. Assign a score to each model Validation and training

• K(x) ≥ 0 k-nearest Neighbor Estimator

P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Autocovariance function Symmetric moving average

γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt k

∀k ∈ N, tk , ck , h ∈ Z Sample autocovariance function

Weakly stationary n−h

Cov [xt+h , xt ] γ(t + h, t) γ(h) Sample cross-correlation function

22.1 Gamma Function Infinite

Stirling numbers, 2nd kind

Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

Вам также может понравиться