Probablity & Statistics-Cheat Sheet

Probability and Statistics
Cheat Sheet
Copyright c Matthias Vallentin, 2010
vallentin@icir.org
November 23, 2010
This cheatsheet integrates a variety of topics in probability
theory and statistics. It is based on literature [1, 6, 3] and
in-class material from courses of the statistics department at
the University of California in Berkeley but also inuenced
by other sources [4, 5]. If you nd errors or have sugges-
tions for further topics, I would appreciate if you send me an
email. The most recent version of this document is available
at http://cs.berkeley.edu/
~
mavam/probstat.pdf. To repro-
duce, please contact me.
Contents
1 Distribution Overview 3
1.1 Discrete Distributions . . . . . . . . . . 3
1.2 Continuous Distributions . . . . . . . . 4
2 Probability Theory 6
3 Random Variables 6
3.1 Transformations . . . . . . . . . . . . . 7
4 Expectation 7
5 Variance 7
6 Inequalities 8
7 Distribution Relationships 8
8 Probability and Moment Generating
Functions 9
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9
9.2 Bivariate Normal . . . . . . . . . . . . . 9
9.3 Multivariate Normal . . . . . . . . . . . 9
10 Convergence 10
10.1 Law of Large Numbers (LLN) . . . . . . 10
10.2 Central Limit Theorem (CLT) . . . . . 10
11 Statistical Inference 11
11.1 Point Estimation . . . . . . . . . . . . . 11
11.2 Normal-based Condence Interval . . . . 11
11.3 Empirical Distribution Function . . . . . 11
11.4 Statistical Functionals . . . . . . . . . . 11
12 Parametric Inference 11
12.1 Method of Moments . . . . . . . . . . . 12
12.2 Maximum Likelihood . . . . . . . . . . . 12
12.2.1 Delta Method . . . . . . . . . . . 12
12.3 Multiparameter Models . . . . . . . . . 13
12.3.1 Multiparameter Delta Method . 13
12.4 Parametric Bootstrap . . . . . . . . . . 13
13 Hypothesis Testing 13
14 Bayesian Inference 14
14.1 Credible Intervals . . . . . . . . . . . . . 14
14.2 Function of Parameters . . . . . . . . . 14
14.3 Priors . . . . . . . . . . . . . . . . . . . 15
14.3.1 Conjugate Priors . . . . . . . . . 15
14.4 Bayesian Testing . . . . . . . . . . . . . 15
15 Exponential Family 16
16 Sampling Methods 16
16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Condence Intervals . 16
16.2 Rejection Sampling . . . . . . . . . . . . 17
16.3 Importance Sampling . . . . . . . . . . . 17
17 Decision Theory 17
17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
17.2 Admissibility . . . . . . . . . . . . . . . 17
17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
18 Linear Regression 18
18.1 Simple Linear Regression . . . . . . . . 18
18.2 Prediction . . . . . . . . . . . . . . . . . 19
18.3 Multiple Regression . . . . . . . . . . . 19
18.4 Model Selection . . . . . . . . . . . . . . 19
19 Non-parametric Function Estimation 20
19.1 Density Estimation . . . . . . . . . . . . 20
19.1.1 Histograms . . . . . . . . . . . . 20
19.1.2 Kernel Density Estimator (KDE) 21
19.2 Non-parametric Regression . . . . . . . 21
19.3 Smoothing Using Orthogonal Functions 21
20 Stochastic Processes 22
20.1 Markov Chains . . . . . . . . . . . . . . 22
20.2 Poisson Processes . . . . . . . . . . . . . 22
21 Time Series 23
21.1 Stationary Time Series . . . . . . . . . . 23
21.2 Estimation of Correlation . . . . . . . . 24
21.3 Non-Stationary Time Series . . . . . . . 24
21.3.1 Detrending . . . . . . . . . . . . 24
21.4 ARIMA models . . . . . . . . . . . . . . 24
21.4.1 Causality and Invertibility . . . . 25
22 Math 26
22.1 Orthogonal Functions . . . . . . . . . . 26
22.2 Series . . . . . . . . . . . . . . . . . . . 26
22.3 Combinatorics . . . . . . . . . . . . . . 27
1 Distribution Overview
1.1 Discrete Distributions
F
X
(x) f
X
(x) E [X] V [X] M
X
(s)
Uniform{a, . . . , b}
0 x < a
xa+1
ba
a x b
1 x > b
I(a < x < b)
b a + 1
a +b
2
(b a + 1)
2
1
12
e
as
e
(b+1)s
s(b a)
Bernoulli(p) (1 p)
1x
p
x
(1 p)
1x
p p(1 p) 1 p +pe
s
Binomial(n, p) I
1p
(n x, x + 1)
n
x
p
x
(1 p)
nx
np np(1 p) (1 p +pe
s
)
n
Multinomial(n, p)
n!
x
1
! . . . x
k
!
p
x1
1
p
xk
k
k
i=1
x
i
= n np
i
np
i
(1 p
i
)
i=0
p
i
e
is
n
Hypergeometric(N, m, n)
x np
np(1 p)

m
x
mx
nx
N
x
nm
N
nm(N n)(N m)
N
2
(N 1)
N/A
NegativeBinomial(r, p) I
p
(r, x + 1)
x +r 1
r 1
p
r
(1 p)
x
r
1 p
p
r
1 p
p
2
p
1 (1 p)e
s
r
Geometric(p) 1 (1 p)
x
x N
+
p(1 p)
x1
x N
+
1
p
1 p
p
2
p
1 (1 p)e
s
Poisson() e
i=0
i
i!
x
e
x!
e
(e
s
1)
q q q q q q
Uniform (discrete)
x
P
M
F
a b
1
n
0 10 20 30 40
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
0
.2
5
Binomial
x
P
M
F
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
0 2 4 6 8 10
0
.0
0
.2
0
.4
0
.6
0
.8
Geometric
x
P
M
F
p = 0.2
p = 0.5
p = 0.8
0 5 10 15 20
0
.0
0
.1
0
.2
0
.3
Poisson
x
P
M
F
= 1
= 4
= 10
3
1.2 Continuous Distributions
F
X
(x) f
X
(x) E [X] V [X] M
X
(s)
Uniform(a, b)
0 x < a
xa
ba
a < x < b
1 x > b
I(a < x < b)
b a
a +b
2
(b a)
2
12
e
sb
e
sa
s(b a)
Normal(,
2
) (x) =
(t) dt (x) =
1
2
exp
(x )
2
2
2

2
exp
s +

2
s
2
2
Log-Normal(,
2
)
1
2
+
1
2
erf
ln x
2
2
1
x
2
2
exp
(ln x )
2
2
2
e
+
2
/2
(e
2
1)e
2+
2
Multivariate Normal(, ) (2)
k/2
||
1/2
e
1
2
(x)
T
1
(x)
exp
T
s +
1
2
s
T
s
Chi-square(k)
1
(k/2)
k
2
,
x
2
1
2
k/2
(k/2)
x
k/2
e
x/2
k 2k (1 2s)
k/2
s < 1/2
Exponential() 1 e
x/
1
e
x/

2
1
1 s
(s < 1/)
Gamma(, )
(, x/)
()
1
()
x
1
e
x/

2
1
1 s
(s < 1/)
InverseGamma(, )
,

x
()
()
x
1
e
/x

1
> 1

2
( 1)
2
( 2)
2
> 2
2(s)
/2
()
K
4s
Dirichlet()
k
i=1
(
i
)
k
i=1
i=1
x
i1
i
k
i=1
i
E [X
i
] (1 E [X
i
])
k
i=1
i
+ 1
Beta(, ) I
x
(, )
( +)
() ()
x
1
(1 x)
1

+
( +)
2
( + + 1)
1 +
k=1
k1
r=0
+r
+ +r
s
k
k!
Weibull(, k) 1 e
(x/)
k k
k1
e
(x/)
k
1 +
1
k
1 +
2
k
n=0
s
n
n
n!

1 +
n
k
Pareto(x
m
, ) 1
x
m
x
x x
m

x
m
x
+1
x x
m
x
m
1
> 1
x
m
( 1)
2
( 2)
> 2 (x
m
s)
(, x
m
s) s < 0
4
q q
Uniform (continuous)
x
P
D
F
a b
1
b a
q q
4 2 0 2 4
0
.0
0
.2
0
.4
0
.6
0
.8
Normal
x
(
x
)
= 0,
2
= 0.2
= 0,
2
= 1
= 0,
2
= 5
= 2,
2
= 0.5
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
Lognormal
x
P
D
F
= 0,
2
= 3
= 2,
2
= 2
= 0,
2
= 1
= 0.5,
2
= 1
= 0.25,
2
= 1
= 0.125,
2
= 1
0 2 4 6 8
0
.0
0
.1
0
.2
0
.3
0
.4
0
.5
2
x
P
D
F
k = 1
k = 2
k = 3
k = 4
k = 5
0 1 2 3 4 5
0
.0
0
.5
1
.0
1
.5
2
.0
Exponential
x
P
D
F
= 2
= 1
= 0.4
0 5 10 15 20
0
.0
0
.1
0
.2
0
.3
0
.4
0
.5
Gamma
x
P
D
F
= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5
0 1 2 3 4 5
0
1
2
3
4
InverseGamma
x
P
D
F
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.5
1
.0
1
.5
2
.0
2
.5
3
.0
Beta
x
P
D
F
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
0.0 0.5 1.0 1.5 2.0 2.5
0
.0
0
.5
1
.0
1
.5
2
.0
2
.5
Weibull
x
P
D
F
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
0 1 2 3 4 5
0
1
2
3
Pareto
x
P
D
F
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4
5
2 Probability Theory
Denitions
Sample space
Outcome (point or element)
Event A
-algebra A
1. A
2. A
1
, A
2
, . . . , A =

i=1
A
i
A
3. A A = A A
Probability distribution P
1. P [A] 0 for every A
2. P [] = 1
3. P
i=1
A
i
i=1
P [A
i
]
Probability space (, A, P)
Properties
P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] +P [A B]
P [] = 1 P [] = 0
(
n
A
n
) =
n
A
n
(
n
A
n
) =
n
A
n
DeMorgan
P [
n
A
n
] = 1 P [
n
A
n
]
P [A B] = P [A] +P [B] P [A B]
= P [A B] P [A] +P [B]
P [A B] = P [A B] +P [A B] +P [A B]
P [A B] = P [A] P [A B]
A
1
A
n
A =
n=1
A
n
= P [A
n
] = P [A]
A
n
A
1
A =
n=1
A
n
= P [A
n
] = P [A]
Independence
A B P [A B] = P [A] P [B]
Conditional Probability
P [A| B] =
P [A B]
P [B]
if P [B] > 0
Law of Total Probability
P [B] =
n
i=1
P [B|A
i
] P [A
i
] =
n
i=1
A
i
Bayes Theorem
P [A
i
| B] =
P [B| A
i
] P [A
i
]
n
j=1
P [B| A
j
] P [A
j
]
=
n
i=1
A
i
Inclusion-Exclusion Principle
i=1
A
i
=
n
r=1
(1)
r1
ii1<<irn
j=1
A
ij
3 Random Variables
Random Variable
X : R
Probability Mass Function (PMF)
f
X
(x) = P [X = x] = P [{ : X() = x}]
Probability Density Function (PDF)
f
X
(x) = P [a X b] =
b
a
f(x) dx
Cumulative Distribution Function (CDF):
F
X
: R [0, 1] F
X
(x) = P [X x]
1. Nondecreasing: x
1
< x
2
= F(x
1
) F(x
2
)
2. Normalized: lim
x
= 0 and lim
x
= 1
3. Right-continuous: lim
yx
F(y) = F(x)
P [a Y b | X = x] =
b
a
f
Y |X
(y | x)dy a b
f
Y |X
(y | x) =
f(x, y)
f
X
(x)
Independence
1. P [X x, Y y] = P [X x] P [Y y]
2. f
X,Y
(x, y) = f
X
(x)f
Y
(y)
6
3.1 Transformations
Transformation function
Z = (X)
Discrete
f
Z
(z) = P [(X) = z] = P [{x : (x) = z}] = P
X
1
(z)
x
1
(z)
f(x)
Continuous
F
Z
(z) = P [(X) z] =
Az
f(x) dx with A
z
= {x : (x) z}
Special case if strictly monotone
f
Z
(z) = f
X
(
1
(z))
d
dz
1
(z)
= f
X
(x)
dx
dz
= f
X
(x)
1
|J|
The Rule of the Lazy Statistician
E [Z] =
(x) dF
X
(x)
E [I
A
(x)] =
I
A
(x) dF
X
(x) =
A
dF
X
(x) = P [X A]
Convolution
Z := X +Y f
Z
(z) =
f
X,Y
(x, z x) dx
X,Y 0
=
z
0
f
X,Y
(x, z x) dx
Z := |X Y | f
Z
(z) = 2

0
f
X,Y
(x, z +x) dx
Z :=
X
Y
f
Z
(z) =
|x|f
X,Y
(x, xz) dx

=
xf
x
(x)f
X
(x)f
Y
(xz) dx
4 Expectation
Expectation
E [X] =
X
=
xdF
X
(x) =
x
xf
X
(x) X discrete
xf
X
(x) X continuous
P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X +Y ] = E [X] +E [Y ]
E [XY ] =
X,Y
xyf
X,Y
(x, y) dF
X
(x) dF
Y
(y)
E [(Y )] = (E [X]) (cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
E [X] =
x=1
P [X x]
Sample mean
X
n
=
1
n
n
i=1
X
i
Conditional Expectation
E [Y | X = x] =
yf(y | x) dy
E [X] = E [E [X| Y ]]
E[(X, Y ) | X = x] =
(x, y)f
Y |X
(y | x) dx
E [(Y, Z) | X = x] =
(y, z)f
(Y,Z)|X
(y, z | x) dy dz
E [Y +Z | X] = E [Y | X] +E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0
5 Variance
Variance
V [X] =
2
X
= E
(X E [X])
2
= E
X
2
E [X]
2
V
i=1
X
i
=
n
i=1
V [X
i
] + 2
i=j
Cov [X
i
, Y
j
]
V
i=1
X
i
=
n
i=1
V [X
i
] i X
i
X
j
Standard deviation
sd[X] =
V [X] =
X
Covariance
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
Cov [X +a, Y +b] = Cov [X, Y ]
7
Cov
i=1
X
i
,
m
j=1
Y
j
=
n
i=1
m
j=1
Cov [X
i
, Y
j
]
Correlation
[X, Y ] =
Cov [X, Y ]
V [X] V [Y ]
Independence
X Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Sample variance
S
2
=
1
n 1
n
i=1
(X
i

X
n
)
2
Conditional Variance
V [Y | X] = E
(Y E [Y | X])
2
| X
= E
Y
2
| X
E [Y | X]
2
V [Y ] = E [V [Y | X]] +V [E [Y | X]]
6 Inequalities
Cauchy-Schwarz
E [XY ]
2
E
X
2
Y
2
Markov
P [(X) t]
E [(X)]
t
Chebyshev
P [|X E [X]| t]
V [X]
t
2
Chernoff
P [X (1 +)]
(1 +)
1+
> 1
Jensen
E [(X)] (E [X]) convex
7 Distribution Relationships
Binomial
X
i
Bernoulli (p) =
n
i=1
X
i
Bin (n, p)
X Bin (n, p) , Y Bin (m, p) = X +Y Bin (n +m, p)
lim
n
Bin (n, p) = Poisson (np) (n large, p small)
lim
n
Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Negative Binomial
X NBin (1, p) = Geometric (p)
X NBin (r, p) =
r
i=1
Geometric (p)
X
i
NBin (r
i
, p) =

X
i
NBin (
r
i
, p)
X NBin (r, p) . Y Bin (s +r, p) = P [X s] = P [Y r]
Poisson
X
i
Poisson (
i
) X
i
X
j
=
n
i=1
X
i
Poisson
i=1
X
i
Poisson (
i
) X
i
X
j
= X
i
j=1
X
j
Bin
j=1
X
j
,

i
n
j=1
Exponential
X
i
Exp () X
i
X
j
=
n
i=1
X
i
Gamma (n, )
Memoryless property: P [X > x +y | X > y] = P [X > x]
Normal
X N
,
2
N (0, 1)
X N
,
2
Z = aX +b = Z N
a +b, a
2
X N
1
,
2
1
Y N
2
,
2
2
= X +Y N
1
+
2
,
2
1
+
2
2
X
i
N
i
,
2
i
=

i
X
i
N
i
,
2
i
P [a < X b] =
(x) = 1 (x)
(x) = x(x)
(x) = (x
2
1)(x)
Upper quantile of N (0, 1): z
=
1
(1 )
Gamma (distribution)
X Gamma (, ) X/ Gamma (, 1)
Gamma (, )
i=1
Exp ()
X
i
Gamma (
i
, ) X
i
X
j
=

i
X
i
Gamma (
i
, )
()

0
x
1
e
x
dx
Gamma (function)
Ordinary: (s) =
0
t
s1
e
t
dt
Upper incomplete: (s, x) =
x
t
s1
e
t
dt
Lower incomplete: (s, x) =
x
0
t
s1
e
t
dt
( + 1) = () > 1
8
(n) = (n 1)! n N
(1/2) =
Beta (distribution)
1
B(, )
x
1
(1 x)
1
=
( +)
()()
x
1
(1 x)
1
E
X
k
=
B( +k, )
B(, )
=
+k 1
+ +k 1
E
X
k1
Beta (1, 1) Unif (0, 1)

Beta (function):
Ordinary: B(x, y) = B(y, x) =
1
0
t
x1
(1 t)
y1
dt =
(x)(y)
(x +y)
Incomplete: B(x; a, b) =
x
0
t
a1
(1 t)
b1
dt
Regularized incomplete:
I
x
(a, b) =
B(x; a, b)
B(a, b)
a,bN
=
a+b1
j=a
(a +b 1)!
j!(a +b 1 j)!
x
j
(1 x)
a+b1j
I
0
(a, b) = 0 I
1
(a, b) = 1
I
x
(a, b) = 1 I
1x
(b, a)
8 Probability and Moment Generating Functions
G
X
(t) = E
t
X
|t| < 1
M
X
(t) = G
X
(e
t
) = E
e
Xt
= E
i=0
(Xt)
i
i!
i=0
E
X
i
i!
t
i
P [X = 0] = G
X
(0)
P [X = 1] = G
X
(0)
P [X = i] =
G
(i)
X
(0)
i!
E [X] = G
X
(1
)
E
X
k
= M
(k)
X
(0)
E
X!
(X k)!
= G
(k)
X
(1
)
V [X] = G
X
(1
) +G
X
(1
) (G
X
(1
))
2
G
X
(t) = G
Y
(t) = X
d
= Y
9 Multivariate Distributions
9.1 Standard Bivariate Normal
Let X, Y N (0, 1) X Z with Y = X +
1
2
Z
Joint density
f(x, y) =
1
2
1
2
exp
x
2
+y
2
2xy
2(1
2
)
Conditionals
(Y | X = x) N
x, 1
2
and (X| Y = y) N
y, 1
2
Independence
X Y = 0
9.2 Bivariate Normal
Let X N
x
,
2
x
and Y N
y
,
2
y
.
f(x, y) =
1
2
x
1
2
exp
z
2(1
2
)
z =
x
x
2
+
y
y
2
2
x
x
y
y
Conditional mean and variance

E [X| Y ] = E [X] +
Y
(Y E [Y ])
V [X| Y ] =
X
1
2
9.3 Multivariate Normal
Covariance Matrix (Precision Matrix
1
)
=
V [X
1
] Cov [X
1
, X
k
]
.
.
.
.
.
.
.
.
.
Cov [X
k
, X
1
] V [X
k
]
If X N (, ),
f
X
(x) = (2)
n/2
||
1/2
exp
1
2
(x )
T
1
(x )
9
Properties
Z N (0, 1) X = +
1/2
Z = X N (, )
X N (, ) =
1/2
(X ) N (0, 1)
X N (, ) = AX N
A, AA
T
X N (, ) a is vector of lenght k = a
T
X N
a
T
, a
T
a
10 Convergence
Let {X
1
, X
2
, . . .} be a sequence of rvs and let X be another rv. Let F
n
denote
the cdf of X
n
and let F denote the cdf of X.
Types of Convergence
1. In distribution (weakly, in law): X
n
D
X
lim
n
F
n
(t) = F(t) t where F continuous
2. In probability: X
n
P
X
( > 0) lim
n
P [|X
n
X| > ] = 0
3. Almost surely (strongly): X
n
as
X
P
lim
n
X
n
= X
= P
: lim
n
X
n
() = X()
= 1
4. In quadratic mean (L
2
): X
n
qm
X
lim
n
E
(X
n
X)
2
= 0
Relationships
X
n
qm
X = X
n
P
X = X
n
D
X
X
n
as
X = X
n
P
X
X
n
D
X (c R) P [X = c] = 1 = X
n
P
X
X
n
P
X Y
n
P
Y = X
n
+Y
n
P
X +Y
X
n
qm
X Y
n
qm
Y = X
n
+Y
n
qm
X +Y
X
n
P
X Y
n
P
Y = X
n
Y
n
P
XY
X
n
P
X = (X
n
)
P
(X)
X
n
D
X = (X
n
)
D
(X)
X
n
qm
b lim
n
E [X
n
] = b lim
n
V [X
n
] = 0
X
1
, . . . , X
n
iid E [X] = V [X] <

X
n
qm
Slutzkys Theorem
X
n
D
X and Y
n
P
c = X
n
+Y
n
D
X +c
X
n
D
X and Y
n
P
c = X
n
Y
n
D
cX
In general: X
n
D
X and Y
n
D
Y = X
n
+Y
n
D
X +Y
10.1 Law of Large Numbers (LLN)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] < .
Weak (WLLN)
X
n
P
as n
Strong (SLLN)
X
n
as
as n
10.2 Central Limit Theorem (CLT)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] =
2
.
Z
n
:=
X
n
X
n
n(

X
n
)
D
Z where Z N (0, 1)
lim
n
P [Z
n
z] = (z) z R
CLT Notations
Z
n
N (0, 1)
X
n
N
,

2
n
X
n
N
0,

2
n
n(

X
n
) N
0,
2
n(

X
n
)
n
N (0, 1)
Continuity Correction
P
X
n
x
x +
1
2

/
X
n
x
x
1
2

/
Delta Method
Y
n
N
,

2
n
= (Y
n
) N
(), (
())
2
2
n
10
11 Statistical Inference
Let X
1
, , X
n
iid
F if not otherwise noted.
11.1 Point Estimation
Point estimator

n
of is a rv:

n
= g(X
1
, . . . , X
n
)
bias(
n
) = E
Consistency:

n
P
Sampling distribution: F(
n
)
Standard error: se(
n
) =
Mean squared error: mse = E
n
)
2
= bias(
n
) +V
lim
n
bias(
n
) = 0 lim
n
se(
n
) = 0 =

n
is consistent
Asymptotic normality:
se
D
N (0, 1)
Slutzkys Theorem often lets us replace se(
n
) by some (weakly) consis-
tent estimator
n
.
11.2 Normal-based Condence Interval
Suppose

n
N
, se
2
. Let z
/2
=
1
(1 (/2)), i.e., P
Z > z
/2
= /2
and P
z
/2
< Z < z
/2
= 1 where Z N (0, 1). Then

C
n
=

n
z
/2
se
11.3 Empirical Distribution Function
Empirical Distribution Function (ECDF)
F
n
(x) =
n
i=1
I(X
i
x)
n
I(X
i
x) =
1 X
i
x
0 X
i
> x
Properties (for any xed x)
E
F
n
= F(x)
V
F
n
=
F(x)(1 F(x))
n
mse =
F(x)(1 F(x))
n
D
0

F
n
P
F(x)
Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X
1
, . . . , X
n
F)
P
sup
x
F(x)

F
n
(x)
>
= 2e
2n
2
Nonparametric 1 condence band for F
L(x) = max{
F
n
n
, 0}
U(x) = min{
F
n
+
n
, 1}
=
1
2n
log
P [L(x) F(x) U(x) x] 1

11.4 Statistical Functionals
Statistical functional: T(F)
Plug-in estimator of = T(F) :

n
= T(

F
n
)
Linear functional: T(F) =
(x) dF
X
(x)
Plug-in estimator for linear functional:
T(

F
n
) =
(x) d
F
n
(x) =
1
n
n
i=1
(X
i
)
Often: T(

F
n
) N
T(F), se
2
=T(

F
n
) z
/2
se
p
th
quantile: F
1
(p) = inf{x : F(x) p}
=

X
n

2
=
1
n 1
n
i=1
(X
i

X
n
)
2
=
1
n
n
i=1
(X
i
)
3

3
j
=
n
i=1
(X
i

X
n
)(Y
i

Y
n
)
n
i=1
(X
i

X
n
)
2
n
i=1
(Y
i

Y
n
)
12 Parametric Inference
Let F =
f(x; :
be a parametric model with parameter space R

k
and parameter = (
1
, . . . ,
k
).
11
12.1 Method of Moments
j
th
moment
j
() = E
X
j
x
j
dF
X
(x)
j
th
sample moment

j
=
1
n
n
i=1
X
j
i
Method of Moments Estimator (MoM)
1
() =
1
2
() =
2
.
.
. =
.
.
.
k
() =
k
Properties of the MoM estimator
n
exists with probability tending to 1
Consistency:

n
P
n(
)
D
N (0, )
where = gE
Y Y
T
g
T
, Y = (X, X
2
, . . . , X
k
)
T
,
g = (g
1
, . . . , g
k
) and g
j
=

1
j
()
12.2 Maximum Likelihood
Likelihood: L
n
: [0, )
L
n
() =
n
i=1
f(X
i
; )
Log-likelihood
n
() = log L
n
() =
n
i=1
log f(X
i
; )
Maximum Likelihood Estimator (mle)
L
n
(
n
) = sup
L
n
()
Score Function
s(X; ) =

log f(X; )
Fisher Information
I() = V
[s(X; )]
I
n
() = nI()
Fisher Information (exponential family)
I() = E
s(X; )
Observed Fisher Information

I
obs
n
() =

2
2
n
i=1
log f(X
i
; )
Properties of the mle
Consistency:

n
P
Equivariance:

n
is the mle =(
n
) is the mle of ()
1. se
1/I
n
()
(
n
)
se
D
N (0, 1)
2. se
1/I
n
(
n
)
(
n
)
se
D
N (0, 1)
Asymptotic optimality, i.e., smallest variance for large samples. If

n
is any
other estimator, then
are(
n
,
n
) =
V
n
1
Approximately the Bayes estimator
12.2.1 Delta Method
If = (
) where is dierentiable and
() = 0:
(
n
)
se( )
D
N (0, 1)
where = (
) is the mle of and

se =
se(
n
)
12
12.3 Multiparameter Models
Let = (
1
, . . . ,
k
) and

= (
1
, . . . ,
k
) be the mle.
H
jj
=

2
2
H
jk
=

2
k
Fisher Information Matrix
I
n
() =
[H
11
] E
[H
1k
]
.
.
.
.
.
.
.
.
.
E
[H
k1
] E
[H
kk
]
Under appropriate regularity conditions

(
) N (0, J
n
)
with J
n
() = I
1
n
. Further, if

j
is the j
th
component of , then
(
j

j
)
se
j
D
N (0, 1)
where se
2
j
= J
n
(j, j) and Cov
j
,
= J
n
(j, k)
12.3.1 Multiparameter Delta Method
Let = (
1
, . . . ,
k
) be a function and let the gradient of be
=
1
.
.
.
Suppose
= 0 and = (
). Then,
( )
se( )
D
N (0, 1)
where
se( ) =
J
n
and

J
n
= J
n
(
) and

=
.
12.4 Parametric Bootstrap
Sample from f(x;
n
) instead of from

F
n
, where

n
could be the mle or method
of moments estimator.
13 Hypothesis Testing
H
0
:
0
versus H
1
:
1
Denitions
Null hypothesis H
0
Alternative hypothesis H
1
Simple hypothesis =
0
Composite hypothesis >
0
or <
0
Two-sided test: H
0
: =
0
versus H
1
: =
0
One-sided test: H
0
:
0
versus H
1
: >
0
Critical value c
Test statistic T
Rejection Region R = {x : T(x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf
1
()
Test size: = P [Type I error] = sup
0
()
Retain H
0
Reject H
0
H
0
true

Type I error ()
H
1
true Type II error ()

(power)
p-value
p-value = sup
0
P
[T(X) T(x)] = inf
: T(x) R
p-value = sup
0
P
[T(X
) T(X)]
. .. .
1F(T(X)) since T(X
)F
= inf
: T(X) R
p-value evidence
< 0.01 very strong evidence against H
0
0.01 0.05 strong evidence against H
0
0.05 0.1 weak evidence against H
0
> 0.1 little or no evidence against H
0
Wald Test
Two-sided test
Reject H
0
when |W| > z
/2
where W =

0
se
P
|W| > z
/2
p-value = P
0
[|W| > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood Ratio Test (LRT)
13
T(X) =
sup
L
n
()
sup
0
L
n
()
=
L
n
(
n
)
L
n
(
n,0
)
(X) = 2 log T(X)
D
2
rq
where
k
i=1
Z
2
i

2
k
with Z
1
, . . . , Z
k
iid
N (0, 1)
p-value = P
0
[(X) > (x)] P
2
rq
> (x)
Multinomial LRT
Let p
n
=
X
1
n
, . . . ,
X
k
n
be the mle
T(X) =
L
n
( p
n
)
L
n
(p
0
)
=
k
j=1
p
j
p
0j
Xj
(X) = 2
k
j=1
X
j
log
p
j
p
0j
2
k1
The approximate size LRT rejects H
0
when (X)
2
k1,
Pearson
2
Test
T =
k
j=1
(X
j
E [X
j
])
2
E [X
j
]
where E [X
j
] = np
0j
under H
0
T
D
2
k1
p-value = P
2
k1
> T(x)
Faster
D
X
2
k1
than LRT, hence preferable for small n
Independence Testing
I rows, J columns, X multinomial sample of size n = I J
mles unconstrained: p
ij
=
Xij
n
mles under H
0
: p
0ij
= p
i
p
j
=
Xi
n
Xj
n
LRT: = 2
I
i=1
J
j=1
X
ij
log
nXij
XiXj
Pearson
2
: T =
I
i=1
J
j=1
(XijE[Xij])
2
E[Xij]
LRT and Pearson
D
, where = (I 1)(J 1)
14 Bayesian Inference
Bayes Theorem
f( | x) =
f(x| )f()
f(x
n
)
=
f(x| )f()
f(x| )f() d
L
n
()f()
Denitions
X
n
= (X
1
, . . . , X
n
)
x
n
= (x
1
, . . . , x
n
)
Prior density f()
Likelihood f(x
n
| ): joint density of the data
In particular, X
n
iid =f(x
n
| ) =
n
i=1
f(x
i
| ) = L
n
()
Posterior density f( | x
n
)
Normalizing constant c
n
= f(x
n
) =
f(x| )f() d
Kernel: part of a density that depends on
Posterior Mean

n
=
f( | x
n
) d =
Ln()f()
Ln()f() d
14.1 Credible Intervals
1 Posterior Interval
P [ (a, b) | x
n
] =
b
a
f( | x
n
) d = 1
1 Equal-tail Credible Interval
f( | x
n
) d =

b
f( | x
n
) d = /2
1 Highest Posterior Density (HPD) region R
n
1. P [ R
n
] = 1
2. R
n
= { : f( | x
n
) > k} for some k
R
n
is unimodal =R
n
is an interval
14.2 Function of Parameters
Let = () and A = { : () }.
Posterior CDF for
H(r | x
n
) = P [() | x
n
] =
A
f( | x
n
) d
Posterior Density
h( | x
n
) = H
( | x
n
)
Bayesian Delta Method
| X
n
N
), se
14
14.3 Priors
Choice
Subjective Bayesianism: prior should incorporate as much detail as possible
the researchs a priori knowledge via prior elicitation.
Objective Bayesianism: prior should incorporate as little detail as possible
(non-informative prior).
Robust Bayesianism: consider various priors and determine sensitivity of
our inferences to changes in the prior.
Types
Flat: f() constant
Proper:
f() d = 1
Improper:
f() d =
Jeffreys prior (transformation-invariant):
f()
I() f()
det(I())
Conjugate: f() and f( | x
n
) belong to the same parametric family
14.3.1 Conjugate Priors
Discrete likelihood
Likelihood Conjugate Prior Posterior hyperparameters
Bernoulli(p) Beta(, ) +
n
i=1
x
i
, +n
n
i=1
x
i
Binomial(p) Beta(, ) +
n
i=1
x
i
, +
n
i=1
N
i
i=1
x
i
Negative Binomial(p) Beta(, ) +rn, +
n
i=1
x
i
Poisson() Gamma(, ) +
n
i=1
x
i
, +n
Multinomial(p) Dirichlet() +
n
i=1
x
(i)
Geometric(p) Beta(, ) +n, +
n
i=1
x
i
Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate Prior Posterior hyperparameters
Uniform(0, ) Pareto(x
m
, k) max
x
(n)
, x
m
, k +n
Exponential() Gamma(, ) +n, +
n
i=1
x
i
Normal(,
2
c
) Normal(
0
,
2
0
)
2
0
+
n
i=1
x
i
2
c
2
0
+
n
2
c
2
0
+
n
2
c
1
Normal(
c
,
2
) Scaled Inverse Chi-
square(,
2
0
)
+n,

2
0
+
n
i=1
(x
i
)
2
+n
Normal(,
2
) Normal-
scaled Inverse
Gamma(, , , )
+n x
+n
, + n, +
n
2
,
+
1
2
n
i=1
(x
i
x)
2
+
( x )
2
2(n +)
MVN(,
c
) MVN(
0
,
0
)
1
0
+n
1
c
1
0

0
+n
1
x
1
0
+n
1
c
1
MVN(
c
, ) Inverse-
Wishart(, )
n +, +
n
i=1
(x
i
c
)(x
i
c
)
T
Pareto(x
mc
, k) Gamma(, ) +n, +
n
i=1
log
x
i
x
mc
Pareto(x
m
, k
c
) Pareto(x
0
, k
0
) x
0
, k
0
kn where k
0
> kn
Gamma(
c
, ) Gamma(
0
,
0
)
0
+n
c
,
0
+
n
i=1
x
i
14.4 Bayesian Testing
If H
0
:
0
:
Prior probability P [H
0
] =
0
f() d
Posterior probability P [H
0
| x
n
] =
0
f( | x
n
) d
Let H
0
, . . . , H
K1
be K hypotheses. Suppose f( | H
k
),
P [H
k
| x
n
] =
f(x
n
| H
k
)P [H
k
]
K
k=1
f(x
n
| H
k
)P [H
k
]
,
15
Marginal Likelihood
f(x
n
| H
i
) =
f(x
n
| , H
i
)f( | H
i
) d
Posterior Odds (of H
i
relative to H
j
)
P [H
i
| x
n
]
P [H
j
| x
n
]
=
f(x
n
| H
i
)
f(x
n
| H
j
)
. .. .
Bayes Factor BFij
P [H
i
]
P [H
j
]
. .. .
prior odds
Bayes Factor
log
10
BF
10
BF
10
evidence
0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
1 2 10 100 Strong
> 2 > 100 Decisive
p
=
p
1p
BF
10
1 +
p
1p
BF
10
where p = P [H
1
] and p
= P [H
1
| x
n
]
15 Exponential Family
Scalar parameter
f
X
(x| ) = h(x) exp {()T(x) A()}
= h(x)g() exp {()T(x)}
Vector parameter
f
X
(x| ) = h(x) exp
i=1
i
()T
i
(x) A()
= h(x) exp {() T(x) A()}

= h(x)g() exp {() T(x)}
Natural form
f
X
(x| ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}
= h(x)g() exp
T
T(x)
16 Sampling Methods
16.1 The Bootstrap
Let T
n
= g(X
1
, . . . , X
n
) be a statistic.
1. Estimate V
F
[T
n
] with V
Fn
[T
n
].
2. Approximate V
Fn
[T
n
] using simulation:
(a) Repeat the following B times to get T
n,1
, . . . , T
n,B
, an iid sample from
the sampling distribution implied by

F
n
i. Sample uniformly X
1
, . . . , X
n

F
n
.
ii. Compute T
n
= g(X
1
, . . . , X
n
).
(b) Then
v
boot
=

V
Fn
=
1
B
B
b=1
n,b
1
B
B
r=1
T
n,r
2
16.1.1 Bootstrap Condence Intervals
Normal-based Interval
T
n
z
/2
se
boot
Pivotal Interval
1. Location parameter = T(F)
2. Pivot R
n
=

3. Let H(r) = P [R
n
r] be the cdf of R
n
4. Let R
n,b
=

n,b
n
. Approximate H using bootstrap:
H(r) =
1
B
B
b=1
I(R
n,b
r)
5. Let
denote the sample quantile of (
n,1
, . . . ,
n,B
)
6. Let r
denote the sample quantile of (R
n,1
, . . . , R
n,B
), i.e., r
n
7. Then, an approximate 1 condence interval is C
n
=
a,
with
a =

H
1
1

2
n
r
1/2
= 2
1/2
b =

H
1
n
r
/2
= 2
/2
Percentile Interval
C
n
=
/2
,
1/2
16
16.2 Rejection Sampling
Setup
We can easily sample from g()
We want to sample from h(), but it is dicult
We know h() up to proportional constant: h() =
k()
k() d
Envelope condition: we can nd M > 0 such that k() Mg()
Algorithm
1. Draw
cand
g()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
k(
cand
)
Mg(
cand
)
4. Repeat until B values of
cand
have been accepted
Example
We can easily sample from the prior g() = f()
Target is the posterior with h() k() = f(x
n
| )f()
Envelope condition: f(x
n
| ) f(x
n
|
n
) = L
n
(
n
) M
Algorithm
1. Draw
cand
f()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
L
n
(
cand
)
L
n
(
n
)
16.3 Importance Sampling
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q() | x
n
]:
1. Sample from the prior
1
, . . . ,
n
iid
f()
2. For each i = 1, . . . , B, calculate w
i
=
L
n
(
i
)
B
i=1
L
n
(
i
)
3. E [q() | x
n
]
B
i=1
q(
i
)w
i
17 Decision Theory
Denitions
Unknown quantity aecting our decision:
Decision rule: synonymous for an estimator

Action a A: possible value of the decision rule. In the estimation

context, the action is just an estimate of ,

(x).
Loss function L: consequences of taking action a when true state is or
discrepancy between and

, L : A [k, ).
Loss functions
Squared error loss: L(, a) = ( a)
2
Linear loss: L(, a) =
K
1
( a) a < 0
K
2
(a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K
1
= K
2
)
L
p
loss: L(, a) = | a|
p
Zero-one loss: L(, a) =
0 a =
1 a =
17.1 Risk
Posterior Risk
r(
| x) =
L(,
(x))f( | x) d = E
|X
L(,
(x))
(Frequentist) Risk
R(,
) =
L(,
(x))f(x| ) dx = E
X|
L(,
(X))
Bayes Risk
r(f,
) =
L(,
(x))f(x, ) dxd = E
,X
L(,
(X))
r(f,
) = E
E
X|
L(,
(X)
= E
R(,
r(f,
) = E
X
E
|X
L(,
(X)
= E
X
r(
| X)
17.2 Admissibility
dominates

if
: R(,
) R(,
)
: R(,
) < R(,

is inadmissible if there is at least one other estimator

that dominates
it. Otherwise it is called admissible.
17
17.3 Bayes Rule
Bayes Rule (or Bayes Estimator)
r(f,
) = inf
r(f,

)

(x) = inf r(
| x) x = r(f,
) =
r(
| x)f(x) dx
Theorems
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode
17.4 Minimax Rules
Maximum Risk
R(
) = sup
R(,
)

R(a) = sup
R(, a)
Minimax Rule
sup
R(,
) = inf
R(
) = inf
sup
R(,

)
= Bayes rule c : R(,
) = c
Least Favorable Prior
f
= Bayes rule R(,
f
) r(f,
f
)
18 Linear Regression
Denitions
Response variable Y
Covariate X (aka predictor variable or feature)
18.1 Simple Linear Regression
Model
Y
i
=
0
+
1
X
i
+
i
E [
i
| X
i
] = 0, V [
i
| X
i
] =
2
Fitted Line
r(x) =

0
+

1
x
Predicted (Fitted) Values
Y
i
= r(X
i
)
Residuals

i
= Y
i

Y
i
= Y
i
0
+

1
X
i
Residual Sums of Squares (rss)

rss(
0
,
1
) =
n
i=1

2
i
Least Square Estimates
T
= (
0
,
1
)
T
: min
0,
1
rss
0
=

Y
n
1

X
n
1
=
n
i=1
(X
i

X
n
)(Y
i

Y
n
)
n
i=1
(X
i

X
n
)
2
=
n
i=1
X
i
Y
i
n

X

Y
n
i=1
X
2
i
nX
2
E
| X
n
| X
n
=

2
ns
X
n
1
n
i=1
X
2
i

X
n
X
n
1
se(
0
) =

s
X
n
i=1
X
2
i
n
se(
1
) =

s
X
n
where s
2
X
= n
1
n
i=1
(X
i

X
n
)
2
and
2
=
1
n2
n
i=1

2
i
an (unbiased) estimate
of . Further properties:
Consistency:

0
P
0
and

1
P
1
0
se(
0
)
D
N (0, 1) and
1
se(
1
)
D
N (0, 1)
Approximate 1 condence intervals for
0
and
1
are
0
z
/2
se(
0
) and

1
z
/2
se(
1
)
The Wald test for testing H
0
:
1
= 0 vs. H
1
:
1
= 0 is: reject H
0
if
|W| > z
/2
where W =

1
/ se(
1
).
R
2
R
2
=
n
i=1
(
Y
i

Y )
2
n
i=1
(Y
i

Y )
2
= 1
n
i=1

2
i
n
i=1
(Y
i

Y )
2
= 1
rss
tss
18
Likelihood
L =
n
i=1
f(X
i
, Y
i
) =
n
i=1
f
X
(X
i
)
n
i=1
f
Y |X
(Y
i
| X
i
) = L
1
L
2
L
1
=
n
i=1
f
X
(X
i
)
L
2
=
n
i=1
f
Y |X
(Y
i
| X
i
)
n
exp
1
2
2
Y
i
(
0
1
X
i
)
Under the assumption of Normality, the least squares estimator is also the mle

2
=
1
n
n
i=1

2
i
18.2 Prediction
Observe X = x
of the covarite and want to predict their outcome Y
0
+

1
x
= V
+x
2
+ 2x
Cov
0
,
Prediction Interval
2
n
=
2
n
i=1
(X
i
X
)
2
n
i
(X
i

X)
2
j
+ 1
z
/2
n
18.3 Multiple Regression
Y = X +
where
X =
X
11
X
1k
.
.
.
.
.
.
.
.
.
X
n1
X
nk
1
.
.
.
1
.
.
.
Likelihood
L(, ) = (2
2
)
n/2
exp
1
2
2
rss
rss = (y X)
T
(y X) = ||Y X||
2
=
N
i=1
(Y
i
x
T
i
)
2
If the (k k) matrix X
T
X is invertible,
= (X
T
X)
1
X
T
Y
V
| X
n
=
2
(X
T
X)
1
,
2
(X
T
X)
1
Estimate regression function

r(x) =
k
j=1
j
x
j
Unbiased estimate for
2

2
=
1
n k
n
i=1

2
i
= X
Y
mle
=

X
2
=
n k
n

2
1 Condence Interval
j
z
/2
se(
j
)
18.4 Model Selection
Consider predicting a new observation Y
for covariates X
and let S J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Undertting: too few covariates yields high bias
Overtting: too many covariates yields high variance
Procedure
1. Assign a score to each model
2. Search through all models to nd the one with the highest score
Hypothesis Testing
H
0
:
j
= 0 vs. H
1
:
j
= 0 j J
Mean Squared Prediction Error (mspe)
mspe = E
Y (S) Y
)
2
Prediction Risk
R(S) =
n
i=1
mspe
i
=
n
i=1
E
Y
i
(S) Y
i
)
2
19
Training Error
R
tr
(S) =
n
i=1
(
Y
i
(S) Y
i
)
2
R
2
R
2
(S) = 1
rss(S)
tss
= 1
R
tr
(S)
tss
= 1
n
i=1
(
Y
i
(S)

Y )
2
n
i=1
(Y
i

Y )
2
The training error is a downward-biased estimate of the prediction risk.
E
R
tr
(S)
< R(S)
bias(
R
tr
(S)) = E
R
tr
(S)
R(S) = 2
n
i=1
Cov
Y
i
, Y
i
Adjusted R
2
R
2
(S) = 1
n 1
n k
rss
tss
Mallows C
p
statistic
R(S) =

R
tr
(S) + 2k
2
= lack of t + complexity penalty
Akaike Information Criterion (AIC)
AIC(S) =
n
(
S
,
2
S
) k
Bayesian Information Criterion (BIC)
BIC(S) =
n
(
S
,
2
S
)
k
2
log n
Validation and Training
R
V
(S) =
m
i=1
(
i
(S) Y
i
)
2
m = |{validation data}|, often
n
4
or
n
2
Leave-one-out Cross-validation
R
CV
(S) =
n
i=1
(Y
i

Y
(i)
)
2
=
n
i=1
Y
i

Y
i
(S)
1 U
ii
(S)
2
U(S) = X
S
(X
T
S
X
S
)
1
X
S
(hat matrix)
19 Non-parametric Function Estimation
19.1 Density Estimation
Estimate f(x), where f(x) = P [X A] =
A
f(x) dx.
Integrated Square Error (ise)
L(f,

f
n
) =

f(x)

f
n
(x)
2
dx = J(h) +
f
2
(x) dx
Frequentist Risk
R(f,

f
n
) = E
L(f,

f
n
)
b
2
(x) dx +
v(x) dx
b(x) = E
f
n
(x)
f(x)
v(x) = V
f
n
(x)
19.1.1 Histograms
Denitions
Number of bins m
Binwidth h =
1
m
Bin B
j
has
j
observations
Dene p
j
=
j
/n and p
j
=
Bj
f(u) du
Histogram Estimator
f
n
(x) =
m
j=1
p
j
h
I(x B
j
)
E
f
n
(x)
=
p
j
h
V
f
n
(x)
=
p
j
(1 p
j
)
nh
2
R(
f
n
, f)
h
2
12
(f
(u))
2
du +
1
nh
h
=
1
n
1/3
(f
(u))
2
du
1/3
R
f
n
, f)
C
n
2/3
C =
3
4
2/3
(f
(u))
2
du
1/3
Cross-validation estimate of E [J(h)]
J
CV
(h) =

f
2
n
(x) dx
2
n
n
i=1
f
(i)
(X
i
) =
2
(n 1)h

n + 1
(n 1)h
m
j=1
p
2
j
20
19.1.2 Kernel Density Estimator (KDE)
Kernel K
K(x) 0
K(x) dx = 1
xK(x) dx = 0
x
2
K(x) dx
2
K
> 0
KDE
f
n
(x) =
1
n
n
i=1
1
h
K
x X
i
h
R(f,

f
n
)
1
4
(h
K
)
4
(f
(x))
2
dx +
1
nh
K
2
(x) dx
h
=
c
2/5
1
c
1/5
2
c
1/5
3
n
1/5
c
1
=
2
K
, c
2
=
K
2
(x) dx, c
3
=
(f
(x))
2
dx
R
(f,

f
n
) =
c
4
n
4/5
c
4
=
5
4
(
2
K
)
2/5
K
2
(x) dx
4/5
. .. .
C(K)
(f
)
2
dx
1/5
Epanechnikov Kernel
K(x) =
3
4
5(1x
2
/5)
|x| <
5
0 otherwise
J
CV
(h) =

f
2
n
(x) dx
2
n
n
i=1
f
(i)
(X
i
)
1
hn
2
n
i=1
n
j=1
K
X
i
X
j
h
+
2
nh
K(0)
K
(x) = K
(2)
(x) 2K(x) K
(2)
(x) =
K(x y)K(y) dy
19.2 Non-parametric Regression
Estimate f(x), where f(x) = E [Y | X = x]. Consider pairs of points
(x
1
, Y
1
), . . . , (x
n
, Y
n
) related by
Y
i
= r(x
i
) +
i
E [
i
] = 0
V [
i
] =
2
k-nearest Neighbor Estimator
r(x) =
1
k
i:xiNk(x)
Y
i
where N
k
(x) = {k values of x
1
, . . . , x
n
closest to x}
Nadaraya-Watson Kernel Estimator
r(x) =
n
i=1
w
i
(x)Y
i
w
i
(x) =
K
xxi
h
n
j=1
K
xxj
h
[0, 1]
R( r
n
, r)
h
4
4
x
2
K
2
(x) dx
4

r
(x) + 2r
(x)
f
(x)
f(x)
2
dx
+
K
2
(x) dx
nhf(x)
dx
h

c
1
n
1/5
R
( r
n
, r)
c
2
n
4/5
J
CV
(h) =
n
i=1
(Y
i
r
(i)
(x
i
))
2
=
n
i=1
(Y
i
r(x
i
))
2
1
K(0)
n
j=1
K
xx
j
h
2
19.3 Smoothing Using Orthogonal Functions
Approximation
r(x) =
j=1
j
(x)
J
i=1
j
(x)
Multivariate Regression
Y = +
where
i
=
i
and =
0
(x
1
)
J
(x
1
)
.
.
.
.
.
.
.
.
.
0
(x
n
)
J
(x
n
)
Least Squares Estimator
= (
T
)
1
T
Y
1
n
T
Y (for equallly spaced observations only)
21
R
CV
(J) =
n
i=1
Y
i
j=1
j
(x
i
)
j,(i)
2
20 Stochastic Processes
Stochastic Process
{X
t
: t T} T =
{0, 1, . . . } = Z discrete
[0, ) continuous
Notations: X
t
, X(t)
State space X
Index set T
20.1 Markov Chains
Markov Chain {X
n
: n T}
P [X
n
= x| X
0
, . . . , X
n1
] = P [X
n
= x| X
n1
] n T, x X
Transition probabilities
p
ij
P [X
n+1
= j | X
n
= i]
p
ij
(n) P [X
m+n
= j | X
m
= i] n-step
Transition matrix P (n-step: P
n
)
(i, j) element is p
ij
p
ij
> 0

i
p
ij
= 1
Chapman-Kolmogorov
p
ij
(m+n) =
k
p
ij
(m)p
kj
(n)
P
m+n
= P
m
P
n
P
n
= P P = P
n
Marginal probability
n
= (
n
(1), . . . ,
n
(N)) where
i
(i) = P [X
n
= i]
0
initial distribution
n
=
0
P
n
20.2 Poisson Processes
Poisson Process
{X
t
: t [0, )} number of events up to and including time t
X
0
= 0
Independent increments:
t
0
< < t
n
: X
t1
X
t0
X
tn
X
tn1
Intensity function (t)
P [X
t+h
X
t
= 1] = (t)h +o(h)
P [X
t+h
X
t
= 2] = o(h)
X
s+t
X
s
Poisson (m(s +t) m(s)) where m(t) =
t
0
(s) ds
Homogeneous Poisson Process
(t) = X
t
Poisson (t) > 0
Waiting Times
W
t
:= time at which X
t
occurs
W
t
Gamma
t,
1
Interarrival Times
S
t
= W
t+1
W
t
S
t
Exp
t Wt1 Wt
St
22
21 Time Series
Mean function
Xt
= E [X
t
] =
xf
t
(x) dF
X
(x)
Autocovariance function
X
(s, t) = E [(X
s
s
)(X
t
t
)] = E [X
s
X
t
]
s
X
(t, t) = E
(X
t
t
)
2
= V [X
t
]
Autocorrelation function (ACF)
(s, t) =
Cov [X
s
, X
t
]
V [X
s
] V [X
t
]
=
(s, t)
(s, s)(t, t)
Cross-covariance function (CCV)
XY
(s, t) = E [(X
s
Xs
)(Y
t
Y t
)]
Cross-correlation function (CCF)
XY
(s, t) =

XY
(s, t)
X
(s, s)
Y
(t, t)
Backshift operator
B
k
(X
t
) = X
tk
Dierence operator
d
= (1 B)
d
White noise W
t
E [W
t
] = 0 t T
V [W
t
] =
2
t T

W
(s, t) = 0 s = t s, t T
Auto regression
X
t
=
p
i=1
i
X
ti
+W
t
Random walk
Y
t
= t +
t
j=1
W
j
where the constant is the drift
Symmetric Moving Average
M
t
=
k
j=k
a
j
X
tj
where a
j
= a
j
0 and
k
j=k
a
j
= 1
21.1 Stationary Time Series
Strictly stationary
P [X
t1
c
1
, . . . , X
tk
c
k
] = P [X
t1+h
c
1
, . . . , X
tk+h
c
k
]
k N, t
k
, c
k
, h Z
Weakly stationary
E
X
2
t
< t Z
E
X
2
t
= m t Z

X
(s, t) =
X
(s +r, t +r) r, s, t Z
Autocovariance function
(h) = E [(X
t+h
)(X
t
)] h Z
(0) = E
(X
t
)
2
(0) 0
(0) |(h)|
(h) = (h)
Autocorrelation function (ACF)
X
(h) =
Cov [X
t+h
, X
t
]
V [X
t+h
] V [X
t
]
=
(t +h, t)
(t +h, t +h)(t, t)
=
(h)
(0)
Jointly stationary time series
XY
(h) = E [(X
t+h
X
)(Y
t
Y
)]
XY
(h) =

XY
(h)
X
(0)
Y
(h)
Linear Process
X
t
= +
j=
j
w
tj
where
j=
|
j
| <
(h) =
2
w
j=
j+h
j
23
21.2 Estimation of Correlation
Sample autocovariance function
(h) =
1
n
nh
t=1
(X
t+h

X)(X
t

X)
Sample autocorrelation function
(h) =
(h)
(0)
Sample cross-variance function

XY
(h) =
1
n
nh
t=1
(X
t+h

X)(Y
t

Y )
Sample cross-correlation function

XY
(h) =

XY
(h)

X
(0)
Y
(0)
Properties

X(h)
=
1
n
if X
t
is white noise

XY (h)
=
1
n
if X
t
or Y
t
is white noise
21.3 Non-Stationary Time Series
Classical decomposition model
X
t
=
t
+S
t
+W
t
where

t
= trend
S
t
= seasonal component
W
t
= random noise term
21.3.1 Detrending
Least Squares
1. Choose trend model, e.g.,
t
=
0
+
1
t +
2
t
2
2. Minimize rss to obtain trend estimate
t
=

0
+

1
t +

2
t
2
3. Residuals noise W
t
Moving average
The low-pass lter V
t
is a symmetric moving average M
t
with a
j
=
1
2k+1
:
V
t
=
1
2k + 1
k
i=k
X
t1
If
1
2k+1
k
i=k
W
tj
0, a linear trend function
t
=
0
+
1
t passes
without distortion
Dierencing

t
=
0
+
1
t = (X
t
) =
1
21.4 ARIMA models
Autoregressive polynomial
(z) = 1
1
z
p
z
p
z C
p
= 0
Autoregressive operator
(B) = 1
1
B
p
B
p
AR(p) (autoregressive model order p)
X
t
=
1
X
t1
+ +
p
X
tp
+W
t
(B)X
t
= W
t
AR(1)
X
t
=
k
(X
tk
) +
k1
j=0
j
(W
tj
)
k,||<1
=
j=0
j
(W
tj
)
. .. .
linear process
E [X
t
] =
j=0
j
(E [W
tj
]) = 0
(h) = Cov [X
t+h
, X
t
] =

2
W
h
1
2
(h) =
(h)
(0)
=
h
(h) = (h 1) h = 1, 2, . . .
Moving average polynomial
(z) = 1 +
1
z + +
q
z
q
z C
q
= 0
Moving average operator
(B) = 1 +
1
B + +
p
B
p
24
MA(q) (moving average model order q)
X
t
= W
t
+
1
W
t1
+ +
q
W
tq
X
t
= (B)W
t
E [X
t
] =
q
j=0
j
E [W
tj
] = 0
(h) = Cov [X
t+h
, X
t
] =
2
W
qh
j=0

j
j+h
0 h q
0 h > q
MA(1)
X
t
= W
t
+W
t1
(h) =
(1 +
2
)
2
W
h = 0
2
W
h = 1
0 h > 1
(h) =

(1+
2
)
h = 1
0 h > 1
ARMA(p, q)
X
t
=
1
X
t1
+ +
p
X
tp
+W
t
+
1
W
t1
+ +
q
W
tq
(B)X
t
= (B)W
t
Partial autocorrelation function (PACF)
hh
= corr(X
h
X
h1
h
, X
0
X
h1
0
) h 2
11
= corr(X
1
, X
0
) = (1)
where X
h1
i
denotes the regression of X
i
on {X
h1
, X
h2
, . . . , X
1
}
ARIMA(p, d, q)
d
X
t
= (1 B)
d
X
t
is ARMA(p, q)
(B)(1 B)
d
X
t
= (B)W
t
Exponentially Weighted Moving Averages (EWMA)
X
t
= X
t1
+W
t
W
t1
X
t
=
j=1
(1 )
j1
X
tj
+W
t
when || < 1
X
n+1
= (1 )X
n
+

X
n
21.4.1 Causality and Invertibility
ARMA(p, q) is causal (future-independent) {
j
} :
j=0
j
< such that
X
t
=
j=0
W
tj
= (B)W
t
ARMA(p, q) is invertible {
j
} :
j=0
j
< such that
(B)X
t
=
j=0
X
tj
= W
t
Properties
ARMA(p, q) causal roots of (z) lie outside the unit circle
(z) =
j=0
j
z
j
=
(z)
(z)
|z| 1
ARMA(p, q) invertible roots of (z) lie outside the unit circle
(z) =
j=0
j
z
j
=
(z)
(z)
|z| 1
25
22 Math
22.1 Orthogonal Functions
L
2
space
L
2
(a, b) =
f : [a, b] R,
b
a
f(x)
2
dx <
Inner Product
f(x)g(x) dx
Norm
||f|| =
f
2
(x) dx
Orthogonality (for a series of functions
i
)

2
j
(x) dx = 1 j

i
(x)
j
(x) dx = 0 i = j
An orthogonal sequence
1
,
2
, . . . is complete if the only function that is is
orthogonal to each
j
is the zero function. Then,
1
,
2
, . . . form an orthogonal
basis in L
2
:
f L
2
= f(x) =
j=1
j
(x) where
j
=
b
a
f(x)
j
(x) dx
Cosine Basis
0
(x) = 1
j
(x) =
2 cos(jx) j 1
Parsevals Relation
||f||
2
f
2
(x) dx =
j=1
||||
2
Legendre Polynomials
x [1, 1]
P
0
(x) = 1
P
1
(x) = x
P
j+1
(x) =
(2j + 1)x(P
j
(x) jP
j1
(x)
j + 1
j
(x) =
(2j + 1)/2P
j
(x) orthogonal basis for L
2
(1, 1)
22.2 Series
Finite
k=0
k =
n(n + 1)
2
k=0
(2k 1) = n
2
k=0
k
2
=
n(n + 1)(2n + 1)
6
k=0
k
3
=
n(n + 1)
2
k=1
r
k1
=
1 r
n
1 r
k=0
n
k
= 2
n
l=0
n
l
m
k l
n +m
k
Vandermondes Identity
k=0
n
k
a
nk
b
k
= (a +b)
n
Binomial Theorem
Innite
k=0
p
k
=
1
1 p
|p| < 1
k=0
kp
k1
=
d
dp
k=0
p
k
=
d
dp
1
1 p
=
1
1 p
2
k=0
r +k 1
k
x
k
= (1 x)
r
r N
+
k=0
p
k
= (1 +p)
|p| < 1 , C
26
22.3 Combinatorics
Sampling
k out of n w/o replacement w/ replacement
ordered n
k
=
k1
i=0
(n i) =
n!
(n k)!
n
k
unordered
n
k
=
n
k
k!
=
n!
k!(n k)!
n 1 +r
r
n 1 +r
n 1
Stirling numbers, 2
nd
kind
n
k
= k
n 1
k
n 1
k 1
1 k n
n
0
1 n = 0
0 else
Partitions
P
n+k,k
=
n
i=1
P
n,i
k > n : P
n,k
= 0 n 1 : P
n,0
= 0, P
0,0
= 1
Balls and Urns f : B U D = distinguishable, D = indistinguishable.
|B| = n, |U| = m f arbitrary f injective f surjective f bijective
B : D, U : D m
n
m
n
m n
0 else
m!
n
m

n! m = n
0 else
B : D, U : D
n +n 1
n

m
n

n 1
m1

1 m = n
0 else
B : D, U : D
m
k=1
n
k

1 m n
0 else
n
m

1 m = n
0 else
B : D, U : D
m
k=1
P
n,k
1 m n
0 else
P
n,m
1 m = n
0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
Brooks Cole, 1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
The American Statistician, 62(1):4553, 2008.
[3] R. H. Shumway and D. S. Stoer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,
Algebra. Springer, 2001.
[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und
Statistik. Springer, 2002.
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
Springer, 2003.
27
C
r
e
a
t
e
d
b
y
L
e
e
m
i
s
a
n
d
M
c
Q
u
e
s
t
o
n
[
2
]
.
28

Probablity & Statistics-Cheat Sheet

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Probablity & Statistics-Cheat Sheet

Загружено:

Авторское право:

Доступные форматы

Probability and Statistics

Beta (1, 1) Unif (0, 1)

Conditional mean and variance

Mean squared error: mse = E

= 1 where Z N (0, 1). Then

P [L(x) F(x) U(x) x] 1

be a parametric model with parameter space R

Observed Fisher Information

) where is dierentiable and

) is the mle of and

Under appropriate regularity conditions

[T(X) T(x)] = inf

= h(x) exp {() T(x) A()}

denote the sample quantile of (

denote the sample quantile of (R

Action a A: possible value of the decision rule. In the estimation

= Bayes rule c : R(,

Residual Sums of Squares (rss)

of the covarite and want to predict their outcome Y

Estimate regression function

Least Squares Estimator

Вам также может понравиться

Probablity &amp; Statistics-Cheat Sheet

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Probablity &amp; Statistics-Cheat Sheet

Загружено:

Авторское право:

Доступные форматы

Probability and Statistics

Beta (1, 1) Unif (0, 1)

Conditional mean and variance

Mean squared error: mse = E

= 1 where Z N (0, 1). Then

P [L(x) F(x) U(x) x] 1

be a parametric model with parameter space R

Observed Fisher Information

) where is dierentiable and

) is the mle of and

Under appropriate regularity conditions

[T(X) T(x)] = inf

= h(x) exp {() T(x) A()}

denote the sample quantile of (

denote the sample quantile of (R

Action a A: possible value of the decision rule. In the estimation

= Bayes rule c : R(,

Residual Sums of Squares (rss)

of the covarite and want to predict their outcome Y

Estimate regression function

Least Squares Estimator

Вам также может понравиться

Probablity & Statistics-Cheat Sheet

Probablity & Statistics-Cheat Sheet