Вы находитесь на странице: 1из 28

Probability and Statistics

Cheat Sheet
Copyright c Matthias Vallentin, 2010
vallentin@icir.org
November 23, 2010
This cheatsheet integrates a variety of topics in probability
theory and statistics. It is based on literature [1, 6, 3] and
in-class material from courses of the statistics department at
the University of California in Berkeley but also inuenced
by other sources [4, 5]. If you nd errors or have sugges-
tions for further topics, I would appreciate if you send me an
email. The most recent version of this document is available
at http://cs.berkeley.edu/
~
mavam/probstat.pdf. To repro-
duce, please contact me.
Contents
1 Distribution Overview 3
1.1 Discrete Distributions . . . . . . . . . . 3
1.2 Continuous Distributions . . . . . . . . 4
2 Probability Theory 6
3 Random Variables 6
3.1 Transformations . . . . . . . . . . . . . 7
4 Expectation 7
5 Variance 7
6 Inequalities 8
7 Distribution Relationships 8
8 Probability and Moment Generating
Functions 9
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9
9.2 Bivariate Normal . . . . . . . . . . . . . 9
9.3 Multivariate Normal . . . . . . . . . . . 9
10 Convergence 10
10.1 Law of Large Numbers (LLN) . . . . . . 10
10.2 Central Limit Theorem (CLT) . . . . . 10
11 Statistical Inference 11
11.1 Point Estimation . . . . . . . . . . . . . 11
11.2 Normal-based Condence Interval . . . . 11
11.3 Empirical Distribution Function . . . . . 11
11.4 Statistical Functionals . . . . . . . . . . 11
12 Parametric Inference 11
12.1 Method of Moments . . . . . . . . . . . 12
12.2 Maximum Likelihood . . . . . . . . . . . 12
12.2.1 Delta Method . . . . . . . . . . . 12
12.3 Multiparameter Models . . . . . . . . . 13
12.3.1 Multiparameter Delta Method . 13
12.4 Parametric Bootstrap . . . . . . . . . . 13
13 Hypothesis Testing 13
14 Bayesian Inference 14
14.1 Credible Intervals . . . . . . . . . . . . . 14
14.2 Function of Parameters . . . . . . . . . 14
14.3 Priors . . . . . . . . . . . . . . . . . . . 15
14.3.1 Conjugate Priors . . . . . . . . . 15
14.4 Bayesian Testing . . . . . . . . . . . . . 15
15 Exponential Family 16
16 Sampling Methods 16
16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Condence Intervals . 16
16.2 Rejection Sampling . . . . . . . . . . . . 17
16.3 Importance Sampling . . . . . . . . . . . 17
17 Decision Theory 17
17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
17.2 Admissibility . . . . . . . . . . . . . . . 17
17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
18 Linear Regression 18
18.1 Simple Linear Regression . . . . . . . . 18
18.2 Prediction . . . . . . . . . . . . . . . . . 19
18.3 Multiple Regression . . . . . . . . . . . 19
18.4 Model Selection . . . . . . . . . . . . . . 19
19 Non-parametric Function Estimation 20
19.1 Density Estimation . . . . . . . . . . . . 20
19.1.1 Histograms . . . . . . . . . . . . 20
19.1.2 Kernel Density Estimator (KDE) 21
19.2 Non-parametric Regression . . . . . . . 21
19.3 Smoothing Using Orthogonal Functions 21
20 Stochastic Processes 22
20.1 Markov Chains . . . . . . . . . . . . . . 22
20.2 Poisson Processes . . . . . . . . . . . . . 22
21 Time Series 23
21.1 Stationary Time Series . . . . . . . . . . 23
21.2 Estimation of Correlation . . . . . . . . 24
21.3 Non-Stationary Time Series . . . . . . . 24
21.3.1 Detrending . . . . . . . . . . . . 24
21.4 ARIMA models . . . . . . . . . . . . . . 24
21.4.1 Causality and Invertibility . . . . 25
22 Math 26
22.1 Orthogonal Functions . . . . . . . . . . 26
22.2 Series . . . . . . . . . . . . . . . . . . . 26
22.3 Combinatorics . . . . . . . . . . . . . . 27
1 Distribution Overview
1.1 Discrete Distributions
F
X
(x) f
X
(x) E [X] V [X] M
X
(s)
Uniform{a, . . . , b}

0 x < a
xa+1
ba
a x b
1 x > b
I(a < x < b)
b a + 1
a +b
2
(b a + 1)
2
1
12
e
as
e
(b+1)s
s(b a)
Bernoulli(p) (1 p)
1x
p
x
(1 p)
1x
p p(1 p) 1 p +pe
s
Binomial(n, p) I
1p
(n x, x + 1)

n
x

p
x
(1 p)
nx
np np(1 p) (1 p +pe
s
)
n
Multinomial(n, p)
n!
x
1
! . . . x
k
!
p
x1
1
p
xk
k
k

i=1
x
i
= n np
i
np
i
(1 p
i
)

i=0
p
i
e
is
n
Hypergeometric(N, m, n)

x np

np(1 p)

m
x

mx
nx

N
x

nm
N
nm(N n)(N m)
N
2
(N 1)
N/A
NegativeBinomial(r, p) I
p
(r, x + 1)

x +r 1
r 1

p
r
(1 p)
x
r
1 p
p
r
1 p
p
2

p
1 (1 p)e
s

r
Geometric(p) 1 (1 p)
x
x N
+
p(1 p)
x1
x N
+
1
p
1 p
p
2
p
1 (1 p)e
s
Poisson() e

i=0

i
i!

x
e

x!
e
(e
s
1)
q q q q q q
Uniform (discrete)
x
P
M
F
a b
1
n
0 10 20 30 40
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
0
.2
5
Binomial
x
P
M
F
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
0 2 4 6 8 10
0
.0
0
.2
0
.4
0
.6
0
.8
Geometric
x
P
M
F
p = 0.2
p = 0.5
p = 0.8
0 5 10 15 20
0
.0
0
.1
0
.2
0
.3
Poisson
x
P
M
F
= 1
= 4
= 10
3
1.2 Continuous Distributions
F
X
(x) f
X
(x) E [X] V [X] M
X
(s)
Uniform(a, b)

0 x < a
xa
ba
a < x < b
1 x > b
I(a < x < b)
b a
a +b
2
(b a)
2
12
e
sb
e
sa
s(b a)
Normal(,
2
) (x) =

(t) dt (x) =
1

2
exp

(x )
2
2
2


2
exp

s +

2
s
2
2

Log-Normal(,
2
)
1
2
+
1
2
erf

ln x

2
2

1
x

2
2
exp

(ln x )
2
2
2

e
+
2
/2
(e

2
1)e
2+
2
Multivariate Normal(, ) (2)
k/2
||
1/2
e

1
2
(x)
T

1
(x)
exp

T
s +
1
2
s
T
s

Chi-square(k)
1
(k/2)

k
2
,
x
2

1
2
k/2
(k/2)
x
k/2
e
x/2
k 2k (1 2s)
k/2
s < 1/2
Exponential() 1 e
x/
1

e
x/

2
1
1 s
(s < 1/)
Gamma(, )
(, x/)
()
1
()

x
1
e
x/

2

1
1 s

(s < 1/)
InverseGamma(, )

,

x

()

()
x
1
e
/x

1
> 1

2
( 1)
2
( 2)
2
> 2
2(s)
/2
()
K

4s

Dirichlet()

k
i=1
(
i
)

k
i=1

i=1
x
i1
i

k
i=1

i
E [X
i
] (1 E [X
i
])

k
i=1

i
+ 1
Beta(, ) I
x
(, )
( +)
() ()
x
1
(1 x)
1

+

( +)
2
( + + 1)
1 +

k=1

k1

r=0
+r
+ +r

s
k
k!
Weibull(, k) 1 e
(x/)
k k

k1
e
(x/)
k

1 +
1
k

1 +
2
k

n=0
s
n

n
n!

1 +
n
k

Pareto(x
m
, ) 1

x
m
x

x x
m

x

m
x
+1
x x
m
x
m
1
> 1
x

m
( 1)
2
( 2)
> 2 (x
m
s)

(, x
m
s) s < 0
4
q q
Uniform (continuous)
x
P
D
F
a b
1
b a
q q
4 2 0 2 4
0
.0
0
.2
0
.4
0
.6
0
.8
Normal
x

(
x
)
= 0,
2
= 0.2
= 0,
2
= 1
= 0,
2
= 5
= 2,
2
= 0.5
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
Lognormal
x
P
D
F
= 0,
2
= 3
= 2,
2
= 2
= 0,
2
= 1
= 0.5,
2
= 1
= 0.25,
2
= 1
= 0.125,
2
= 1
0 2 4 6 8
0
.0
0
.1
0
.2
0
.3
0
.4
0
.5

2
x
P
D
F
k = 1
k = 2
k = 3
k = 4
k = 5
0 1 2 3 4 5
0
.0
0
.5
1
.0
1
.5
2
.0
Exponential
x
P
D
F
= 2
= 1
= 0.4
0 5 10 15 20
0
.0
0
.1
0
.2
0
.3
0
.4
0
.5
Gamma
x
P
D
F
= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5
0 1 2 3 4 5
0
1
2
3
4
InverseGamma
x
P
D
F
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.5
1
.0
1
.5
2
.0
2
.5
3
.0
Beta
x
P
D
F
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
0.0 0.5 1.0 1.5 2.0 2.5
0
.0
0
.5
1
.0
1
.5
2
.0
2
.5
Weibull
x
P
D
F
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
0 1 2 3 4 5
0
1
2
3
Pareto
x
P
D
F
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4
5
2 Probability Theory
Denitions
Sample space
Outcome (point or element)
Event A
-algebra A
1. A
2. A
1
, A
2
, . . . , A =

i=1
A
i
A
3. A A = A A
Probability distribution P
1. P [A] 0 for every A
2. P [] = 1
3. P

i=1
A
i

i=1
P [A
i
]
Probability space (, A, P)
Properties
P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] +P [A B]
P [] = 1 P [] = 0
(

n
A
n
) =

n
A
n
(

n
A
n
) =

n
A
n
DeMorgan
P [

n
A
n
] = 1 P [

n
A
n
]
P [A B] = P [A] +P [B] P [A B]
= P [A B] P [A] +P [B]
P [A B] = P [A B] +P [A B] +P [A B]
P [A B] = P [A] P [A B]
A
1
A
n
A =

n=1
A
n
= P [A
n
] = P [A]
A
n
A
1
A =

n=1
A
n
= P [A
n
] = P [A]
Independence
A B P [A B] = P [A] P [B]
Conditional Probability
P [A| B] =
P [A B]
P [B]
if P [B] > 0
Law of Total Probability
P [B] =
n

i=1
P [B|A
i
] P [A
i
] =
n

i=1
A
i
Bayes Theorem
P [A
i
| B] =
P [B| A
i
] P [A
i
]

n
j=1
P [B| A
j
] P [A
j
]
=
n

i=1
A
i
Inclusion-Exclusion Principle

i=1
A
i

=
n

r=1
(1)
r1

ii1<<irn

j=1
A
ij

3 Random Variables
Random Variable
X : R
Probability Mass Function (PMF)
f
X
(x) = P [X = x] = P [{ : X() = x}]
Probability Density Function (PDF)
f
X
(x) = P [a X b] =

b
a
f(x) dx
Cumulative Distribution Function (CDF):
F
X
: R [0, 1] F
X
(x) = P [X x]
1. Nondecreasing: x
1
< x
2
= F(x
1
) F(x
2
)
2. Normalized: lim
x
= 0 and lim
x
= 1
3. Right-continuous: lim
yx
F(y) = F(x)
P [a Y b | X = x] =

b
a
f
Y |X
(y | x)dy a b
f
Y |X
(y | x) =
f(x, y)
f
X
(x)
Independence
1. P [X x, Y y] = P [X x] P [Y y]
2. f
X,Y
(x, y) = f
X
(x)f
Y
(y)
6
3.1 Transformations
Transformation function
Z = (X)
Discrete
f
Z
(z) = P [(X) = z] = P [{x : (x) = z}] = P

X
1
(z)

x
1
(z)
f(x)
Continuous
F
Z
(z) = P [(X) z] =

Az
f(x) dx with A
z
= {x : (x) z}
Special case if strictly monotone
f
Z
(z) = f
X
(
1
(z))

d
dz

1
(z)

= f
X
(x)

dx
dz

= f
X
(x)
1
|J|
The Rule of the Lazy Statistician
E [Z] =

(x) dF
X
(x)
E [I
A
(x)] =

I
A
(x) dF
X
(x) =

A
dF
X
(x) = P [X A]
Convolution
Z := X +Y f
Z
(z) =

f
X,Y
(x, z x) dx
X,Y 0
=

z
0
f
X,Y
(x, z x) dx
Z := |X Y | f
Z
(z) = 2


0
f
X,Y
(x, z +x) dx
Z :=
X
Y
f
Z
(z) =

|x|f
X,Y
(x, xz) dx

=

xf
x
(x)f
X
(x)f
Y
(xz) dx
4 Expectation
Expectation
E [X] =
X
=

xdF
X
(x) =

x
xf
X
(x) X discrete

xf
X
(x) X continuous
P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X +Y ] = E [X] +E [Y ]
E [XY ] =

X,Y
xyf
X,Y
(x, y) dF
X
(x) dF
Y
(y)
E [(Y )] = (E [X]) (cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
E [X] =

x=1
P [X x]
Sample mean

X
n
=
1
n
n

i=1
X
i
Conditional Expectation
E [Y | X = x] =

yf(y | x) dy
E [X] = E [E [X| Y ]]
E[(X, Y ) | X = x] =

(x, y)f
Y |X
(y | x) dx
E [(Y, Z) | X = x] =

(y, z)f
(Y,Z)|X
(y, z | x) dy dz
E [Y +Z | X] = E [Y | X] +E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0
5 Variance
Variance
V [X] =
2
X
= E

(X E [X])
2

= E

X
2

E [X]
2
V

i=1
X
i

=
n

i=1
V [X
i
] + 2

i=j
Cov [X
i
, Y
j
]
V

i=1
X
i

=
n

i=1
V [X
i
] i X
i
X
j
Standard deviation
sd[X] =

V [X] =
X
Covariance
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
Cov [X +a, Y +b] = Cov [X, Y ]
7
Cov

i=1
X
i
,
m

j=1
Y
j

=
n

i=1
m

j=1
Cov [X
i
, Y
j
]
Correlation
[X, Y ] =
Cov [X, Y ]

V [X] V [Y ]
Independence
X Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Sample variance
S
2
=
1
n 1
n

i=1
(X
i


X
n
)
2
Conditional Variance
V [Y | X] = E

(Y E [Y | X])
2
| X

= E

Y
2
| X

E [Y | X]
2
V [Y ] = E [V [Y | X]] +V [E [Y | X]]
6 Inequalities
Cauchy-Schwarz
E [XY ]
2
E

X
2

Y
2

Markov
P [(X) t]
E [(X)]
t
Chebyshev
P [|X E [X]| t]
V [X]
t
2
Chernoff
P [X (1 +)]

(1 +)
1+

> 1
Jensen
E [(X)] (E [X]) convex
7 Distribution Relationships
Binomial
X
i
Bernoulli (p) =
n

i=1
X
i
Bin (n, p)
X Bin (n, p) , Y Bin (m, p) = X +Y Bin (n +m, p)
lim
n
Bin (n, p) = Poisson (np) (n large, p small)
lim
n
Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Negative Binomial
X NBin (1, p) = Geometric (p)
X NBin (r, p) =

r
i=1
Geometric (p)
X
i
NBin (r
i
, p) =

X
i
NBin (

r
i
, p)
X NBin (r, p) . Y Bin (s +r, p) = P [X s] = P [Y r]
Poisson
X
i
Poisson (
i
) X
i
X
j
=
n

i=1
X
i
Poisson

i=1

X
i
Poisson (
i
) X
i
X
j
= X
i

j=1
X
j
Bin

j=1
X
j
,

i

n
j=1

Exponential
X
i
Exp () X
i
X
j
=
n

i=1
X
i
Gamma (n, )
Memoryless property: P [X > x +y | X > y] = P [X > x]
Normal
X N

,
2

N (0, 1)
X N

,
2

Z = aX +b = Z N

a +b, a
2

X N

1
,
2
1

Y N

2
,
2
2

= X +Y N

1
+
2
,
2
1
+
2
2

X
i
N

i
,
2
i

=

i
X
i
N

i
,

2
i

P [a < X b] =

(x) = 1 (x)

(x) = x(x)

(x) = (x
2
1)(x)
Upper quantile of N (0, 1): z

=
1
(1 )
Gamma (distribution)
X Gamma (, ) X/ Gamma (, 1)
Gamma (, )

i=1
Exp ()
X
i
Gamma (
i
, ) X
i
X
j
=

i
X
i
Gamma (

i
, )

()


0
x
1
e
x
dx
Gamma (function)
Ordinary: (s) =

0
t
s1
e
t
dt
Upper incomplete: (s, x) =

x
t
s1
e
t
dt
Lower incomplete: (s, x) =

x
0
t
s1
e
t
dt
( + 1) = () > 1
8
(n) = (n 1)! n N
(1/2) =

Beta (distribution)

1
B(, )
x
1
(1 x)
1
=
( +)
()()
x
1
(1 x)
1
E

X
k

=
B( +k, )
B(, )
=
+k 1
+ +k 1
E

X
k1

Beta (1, 1) Unif (0, 1)


Beta (function):
Ordinary: B(x, y) = B(y, x) =

1
0
t
x1
(1 t)
y1
dt =
(x)(y)
(x +y)
Incomplete: B(x; a, b) =

x
0
t
a1
(1 t)
b1
dt
Regularized incomplete:
I
x
(a, b) =
B(x; a, b)
B(a, b)
a,bN
=
a+b1

j=a
(a +b 1)!
j!(a +b 1 j)!
x
j
(1 x)
a+b1j
I
0
(a, b) = 0 I
1
(a, b) = 1
I
x
(a, b) = 1 I
1x
(b, a)
8 Probability and Moment Generating Functions
G
X
(t) = E

t
X

|t| < 1
M
X
(t) = G
X
(e
t
) = E

e
Xt

= E

i=0
(Xt)
i
i!

i=0
E

X
i

i!
t
i
P [X = 0] = G
X
(0)
P [X = 1] = G

X
(0)
P [X = i] =
G
(i)
X
(0)
i!
E [X] = G

X
(1

)
E

X
k

= M
(k)
X
(0)
E

X!
(X k)!

= G
(k)
X
(1

)
V [X] = G

X
(1

) +G

X
(1

) (G

X
(1

))
2
G
X
(t) = G
Y
(t) = X
d
= Y
9 Multivariate Distributions
9.1 Standard Bivariate Normal
Let X, Y N (0, 1) X Z with Y = X +

1
2
Z
Joint density
f(x, y) =
1
2

1
2
exp

x
2
+y
2
2xy
2(1
2
)

Conditionals
(Y | X = x) N

x, 1
2

and (X| Y = y) N

y, 1
2

Independence
X Y = 0
9.2 Bivariate Normal
Let X N

x
,
2
x

and Y N

y
,
2
y

.
f(x, y) =
1
2
x

1
2
exp

z
2(1
2
)

z =

x
x

2
+

y
y

2
2

x
x

y
y

Conditional mean and variance


E [X| Y ] = E [X] +

Y
(Y E [Y ])
V [X| Y ] =
X

1
2
9.3 Multivariate Normal
Covariance Matrix (Precision Matrix
1
)
=

V [X
1
] Cov [X
1
, X
k
]
.
.
.
.
.
.
.
.
.
Cov [X
k
, X
1
] V [X
k
]

If X N (, ),
f
X
(x) = (2)
n/2
||
1/2
exp

1
2
(x )
T

1
(x )

9
Properties
Z N (0, 1) X = +
1/2
Z = X N (, )
X N (, ) =
1/2
(X ) N (0, 1)
X N (, ) = AX N

A, AA
T

X N (, ) a is vector of lenght k = a
T
X N

a
T
, a
T
a

10 Convergence
Let {X
1
, X
2
, . . .} be a sequence of rvs and let X be another rv. Let F
n
denote
the cdf of X
n
and let F denote the cdf of X.
Types of Convergence
1. In distribution (weakly, in law): X
n
D
X
lim
n
F
n
(t) = F(t) t where F continuous
2. In probability: X
n
P
X
( > 0) lim
n
P [|X
n
X| > ] = 0
3. Almost surely (strongly): X
n
as
X
P

lim
n
X
n
= X

= P

: lim
n
X
n
() = X()

= 1
4. In quadratic mean (L
2
): X
n
qm
X
lim
n
E

(X
n
X)
2

= 0
Relationships
X
n
qm
X = X
n
P
X = X
n
D
X
X
n
as
X = X
n
P
X
X
n
D
X (c R) P [X = c] = 1 = X
n
P
X
X
n
P
X Y
n
P
Y = X
n
+Y
n
P
X +Y
X
n
qm
X Y
n
qm
Y = X
n
+Y
n
qm
X +Y
X
n
P
X Y
n
P
Y = X
n
Y
n
P
XY
X
n
P
X = (X
n
)
P
(X)
X
n
D
X = (X
n
)
D
(X)
X
n
qm
b lim
n
E [X
n
] = b lim
n
V [X
n
] = 0
X
1
, . . . , X
n
iid E [X] = V [X] <

X
n
qm

Slutzkys Theorem
X
n
D
X and Y
n
P
c = X
n
+Y
n
D
X +c
X
n
D
X and Y
n
P
c = X
n
Y
n
D
cX
In general: X
n
D
X and Y
n
D
Y = X
n
+Y
n
D
X +Y
10.1 Law of Large Numbers (LLN)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] < .
Weak (WLLN)

X
n
P
as n
Strong (SLLN)

X
n
as
as n
10.2 Central Limit Theorem (CLT)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] =
2
.
Z
n
:=

X
n

X
n

n(

X
n
)

D
Z where Z N (0, 1)
lim
n
P [Z
n
z] = (z) z R
CLT Notations
Z
n
N (0, 1)

X
n
N

,

2
n

X
n
N

0,

2
n

n(

X
n
) N

0,
2

n(

X
n
)
n
N (0, 1)
Continuity Correction
P

X
n
x

x +
1
2

/

X
n
x

x
1
2

/

Delta Method
Y
n
N

,

2
n

= (Y
n
) N

(), (

())
2
2
n

10
11 Statistical Inference
Let X
1
, , X
n
iid
F if not otherwise noted.
11.1 Point Estimation
Point estimator

n
of is a rv:

n
= g(X
1
, . . . , X
n
)
bias(

n
) = E

Consistency:

n
P

Sampling distribution: F(

n
)
Standard error: se(

n
) =

Mean squared error: mse = E

n
)
2

= bias(

n
) +V

lim
n
bias(

n
) = 0 lim
n
se(

n
) = 0 =

n
is consistent
Asymptotic normality:

se
D
N (0, 1)
Slutzkys Theorem often lets us replace se(

n
) by some (weakly) consis-
tent estimator
n
.
11.2 Normal-based Condence Interval
Suppose

n
N

, se
2

. Let z
/2
=
1
(1 (/2)), i.e., P

Z > z
/2

= /2
and P

z
/2
< Z < z
/2

= 1 where Z N (0, 1). Then


C
n
=

n
z
/2
se
11.3 Empirical Distribution Function
Empirical Distribution Function (ECDF)

F
n
(x) =

n
i=1
I(X
i
x)
n
I(X
i
x) =

1 X
i
x
0 X
i
> x
Properties (for any xed x)
E

F
n

= F(x)
V

F
n

=
F(x)(1 F(x))
n
mse =
F(x)(1 F(x))
n
D
0


F
n
P
F(x)
Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X
1
, . . . , X
n
F)
P

sup
x

F(x)

F
n
(x)

>

= 2e
2n
2
Nonparametric 1 condence band for F
L(x) = max{

F
n

n
, 0}
U(x) = min{

F
n
+
n
, 1}
=

1
2n
log

P [L(x) F(x) U(x) x] 1


11.4 Statistical Functionals
Statistical functional: T(F)
Plug-in estimator of = T(F) :

n
= T(

F
n
)
Linear functional: T(F) =

(x) dF
X
(x)
Plug-in estimator for linear functional:
T(

F
n
) =

(x) d

F
n
(x) =
1
n
n

i=1
(X
i
)
Often: T(

F
n
) N

T(F), se
2

=T(

F
n
) z
/2
se
p
th
quantile: F
1
(p) = inf{x : F(x) p}
=

X
n

2
=
1
n 1
n

i=1
(X
i


X
n
)
2
=
1
n

n
i=1
(X
i
)
3

3
j
=

n
i=1
(X
i


X
n
)(Y
i


Y
n
)

n
i=1
(X
i


X
n
)
2

n
i=1
(Y
i


Y
n
)
12 Parametric Inference
Let F =

f(x; :

be a parametric model with parameter space R


k
and parameter = (
1
, . . . ,
k
).
11
12.1 Method of Moments
j
th
moment

j
() = E

X
j

x
j
dF
X
(x)
j
th
sample moment

j
=
1
n
n

i=1
X
j
i
Method of Moments Estimator (MoM)

1
() =
1

2
() =
2
.
.
. =
.
.
.

k
() =
k
Properties of the MoM estimator

n
exists with probability tending to 1
Consistency:

n
P

Asymptotic normality:

n(

)
D
N (0, )
where = gE

Y Y
T

g
T
, Y = (X, X
2
, . . . , X
k
)
T
,
g = (g
1
, . . . , g
k
) and g
j
=

1
j
()
12.2 Maximum Likelihood
Likelihood: L
n
: [0, )
L
n
() =
n

i=1
f(X
i
; )
Log-likelihood

n
() = log L
n
() =
n

i=1
log f(X
i
; )
Maximum Likelihood Estimator (mle)
L
n
(

n
) = sup

L
n
()
Score Function
s(X; ) =

log f(X; )
Fisher Information
I() = V

[s(X; )]
I
n
() = nI()
Fisher Information (exponential family)
I() = E

s(X; )

Observed Fisher Information


I
obs
n
() =

2

2
n

i=1
log f(X
i
; )
Properties of the mle
Consistency:

n
P

Equivariance:

n
is the mle =(

n
) is the mle of ()
Asymptotic normality:
1. se

1/I
n
()
(

n
)
se
D
N (0, 1)
2. se

1/I
n
(

n
)
(

n
)
se
D
N (0, 1)
Asymptotic optimality, i.e., smallest variance for large samples. If

n
is any
other estimator, then
are(

n
,

n
) =
V

n
1
Approximately the Bayes estimator
12.2.1 Delta Method
If = (

) where is dierentiable and

() = 0:
(
n
)
se( )
D
N (0, 1)
where = (

) is the mle of and


se =

se(

n
)
12
12.3 Multiparameter Models
Let = (
1
, . . . ,
k
) and

= (

1
, . . . ,

k
) be the mle.
H
jj
=

2

2
H
jk
=

2

k
Fisher Information Matrix
I
n
() =

[H
11
] E

[H
1k
]
.
.
.
.
.
.
.
.
.
E

[H
k1
] E

[H
kk
]

Under appropriate regularity conditions


(

) N (0, J
n
)
with J
n
() = I
1
n
. Further, if

j
is the j
th
component of , then
(

j

j
)
se
j
D
N (0, 1)
where se
2
j
= J
n
(j, j) and Cov

j
,

= J
n
(j, k)
12.3.1 Multiparameter Delta Method
Let = (
1
, . . . ,
k
) be a function and let the gradient of be
=

1
.
.
.

Suppose

= 0 and = (

). Then,
( )
se( )
D
N (0, 1)
where
se( ) =

J
n

and

J
n
= J
n
(

) and

=

.
12.4 Parametric Bootstrap
Sample from f(x;

n
) instead of from

F
n
, where

n
could be the mle or method
of moments estimator.
13 Hypothesis Testing
H
0
:
0
versus H
1
:
1
Denitions
Null hypothesis H
0
Alternative hypothesis H
1
Simple hypothesis =
0
Composite hypothesis >
0
or <
0
Two-sided test: H
0
: =
0
versus H
1
: =
0
One-sided test: H
0
:
0
versus H
1
: >
0
Critical value c
Test statistic T
Rejection Region R = {x : T(x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf
1
()
Test size: = P [Type I error] = sup
0
()
Retain H
0
Reject H
0
H
0
true

Type I error ()
H
1
true Type II error ()

(power)
p-value
p-value = sup
0
P

[T(X) T(x)] = inf

: T(x) R

p-value = sup
0
P

[T(X

) T(X)]
. .. .
1F(T(X)) since T(X

)F
= inf

: T(X) R

p-value evidence
< 0.01 very strong evidence against H
0
0.01 0.05 strong evidence against H
0
0.05 0.1 weak evidence against H
0
> 0.1 little or no evidence against H
0
Wald Test
Two-sided test
Reject H
0
when |W| > z
/2
where W =


0
se
P

|W| > z
/2

p-value = P
0
[|W| > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood Ratio Test (LRT)
13
T(X) =
sup

L
n
()
sup
0
L
n
()
=
L
n
(

n
)
L
n
(

n,0
)
(X) = 2 log T(X)
D

2
rq
where
k

i=1
Z
2
i

2
k
with Z
1
, . . . , Z
k
iid
N (0, 1)
p-value = P
0
[(X) > (x)] P

2
rq
> (x)

Multinomial LRT
Let p
n
=

X
1
n
, . . . ,
X
k
n

be the mle
T(X) =
L
n
( p
n
)
L
n
(p
0
)
=
k

j=1

p
j
p
0j

Xj
(X) = 2
k

j=1
X
j
log

p
j
p
0j

2
k1
The approximate size LRT rejects H
0
when (X)
2
k1,
Pearson
2
Test
T =
k

j=1
(X
j
E [X
j
])
2
E [X
j
]
where E [X
j
] = np
0j
under H
0
T
D

2
k1
p-value = P

2
k1
> T(x)

Faster
D
X
2
k1
than LRT, hence preferable for small n
Independence Testing
I rows, J columns, X multinomial sample of size n = I J
mles unconstrained: p
ij
=
Xij
n
mles under H
0
: p
0ij
= p
i
p
j
=
Xi
n
Xj
n
LRT: = 2

I
i=1

J
j=1
X
ij
log

nXij
XiXj

Pearson
2
: T =

I
i=1

J
j=1
(XijE[Xij])
2
E[Xij]
LRT and Pearson
D

, where = (I 1)(J 1)
14 Bayesian Inference
Bayes Theorem
f( | x) =
f(x| )f()
f(x
n
)
=
f(x| )f()

f(x| )f() d
L
n
()f()
Denitions
X
n
= (X
1
, . . . , X
n
)
x
n
= (x
1
, . . . , x
n
)
Prior density f()
Likelihood f(x
n
| ): joint density of the data
In particular, X
n
iid =f(x
n
| ) =
n

i=1
f(x
i
| ) = L
n
()
Posterior density f( | x
n
)
Normalizing constant c
n
= f(x
n
) =

f(x| )f() d
Kernel: part of a density that depends on
Posterior Mean

n
=

f( | x
n
) d =

Ln()f()

Ln()f() d
14.1 Credible Intervals
1 Posterior Interval
P [ (a, b) | x
n
] =

b
a
f( | x
n
) d = 1
1 Equal-tail Credible Interval

f( | x
n
) d =


b
f( | x
n
) d = /2
1 Highest Posterior Density (HPD) region R
n
1. P [ R
n
] = 1
2. R
n
= { : f( | x
n
) > k} for some k
R
n
is unimodal =R
n
is an interval
14.2 Function of Parameters
Let = () and A = { : () }.
Posterior CDF for
H(r | x
n
) = P [() | x
n
] =

A
f( | x
n
) d
Posterior Density
h( | x
n
) = H

( | x
n
)
Bayesian Delta Method
| X
n
N

), se

14
14.3 Priors
Choice
Subjective Bayesianism: prior should incorporate as much detail as possible
the researchs a priori knowledge via prior elicitation.
Objective Bayesianism: prior should incorporate as little detail as possible
(non-informative prior).
Robust Bayesianism: consider various priors and determine sensitivity of
our inferences to changes in the prior.
Types
Flat: f() constant
Proper:

f() d = 1
Improper:

f() d =
Jeffreys prior (transformation-invariant):
f()

I() f()

det(I())
Conjugate: f() and f( | x
n
) belong to the same parametric family
14.3.1 Conjugate Priors
Discrete likelihood
Likelihood Conjugate Prior Posterior hyperparameters
Bernoulli(p) Beta(, ) +
n

i=1
x
i
, +n
n

i=1
x
i
Binomial(p) Beta(, ) +
n

i=1
x
i
, +
n

i=1
N
i

i=1
x
i
Negative Binomial(p) Beta(, ) +rn, +
n

i=1
x
i
Poisson() Gamma(, ) +
n

i=1
x
i
, +n
Multinomial(p) Dirichlet() +
n

i=1
x
(i)
Geometric(p) Beta(, ) +n, +
n

i=1
x
i
Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate Prior Posterior hyperparameters
Uniform(0, ) Pareto(x
m
, k) max

x
(n)
, x
m

, k +n
Exponential() Gamma(, ) +n, +
n

i=1
x
i
Normal(,
2
c
) Normal(
0
,
2
0
)

2
0
+

n
i=1
x
i

2
c

2
0
+
n

2
c

2
0
+
n

2
c

1
Normal(
c
,
2
) Scaled Inverse Chi-
square(,
2
0
)
+n,

2
0
+

n
i=1
(x
i
)
2
+n
Normal(,
2
) Normal-
scaled Inverse
Gamma(, , , )
+n x
+n
, + n, +
n
2
,
+
1
2
n

i=1
(x
i
x)
2
+
( x )
2
2(n +)
MVN(,
c
) MVN(
0
,
0
)

1
0
+n
1
c

1
0

0
+n
1
x

1
0
+n
1
c

1
MVN(
c
, ) Inverse-
Wishart(, )
n +, +
n

i=1
(x
i

c
)(x
i

c
)
T
Pareto(x
mc
, k) Gamma(, ) +n, +
n

i=1
log
x
i
x
mc
Pareto(x
m
, k
c
) Pareto(x
0
, k
0
) x
0
, k
0
kn where k
0
> kn
Gamma(
c
, ) Gamma(
0
,
0
)
0
+n
c
,
0
+
n

i=1
x
i
14.4 Bayesian Testing
If H
0
:
0
:
Prior probability P [H
0
] =

0
f() d
Posterior probability P [H
0
| x
n
] =

0
f( | x
n
) d
Let H
0
, . . . , H
K1
be K hypotheses. Suppose f( | H
k
),
P [H
k
| x
n
] =
f(x
n
| H
k
)P [H
k
]

K
k=1
f(x
n
| H
k
)P [H
k
]
,
15
Marginal Likelihood
f(x
n
| H
i
) =

f(x
n
| , H
i
)f( | H
i
) d
Posterior Odds (of H
i
relative to H
j
)
P [H
i
| x
n
]
P [H
j
| x
n
]
=
f(x
n
| H
i
)
f(x
n
| H
j
)
. .. .
Bayes Factor BFij

P [H
i
]
P [H
j
]
. .. .
prior odds
Bayes Factor
log
10
BF
10
BF
10
evidence
0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
1 2 10 100 Strong
> 2 > 100 Decisive
p

=
p
1p
BF
10
1 +
p
1p
BF
10
where p = P [H
1
] and p

= P [H
1
| x
n
]
15 Exponential Family
Scalar parameter
f
X
(x| ) = h(x) exp {()T(x) A()}
= h(x)g() exp {()T(x)}
Vector parameter
f
X
(x| ) = h(x) exp

i=1

i
()T
i
(x) A()

= h(x) exp {() T(x) A()}


= h(x)g() exp {() T(x)}
Natural form
f
X
(x| ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}
= h(x)g() exp

T
T(x)

16 Sampling Methods
16.1 The Bootstrap
Let T
n
= g(X
1
, . . . , X
n
) be a statistic.
1. Estimate V
F
[T
n
] with V

Fn
[T
n
].
2. Approximate V

Fn
[T
n
] using simulation:
(a) Repeat the following B times to get T

n,1
, . . . , T

n,B
, an iid sample from
the sampling distribution implied by

F
n
i. Sample uniformly X

1
, . . . , X

n


F
n
.
ii. Compute T

n
= g(X

1
, . . . , X

n
).
(b) Then
v
boot
=

V

Fn
=
1
B
B

b=1

n,b

1
B
B

r=1
T

n,r
2
16.1.1 Bootstrap Condence Intervals
Normal-based Interval
T
n
z
/2
se
boot
Pivotal Interval
1. Location parameter = T(F)
2. Pivot R
n
=

3. Let H(r) = P [R
n
r] be the cdf of R
n
4. Let R

n,b
=

n,b

n
. Approximate H using bootstrap:

H(r) =
1
B
B

b=1
I(R

n,b
r)
5. Let

denote the sample quantile of (

n,1
, . . . ,

n,B
)
6. Let r

denote the sample quantile of (R

n,1
, . . . , R

n,B
), i.e., r

n
7. Then, an approximate 1 condence interval is C
n
=

a,

with
a =


H
1

1

2

n
r

1/2
= 2

1/2

b =


H
1

n
r

/2
= 2

/2
Percentile Interval
C
n
=

/2
,

1/2

16
16.2 Rejection Sampling
Setup
We can easily sample from g()
We want to sample from h(), but it is dicult
We know h() up to proportional constant: h() =
k()

k() d
Envelope condition: we can nd M > 0 such that k() Mg()
Algorithm
1. Draw
cand
g()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
k(
cand
)
Mg(
cand
)
4. Repeat until B values of
cand
have been accepted
Example
We can easily sample from the prior g() = f()
Target is the posterior with h() k() = f(x
n
| )f()
Envelope condition: f(x
n
| ) f(x
n
|

n
) = L
n
(

n
) M
Algorithm
1. Draw
cand
f()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
L
n
(
cand
)
L
n
(

n
)
16.3 Importance Sampling
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q() | x
n
]:
1. Sample from the prior
1
, . . . ,
n
iid
f()
2. For each i = 1, . . . , B, calculate w
i
=
L
n
(
i
)

B
i=1
L
n
(
i
)
3. E [q() | x
n
]

B
i=1
q(
i
)w
i
17 Decision Theory
Denitions
Unknown quantity aecting our decision:
Decision rule: synonymous for an estimator

Action a A: possible value of the decision rule. In the estimation


context, the action is just an estimate of ,

(x).
Loss function L: consequences of taking action a when true state is or
discrepancy between and

, L : A [k, ).
Loss functions
Squared error loss: L(, a) = ( a)
2
Linear loss: L(, a) =

K
1
( a) a < 0
K
2
(a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K
1
= K
2
)
L
p
loss: L(, a) = | a|
p
Zero-one loss: L(, a) =

0 a =
1 a =
17.1 Risk
Posterior Risk
r(

| x) =

L(,

(x))f( | x) d = E
|X

L(,

(x))

(Frequentist) Risk
R(,

) =

L(,

(x))f(x| ) dx = E
X|

L(,

(X))

Bayes Risk
r(f,

) =

L(,

(x))f(x, ) dxd = E
,X

L(,

(X))

r(f,

) = E

E
X|

L(,

(X)

= E

R(,

r(f,

) = E
X

E
|X

L(,

(X)

= E
X

r(

| X)

17.2 Admissibility

dominates

if
: R(,

) R(,

)
: R(,

) < R(,


is inadmissible if there is at least one other estimator

that dominates
it. Otherwise it is called admissible.
17
17.3 Bayes Rule
Bayes Rule (or Bayes Estimator)
r(f,

) = inf

r(f,

)


(x) = inf r(

| x) x = r(f,

) =

r(

| x)f(x) dx
Theorems
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode
17.4 Minimax Rules
Maximum Risk

R(

) = sup

R(,

)

R(a) = sup

R(, a)
Minimax Rule
sup

R(,

) = inf

R(

) = inf

sup

R(,

)

= Bayes rule c : R(,

) = c
Least Favorable Prior

f
= Bayes rule R(,

f
) r(f,

f
)
18 Linear Regression
Denitions
Response variable Y
Covariate X (aka predictor variable or feature)
18.1 Simple Linear Regression
Model
Y
i
=
0
+
1
X
i
+
i
E [
i
| X
i
] = 0, V [
i
| X
i
] =
2
Fitted Line
r(x) =

0
+

1
x
Predicted (Fitted) Values

Y
i
= r(X
i
)
Residuals

i
= Y
i


Y
i
= Y
i

0
+

1
X
i

Residual Sums of Squares (rss)


rss(

0
,

1
) =
n

i=1

2
i
Least Square Estimates

T
= (

0
,

1
)
T
: min

0,

1
rss

0
=

Y
n

1

X
n

1
=

n
i=1
(X
i


X
n
)(Y
i


Y
n
)

n
i=1
(X
i


X
n
)
2
=

n
i=1
X
i
Y
i
n

X

Y

n
i=1
X
2
i
nX
2
E

| X
n

| X
n

=

2
ns
X

n
1

n
i=1
X
2
i

X
n

X
n
1

se(

0
) =

s
X

n
i=1
X
2
i
n
se(

1
) =

s
X

n
where s
2
X
= n
1

n
i=1
(X
i


X
n
)
2
and
2
=
1
n2

n
i=1

2
i
an (unbiased) estimate
of . Further properties:
Consistency:

0
P

0
and

1
P

1
Asymptotic normality:

0
se(

0
)
D
N (0, 1) and

1
se(

1
)
D
N (0, 1)
Approximate 1 condence intervals for
0
and
1
are

0
z
/2
se(

0
) and

1
z
/2
se(

1
)
The Wald test for testing H
0
:
1
= 0 vs. H
1
:
1
= 0 is: reject H
0
if
|W| > z
/2
where W =

1
/ se(

1
).
R
2
R
2
=

n
i=1
(

Y
i


Y )
2

n
i=1
(Y
i


Y )
2
= 1

n
i=1

2
i

n
i=1
(Y
i


Y )
2
= 1
rss
tss
18
Likelihood
L =
n

i=1
f(X
i
, Y
i
) =
n

i=1
f
X
(X
i
)
n

i=1
f
Y |X
(Y
i
| X
i
) = L
1
L
2
L
1
=
n

i=1
f
X
(X
i
)
L
2
=
n

i=1
f
Y |X
(Y
i
| X
i
)
n
exp

1
2
2

Y
i
(
0

1
X
i
)

Under the assumption of Normality, the least squares estimator is also the mle

2
=
1
n
n

i=1

2
i
18.2 Prediction
Observe X = x

of the covarite and want to predict their outcome Y

0
+

1
x

= V

+x
2

+ 2x

Cov

0
,

Prediction Interval

2
n
=
2

n
i=1
(X
i
X

)
2
n

i
(X
i


X)
2
j
+ 1

z
/2

n
18.3 Multiple Regression
Y = X +
where
X =

X
11
X
1k
.
.
.
.
.
.
.
.
.
X
n1
X
nk

1
.
.
.

1
.
.
.

Likelihood
L(, ) = (2
2
)
n/2
exp

1
2
2
rss

rss = (y X)
T
(y X) = ||Y X||
2
=
N

i=1
(Y
i
x
T
i
)
2
If the (k k) matrix X
T
X is invertible,

= (X
T
X)
1
X
T
Y
V

| X
n

=
2
(X
T
X)
1

,
2
(X
T
X)
1

Estimate regression function


r(x) =
k

j=1

j
x
j
Unbiased estimate for
2

2
=
1
n k
n

i=1

2
i
= X

Y
mle
=

X
2
=
n k
n

2
1 Condence Interval

j
z
/2
se(

j
)
18.4 Model Selection
Consider predicting a new observation Y

for covariates X

and let S J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Undertting: too few covariates yields high bias
Overtting: too many covariates yields high variance
Procedure
1. Assign a score to each model
2. Search through all models to nd the one with the highest score
Hypothesis Testing
H
0
:
j
= 0 vs. H
1
:
j
= 0 j J
Mean Squared Prediction Error (mspe)
mspe = E

Y (S) Y

)
2

Prediction Risk
R(S) =
n

i=1
mspe
i
=
n

i=1
E

Y
i
(S) Y

i
)
2

19
Training Error

R
tr
(S) =
n

i=1
(

Y
i
(S) Y
i
)
2
R
2
R
2
(S) = 1
rss(S)
tss
= 1

R
tr
(S)
tss
= 1

n
i=1
(

Y
i
(S)

Y )
2

n
i=1
(Y
i


Y )
2
The training error is a downward-biased estimate of the prediction risk.
E

R
tr
(S)

< R(S)
bias(

R
tr
(S)) = E

R
tr
(S)

R(S) = 2
n

i=1
Cov

Y
i
, Y
i

Adjusted R
2
R
2
(S) = 1
n 1
n k
rss
tss
Mallows C
p
statistic

R(S) =

R
tr
(S) + 2k
2
= lack of t + complexity penalty
Akaike Information Criterion (AIC)
AIC(S) =
n
(

S
,
2
S
) k
Bayesian Information Criterion (BIC)
BIC(S) =
n
(

S
,
2
S
)
k
2
log n
Validation and Training

R
V
(S) =
m

i=1
(

i
(S) Y

i
)
2
m = |{validation data}|, often
n
4
or
n
2
Leave-one-out Cross-validation

R
CV
(S) =
n

i=1
(Y
i


Y
(i)
)
2
=
n

i=1

Y
i


Y
i
(S)
1 U
ii
(S)

2
U(S) = X
S
(X
T
S
X
S
)
1
X
S
(hat matrix)
19 Non-parametric Function Estimation
19.1 Density Estimation
Estimate f(x), where f(x) = P [X A] =

A
f(x) dx.
Integrated Square Error (ise)
L(f,

f
n
) =


f(x)

f
n
(x)

2
dx = J(h) +

f
2
(x) dx
Frequentist Risk
R(f,

f
n
) = E

L(f,

f
n
)

b
2
(x) dx +

v(x) dx
b(x) = E

f
n
(x)

f(x)
v(x) = V

f
n
(x)

19.1.1 Histograms
Denitions
Number of bins m
Binwidth h =
1
m
Bin B
j
has
j
observations
Dene p
j
=
j
/n and p
j
=

Bj
f(u) du
Histogram Estimator

f
n
(x) =
m

j=1
p
j
h
I(x B
j
)
E

f
n
(x)

=
p
j
h
V

f
n
(x)

=
p
j
(1 p
j
)
nh
2
R(

f
n
, f)
h
2
12

(f

(u))
2
du +
1
nh
h

=
1
n
1/3

(f

(u))
2
du

1/3
R

f
n
, f)
C
n
2/3
C =

3
4

2/3

(f

(u))
2
du

1/3
Cross-validation estimate of E [J(h)]

J
CV
(h) =


f
2
n
(x) dx
2
n
n

i=1

f
(i)
(X
i
) =
2
(n 1)h

n + 1
(n 1)h
m

j=1
p
2
j
20
19.1.2 Kernel Density Estimator (KDE)
Kernel K
K(x) 0

K(x) dx = 1

xK(x) dx = 0

x
2
K(x) dx
2
K
> 0
KDE

f
n
(x) =
1
n
n

i=1
1
h
K

x X
i
h

R(f,

f
n
)
1
4
(h
K
)
4

(f

(x))
2
dx +
1
nh

K
2
(x) dx
h

=
c
2/5
1
c
1/5
2
c
1/5
3
n
1/5
c
1
=
2
K
, c
2
=

K
2
(x) dx, c
3
=

(f

(x))
2
dx
R

(f,

f
n
) =
c
4
n
4/5
c
4
=
5
4
(
2
K
)
2/5

K
2
(x) dx

4/5
. .. .
C(K)

(f

)
2
dx

1/5
Epanechnikov Kernel
K(x) =

3
4

5(1x
2
/5)
|x| <

5
0 otherwise
Cross-validation estimate of E [J(h)]

J
CV
(h) =


f
2
n
(x) dx
2
n
n

i=1

f
(i)
(X
i
)
1
hn
2
n

i=1
n

j=1
K

X
i
X
j
h

+
2
nh
K(0)
K

(x) = K
(2)
(x) 2K(x) K
(2)
(x) =

K(x y)K(y) dy
19.2 Non-parametric Regression
Estimate f(x), where f(x) = E [Y | X = x]. Consider pairs of points
(x
1
, Y
1
), . . . , (x
n
, Y
n
) related by
Y
i
= r(x
i
) +
i
E [
i
] = 0
V [
i
] =
2
k-nearest Neighbor Estimator
r(x) =
1
k

i:xiNk(x)
Y
i
where N
k
(x) = {k values of x
1
, . . . , x
n
closest to x}
Nadaraya-Watson Kernel Estimator
r(x) =
n

i=1
w
i
(x)Y
i
w
i
(x) =
K

xxi
h

n
j=1
K

xxj
h
[0, 1]
R( r
n
, r)
h
4
4

x
2
K
2
(x) dx

4

r

(x) + 2r

(x)
f

(x)
f(x)

2
dx
+

K
2
(x) dx
nhf(x)
dx
h


c
1
n
1/5
R

( r
n
, r)
c
2
n
4/5
Cross-validation estimate of E [J(h)]

J
CV
(h) =
n

i=1
(Y
i
r
(i)
(x
i
))
2
=
n

i=1
(Y
i
r(x
i
))
2

1
K(0)

n
j=1
K

xx
j
h

2
19.3 Smoothing Using Orthogonal Functions
Approximation
r(x) =

j=1

j
(x)
J

i=1

j
(x)
Multivariate Regression
Y = +
where
i
=
i
and =

0
(x
1
)
J
(x
1
)
.
.
.
.
.
.
.
.
.

0
(x
n
)
J
(x
n
)

Least Squares Estimator

= (
T
)
1

T
Y

1
n

T
Y (for equallly spaced observations only)
21
Cross-validation estimate of E [J(h)]

R
CV
(J) =
n

i=1

Y
i

j=1

j
(x
i
)

j,(i)

2
20 Stochastic Processes
Stochastic Process
{X
t
: t T} T =

{0, 1, . . . } = Z discrete
[0, ) continuous
Notations: X
t
, X(t)
State space X
Index set T
20.1 Markov Chains
Markov Chain {X
n
: n T}
P [X
n
= x| X
0
, . . . , X
n1
] = P [X
n
= x| X
n1
] n T, x X
Transition probabilities
p
ij
P [X
n+1
= j | X
n
= i]
p
ij
(n) P [X
m+n
= j | X
m
= i] n-step
Transition matrix P (n-step: P
n
)
(i, j) element is p
ij
p
ij
> 0


i
p
ij
= 1
Chapman-Kolmogorov
p
ij
(m+n) =

k
p
ij
(m)p
kj
(n)
P
m+n
= P
m
P
n
P
n
= P P = P
n
Marginal probability

n
= (
n
(1), . . . ,
n
(N)) where
i
(i) = P [X
n
= i]

0
initial distribution

n
=
0
P
n
20.2 Poisson Processes
Poisson Process
{X
t
: t [0, )} number of events up to and including time t
X
0
= 0
Independent increments:
t
0
< < t
n
: X
t1
X
t0
X
tn
X
tn1
Intensity function (t)
P [X
t+h
X
t
= 1] = (t)h +o(h)
P [X
t+h
X
t
= 2] = o(h)
X
s+t
X
s
Poisson (m(s +t) m(s)) where m(t) =

t
0
(s) ds
Homogeneous Poisson Process
(t) = X
t
Poisson (t) > 0
Waiting Times
W
t
:= time at which X
t
occurs
W
t
Gamma

t,
1

Interarrival Times
S
t
= W
t+1
W
t
S
t
Exp

t Wt1 Wt
St
22
21 Time Series
Mean function

Xt
= E [X
t
] =

xf
t
(x) dF
X
(x)
Autocovariance function

X
(s, t) = E [(X
s

s
)(X
t

t
)] = E [X
s
X
t
]
s

X
(t, t) = E

(X
t

t
)
2

= V [X
t
]
Autocorrelation function (ACF)
(s, t) =
Cov [X
s
, X
t
]

V [X
s
] V [X
t
]
=
(s, t)

(s, s)(t, t)
Cross-covariance function (CCV)

XY
(s, t) = E [(X
s

Xs
)(Y
t

Y t
)]
Cross-correlation function (CCF)

XY
(s, t) =

XY
(s, t)

X
(s, s)
Y
(t, t)
Backshift operator
B
k
(X
t
) = X
tk
Dierence operator

d
= (1 B)
d
White noise W
t
E [W
t
] = 0 t T
V [W
t
] =
2
t T

W
(s, t) = 0 s = t s, t T
Auto regression
X
t
=
p

i=1

i
X
ti
+W
t
Random walk
Y
t
= t +
t

j=1
W
j
where the constant is the drift
Symmetric Moving Average
M
t
=
k

j=k
a
j
X
tj
where a
j
= a
j
0 and
k

j=k
a
j
= 1
21.1 Stationary Time Series
Strictly stationary
P [X
t1
c
1
, . . . , X
tk
c
k
] = P [X
t1+h
c
1
, . . . , X
tk+h
c
k
]
k N, t
k
, c
k
, h Z
Weakly stationary
E

X
2
t

< t Z
E

X
2
t

= m t Z

X
(s, t) =
X
(s +r, t +r) r, s, t Z
Autocovariance function
(h) = E [(X
t+h
)(X
t
)] h Z
(0) = E

(X
t
)
2

(0) 0
(0) |(h)|
(h) = (h)
Autocorrelation function (ACF)

X
(h) =
Cov [X
t+h
, X
t
]

V [X
t+h
] V [X
t
]
=
(t +h, t)

(t +h, t +h)(t, t)
=
(h)
(0)
Jointly stationary time series

XY
(h) = E [(X
t+h

X
)(Y
t

Y
)]

XY
(h) =

XY
(h)

X
(0)
Y
(h)
Linear Process
X
t
= +

j=

j
w
tj
where

j=
|
j
| <
(h) =
2
w

j=

j+h

j
23
21.2 Estimation of Correlation
Sample autocovariance function
(h) =
1
n
nh

t=1
(X
t+h


X)(X
t


X)
Sample autocorrelation function
(h) =
(h)
(0)
Sample cross-variance function

XY
(h) =
1
n
nh

t=1
(X
t+h


X)(Y
t


Y )
Sample cross-correlation function

XY
(h) =

XY
(h)


X
(0)
Y
(0)
Properties

X(h)
=
1

n
if X
t
is white noise

XY (h)
=
1

n
if X
t
or Y
t
is white noise
21.3 Non-Stationary Time Series
Classical decomposition model
X
t
=
t
+S
t
+W
t
where

t
= trend
S
t
= seasonal component
W
t
= random noise term
21.3.1 Detrending
Least Squares
1. Choose trend model, e.g.,
t
=
0
+
1
t +
2
t
2
2. Minimize rss to obtain trend estimate
t
=

0
+

1
t +

2
t
2
3. Residuals noise W
t
Moving average
The low-pass lter V
t
is a symmetric moving average M
t
with a
j
=
1
2k+1
:
V
t
=
1
2k + 1
k

i=k
X
t1
If
1
2k+1

k
i=k
W
tj
0, a linear trend function
t
=
0
+
1
t passes
without distortion
Dierencing

t
=
0
+
1
t = (X
t
) =
1
21.4 ARIMA models
Autoregressive polynomial
(z) = 1
1
z
p
z
p
z C
p
= 0
Autoregressive operator
(B) = 1
1
B
p
B
p
AR(p) (autoregressive model order p)
X
t
=
1
X
t1
+ +
p
X
tp
+W
t
(B)X
t
= W
t
AR(1)
X
t
=
k
(X
tk
) +
k1

j=0

j
(W
tj
)
k,||<1
=

j=0

j
(W
tj
)
. .. .
linear process
E [X
t
] =

j=0

j
(E [W
tj
]) = 0
(h) = Cov [X
t+h
, X
t
] =

2
W

h
1
2
(h) =
(h)
(0)
=
h
(h) = (h 1) h = 1, 2, . . .
Moving average polynomial
(z) = 1 +
1
z + +
q
z
q
z C
q
= 0
Moving average operator
(B) = 1 +
1
B + +
p
B
p
24
MA(q) (moving average model order q)
X
t
= W
t
+
1
W
t1
+ +
q
W
tq
X
t
= (B)W
t
E [X
t
] =
q

j=0

j
E [W
tj
] = 0
(h) = Cov [X
t+h
, X
t
] =

2
W

qh
j=0

j

j+h
0 h q
0 h > q
MA(1)
X
t
= W
t
+W
t1
(h) =

(1 +
2
)
2
W
h = 0

2
W
h = 1
0 h > 1
(h) =


(1+
2
)
h = 1
0 h > 1
ARMA(p, q)
X
t
=
1
X
t1
+ +
p
X
tp
+W
t
+
1
W
t1
+ +
q
W
tq
(B)X
t
= (B)W
t
Partial autocorrelation function (PACF)

hh
= corr(X
h
X
h1
h
, X
0
X
h1
0
) h 2

11
= corr(X
1
, X
0
) = (1)
where X
h1
i
denotes the regression of X
i
on {X
h1
, X
h2
, . . . , X
1
}
ARIMA(p, d, q)

d
X
t
= (1 B)
d
X
t
is ARMA(p, q)
(B)(1 B)
d
X
t
= (B)W
t
Exponentially Weighted Moving Averages (EWMA)
X
t
= X
t1
+W
t
W
t1
X
t
=

j=1
(1 )
j1
X
tj
+W
t
when || < 1

X
n+1
= (1 )X
n
+

X
n
21.4.1 Causality and Invertibility
ARMA(p, q) is causal (future-independent) {
j
} :

j=0

j
< such that
X
t
=

j=0
W
tj
= (B)W
t
ARMA(p, q) is invertible {
j
} :

j=0

j
< such that
(B)X
t
=

j=0
X
tj
= W
t
Properties
ARMA(p, q) causal roots of (z) lie outside the unit circle
(z) =

j=0

j
z
j
=
(z)
(z)
|z| 1
ARMA(p, q) invertible roots of (z) lie outside the unit circle
(z) =

j=0

j
z
j
=
(z)
(z)
|z| 1
25
22 Math
22.1 Orthogonal Functions
L
2
space
L
2
(a, b) =

f : [a, b] R,

b
a
f(x)
2
dx <

Inner Product
f(x)g(x) dx
Norm
||f|| =

f
2
(x) dx
Orthogonality (for a series of functions
i
)


2
j
(x) dx = 1 j


i
(x)
j
(x) dx = 0 i = j
An orthogonal sequence
1
,
2
, . . . is complete if the only function that is is
orthogonal to each
j
is the zero function. Then,
1
,
2
, . . . form an orthogonal
basis in L
2
:
f L
2
= f(x) =

j=1

j
(x) where
j
=

b
a
f(x)
j
(x) dx
Cosine Basis

0
(x) = 1

j
(x) =

2 cos(jx) j 1
Parsevals Relation
||f||
2

f
2
(x) dx =

j=1
||||
2
Legendre Polynomials
x [1, 1]
P
0
(x) = 1
P
1
(x) = x
P
j+1
(x) =
(2j + 1)x(P
j
(x) jP
j1
(x)
j + 1

j
(x) =

(2j + 1)/2P
j
(x) orthogonal basis for L
2
(1, 1)
22.2 Series
Finite

k=0
k =
n(n + 1)
2

k=0
(2k 1) = n
2

k=0
k
2
=
n(n + 1)(2n + 1)
6

k=0
k
3
=

n(n + 1)
2

k=1
r
k1
=
1 r
n
1 r

k=0

n
k

= 2
n

l=0

n
l

m
k l

n +m
k

Vandermondes Identity

k=0

n
k

a
nk
b
k
= (a +b)
n
Binomial Theorem
Innite

k=0
p
k
=
1
1 p
|p| < 1

k=0
kp
k1
=
d
dp

k=0
p
k

=
d
dp

1
1 p

=
1
1 p
2

k=0

r +k 1
k

x
k
= (1 x)
r
r N
+

k=0

p
k
= (1 +p)

|p| < 1 , C
26
22.3 Combinatorics
Sampling
k out of n w/o replacement w/ replacement
ordered n
k
=
k1

i=0
(n i) =
n!
(n k)!
n
k
unordered

n
k

=
n
k
k!
=
n!
k!(n k)!

n 1 +r
r

n 1 +r
n 1

Stirling numbers, 2
nd
kind

n
k

= k

n 1
k

n 1
k 1

1 k n

n
0

1 n = 0
0 else
Partitions
P
n+k,k
=
n

i=1
P
n,i
k > n : P
n,k
= 0 n 1 : P
n,0
= 0, P
0,0
= 1
Balls and Urns f : B U D = distinguishable, D = indistinguishable.
|B| = n, |U| = m f arbitrary f injective f surjective f bijective
B : D, U : D m
n

m
n
m n
0 else
m!

n
m


n! m = n
0 else
B : D, U : D

n +n 1
n

m
n

n 1
m1


1 m = n
0 else
B : D, U : D
m

k=1

n
k


1 m n
0 else

n
m


1 m = n
0 else
B : D, U : D
m

k=1
P
n,k

1 m n
0 else
P
n,m

1 m = n
0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
Brooks Cole, 1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
The American Statistician, 62(1):4553, 2008.
[3] R. H. Shumway and D. S. Stoer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,
Algebra. Springer, 2001.
[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und
Statistik. Springer, 2002.
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
Springer, 2003.
27
C
r
e
a
t
e
d
b
y
L
e
e
m
i
s
a
n
d
M
c
Q
u
e
s
t
o
n
[
2
]
.
28

Вам также может понравиться