Вы находитесь на странице: 1из 7

Stat 111 | Section 8

Stat 111 – Introduction to Statistical Inference


Section 8: Bayesian
Mar 26 - Mar 30, 2018
Kin Wai Chan and Sanqian Zhang
Department of Statistics, Harvard University

1 Sufficiency
Definition Let Y be a sample from model FY (y|θ). The statistic T (Y ) is a sufficient statistic for θ if
the conditional distribution of Y given T (Y ) is free of θ.
iid Pn
Example 1.1. X1 , · · · , Xn ∼ P o(λ). Using the definition of sufficiency, show that T = i=1 Xi is a
sufficient statistic for λ.

Solution Based on the definition, we want to show that

P (X1 = x1 , · · · , Xn = xn |T = t)

is free of λ. Based on definition, we have

P (X1 = x1 , · · · , Xn = xn , T = t)
P (X1 = x1 , · · · , Xn = xn |T = t) =
P (T = t)

Note that
i) T ∼ P o(nλ)
Pn
ii) P (X1 = x1 , · · · , Xn = xn , T = t) > 0 only if xi ≥ 0, ∀i and i=1 xi = t.
Pn
For xi ≥ 0 and i=1 xi = t, we have

P (X1 = x1 , · · · , Xn = xn , T = t)
P (X1 = x1 , · · · , Xn = xn |T = t) =
P (T = t)
Qn e−λ λxi
i=1 xi !
= e−nλ (nλ)t
t!
−nλ
Qn xi
t! e i=1 λ
= Qn −nλ
Q n xi
i=1 xi ! e i=1 (nλ)
n 
Y 1 i  x
t!
= Qn
i=1 xi ! i=1 n

Hence, we have
(X1 , · · · , Xn |T ) ∼ Multinomial T, (n−1 , · · · , n−1 ) .


This conditional distribution doesn’t depend on λ. By definition of sufficiency, T is a sufficient statistic


for λ. When n = 2, this is a special case of the Chicken and Egg story (Thm 7.1.10 in Blitzstein and
Hwang).

1 |7
Stat 111 | Section 8

Factorization Theorem If we can write

fY (y|θ) = g(θ, T (y))h(y)

for non-negative functions g, h, then T (y) is sufficient.


iid
Example 1.2. X1 , · · · , Xn ∼ U nif (θ − 1, θ + 1), find the sufficient statistic for θ.

Solution The pdf is given by


1
fX (x|θ) = 1(θ−1≤x≤θ+1) .
2
As a result, the likelihood is
n
Y 1
L(θ; x) = 1(θ−1≤xi ≤θ+1)
i=1
2
 n
1
= 1(θ−1≤x1 ,··· ,xn ≤θ+1)
2
 n
1
= 1(θ−1≤x(1) ) 1(θ+1≥x(n) ) .
2

By the factorization theorem, X(1) and X(n) are the sufficient statistics for θ. From this example, we
can also see that the dimension of sufficient statistics can be larger than the dimension of parameter θ.

2 Bayesian Inference
2.1 Prior Distribution
A key feature of Bayesian inference is that we assume a prior distribution f (θ) on the parameter(s). Here
are some types of prior:
• Proper vs improper prior

– Proper priors are priors on θ that is a probability distribution on θ. Proper priors always lead
to proper posterior.
– Improper priors are priors on θ that is not a probability distribution. For example, f (θ) ∝ 1 on
θ ∈ R, f (θ) ∝ θ−1 on θ > 0. When you use an improper prior, it is important to check that
the posterior is proper.

• Conjugate prior
– Under a model f (y|θ), a family of prior distribution f (θ) is considered to be conjugate prior if the
posterior distribution belongs to the same family as the prior distribution. Common examples
include Normal-Normal, Gamma-Poisson, Beta-Binomial.

• Informative vs uninformative prior


– Informative prior gives explicit information on the parameter. In applied studies, this may come
from previous literature.
– Uninformative prior is describing a prior that gives little information on the parameter (so
inference is based on the likelihood). However, there is no consensus as to what exactly does
uninformative prior mean and what is an uninformative prior. An example of an uninformative
prior is the Jeffreys’ prior.

2 |7
2.2 Posterior Distribution Stat 111 | Section 8

2.2 Posterior Distribution


Under the Bayesian framework, the key object of interest is f (θ|y), the posterior distribution. Inferential
tasks related to θ can be completed by making reference to the posterior distribution. By Bayes’ rule, the
posterior distribution is given by
f (y|θ)f (θ)
f (θ|y) =
f (y)
R
where f (y) = f (y|θ)f (θ)dθ. Note that the denominator f (y) does not depend on θ. When y is fixed,
this can be considered as a constant. We typically express the posterior distribution up to proportionality
constant, that is
f (θ|y) ∝ f (y|θ)f (θ)
iid
Example 2.1. We have y1 , ..., yn ∼ Bin(N, θ). Consider the following inference problems:
(a) N known but θ unknown. Assume a prior θ ∼ Beta(α, β). Derive f (θ|y1:n ).
(b) θ known but N unknown. Assume an improper prior f (N ) ∝ N −1 . Derive f (N |y1:n ).

Solution
(a)

f (θ|y1:n ) ∝ f (y1:n |θ)f (θ)


n
!
Y
yi N −yi
∝ θ (1 − θ) θα−1 (1 − θ)β−1
i=1
P P
yi −1
∝ θα+ (1 − θ)β+nN − yi −1
X X
θ|y1:n ∼ Beta(α + yi , β + nN − yi )

(b)
n
!
Y N! yi N −yi 1
f (N |y1:n ) ∝ θ (1 − θ) I(N ≥yi )
y !(N − yi )!
i=1 i
N
" n #
Y N! N
∝ ((1 − θ)n ) N −1 I(N ≥y(n) )
i=1
(N − yi )!

2.3 Predictive Distributions


2.3.1 Prior Predictive Distribution
Before we observe any data, our prediction about unobserved data y is based on the model and the prior.
The marginal distribution of y, or the prior predictive distribution of y is given by
Z
f (y) = f (y|θ)f (θ)dθ

2.3.2 Posterior Predictive Distribution


After observing data y, we can make prediction about potential new observations ỹ from the same process.
This posterior predictive distribution is given by
Z
f (ỹ|y) = f (ỹ|θ, y)f (θ|y)dθ
Z
= f (ỹ|θ)f (θ|y)dθ , if ỹ and y conditionally independent given θ

3 |7
2.3 Predictive Distributions Stat 111 | Section 8

Example 2.2 (Normal-Normal model with known variance). Suppose we observe data

y ∼ N (µ, σ 2 ) ,

σ 2 known and assume prior µ ∼ N (µ0 , τ02 ). Derive the prior predictive distribution p(y), posterior distri-
bution p(µ|y) and posterior predictive distribution p(ỹ|y).

Solution

Method 1. Using representation and m.g.f Using representation, we have

y = µ + σZ ,

where µ ∼ N (µ0 , τ02 ) and Z ∼ N (0, 1) such that Z

|=
µ. Then we have
   2
τ02
  
µ µ0 τ0
∼ N2 , .
y µ0 τ02 τ02 + σ 2

Using properties of multivariate normal distribution, it follows that

y ∼ N (µ0 , τ02 + σ 2 ), and


N µ1 , τ12

µ|y ∼

σ2
with B = , µ1 = Bµ0 + (1 − B)y and τ12 = Bτ02 .
τ02
+ σ2
Consider the m.g.f. of ỹ given y,

E etỹ |y = E E etỹ |y, µ |y


     

= E E etỹ |µ |y
   
h 2 2
i
= E etµ+0.5σ t |y
2 2
τ1 +0.5σ 2 t2
= etµ1 +0.5t

This implies
ỹ|y ∼ N µ1 , τ12 + σ 2 .


4 |7
2.3 Predictive Distributions Stat 111 | Section 8

Method 2. Working with densities directly For the prior predictive distribution of y,
Z ∞
p(y) = p(y|µ)p(µ)dµ
−∞
Z ∞    
1 1
∝ exp − 2 (y − µ)2 exp − 2 (µ − µ0 )2 dµ
−∞ 2σ 2τ0
  2  Z ∞      
1 y 1 2 1 1 y µ0
∝ exp − exp − µ + 2 − 2µ + 2 dµ
2 σ2 −∞ 2 σ2 τ0 σ2 τ0
  Z ∞
1 y2 yBτ02 µ0 Bτ02
    
1 2
∝ exp − exp − µ − 2µ + dµ
2 σ2 −∞ 2Bτ02 σ2 τ02
  Z ∞
1 y2
  
1
∝ exp − exp − µ2 − 2µ (y(1 − B) + Bµ0 )
2 σ2 −∞ 2Bτ02

2 2
+ (y(1 − B) + Bµ0 ) − (y(1 − B) + Bµ0 ) dµ

1 y2 y 2 (1 − B)2
  
2y(1 − B)µ0 B
∝ exp − − −
2 σ2 Bτ02 Bτ02
    
1 1 µ0
∝ exp − y 2 − 2y
2 σ 2 + τ02 σ 2 + τ02
 
1
∝ exp − (y − µ0 )2 . (Complete the square)
2 (σ 2 + τ02 )
⇒ y ∼ N (µ0 , σ 2 + τ02 ).

For the posterior distribution p(µ|y),

p(µ|y) ∝ p(y|µ)p(µ)
   
1 2 1 2
∝ exp − 2 (y − µ) exp (µ − µ0 )
2σ 2τ02
1 µ2 − 2µy + y 2 µ2 − 2µµ0 + µ20
  
∝ exp − +
2 σ2 τ2
     0 
1 1 1 y µ0
∝ exp − µ2 + − 2µ +
2 σ2 τ02 σ2 τ02
(  "  −1  #)
1 1 1 2 1 1 y µ0
∝ exp − + 2 µ − 2µ + 2 + 2
2 σ2 τ0 σ2 τ0 σ2 τ0
(  2 )
1
∝ exp − µ − (Bµ 0 + (1 − B)y) . (Complete the square)
2Bτ02

For the posterior predictive distribution p(ỹ|y),


Z
p(ỹ|y) = p(ỹ|µ)p(µ|y)dµ
Z    
1 1
∝ exp − 2 (ỹ − µ)2 exp (µ − µ1 )2
dµ (The calculus is same as above.)
2σ 2τ12

5 |7
2.4 Bayesian Inference as a Sampling Problem Stat 111 | Section 8

2.4 Bayesian Inference as a Sampling Problem


Bayesian inference is performed based on the posterior distribution p(θ|y). Based on the posterior distri-
bution we can construct point estimators and credible intervals for θ. The typical point estimators that
we look at are
• MAP: value of θ that maximizes the posterior distribution of θ;
• median(θ|y): the estimator that minimizes posterior expected loss under loss function L(θ, c) = |θ−c|
• E [θ|y]: the estimator that minimizes posterior expected loss under the squared loss function (θ − c)2

Our posterior distribution look like this


1
f (θ|y) = f (y|θ)f (θ)
C
R
where C = f (y|θ)p(θ)dθ is a normalizing constant. This normalizing constant may not be easy to
compute. However, computation of estimators such as the posterior median and posterior mean would
typically involve knowing this constant C.
To bypass the problem of calculating C, we can try to frame our inference problem as a sampling
problem. That is, if we can sample s1 , · · · , sm from f (θ|y) (without knowing C), how can we construct
(estimators for) point estimators for θ? How about the credible interval?
Some nice theoretical results (which is out of our scope!) tell us that under certain appropriate sampling
procedures, we have
m
1 X
h(sj ) → E [h(θ)|y] , m → ∞.
m j=1

The implication is that if we want to construction E [θ|y] as a point estimator, we can estimate it by
1
P m
m j=1 sj . Similarly, we can use median(s1 , · · · , sm ) to estimate median(θ|y). Similarly, we can estimate
credible intervals based on appropriate quantiles of the sample.
iid
Example 2.3. X1 , · · · , Xn ∼ U nif (θ − 1, θ + 1). We assume a flat prior on θ, what is the posterior
distribution of θ?
Now suppose we have a sample s1 , · · · , sm from the posterior distribution of θ, describe how you would
construct:
i) A point estimate of θ.
ii) A 95 % credible interval for θ. What is the interpretation of this credible interval?
iii) A point estimate of θ2 .
iv) A 95 % credible interval for θ2 .

Solution From previous exercise, we know that

L(θ; x) ∝ 1(θ−1≤x(1) ) 1(θ+1≥x(n) ) .

The question specifies a flat prior on θ. Since θ ∈ R, this is an improper prior f (θ) ∝ 1.
The posterior is given by

f (θ|x) ∝ 1 × 1(θ−1≤x(1) ) 1(θ+1≥x(n) )


∝ 1(θ≤x(1) +1) 1(θ≥x(n) −1)
∝ 1(x(n) −1≤ θ ≤x(1) +1)

Hence, θ|x ∼ U nif (x(n) − 1, x(1) + 1).

6 |7
2.4 Bayesian Inference as a Sampling Problem Stat 111 | Section 8

i) We can choose to use either the posterior mean or the posterior median
Pm as the estimator. If we want to
1
use posterior mean, then our estimator based on the samples is m j=1 sj . If we want to use posterior
median, our estimator is the median of the m samples.
ii) A 95% credible interval can be constructed using

s(d0.025me) , s(d0.975me) ,

the empirical 2.5% and 97.5% quantiles of the sample. In R, this would be
quantile(posteriorsamples,c(0.025,0.975)). This interval can be interpreted as, given
the data, there is a 95% probability that the parameter θ falls in this region.
1
Pm 2 2 2
iii) Our estimator can be m j=1 sj or the empirical median of s1 , · · · , sm , depending on which estimator
is chosen.
iv) We can take the empirical 2.5% and 97.5% quantiles of s21 , · · · , s2m .

7 |7

Вам также может понравиться