Вы находитесь на странице: 1из 12

DS-GA 1002 Lecture notes 11 Fall 2016

Bayesian statistics
In the frequentist paradigm we model the data as realizations from a distribution that
depends on deterministic parameters. In contrast, in Bayesian statistics these parameters
are modeled as random variables, which allows to quantify our uncertainty about them.

1 Learning Bayesian models


We consider the problem of learning a parametrized distribution from a set of data. In the
Bayesian framework, we model the data ~x Rn as a realization of a random vector X, ~
~
which depends on a vector of parameters that is also random. This requires two modeling
choices:

~ which encodes our uncertainty about


1. The prior distribution is the distribution of ,
the model before seeing the data.
2. The likelihood is the conditional distribution of X ~ given ,
~ which specifies how the
data depend on the parameters. In contrast to the frequentist framework, the likelihood
is not interpreted as a deterministic function of the parameters.

Our goal when learning a Bayesian model is to compute the posterior distribution of the
~ Evaluating this posterior distribution at the realization ~x allows to
parameters given X.
update our uncertainty about using the data.
The following example illustrates Bayesian inference applied to fitting the parameter of a
Bernoulli random variable from iid realizations.

Example 1.1 (Bernoulli distribution). Let ~x be a vector of data that we wish to model
as iid samples from a Bernoulli distribution. Since we are taking a Bayesian approach we
choose a prior distribution for the parameter of the Bernoulli. We will consider two different
Bayesian estimators 1 and 2 :

1. 1 represents a conservative estimator in terms of prior information. We assign a


uniform pdf to the parameter. Any value in the unit interval has the same probability
density:
(
1 for 0 1,
f1 () = (1)
0 otherwise.
Prior distribution n0 = 1, n1 = 3

2.0 2.5

1.5 2.0

1.5
1.0
1.0

0.5
0.5

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

n0 = 3, n1 = 1 n0 = 91, n1 = 9

14
Posterior mean (uniform prior)
2.0 Posterior mean (skewed prior)
ML estimator
12

1.5 10

8
1.0
6

4
0.5
2

0.0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: Posterior distributions of the parameter of a Bernoulli for two different priors and for
different data realizations.

2
2. 2 is an estimator that assumes that the parameter is closer to 1 than to 0. We could
use it for instance to capture the suspicion that a coin is biased towards heads. We
choose a skewed pdf that increases linearly from zero to one,
(
2 for 0 1,
f2 () = (2)
0 otherwise.

By the iid assumption, the likelihood, which is just the conditional pmf of the data given
the parameter of the Bernoulli, equals
pX|
~ (~x|) = n1 (1 )n0 , (3)
where n1 is the number of ones in the data and n0 the number of zeros (see Example 1.3 in
Lecture Notes 9). The posterior pdfs of the two estimators are consequently equal to
f1 () pX| x|)
~ 1 (~
f1 |X~ (|~x) = (4)
pX~ (~x)
f1 () pX| x|)
~ 1 (~
=R (5)
u
f1 (u) pX| x|u) du
~ 1 (~

n1 (1 )n0
=R (6)
u
un1 (1 u)n0 du
n1 (1 )n0
= , (7)
(n1 + 1, n0 + 1)
f2 () pX| x|)
~ 2 (~
f2 |X~ (|~x) = (8)
pX~ (~x)
n1 +1
(1 )n0
= R n1 +1 (9)
u
u (1 u)n0 du
n1 +1 (1 )n0
= , (10)
(n1 + 2, n0 + 1)
(11)
where
Z
(a, b) := ua1 (1 u)b1 du (12)
u
is a special function called the beta function or Euler integral of the first kind, which is
tabulated.
Figure 2 shows the plot of the posterior distribution for different values of n1 and n0 . It also
shows the maximum-likelihood estimator of the parameter, which is just n1 / (n0 + n1 ) (see
Example 1.3 in Lecture Notes 9). For a small number of flips, the posterior pdf of 2 is
skewed to the right with respect to that of 1 , reflecting the prior belief that the parameter
is closer to 1. However for a large number of flips both posterior densities are very close.

3
2 Conjugate priors
Both posterior distributions in Example 1.1 are beta distributions.

Definition 2.1 (Beta distribution). The pdf of a beta distribution with parameters a and b
is defined as
(1)b1
( a1

(a,b)
, if 0 1,
f (; a, b) := (13)
0 otherwise,

where
Z
(a, b) := ua1 (1 u)b1 du. (14)
u

In fact, the uniform prior of 1 in Example 1.1 is also a beta distribution, the parameters are
a = 1 and b = 1. This is also the case for the skewed prior of 2 : it is a beta distribution with
parameters a = 2 and b = 1. Since the prior and the posterior belong to the same family we
can view the posterior as an updated estimate of the parameters of the distribution. When
the prior and posterior are guaranteed to belong to the same family of distributions for a
particular likelihood, the distributions are called conjugate priors.

Definition 2.2 (Conjugate priors). A conjugate family of distributions for a certain like-
lihood satisfies the following property: if the prior belongs to the family, then the posterior
also belongs to the family.

Beta distributions are conjugate priors when the likelihood is binomial.

Theorem 2.3 (The beta distribution is conjugate to the binomial likelihood). If the prior
distribution of is a beta distributions with parameters a and b and the likelihood of the
data X given is binomial with parameters n and x, then the posterior distribution of
given X is a beta distribution with parameters x + a and n x + b.

4
18
88.6%
16 11.4%
14
12
10
8
6
4
2
0
0.35 0.40 0.45 0.50 0.55 0.60

Figure 2: Posterior distribution of the fraction of Trump voters in New Mexico conditioned on
the poll data in Example 2.4.

Proof.
f () pX | (x | )
f | X ( | x) = (15)
pX (x)
f () pX | (x | )
=R (16)
f (u) pX | (x | u) du
u
a1 (1 )b1 nx x (1 )nx

=R (17)
a1 (1 u)b1 n ux (1 u)nx du

u
u x
x+a1 (1 )nx+b1
=R (18)
u
ux+a1 (1 u)nx+b1 du
= f (; x + a, n x + b) . (19)

Note that the posteriors obtained in Example 1.1 follow immediately from the theorem.

Example 2.4 (Poll in New Mexico). In a poll in New Mexico with 449 participants, 227
people intend to vote for Clinton and 202 for Trump (the data are from a real poll1 , but for
simplicity we are ignoring the other candidates and people that were undecided). Our aim
1
The poll results are taken from
https://www.abqjournal.com/883092/clinton-still-ahead-in-new-mexico.html

5
is to use a Bayesian framework to predict the outcome of the election in New Mexico using
these data.
We model the fraction of people that vote for Trump as a random variable . We assume
that the n people in the poll are chosen uniformly at random with replacement from the
population, so given = the number of Trump voters is a binomial with parameters n and
. We dont have any additional information about the possible value of , so we assume it
is uniform or equivalently a beta distribution with parameters a := 1 and b := 1.
By Theorem 2.3 the posterior distribution of given the data that we observe is a beta
distribution with parameters a := 203 and b := 228. The corresponding probability that
0.5 is 11.4%, which is our estimate for the probability that Trump wins in New Mexico.

3 Bayesian estimators
As we have seen, the Bayesian approach to learning probabilistic models yields the posterior
distribution of the parameters of interest, as opposed to a single estimate. In this section we
describe two possible estimators that are derived from the posterior distribution.

3.1 Minimum mean-square-error estimation

The mean of the posterior distribution is the conditional expectation of the parameters given
the data. Choosing the posterior mean as an estimator for the parameters ~ has a strong
theoretical justification: it is guaranteed to achieve the minimum mean square error (MSE)
among all possible estimators.

Theorem 3.1 (The posterior mean minimizes the MSE). The posterior mean is the mini-
mum mean-square-error (MMSE) estimate of the parameter ~ given the data X.
~ To be more
precise, let us define
~ X~ = ~x .

MMSE (~x) := E | (20)

For any arbitrary estimator other (~x),


 2   2 
E ~
other (X) ~ ~ ~
E MMSE (X) . (21)

6
Proof. We begin by computing the MSE of the arbitrary estimator conditioned on X ~ = ~x
in terms of the conditional expectation of given X, ~
 2 
E ~
other (X) ~ X ~ = ~x (22)
 2 
= E other (X) ~ MMSE (X) ~ + MMSE (X)
~ ~ X ~ = ~x (23)
 2 
= (other (~x) MMSE (~x))2 + E MMSE (X) ~ ~ X ~ = ~x (24)
  
+ 2 (other (~x) MMSE (~x)) E MMSE (~x) E |~ X ~ = ~x
 2 
= (other (~x) MMSE (~x))2 + E MMSE (X) ~ ~ X ~ = ~x . (25)

By iterated expectation,
 2    2 
~
E other (X) ~ ~
= E E other (X) X ~ (26)
 2    2 
~ ~
= E other (X) MMSE (X) ~ ~
+ E E MMSE (X) X
~
 2   2 
~ MMSE (X)
= E other (X) ~ +E ~
MMSE (X) ~ (27)
 2 
E ~
MMSE (X) ~ , (28)

since the expectation of a nonnegative quantity is nonnegative.

Example 3.2 (Bernoulli distribution (continued)). In order to obtain point estimates for
the parameter in Example 2 we compute the posterior means:
  Z 1
E 1 |X~ = ~x = f1 |X~ (|~x) d (29)
0
R 1 n +1
1 (1 )n0 d
= 0 (30)
(n1 + 1, n0 + 1)
(n1 + 2, n0 + 1)
= , (31)
(n1 + 1, n0 + 1)
  Z 1
~
E 2 |X = ~x = f2 |X~ (|~x) d (32)
0
(n1 + 3, n0 + 1)
= . (33)
(n1 + 2, n0 + 1)
Figure 2 shows the posterior means for different values of n0 and n1 .

7
3.2 Maximum-a-posteriori estimation

An alternative to the posterior mean is the posterior mode, which is the maximum of the
pdf or the pmf of the posterior distribution.

Definition 3.3 (Maximum-a-posteriori estimator). The maximum-a-posteriori (MAP) es-


~ given data ~x modeled as a realization of a random vector X
timator of a parameter ~ is
 
MAP (~x) := arg max p
~ |X
~ ~ | ~
x (34)
~

~ is modeled as a discrete random variable and


if
 
MAP (~x) := arg max f
~ |X
~
~
| ~
x (35)
~

if it is modeled as a continuous random variable.

In Figure 2 the ML estimator of is the mode (maximum value) of the posterior distribution
when the prior is uniform. This is not a coincidence, under a uniform prior the MAP and
ML estimates are the same.

Lemma 3.4. The maximum-likelihood estimator of a parameter is the mode (maximum


~ if its prior distribution is
value) of the pdf of the posterior distribution given the data X
uniform.

Proof. We prove the result when the model for the data and the parameters is continuous,
if any or both of them are discrete the proof is identical (in that case the ML estimator is
the mode of the pmf of the posterior). If the prior distribution of the parameters is uniform,
then f ~ ~
~ is constant for any , which implies

 
~ ~ ~ ~x|~

  f~ fX|
arg max f ~ x = arg max R
~ |~
~ |X (36)
~ ~ f ~ (u) fX|
u ~ (~
~ x|u) du
 
= arg max fX| ~ ~
~ x|~ ~
(the rest of the terms dont depend on )
~

= arg max L~x ~ .



(37)
~

8
Note that uniform priors are only well defined in situations where the parameter is restricted
to a bounded set.
We now describe a situation in which the MAP estimator is optimal. If the parameter
can only take a discrete set of values, then the MAP estimator minimizes the probability of
making the wrong choice.
Theorem 3.5 (MAP estimator minimizes the probability of error). Let ~ be a discrete
~ be a random vector modeling the data. We define
random vector and let X
~ ~

MAP (~x) := arg max p ~ |X = ~
~ |X x . (38)
~

For any arbitrary estimator other (~x),


   
~ ~ ~ ~
P other (X) 6= P MAP (X) 6= . (39)

In words, the MAP estimator minimizes the probability of error.

Proof. We assume that X ~ is a continuous random vector, but the same argument applies if
it is discrete. We have
  Z  
~
P = other (X) = fX~ (~x) P = other (~x) | X ~ = ~x d~x (40)
Z~x

= fX~ (~x) p x) | ~x) d~x


~ (other (~
~ |X (41)
Z~x
fX~ (~x) p ~ (MAP (~
~ |X x) | ~x) d~x (42)
~
x
 
= P = MAP (X) ~ (43)

(42) follows from the definition of the MAP estimator as the mode of the posterior.

Example 3.6 (Sending bits). We consider a very simple model for a communication channel
in which we aim to send a signal consisting of a single bit. Our prior knowledge indicates
that the signal is equal to one with probability 1/4.
1 3
p (1) = , p (0) = . (44)
4 4
Due to the presence of noise in the channel, we send the signal n times. At the receptor we
observe
~i = + Z
X ~i, 1 i n, (45)

9
~ contains n iid standard Gaussian random variables. Modeling perturbations as
where Z
Gaussian is a popular choice in communications. It is justified by the central limit theo-
rem, under the assumption that the noise is a combination of many small effects that are
approximately independent.
We will now compute and compare the ML and MAP estimators of given the observations.
The likelihood is equal to
n
Y
L~x () = fX~ i | (~xi | ) (46)
i=1
n
Y 1 (~xi )2
= e 2 . (47)
i=1
2

It is easier to deal with the log-likelihood function,


n
X (~xi )2 n
log L~x () = log 2. (48)
i=1
2 2

Since only takes two values, we can compare directly. We will choose ML (~x) = 1 if
n
X ~x 2 2~xi + 1
i n
log L~x (1) = log 2 (49)
i=1
2 2
n
X ~xi2 n
log 2 (50)
i=1
2 2
= log L~x (0) . (51)

Equivalently,
(
1 if n1 ni=1 ~xi > 21 ,
P
ML (~x) = (52)
0 otherwise.

The rule makes a lot of sense: if the sample mean of the data is closer to 1 than to 0 then
our estimate is equal to 1. By the law of total probability, the probability of error of this
estimator is equal to
     
~ = P 6= ML (X) ~ = 0 P ( = 0) + P 6= ML (X) ~ = 1 P ( = 1)

P 6= ML (X)
n ! n !
1X 1 1X 1
=P ~xi > = 0 P ( = 0) + P ~xi < = 1 P ( = 1)
n i=1 2 n i=1 2

= Q n/2 , (53)

10
where the last equality follows from the fact that if we condition on = the empirical
mean is Gaussian with variance 2 /n and mean (see the proof of Theorem 3.2 in Lecture
Notes 6).
To compute the MAP estimate we must find the maximum of the posterior distribution
of given the observed data. Equivalently, we find the maximum of the logarithm of the
posterior pdf (this is equivalent because the logarithm is a monotone function),
Qn
i=1 fX xi |) p ()
~ i | (~
log p|X~ (|~x) = log (54)
fX~ (~x)
Xn
= log fX~ i | (~xi |) p () log fX~ (~x) (55)
i=1
n
X ~x 2 2~xi + 2
i n
= log 2 + log p () log fX~ (~x) . (56)
i=1
2 2
We compare the value of this function for the two possible values of : 0 and 1. We choose
MAP (~x) = 1 if
n
X ~x 2 2~xi + 1
i n
log p|X~ (1|~x) + log fX~ (~x) = log 2 log 4 (57)
i=1
2 2
n
X ~x 2 i n
log 2 log 4 + log 3 (58)
i=1
2 2
= log p|X~ (0|~x) + log fX~ (~x) . (59)
Equivalently,
(
1 if n1 ni=1 ~xi > 1 log 3
P
2
+ n
,
MAP (~x) = (60)
0 otherwise.
The MAP estimate shifts the threshold with respect to the ML estimate to take into account
that is more prone to equal zero. However, the correction term tends to zero as we gather
more evidence, so if a lot of data is available the two estimators will be very similar.
The probability of error of the MAP estimator is equal to
P ( 6= MAP (~x)) = P ( 6= MAP (~x) | = 0) P ( = 0) + P ( 6= MAP (~x) | = 1) P ( = 1)
n !
1X 1 log 3
=P ~xi > + = 0 P ( = 0) (61)
n i=1 2 n
n !
1X 1 log 3
+P ~xi < + = 1 P ( = 1)
n i=1 2 n

   
3 log 3 1 log 3
= Q n/2 + + Q n/2 . (62)
4 n 4 n

11
0.35
ML estimator
0.30 MAP estimator
0.25
Probability of error

0.20

0.15

0.10

0.05

0.00
0 5 10 15 20
n
Figure 3: Probability of error of the ML and MAP estimators in Example 3.6 for different values
of n.

We compare the probability of error of the ML and MAP estimators in Figure 3. MAP
estimation results in better performance, but the difference becomes small as n increases.

12

Вам также может понравиться