ML and MAP - HTML

Maximum Likelihood and Maximum A
Posterior
We looked at the regularization term as a penalty term in the objective function. There is
another way to interpret the regularization term as well. Specifically, there is a Bayesian
interpretation.
min E ∗ (w) = max − E ∗ (w)
{
= max exp − E ∗ (w) }
{ }
N
1 λ
= max exp
2
n=1
(
− ∑ y(x n, w) − t n )2 − 2 ‖ w‖ 22
{ }{
N
= max exp
1
2
n=1
(
− ∑ y(x n, w) − t n )2 exp −
1
2
λ‖w‖ 22 }
N
= max ∏ exp
n=1
{ −
1
2(y(x n, w) − t n )} {
2
exp −
1
2
λ‖w‖ 22 }
So, this is a maximization of the data likelihood with a prior: p(X | w)p(w)
Method of Maximum Likelihood:
A data likelihood is how likely the data is given the parameter set
So, if we want to maximize how likely the data is to have come from the model we fit,
we should find the parameters that maximize the likelihood
A common trick to maximizing the likelihood is to maximize the log likelihood. Often
makes the math much easier. Why can we maximize the log likelihood instead of the
Loading [MathJax]/jax/output/HTML-CSS/jax.js likelihood and still get the same answer?
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Consider: max lnexp { −
1
2 (y(xn, w) − tn )2} We go back to our original objective.
Method of Maximum A Posteriori (MAP):

p(X|Y)p(Y)
Bayes Rule: p(Y | X) = p(X)
p(D|w)p(w)
Consider: p(w | D) = , i.e., posterior ∝ likelihood × prior
p(D)
Maximum Likelihood vs. Maximum A Posteriori (MAP)

Lets look at this in terms of binary variables, e.g., Flipping a coin: X = 1 is heads, X = 0 is
tails
Let μ be the probability of heads. If we know μ, then: P(x = 1 | μ) = μ and P(x = 0 | μ) = 1 − μ
P(x | μ) = μ x(1 − μ) 1 − x =
{ μ
1−μ
if x = 1
if x = 0
This is called the Bernoulli distribution. The mean and variance of a Bernoulli distribution is:
E[x] = μ
[ ]
E (x − μ) 2 = μ(1 − μ)
So, suppose we conducted many Bernoulli trials (e.g., coin flips) and we want to estimate μ
Method: Maximum Likelihood
Loading [MathJax]/jax/output/HTML-CSS/jax.js
N
p(D | μ) = ∏ p(x n | μ)
n=1
N
= ∏ μ xn(1 − μ) 1 − xn
n=1
Maximize : (What trick should we use?)
N
L = ∑ x nlnμ + (1 − x n)ln(1 − μ)
n=1
N N
∂L 1 1
∂μ
=0= ∑
μ n=1
xn − ∑ (1 − x n)
1 − μn=1
N N
(1 − μ) ∑ n = 1x n − μ ∑ n = 1(1 − x n)
0=
μ(1 − μ)
N N N N
0= ∑ xn − μ ∑ xn − μ ∑ 1 + μ ∑ xn
n=1 n=1 n=1 n=1
N
0= ∑ x n − μN
n=1
N
1 m
μ=
N
∑ xn = N
n=1
where m is the number of successful trials.
So, if we flip a coin 1 time and get heads, then μ = 1 and probability of getting tails is 0.
Would you believe that? We need a prior!
Method: Maximum A Posteriori:

Look at several independent trials. Consider N = 3 and m = 2 (N is number of trials, m is
number of successes) and look at all ways to get 2 H and 1 T:
H H T → μμ(1 − μ) = μ 2(1 − μ)
H T H → μ(1 − μ)μ = μ 2(1 − μ)
T H H → (1 − μ)μμ = μ 2(1 − μ)
()
3
2
μ 2(1 − μ) →
()
N
m
μ m(1 − μ) N − m =
N!
(N−m) !m!
μ m(1 − μ) N − m
This is the Binomial Distribution, gives the probability of m observations of x = 1 out of N

independent trails
So, what we saw is that we need a prior. We want to incorporate our prior belief. Let us
place a prior on μ
Γ(a + b) a − 1
Beta(μ | a, b) = μ (1 − μ) b − 1
Γ(a)Γ(b)
a
E[μ] =
a+b
ab
Var[μ] =
(a + b) 2(a + b + 1)
∞
Note: Γ(x) = ∫ 0 u x − 1e − udu and when x is an integer, then it simplifys to (x − 1) !
Calculation of the posterior, Take N = m + l observations:
p(μ | m, l, a, b) ∝ Bin(m, l | μ)Beta(μ | a, b)

∝ μ m(1 − μ) lμ a − 1(1 − μ) b − 1
= μ m + a − 1(1 − μ) l + b − 1
What does this look like? Beta: a ← m + a, b ← l + b

So, what's the posterior?
Γ(m + a + l + b) m + a − 1
p(μ | m, l, a, b) = μ (1 − μ) l + b − 1
Γ(m + a)Γ(l + b)
Conjugate Prior Relationship: When the posterior is the same form as the prior
Now we can maximize the (log of the) posterior:
max ((m + a − 1)lnμ + (l + b − 1)ln(1 − μ))

μ
∂L m+a−1 l+b−1
=0= −
∂μ μ 1−μ
= (1 − μ)(m + a − 1) − μ(l + b − 1)
= (m + a − 1) − μ(m + a − 1) − μ(l + b − 1)
m+a−1
μ=
m+a+l+b−2
This is the MAP solution. So, what happens now when you flip one heads, two heads, etc.?
Discuss online updating of the prior. Eventually the data takes over the prior.
In [ ]: import numpy as np

import matplotlib.pyplot as plt
import math
%matplotlib inline
priorA = 2
priorB = 2
def plotBeta(a=priorA,b=priorB):
'''plotBeta(a=1,b=1): Plot plot beta distribution with paramete
rs a and b'''
xrange = np.arange(0,1,0.001) #get equally spaced points in th
e xrange
normconst = math.gamma(a+b)/(math.gamma(a)*math.gamma(b))
beta = normconst*xrange**(a-1)*(1-xrange)**(b-1)
fig = plt.figure()
p1 = plt.plot(xrange,beta, 'g')
plt.show()
#Beta Distribution
# plotBeta(priorA,priorB);
Loading [MathJax]/jax/output/HTML-CSS/jax.js trueMu = 0.5
numFlips = 100
flipResult = []
for flip in range(numFlips):
flipResult.append(np.random.binomial(1,trueMu,1)[0])
print(flipResult)
print('Frequentist/Maximum Likelihood Probability of Heads:' + str(
sum(flipResult)/len(flipResult)))
print('Bayesian/MAP Probability of Heads:' + str((sum(flipResult)+p
riorA-1)/(len(flipResult)+priorA+priorB-2)))
if (input("Hit enter to continue, or q to quit...\n") == "q"):
print("quitting...\n")
break
The Gaussian Distribution:

Consider a univariate Gaussian distribution:
2
N(x | μ, σ ) =
1
√2πσ2
exp
{ −
1 (x − μ) 2
2 σ2 }
1
σ 2 is the variance OR is the precision
σ2
So, as λ gets big, variance gets smaller/tighter. As λ gets small, variance gets larger/wider.
The Gaussian distribution is also called the Normal distribution.
We will often write N(x | μ, σ 2) to refer to a Gaussian with mean μ and variance σ 2.
What is the multi-variate Gaussian distribution?
What is the expected value of x for the Gaussian distribution?
E[x] = ∫ xp(x)dx
= ∫x
1
√2πσ2
exp
{ −
1 (x − μ) 2
2 σ2
dx
}
Change of variables: Let
x−μ
y= → x = σy + μ
σ
1
dy = dx → dx = σdy
σ
Plugging this into the expectation:
E[x] = ∫ (σy + μ)
√ { }
1
2πσ
exp −
1 2
2
y σdy
= ∫
σy
√2π
exp { } ∫√ { }
−
1 2
2
y dy +
μ
2π
exp −
1 2
2
y dy
The first term is an odd function: f( − y) = − f(y) So, E[x] = 0 + μ = μ
MLE of Mean of Gaussian

Let x 1, x 2, …, x N be samples from a multi-variance Normal distribution with known
covariance matrix and an unknown mean. Given this data, obtain the ML estimate of the
mean vector.
p(x k | μ) = l
1
1
(2π) 2 |Σ| 2
exp −
( 1
(x − μ) TΣ − 1(x k − μ)
2 k )
We can define our likelihood given the N data points. We are assuming these data points are
drawn independently but from an identical distribution (i.i.d.):
N N
∏ p(x n | μ) = ∏
n=1 n=1
1
l 1
(2π) 2 |Σ| 2
(
exp −
1
2
(x n − μ) TΣ − 1(x n − μ) )
We can apply our "trick" to simplify
N N
L = ln ∏ p(x n | μ) = ln ∏
n=1 n=1
l
1
1
(2π) 2 |Σ| 2
exp − ( 1
(x − μ) TΣ − 1(x n − μ)
2 n )
N
= ∑ ln
n=1
1
l
(2π) 2 |Σ| 2
1
exp − ( 1
(x − μ) TΣ − 1(x n − μ)
2 n )
( )
N
= ∑
n=1
ln l
1
1
(2π) 2 |Σ| 2
+ ( −
1
(x − μ) TΣ − 1(x n − μ)
2 n )
N
( 1
)
l 1
= − Nln(2π) 2 |Σ| 2 + ∑ − (x − μ) TΣ − 1(x n − μ)
2 n
n=1
Now, lets maximize:
[ )]
N
∂L ∂
( 1
l 1
∂μ
=
∂μ
− Nln(2π) 2 |Σ| 2 + ∑ − (x − μ) TΣ − 1(x n − μ)
2 n
=0
n=1
N
→ ∑ Σ − 1(x n − μ) = 0
n=1
N N
→ ∑Σ −1
xn = ∑ Σ − 1μ
n=1 n=1
N
→ Σ − 1 ∑ x n = Σ − 1μN
n=1
N
→ ∑ x n = μN
n=1
∑ nN= 1x n
→ =μ
N
Loading [MathJax]/jax/output/HTML-CSS/jax.js So, the ML estimate of μ is the sample mean!
MAP of Mean of Gaussian
To get a MAP estimate of the mean of a Gaussian, we apply a prior distribution and
maximize the posterior.
Lets use a Gaussian prior on the mean (because it has a conjugate prior relationship)
p(μ | X, μ 0, σ 20, σ 2) ∝ N (X | μ, σ 2)N(μ | μ 0, σ 20)
(√ { })√ { }
N
1 1 1 1
= ∏
n=1 2πσ 2
exp −
2σ 2
(xn − μ )2 2
2πσ 0
exp −
2σ 20
(μ − μ 0 ) 2
N
N 1 N N
L = − ln(2πσ 2) −
2 2
∑ (x n − μ) 2 − ln(2πσ 20) −
2 2
(μ − μ 0) 2
2σ n = 1 2σ 0
N
∂L N N N 1
∂μ
= − μ−
2 2
μ+
2
μ0 + ∑ xn = 0
σ σ0 σ0 σ2 n = 1
( )
Nσ 2 + σ 2 N
0 1 1
Nμ = ∑ xn + μ0
σ 2σ 20 σ 2
n=1 σ 20
σ 20 N
μ 0σ 2
μ MAP = 2
∑ xn + 2
Nσ 0 + Nσ 2 n = 1 Nσ 0 + Nσ 2
Does this result make sense?
In [ ]:

ML and MAP - HTML

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ML and MAP - HTML

Загружено:

Авторское право:

Доступные форматы

Maximum Likelihood and Maximum A

min E ∗ (w) = max − E ∗ (w)

Method of Maximum A Posteriori (MAP):

Maximum Likelihood vs. Maximum A Posteriori (MAP)

Method: Maximum Likelihood

Maximize : (What trick should we use?)

where m is the number of successful trials.

Method: Maximum A Posteriori:

This is the Binomial Distribution, gives the probability of m observations of x = 1 out of N

p(μ | m, l, a, b) ∝ Bin(m, l | μ)Beta(μ | a, b)

What does this look like? Beta: a ← m + a, b ← l + b

max ((m + a − 1)lnμ + (l + b − 1)ln(1 − μ))

In [ ]: import numpy as np

Loading [MathJax]/jax/output/HTML-CSS/jax.js trueMu = 0.5

The Gaussian Distribution:

Plugging this into the expectation:

The first term is an odd function: f( − y) = − f(y) So, E[x] = 0 + μ = μ

MLE of Mean of Gaussian

Now, lets maximize:

Loading [MathJax]/jax/output/HTML-CSS/jax.js So, the ML estimate of μ is the sample mean!

p(μ | X, μ 0, σ 20, σ 2) ∝ N (X | μ, σ 2)N(μ | μ 0, σ 20)

Does this result make sense?

Вам также может понравиться