Академический Документы
Профессиональный Документы
Культура Документы
Posterior
We looked at the regularization term as a penalty term in the objective function. There is
another way to interpret the regularization term as well. Specifically, there is a Bayesian
interpretation.
{
= max exp − E ∗ (w) }
{ }
N
1 λ
= max exp
2
n=1
(
− ∑ y(x n, w) − t n )2 − 2 ‖ w‖ 22
{ }{
N
= max exp
1
2
n=1
(
− ∑ y(x n, w) − t n )2 exp −
1
2
λ‖w‖ 22 }
N
= max ∏ exp
n=1
{ −
1
2(y(x n, w) − t n )} {
2
exp −
1
2
λ‖w‖ 22 }
So, this is a maximization of the data likelihood with a prior: p(X | w)p(w)
Method of Maximum Likelihood:
A data likelihood is how likely the data is given the parameter set
So, if we want to maximize how likely the data is to have come from the model we fit,
we should find the parameters that maximize the likelihood
A common trick to maximizing the likelihood is to maximize the log likelihood. Often
makes the math much easier. Why can we maximize the log likelihood instead of the
Loading [MathJax]/jax/output/HTML-CSS/jax.js likelihood and still get the same answer?
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Consider: max lnexp { −
1
2 (y(xn, w) − tn )2} We go back to our original objective.
P(x | μ) = μ x(1 − μ) 1 − x =
{ μ
1−μ
if x = 1
if x = 0
This is called the Bernoulli distribution. The mean and variance of a Bernoulli distribution is:
E[x] = μ
[ ]
E (x − μ) 2 = μ(1 − μ)
So, suppose we conducted many Bernoulli trials (e.g., coin flips) and we want to estimate μ
Loading [MathJax]/jax/output/HTML-CSS/jax.js
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
N
p(D | μ) = ∏ p(x n | μ)
n=1
N
= ∏ μ xn(1 − μ) 1 − xn
n=1
N
L = ∑ x nlnμ + (1 − x n)ln(1 − μ)
n=1
N N
∂L 1 1
∂μ
=0= ∑
μ n=1
xn − ∑ (1 − x n)
1 − μn=1
N N
(1 − μ) ∑ n = 1x n − μ ∑ n = 1(1 − x n)
0=
μ(1 − μ)
N N N N
0= ∑ xn − μ ∑ xn − μ ∑ 1 + μ ∑ xn
n=1 n=1 n=1 n=1
N
0= ∑ x n − μN
n=1
N
1 m
μ=
N
∑ xn = N
n=1
So, if we flip a coin 1 time and get heads, then μ = 1 and probability of getting tails is 0.
Would you believe that? We need a prior!
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Look at several independent trials. Consider N = 3 and m = 2 (N is number of trials, m is
number of successes) and look at all ways to get 2 H and 1 T:
H H T → μμ(1 − μ) = μ 2(1 − μ)
H T H → μ(1 − μ)μ = μ 2(1 − μ)
T H H → (1 − μ)μμ = μ 2(1 − μ)
()
3
2
μ 2(1 − μ) →
()
N
m
μ m(1 − μ) N − m =
N!
(N−m) !m!
μ m(1 − μ) N − m
Γ(a + b) a − 1
Beta(μ | a, b) = μ (1 − μ) b − 1
Γ(a)Γ(b)
a
E[μ] =
a+b
ab
Var[μ] =
(a + b) 2(a + b + 1)
∞
Note: Γ(x) = ∫ 0 u x − 1e − udu and when x is an integer, then it simplifys to (x − 1) !
Calculation of the posterior, Take N = m + l observations:
Γ(m + a + l + b) m + a − 1
p(μ | m, l, a, b) = μ (1 − μ) l + b − 1
Γ(m + a)Γ(l + b)
Conjugate Prior Relationship: When the posterior is the same form as the prior
Loading [MathJax]/jax/output/HTML-CSS/jax.js
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Now we can maximize the (log of the) posterior:
∂L m+a−1 l+b−1
=0= −
∂μ μ 1−μ
= (1 − μ)(m + a − 1) − μ(l + b − 1)
= (m + a − 1) − μ(m + a − 1) − μ(l + b − 1)
m+a−1
μ=
m+a+l+b−2
This is the MAP solution. So, what happens now when you flip one heads, two heads, etc.?
Discuss online updating of the prior. Eventually the data takes over the prior.
priorA = 2
priorB = 2
def plotBeta(a=priorA,b=priorB):
'''plotBeta(a=1,b=1): Plot plot beta distribution with paramete
rs a and b'''
xrange = np.arange(0,1,0.001) #get equally spaced points in th
e xrange
normconst = math.gamma(a+b)/(math.gamma(a)*math.gamma(b))
beta = normconst*xrange**(a-1)*(1-xrange)**(b-1)
fig = plt.figure()
p1 = plt.plot(xrange,beta, 'g')
plt.show()
#Beta Distribution
# plotBeta(priorA,priorB);
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
numFlips = 100
flipResult = []
for flip in range(numFlips):
flipResult.append(np.random.binomial(1,trueMu,1)[0])
print(flipResult)
print('Frequentist/Maximum Likelihood Probability of Heads:' + str(
sum(flipResult)/len(flipResult)))
print('Bayesian/MAP Probability of Heads:' + str((sum(flipResult)+p
riorA-1)/(len(flipResult)+priorA+priorB-2)))
if (input("Hit enter to continue, or q to quit...\n") == "q"):
print("quitting...\n")
break
2
N(x | μ, σ ) =
1
√2πσ2
exp
{ −
1 (x − μ) 2
2 σ2 }
1
σ 2 is the variance OR is the precision
σ2
So, as λ gets big, variance gets smaller/tighter. As λ gets small, variance gets larger/wider.
The Gaussian distribution is also called the Normal distribution.
We will often write N(x | μ, σ 2) to refer to a Gaussian with mean μ and variance σ 2.
What is the multi-variate Gaussian distribution?
What is the expected value of x for the Gaussian distribution?
E[x] = ∫ xp(x)dx
= ∫x
1
√2πσ2
exp
{ −
1 (x − μ) 2
2 σ2
dx
}
Change of variables: Let
Loading [MathJax]/jax/output/HTML-CSS/jax.js
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
x−μ
y= → x = σy + μ
σ
1
dy = dx → dx = σdy
σ
E[x] = ∫ (σy + μ)
√ { }
1
2πσ
exp −
1 2
2
y σdy
= ∫
σy
√2π
exp { } ∫√ { }
−
1 2
2
y dy +
μ
2π
exp −
1 2
2
y dy
p(x k | μ) = l
1
1
(2π) 2 |Σ| 2
exp −
( 1
(x − μ) TΣ − 1(x k − μ)
2 k )
We can define our likelihood given the N data points. We are assuming these data points are
drawn independently but from an identical distribution (i.i.d.):
N N
∏ p(x n | μ) = ∏
n=1 n=1
1
l 1
(2π) 2 |Σ| 2
(
exp −
1
2
(x n − μ) TΣ − 1(x n − μ) )
We can apply our "trick" to simplify
Loading [MathJax]/jax/output/HTML-CSS/jax.js
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
N N
L = ln ∏ p(x n | μ) = ln ∏
n=1 n=1
l
1
1
(2π) 2 |Σ| 2
exp − ( 1
(x − μ) TΣ − 1(x n − μ)
2 n )
N
= ∑ ln
n=1
1
l
(2π) 2 |Σ| 2
1
exp − ( 1
(x − μ) TΣ − 1(x n − μ)
2 n )
( )
N
= ∑
n=1
ln l
1
1
(2π) 2 |Σ| 2
+ ( −
1
(x − μ) TΣ − 1(x n − μ)
2 n )
N
( 1
)
l 1
= − Nln(2π) 2 |Σ| 2 + ∑ − (x − μ) TΣ − 1(x n − μ)
2 n
n=1
[ )]
N
∂L ∂
( 1
l 1
∂μ
=
∂μ
− Nln(2π) 2 |Σ| 2 + ∑ − (x − μ) TΣ − 1(x n − μ)
2 n
=0
n=1
N
→ ∑ Σ − 1(x n − μ) = 0
n=1
N N
→ ∑Σ −1
xn = ∑ Σ − 1μ
n=1 n=1
N
→ Σ − 1 ∑ x n = Σ − 1μN
n=1
N
→ ∑ x n = μN
n=1
∑ nN= 1x n
→ =μ
N
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
MAP of Mean of Gaussian
To get a MAP estimate of the mean of a Gaussian, we apply a prior distribution and
maximize the posterior.
Lets use a Gaussian prior on the mean (because it has a conjugate prior relationship)
(√ { })√ { }
N
1 1 1 1
= ∏
n=1 2πσ 2
exp −
2σ 2
(xn − μ )2 2
2πσ 0
exp −
2σ 20
(μ − μ 0 ) 2
N
N 1 N N
L = − ln(2πσ 2) −
2 2
∑ (x n − μ) 2 − ln(2πσ 20) −
2 2
(μ − μ 0) 2
2σ n = 1 2σ 0
N
∂L N N N 1
∂μ
= − μ−
2 2
μ+
2
μ0 + ∑ xn = 0
σ σ0 σ0 σ2 n = 1
( )
Nσ 2 + σ 2 N
0 1 1
Nμ = ∑ xn + μ0
σ 2σ 20 σ 2
n=1 σ 20
σ 20 N
μ 0σ 2
μ MAP = 2
∑ xn + 2
Nσ 0 + Nσ 2 n = 1 Nσ 0 + Nσ 2
In [ ]:
Loading [MathJax]/jax/output/HTML-CSS/jax.js
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD