Вы находитесь на странице: 1из 25

Maximum Likelihood

Estimation

1
Maximum Likelihood Estimation
• For different pattern recognition tasks, we often find ourselves having
to estimate parameters of a given model
• We have some samples of the data, and want to use them to estimate
parameters of the model
• This happens in many pattern recognition applications, e.g.,
• Regression analysis
• Modeling Biometric score distributions
• Logistic Regression
• Time series analysis

2
Maximum Likelihood Estimation
• Assume we have data samples Y1, Y2,…,Yn which are assumed to be
independent and identically distributed (iid)
• Let ϴ be the parameter which we seek to estimate
• Since Y1, Y2,…,Yn are iid, the joint distribution of the entire sample
can be expressed as:
𝑝(𝑦1 , 𝑦2 , … , 𝑦𝑛 |ϴ) = 𝑝(𝑦1 |ϴ) × 𝑝(𝑦2 |ϴ) × ⋯ × 𝑝(𝑦1 |ϴ)
• The function 𝑝(𝑦1 , 𝑦2 , … , 𝑦𝑛 |ϴ) is called the likelihood function
• Given some observed data (e.g., 𝑦1 = 5, 𝑦2 = 6, … , 𝑦𝑛 = 4),
maximum likelihood estimation leverages this function to find the
most likely value of ϴ

3
Likelihood Function

• 𝐿(ϴ|𝑦1 , 𝑦2 , … , 𝑦𝑛 )= 𝑝(𝑦1 , 𝑦2 , … , 𝑦𝑛 |ϴ)


is identical to a probability density function except that it is a function
of the parameter ϴ for the fixed values of 𝑦1 , 𝑦2 , … , 𝑦𝑛 (a pdf is on the
other hand a function of 𝑦1 , 𝑦2 , … , 𝑦𝑛 for a fixed value of ϴ

4
Example
• Recall the exponential distribution, 𝑝 𝑦 λ = λ𝑒 −λ𝑦 for y>0
• Suppose we hypothesize that the customer service waiting time at a
call center follows this distribution
• Suppose we get one observation y1=2.0 from a single customer. We
can attempt to try to find the value of theta using this data
• We have: 𝐿(λ|𝑦 = 2)=λ𝑒 −2λ

5
Example: How a
single data point
reduces our
uncertainty
about the
parameters of
−λ𝑦
𝑝 𝑦 λ = λ𝑒

6
Likelihood function based on a single
observation, y=2.0

7
What if we had more than just one sample
• Assume n=10; and the samples iid.

Data Value
point 𝐿(λ|𝑦1 , 𝑦2 , … , 𝑦10 )
𝑦1 2.0 = 𝑝(𝑦1 , 𝑦2 , … , 𝑦10 λ
= 𝑝(𝑦1 |λ) × 𝑝(𝑦2 |λ) × ⋯ × 𝑝(𝑦10 |λ)
𝑦2 1.2
= λ𝑒 −λ𝑦1 × λ𝑒 −λ𝑦2 × … × λ𝑒 −λ𝑦10
𝑦3 4.8 = λ10 𝑒 −λ𝑦1−λ𝑦2−⋯−λ𝑦10
10
𝑦4 1.0 = λ10 𝑒 −λ 𝑖=1 𝑦𝑖
𝑦5 3.8 1 10 10
But 𝑦 = 10 𝑖=1 𝑦𝑖 ⇒ 10𝑦= 𝑖=1 𝑦𝑖 ⇒
𝑦6 0.7
10 −10λ𝑦
𝐿(λ|𝑦1 , 𝑦2 , … , 𝑦10 )= λ 𝑒
𝑦7 0.3
𝑦8 0.2 𝐿(λ|𝑦 = 2)= λ10 𝑒 −20λ
𝑦9 4.5
𝑦10 1.5
𝒚 = 𝟐. 𝟎
8
Likelihood estimates with more samples

 Vertical axis is likelihood


 Horizontal axis is λ
9
Computing MLE
• Previous plots helped us visualize the behavior of the likelihood
function as sample sizes increased.
• However in practice, we may not be able to graph the function that
easily – often one has to deal with lots of parameters (both in terms
of numbers and variety)
• Options:
• Sometimes it is possible to use calculus to find the parameter value(s) which
maximize the likelihood function
• Numerical methods can also be used to find the parameter values

10
Computing MLE
𝜕𝐿(ϴ|𝑦1 , 𝑦2 , …,𝑦𝑛 )
• To find the maxima, we could use |ϴ=ϴ =0
𝜕ϴ
• However, often its much simpler to maximize the logarithm of the
likelihood function instead.
• Log-likelihood function: 𝑳𝑳 ϴ|𝒚𝟏 , 𝒚𝟐 , … , 𝒚𝒏 = 𝐥𝐧(𝑳 ϴ|𝒚𝟏 , 𝒚𝟐 , … , 𝒚𝒏 )
• Reasons?
• Density functions often complex -- have exponential terms
• Log of product of likelihoods is a sum – easier to deal with
• Likelihood values tend to be too small –logarithm helps make them bigger
and reduces the risk precision loss.

11
Does the Log-likelihood provide the same
MLE?

• Natural log is a monotonically increasing function of its argument, so


if a1>a2 then ln(a1)>ln(a2)
• Thus: 𝑳 ϴ1|𝒚𝟏 , 𝒚𝟐 , … , 𝒚𝒏 > 𝑳 ϴ2|𝒚𝟏 , 𝒚𝟐 , … , 𝒚𝒏
is equivalent to
𝒍𝒏(𝑳 ϴ1|𝒚𝟏 , 𝒚𝟐 , … , 𝒚𝒏 ) > 𝒍𝒏(𝑳 ϴ2|𝒚𝟏 , 𝒚𝟐 , … , 𝒚𝒏 )
• So we are free to maximize the log-likelihood function

12
Does Log-likelihood provide the same MLE?

13
Coin Toss Example
• Problem: We have a coin, and want to estimate its bias – whats the probability it
lands on heads/tails?
• Let 𝑃(𝐻𝑒𝑎𝑑𝑠) = 𝜃 and 𝑃(𝑇𝑎𝑖𝑙𝑠) = 1 − 𝜃. Assume we toss the coin 12 times
and obtain: 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝑇𝐻𝑇𝐻𝐻.
• 𝐿(𝜃) = 𝜃 10 (1 − 𝜃)2 = 𝜃 10 1 − 2𝜃 + 𝜃 2 = 𝜃 10 − 2𝜃 11 + 𝜃 12
• If we maximize the likelihood directly:
𝑑
𝐿 𝜃 = 10𝜃 9 − 22𝜃10 + 12𝜃11
𝑑𝜃
𝑑
𝐿 𝜃 = 0 ⇒ 10𝜃 9 − 22𝜃10 + 12𝜃11 = 0 ⇒ 𝜃 9 10 − 22𝜃 + 12𝜃 2 = 0
𝑑𝜃
10 10
• ⇒ 𝜃 = 0, 𝜃 = 1, 𝜃 = . Further evaluation of each turning point confirms 𝜃 =
12 12

14
Example: Coin Toss
𝐿(𝐿(θ)) = ln(𝐿(θ))=ln(𝜃10 (1 − 𝜃)2 )
= 10𝑙𝑛θ+2𝑙𝑛(1 − 𝜃)
𝑑(𝐿(𝐿(θ)) 10 2
= −
θ θ 1−𝜃

𝑑(𝐿(𝐿(θ))
=0

10
⇒ 10 1 − 𝜃 − 2θ=0 ⇒ θ=
12

15
General Case of binary valued rv
• Assume a binary valued r.v X having: 𝑃(𝑋 = 1) = 𝜃 𝑎𝑛𝑑 𝑃(𝑋 = 0) =
1− 𝜃
• 𝐿(𝜃) = 𝜃𝑥 (1 − 𝜃)𝑦 if we observe 𝑥 ones and 𝑦 zeros.
• 𝑙𝑛(𝐿(𝜃)) = 𝑥𝑙𝑛(𝜃) + 𝑦𝑙𝑛(1 − 𝜃)
𝜕𝐿(𝜃) 𝑥 −1.𝑦.
• = +
𝜕𝜃 𝜃 (1− 𝜃)
𝜕𝐿(𝜃) 𝑥 −1.𝑦.
• = 0⇒ + = 0 ⇒ 𝑥 1 − 𝜃 − 𝑦𝜃 = 0
𝜕𝜃 𝜃 (1− 𝜃)
𝑥
• ⇒𝑥−𝜃 𝑥+𝑦 =0⇒𝜃 =
𝑥+𝑦

16
MLE for Univariate Gaussian
(𝑥−𝜇)2
1 −
• 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑝𝑑𝑓: 𝑃(𝑥| 𝜇, 𝜎) = 𝑒 2𝜎2
𝜎 2𝜋
• If we have received N samples x1, x2, …, xN
1 (𝑥1 −𝜇)2 1 (𝑥2 −𝜇)2 1 (𝑥𝑁 −𝜇)2
− 2 − 2 −
𝐿 𝜇, 𝜎 = 𝑒 2𝜎 × 𝑒 2𝜎 × …× 𝑒 2𝜎2
𝑁
𝜎 2𝜋 𝜎 2𝜋 𝜎 2𝜋
1 (𝑥𝑖 −𝜇)2

= 𝑒 2𝜎2
𝑖=1
𝜎 2𝜋

17
MLE for Univariate Gaussian
𝑁 𝑁
1 (𝑥𝑖 − 𝜇)2 1 (𝑥𝑖 −𝜇)2
− −
𝑙𝑛 𝐿 𝜇, 𝜎 = ln 𝑒 = ln( 𝑒 2𝜎2 )
𝜎 2𝜋 2𝜎 2 𝜎 2𝜋
𝑖=1 𝑖=1
𝑁
(𝑥𝑖 − 𝜇)2
=𝐾+ (−log𝜎 − 2
)
2𝜎
𝑖=1𝑁 𝑁
𝜕 (𝑥𝑖 − 𝜇) 1
𝑙𝑛 𝐿 𝜇, 𝜎 = 2
= 2 (𝑥𝑖 − 𝜇)
𝜕𝜇 𝜎 𝜎
𝑁 𝑁𝑖=1 𝑖=1 𝑁
1
(𝑥𝑖 − 𝜇) = 0 ⇒ (𝑥𝑖 − 𝜇) = 0 ⇒ −𝑁𝜇 + 𝑥𝑖 = 0
𝜎2
𝑖=1 𝑖=1 𝑖=1

18
MLE for Univariate Gaussian
𝑁 𝑁
−𝑁𝜇 + 𝑖=1 𝑥𝑖 = 0 ⇒ 𝑁𝜇 = 𝑖=1 𝑥𝑖
1 𝑁
⇒ 𝜇𝑀𝐿 = 𝑥𝑖 = 𝑥
𝑁 𝑖=1
𝑁 (𝑥𝑖 −𝜇)2
• Recall: 𝑙𝑛 𝐿 𝜇, 𝜎 =𝐾+ 𝑖=1(−log𝜎 − )
2𝜎 2
𝜕 𝑁 1 (𝑥𝑖 −𝜇)2
• 𝑙𝑛 𝐿 𝜇, 𝜎 = 𝑖=1(− 𝜎 + 3 )= 0
𝜕𝜎 𝜎
−𝑁 1
• ⇒ + 3 𝑁 (𝑥
𝑖=1 𝑖 − 𝜇) 2
= 0 ⇒ 𝑁𝜎 = 2 𝑁
𝑖=1(𝑥𝑖 − 𝜇)2
𝜎 𝜎
𝑁 2
2
(𝑥
𝑖=1 𝑖 − 𝜇 𝑀𝐿 )
𝜎𝑀𝐿 =
𝑁
19
Biasness of Gaussian MLE Estimators -- Mean
• 𝝁𝑴𝑳 is unbiased if E(𝝁𝑴𝑳 )=𝝁
1 𝑁
• Recall: 𝜇𝑀𝐿 = 𝑖=1 𝑥𝑖 =𝑥
𝑁
1 𝑁 1 𝑁
• 𝐸 𝜇𝑀𝐿 = 𝐸 𝑥 = 𝐸 𝑖=1 𝑥𝑖 = 𝐸 𝑖=1 𝑥𝑖
𝑁 𝑁
1 𝑁
•= 𝑖=1 𝐸(𝑥𝑖 )
𝑁
• For iid samples, 𝐸(𝑥𝑖 ) = 𝐸(𝑥)
1 𝑁 1
• ⇒ 𝐸 𝜇𝑀𝐿 = 𝑖=1(𝐸(𝑥) = . 𝑁. 𝐸(𝑥)
𝑁 𝑁
•=𝐸 𝑥 =𝜇
20
Biasness of Gaussian MLE Estimators -- Variance
• 𝝈𝟐𝑴𝑳 is unbiased if E(𝝈𝟐𝑴𝑳 )=𝝈𝟐 ;
𝑁 2
𝑖=1(𝑥𝑖 −𝜇𝑀𝐿 ) 1
2
𝐸(𝜎𝑀𝐿 ) =𝐸 = 𝐸( 𝑁
𝑖=1(𝑥𝑖 − 𝜇𝑀𝐿)2 )
𝑁 𝑁
• Recall 𝜇𝑀𝐿=sample mean, 𝑥.
𝑁 𝑁 𝑁

⇒ (𝑥𝑖 − 𝑥)2 = 𝑥𝑖2 − 2𝑥𝑖 𝑥 + 𝑥 2 = 𝑁𝑥 2 − 2𝑁𝑥 𝑥 + 𝑥𝑖2


𝑖=1 𝑖=1 𝑖=1
Previous expression is after manipulating two terms in the summation using
𝑁 2 2 𝑁 2
𝑖=1 𝑥 = 𝑁 𝑥 and 𝑁 𝑥 = 𝑥
𝑖=1 𝑖 . Substituting the expression back into 𝐸(𝜎𝑀𝐿 ):
𝑁
1
2
⇒ 𝐸(𝜎𝑀𝐿 ) = 𝐸(𝑁𝑥 2 − 2𝑁𝑥 2 + 𝑥𝑖2 )
𝑁
𝑖=1

21
Biasness of Gaussian MLE Estimators
𝑁 𝑁
1 1
2
𝐸(𝜎𝑀𝐿 ) = 𝐸 𝑁𝑥 2 − 2𝑁𝑥 2 + 𝑥𝑖2 = 𝐸 𝑥𝑖2 − 𝐸 𝑥 2
𝑁 𝑁
𝑖=1 𝑖=1
𝑁
1
⇒ 2
𝐸(𝜎𝑀𝐿 ) = 𝐸(𝑥𝑖2 ) − 𝐸 𝑥 2
𝑁
𝑖=1
Recall standard formulae for variance:
𝜎2
𝜎 2 = 𝐸 𝑥 2 − 𝜇2 and 𝐸 𝑥 2 − 𝜇2 = 𝑉𝑎𝑟 𝑥 =
𝑁 𝑁
2 2
2
1 𝜎 1 𝜎
⇒ 𝐸(𝜎𝑀𝐿 ) = ( (𝜎 2 + 𝜇2 )) − + 𝜇2 = 𝑁𝜎 2 + 𝑁𝜇2 − − 𝜇2
𝑁 𝑁 𝑁 𝑁
𝑖=1
22
Biasness of Gaussian MLE Estimators
2 1 2 2 𝜎2 2 2 2 𝜎2
• 𝐸(𝜎𝑀𝐿 ) = 𝑁𝜎 + 𝑁𝜇 − −𝜇 =𝜎 +𝜇 − − 𝜇2
𝑁 𝑁 𝑁
𝜎 2 (𝑁−1)
2 2 2
• 𝐸(𝜎𝑀𝐿 )= 𝜎 − =𝜎
𝑁 𝑁
• Any observations on biasness of the MLE estimator for variance ?

23
Gaussian MLE Estimators –
2
Correcting the Bias in 𝜎𝑀𝐿
2 2 𝜎2 2 (𝑁−1)
𝑆𝑖𝑛𝑐𝑒 𝐸(𝜎𝑀𝐿 ) =𝜎 − =𝜎 ,
𝑁 𝑁
𝑁 2 2 (𝑁−1) 𝑁
⇒ 𝐸( . 𝜎𝑀𝐿 ) = 𝜎 . = 𝜎2
𝑁−1 𝑁 𝑁−1
2 1 𝑁 2
Recall 𝜎𝑀𝐿 = (𝑥 − 𝜇 )
𝑁 𝑖=1 𝑖 𝑁 𝑀𝐿 𝑁
𝑁 2
𝑁 1 2
1
. 𝜎𝑀𝐿 = . (𝑥𝑖 − 𝜇𝑀𝐿) = (𝑥𝑖 − 𝜇𝑀𝐿)2
𝑁−1 𝑁−1 𝑁 (𝑁 − 1)
𝑖=1 𝑖=1

24
MLE-- Issues
• Suppose we observed the following for a coin toss: TTTTTTTTTTT

• Then that means MLE would find that theta (for number of heads) is
zero, which is plain wrong.

• Other methods such as MAP (Maximum aposteriori estimation) try


to address this.

25

Вам также может понравиться