As.000 Unit 2 Likelihood W Sols

ACTL30004
Unit 2. Likelihood Theory
1 / 97 Unit 2. Likelihood Theory ACTL30004

Objectives
I Advantages of using maximum likelihood estimation (MLE).
I Estimate parameters of statistical models using MLE.
I Understand the difference between an estimate and an
estimator.
I Understand the concept of minimum variance for an unbiased
estimator of a parameter (CRLB).
I Analyze when an unbiased estimator coincides with the ML
estimator.
I Be able to use Theorems 2.1, 2.2 and 2.3.
I Provide the asymptotic distribution of the ML estimator.
I Interpret the graphical meaning of second derivative of the
log-likelihood.
I Approximate the expected Fisher’s information.

Objectives
I Provide the asymptotic distribution of the ML estimator for

multivariate case.
I Apply the likelihood ratio test for nested models.
I Perform MLE with Excel.
I Understand how Fisher Scoring algorithm works.
I Apply Newton–Raphson approximation for single variable prob-
lems.
I Apply Newton–Raphson approximation for multivariable prob-
lems.
I Modify Newton–Raphson algorithm to obtain Fisher Scoring
algorithm.

Introduction
I Likelihood theory is mostly used as a method for estimating

unknown parameters associated to a random variable.
I Review of materials learnt in MAST20005.
I New Material: Fisher Scoring algorithm.
I Set the foundations for the analysis of GLM’s.

Maximum Likelihood Estimation
I Consider y as a value taken on by a RV Y.
I Use y in drawing conclusions on the unknown CDF F(·|θ ) of Y

that depends on the unknown quantity θ ∈ Θ ⊆ R.
I Quantity θ is called the parameter and the set Θ is the pa-

rameter space.
I Class F(·|θ ) is called the parametric statistical model.

Estimates and Estimators
I An estimate of θ, i.e. θ̂, is a value close to θ.
I Choice which best explains why the phenomenom under study

has produced the observed data y.
I Usually estimate ≡ point estimate are equivalent.
I General criterion for constructing estimates that associates an

element of θ ∈ Θ with every realisation of the random variable
y ∈ Y.
I It defines a function Tn (Y) which is called an estimator of θ.
I Tn (Y) is a RV (or transformation of a set of RV’s).

Estimates and Estimators
I pdf (or pmf) associated with a continuous (or discrete) RV Y

will be denoted as f (y|θ ).
I Y ∼ G eo(θ ), its pmf is given by
f (y|θ ) = θ (1 − θ )y , y = 0, 1, 2 . . .
I Let us consider a RV Y. We denote a RS of size n from Y as
y1 , . . . , yn , i.e. independent realisations.
I Capital letters: random variables
I Lower case letters: sample values.

Maximum Likelihood Estimates
I Likelihood function is defined in terms of a RS from a RV Y
with unknown parameter θ.
I Values in the RS are assumed to be drawn from the same RV,

say Y, and they are assumed to be independent of each other.
I Equivalent to say that we have a RS of size n obtained from

RV’s Y1 , . . . , Yn where the variables are independent and iden-
tically distributed (iid).
I Likelihood function is the product of the pdf ( or pmf) is given

by
n
Ln (θ |y1 , . . . , yn ) = ∏ f (yi |θ ), based on n and θ.
i=1

I Since Ln (θ |y1 , . . . , yn ) is non–negative ∀ θ ∈ Θ, in likelihood
theory we use the log–likelihood function
!
n n
`(θ ) = log(Ln (θ |y1 , . . . , yn )) = log ∏ f ( yi | θ ) = ∑ log(f (yi |θ )).
i=1 i=1
Definition: Given a RS of size n from a RV Y with a single unknown

parameter θ ∈ Θ, a maximum likelihood estimate (MLE) of θ is an
element θ̂ which attains the maximum value of Ln (θ |y1 , . . . , yn ) in
Θ, i.e. such that
Ln (θ̂ |y1 , . . . , yn ) = max Ln (θ |y1 , . . . , yn )

θ ∈Θ
As log–likelihood function is a monotonic transformation of the like-

lihood function this definition is equivalent to
`(θ̂ ) = max `(θ ).

θ ∈Θ

I ML estimation is only one method of estimation to estimate
the unknown parameter θ.
I ML estimation chooses the value of θ, θ̂, such that
Ln (θ̂ |y1 , . . . , yn ) > Ln (θ |y1 , . . . , yn ) ∀ θ ∈ Θ.
I MLE may not exist.
I MLE is not unique.
I The likelihood function has to be maximized in the space Θ
specified by the statistical model, not over the set of the ad-
missible values of θ.
I In general, ML estimator has no closed–form expression.
I If so, MLE has to be obtained numerically for the observed
sample. In real applications, this aspect is very relevant, and
gives rise to numerical methods. In this Unit we will consider
the Fisher–Scoring algorithm.

Maximum Likelihood Estimation – Notation
y : = y1 , y2 , . . . , yn
∂ log(Ln (θ |y))
1. = 0 ⇒ θ̂ = . . .
∂θ
∂2 log(Ln (θ |y))
2. Check |θ =θ̂ < 0.
∂θ 2
Tn (Y): Maximum likelihood estimator.
Replace y1 , . . . , yn in the MLE θ̂ by RV’s Y1 , . . . , Yn .
∑ni=1 yi ∑n Y
For example, θ̂ = ⇒ Tn ( Y ) = i = 1 i
n n

Exercise 1
A certain type of electronic component has a lifetime Y (in hours)
with pdf given by
( y
1
2 y e− θ , if y > 0
f (y| θ ) = θ
0, otherwise.
As defined above, we use θ̂ to denote MLE of θ.
Suppose that three such components, tested independently, had

lifetimes of 120, 130 and 128 hours. Based on this sample, find θ̂.

Solution to Exercise 1
The likelihood function is Ln (θ |y1 , . . . , yn ) = ∏ni=1 f (yi |θ ). For our
sample of size three we can write this as:
3
L3 (θ |y1 , y2 , y3 ) = ∏ f (yi |θ ).
i=1
3
1 yi
L3 (θ |y1 , y2 , y3 ) =
θ6 ∏ yi e − θ ,
i=1
where y1 = 120, y2 = 130 and y3 = 128.

To find MLE, θ̂, we maximise this likelihood function by varying θ.
Take the log-likelihood function,
log(Ln (θ |y1 , y2 , y3 )).

3
yi
log(L3 (θ |y1 , y2 , y3 )) = −6 log θ + ∑ log yi − .
i=1
θ
Maximising this expression by setting the first derivative equal to

zero, and replacing θ by θ̂, we have
3
6 1
− + 2
θ̂ θ̂
∑ yi = 0.
i=1
∑3i=1 yi
Solving this equation, we get θ̂ = and θ̂ = 63.
6

Check that θ̂ corresponds to a maximum by observing the sign of

the second derivative of the log-likelihood function at θ = θ̂.
∂2 log(L3 (θ |y1 , y2 , y3 )) 6 2 · 378

= −
∂θ 2 θ̂ 2 θ̂ 3
= −0.001511716 < 0.
which is clearly negative when θ is 63.

Exercise 2
Suppose that we have a random sample of size n from
Y ∼ N (µ, 1). We write this sample as y1 , . . . , yn .
(a) Find the MLE of µ.
(b) Write down an expression for Tn (Y), the ML estimator
of µ based on a sample of size n.

Solution to Exercise 2 (a)

1 1 2
f (y| µ ) = √ exp − (y − µ) .
2π 2
n
1
Ln (µ|y) = (2 π ) −n/2
∏ exp − 2 (yi − µ)2
i=1
!
n
1
= (2 π ) −n/2
exp −
2 ∑ ( yi − µ ) 2
.
i=1
n
n 1
log Ln (µ|y) = −
2
log(2 π ) −
2 ∑ ( yi − µ ) 2 .
i=1
∂ log Ln (µ|y) n
1 n
∂µ
= ∑ (yi − µ) = 0 ⇒ µ̂ = n ∑ yi .
i=1 i=1

Solution to Exercise 2 (a) and (b)
(a)
Check for maximum
∂ log2 (Ln (µ|y))

= −n
∂µ2
hence maximum.
1 n
(b) Tn (Y) = ∑ Yi .
n i=1

Notation
We adopt the following notation:
∂ log(Ln (θ |y)) ∂2 log(Ln (θ |y))

d1 = and d2 =
∂θ ∂θ 2
d1 and d2 are functions of θ, then it will be useful to have RV’s that
relate to d1 and d2 .
These RV’s are obtained by replacing the sample values in d1 and
d2 by RV’s. The resulting expressions are denoted as D1 and D2 .
D1 Score Statistic
E[−D2 ] expected Fisher’s information (Information function).

Score Statistic
D1 is a function of θ and it is denoted as U (θ ). From the definition

of score statistic we have
∂ log(Ln (θ |Y))
U (θ ) =
∂θ
where sample values y1 , . . . , yn have been replaced by RV’s Y1 , . . . , Yn
It will be useful to define
∂ log(L1 (θ |Yj )) ∂ log(f (Yj |θ ))

Uj (θ ) = = .
∂θ ∂θ
Clearly, U (θ ) = ∑nj=1 Uj (θ ).

Expected Fisher’s Information
The expected Fisher’s information or information function is denoted

as I (θ ). We have
2
∂ log(Ln (θ |Y))

I (θ ) = E −
∂θ 2
where before the expectation is taken, sample values y1 , . . . , yn have
been replaced by Y1 , . . . , Yn .

Notation
I Tn (Y) is defined as the ML estimator for θ.
I Tn∗ (Y) denotes a general estimator of θ (e.g. derived using the

method of moments or percentiles or even MLE).
I We say that Tn∗ (Y) is an unbiased estimator of a parameter θ

if E[Tn∗ (Y)] = θ.

Lemma 1
From the definition given above we have
!
n
∂ log(Ln (θ |Y)) ∂ ∂
D1 =
∂θ
=
∂θ
log ∏ f ( Yi | θ ) =
∂θ
log (f (Y|θ ))
i=1
where Y denotes n iid random variables, then E(D1 ) = 0, that is

the expected value of the score statistic is 0.

Proof of Lemma 2.1
Proof.
∂ log(Ln (θ |Y))

E(D1 ) = E
∂θ
Z
∂
= log(f (y|θ )) f (y|θ ) dy
Y ∂θ
!
1
Z
∂
= f (y|θ ) f (y|θ ) dy
Y f (y|θ ) ∂θ
Z
∂
= f (y|θ ) dy
∂θ Y
∂
= 1 = 0.
∂θ

Lemma 2
The expected Fisher’s information can be written as
2 !
∂2 log(Ln (θ |Y)) ∂ log(Ln (θ |Y))

I (θ ) = E − =E .
∂θ 2 ∂θ

Proof of Lemma 2.2
Proof.
2
∂ log(Ln (θ |Y))

E(−D2 ) = E −
∂θ 2
2
∂ ∂ ∂
= −E log f ( Y | θ ) = − E log f ( Y | θ )
∂θ 2 ∂θ ∂θ

∂ 1 ∂
= −E f (Y | θ )
∂θ f (Y|θ ) ∂θ
 !2 
Z
1 ∂ 2 1 ∂
= −  f (y| θ ) − f (y|θ )  f (y|θ ) dy
Y f (y|θ ) ∂θ 2 f (y|θ ) ∂θ
!2
1
Z
∂
= f (y| θ ) f (y|θ ) dy
Y f (y|θ ) ∂θ
!
∂ log(Ln (θ |Y)) 2 ∂ log(Ln (θ |Y)) 2
Z
= f (y|θ ) dy = E
Y ∂θ ∂θ
2
26 / 97 = E(D 1 )2.
Unit . Likelihood Theory ACTL30004
Theorem 1
Suppose that Tn∗ (Y) is an unbiased estimator of a scalar parameter
θ associated with a random variable Y. The minimum possible value
for Var(Tn∗ (Y)) is
1 1
Var(Tn∗ (Y)) ≥ = .
I (θ ) ∂2 log(Ln (θ |Y))

E −
∂θ 2
This minimum variance for an unbiased estimator of θ is called the

Cramer-Rao Lower Bound for variance (CRLB).

Proof of Theorem 2.1
Let us consider Cov(Tn∗ (Y), D1 ), given that E(D1 ) = 0, we have
Cov(Tn∗ (Y), D1 ) = E(Tn∗ (Y) D1 ).
Now the right hand side of the above expression can be simplified
using
!
1
Z
∂
E(Tn∗ (Y) D1 ) = Tn∗ (Y) f (y| θ ) f (y|θ ) dy
Y f (y|θ ) ∂θ
Z
∂
= Tn∗ (Y) f (y|θ ) dy
Y ∂θ
Z
∂
= T ∗ (Y) f (y|θ ) dy
∂θ Y n
∂
= θ = 1.
∂θ

Proof of Theorem 2.1 (continued)
Now by using the Schwartz inequality, it follows that
(Cov(Tn∗ (Y), D1 ))2 ≤ Var(Tn∗ (Y)) Var(D1 )

and hence using results from Lemma 1 and Lemma 2, we have that
1
Var(Tn∗ (Y)) ≥ .
E(D21 )

Theorem 2
If there exists an unbiased estimator, Tn∗ (Y), of θ and
1
Var(Tn∗ (Y)) =
∂2 log(L n ( θ |Y))

E −
∂θ 2
then Tn∗ (Y) = Tn (Y).

From Theorem 1 we have that
(Cov(Tn∗ (Y), D1 ))2 ≤ Var(Tn∗ (Y)) Var(D1 )

1
can be written as Var(Tn∗ (Y)) ≥ .
E(D21 )
We note that equality will hold here if Tn∗ (Y) and D1 are perfectly
positively correlated. Equality will mean that we have an estimator
that has variance equal to the Cramer–Rao minimum lower bound.
That is, we will have
1 = Var(Tn∗ (Y)) Var(D1 )

if Tn∗ (Y) can be written as a linear function of D1 . We will write
this general linear function as
D1 = a(θ ) (Tn∗ (Y) − θ ) + b(θ ),

Now, taking expectations and using that E(D1 ) = 0 we have that
Proof of Theorem 2.2 (Continued)
0 = a(θ ) (E(Tn∗ (Y)) − θ ) + b(θ )

which is only true if b(θ ) = 0 since Tn∗ (Y) is an unbiased estimator
for θ.
We therefore have
∂ log(Ln (θ |Y))
D1 = = a(θ ) (Tn∗ (Y) − θ ).
∂θ
Hence E(D21 ) = a(θ )2 Var(Tn∗ (Y)).
Now if Var(Tn∗ (Y)) attains the Cramer–Rao minimum lower bound,

we have that
Var(Tn∗ (Y)) = a(θ )−1 .

Proof of Theorem 2.2 (Continued)
We can therefore write
∂ log(Ln (θ |Y)) 1
D1 = = (T ∗ (Y ) − θ ).
∂θ Var(Tn∗ (Y)) n
From this final line, the only solution to D1 = 0 is θ̂ = Tn∗ (Y) and
hence Tn∗ (Y) is the maximum likelihood estimator of θ.

Exercise 3
Consider a random sample of size n, y1 , y2 , . . . , yn , from the
Y ∼ Bin(1, θ ).
(a) Find the maximum likelihood estimate of θ.
(b) Find expressions for E(Tn (Y)) and Var(Tn (Y)).
(c) Show that the variance of the ML estimator found in
(b) is equal to the Cramer–Rao lower bound (CRLB)
for the variance of an unbiased estimator of θ.

Solution to Exercise 2.3 (a)
The pmf and likelihood function are
f (y | θ ) = θ y (1 − θ )1−y .
n n
Ln (θ |y) = θ ∑i=1 yi (1 − θ )n−∑i=1 yi
The log–likelihood function is

!
n n
log Ln (θ |y) = ∑ yi log θ + n − ∑ yi log(1 − θ )
i=1 i=1
The derivative with respect to θ is
∂ log Ln (θ |y) ∑ni=1 yi n − ∑ni=1 yi

= −
∂θ θ 1−θ

Setting this to zero and solving, we have
n ȳ n (1 − ȳ)
=
θ̂ 1 − θ̂
θ̂ = ȳ
Checking that we have found a maximum, we find
∂2 log Ln (θ ) 1 n
n − ∑ni=1 yi
∂θ 2
= −
θ2 ∑ yi − (1 − θ )2
,
i=1
which when θ is replaced with ȳ can be written as

1 1
−n +
ȳ 1 − ȳ
which is clearly negative (assuming that not all of the sample values
are 0 or 1).
Solution to Exercise 2.3 (b)
E (Tn (Y)) = E(Ȳ) = θ

θ (1 − θ )
Var (Tn (Y)) = Var(Ȳ) = .
n

Solution to Exercise 2.3 (c) −1
∂2 log(Ln (θ |Y))

The CRLB is E − .
∂θ 2
!
1 n n − ∑ni=1 Yi
2
∂ log(Ln (θ |Y))

θ 2 i∑
E − = E Yi +
dθ 2 =1 (1 − θ )2
1 1
= 2
nθ + (n − n θ )
θ (1 − θ )2
n n
= +
θ 1−θ
∂2 log(Ln (θ |Y)) n (1 − θ ) + n θ

E − =
∂θ 2 θ (1 − θ )
n −nθ +nθ n
= = .
θ (1 − θ ) θ (1 − θ )
θ (1 − θ )
⇒ CRLB for variance of an unbiased estimator of θ is ,
n
the same as the answer from (b).
Theorem 3
For large n,
 
 1 
Tn (Y) ' N 
 θ0 , 2 
∂ log(Ln (θ |Y))

E ∂θ | θ0
Proof.
See the Appendix.
Tn (Y) is distributed approximately normally with mean θ0 (assumed

true underlying value of θ) and variance equal to the reciprocal of
the information evaluated at θ = θ0 .

Exercise 4
Suppose that Y1 , . . . , Yn are iid Poisson random variables with
parameter µ.
(a) Find an expression for the ML estimator of µ based on
a sample of size n.
(b) Find the mean and variance of the estimator from (a).
2
∂ log L(µ|Y)

(c) By finding E − write down the asymp-
∂µ2
totic distribution of the ML estimator from (a).
(d) What is the exact distribution of n Tn (Y) where Tn (Y)
is the ML estimator from (a)?

The likelihood function is
n
e−µ µyi

Ln (µ|y) = ∏ yi !
.
i=1
n
log(Ln (µ|y)) = ∑ (−µ + yi log µ − log yi !)
i=1
n n
= −n µ + log µ ∑ yi − ∑ log yi !.
i=1 i=1
Differentiating this with respect to µ and setting the result equal to

0, we obtain the ML estimator of µ, Tn (Y) = Ȳ.
Check for maximum:
∂2 log(Ln (µ)|y) ∑ni=1 yi
=− < 0.
∂µ2 µ2
E(Tn (Y)) = E(Ȳ) = 1

n ∑ni=1 E(Yi ) = µ.
n
1 µ
Var(Tn (Y)) = Var(Ȳ) =
n2 ∑ Var(Yi ) = n
.
i=1

Solution to Exercise 2.4 (c)
Using the second derivative from (a), we have

2 n
∂ log(Ln (µ|Y)) ∑i=1 Yi

E − = E
∂µ2 µ2
∑ni=1 E (Yi ) nµ n
= = 2 = .
µ2 µ µ
µ
Hence the asymptotic distribution of the ML estimator of µ is N µ, n .

Solution to Exercise 2.4 (d)
n
n Tn (Y) = n ∑i=n1
Yi
= ∑ni=1 Yi .
This has a Poisson distribution with parameter n µ.

−1

Remark 1–Intuition behind ∂θ 2
log L
-5500
-6000
-6500
-7000
-7500
0.1 0.2 0.3 0.4 0.5 0.6 0.7 θ
I Maximum of ` clearly defined around θ̂ so it will change

gradient very sharply around θ̂.
I A greater confidence will be put on θ̂.
I log Ln (θ |y) < log Ln (θ̂ |y).
I A much larger negative second derivative value at MLE ⇒ A
smaller variance (greater confidence) of θ̂
−1

Remark 1–Intuition behind ∂θ 2
log L
-50
-100
-150
-200
2 4 6 8 10 θ
I Maximum of ` not clearly defined around θ̂ so it will change

gradient slowly around θ̂ .
I A lower confidence will be put on θ̂.
I log Ln (θ |y) slightly lower than log Ln (θ̂ |y).
I Slightly larger negative second derivative value at MLE ⇒ A
greater variance (lower confidence) of θ̂
Remark 2–Observe Information instead of Expected
Information
∂2 log L(θ |Y)

I Sometimes it is difficult to find E − .
∂θ 2
I In these cases, we use the observed information function
defined as
I The observed information can always be calculated:
∂2 log Ln (θ |y)
Observed information ≡ − .
∂θ 2

Remark 3–Extension to Multi–Parameter Case
I These results can be extended to the case of an unknown vector
of parameters θ (dimension p).
I In this case we have

1
Tn (Y) ' Nmulti θ0 , (i(θ0 ))−1 where
n
Nmulti (·, ·) represents the multivariate normal distribution,
θ0 mean random vector and
1
(i(θ0 ))−1 is the covariance matrix where i(θ0 ) has (j, k)th
n
element given by
∂2

E − log f (Y1 |θ ) , with j, k = 1, . . . , p.
∂θj ∂θk

Exercise 5
Based on a sample y1 , y2 , . . . , yn of size n from the normal distribu-
tion with unknown parameters µ and σ, find the maximum likelihood
estimates of µ and σ.

Solution to Exercise 2.5
We maximise the likelihood

!!
n
1 yi − µ 2

1
Ln (µ, σ|y) = ∏ √ exp −
i=1 σ 2π 2 σ
n !
1 n yi − µ 2

1
= √ exp − ∑ .
σ 2π 2 i=1 σ
Then, we find the log–likelihood function,
√ 1 n
yi − µ
2
log(Ln (µ, σ )|y) = −n log(σ 2 π) −
2 ∑ σ
.
i=1

We then take the partial derivative with respect to µ.
∂ log(Ln (µ, σ)|y) n

1
∂µ
=
σ2 ∑ (yi − µ) .
i=1
Setting this to 0 and solving for µ, we have µ̂ = ȳ.

We next take the partial derivative with respect to σ.
∂ log(Ln (µ, σ )) n ∑n (y − µ )2
= − + i=1 3i .
∂σ σ σ
Setting this to 0, replacing µ with µ̂ = ȳ and replacing σ with σ̂,

we have:

∑ni=1 (yi − ȳ)2 n

=
σ̂3 σ̂
which gives
n
1
σ̂2 =
n ∑ (yi − ȳ)2 .
i=1
Note: for the function of two variables, we check for a maximum

by observing that the matrix of second order partial derivatives has
∂2 log Ln (µ, σ|y)
a positive determinant and that < 0.
∂µ2
This will NOT be required of you in this course.

Likelihood Ratio Test
This statistical test is used to compare the goodness–of–fit of two

models, one of which (the reduced model) is a special case of the
other (the larger model). Let us consider that the larger model is
specified in terms of a random variable Y with probability density
function f (y|θ ) where θ ∈ Θ ⊆ Rk (k ≥ 1).
If the parameter space Θ of this model is partitioned into two subsets

Θ0 and Θ1 such that Θ = Θ0 ∪ Θ1 and Θ0 ∩ Θ1 = ∅, we can
assume a reduced model where Y follows a density function f (y|θ )
with θ ∈ Θ0 .
We now provide a test procedure for choosing one and only one of
the models based on the likelihood ratio test statistic.

Theorem 4 (Likelihood Ratio Test (for nested models))
Consider a random variable Y with probability density function f (y|θ ) where
θ ∈ Θ ⊆ Rk (k ≥ 1). Our aim will be to test
H0 : θ ∈ Θ0 Vs. H1 : θ ∈ Θ1 = Θ − Θ0 .
If we define the ratio of likelihoods as

max Ln (θ |Y)
θ ∈Θ
Rn = .
max Ln (θ |Y)
θ ∈ Θ0
We will reject H0 in favour of H1 if and only if

 
max Ln (θ |Y)
θ ∈ Θ
2 log Rn = 2 log  
max Ln (θ |Y)
θ ∈ Θ0
= 2 [log(max Ln (θ |Y)) − log(max Ln (θ |Y))]
θ ∈Θ θ ∈ Θ0
= 2 [max log(Ln (θ |Y)) − max log(Ln (θ |Y))]
θ ∈Θ θ ∈ Θ0
is too large. Besides, for large n if the true value of θ is in Θ0 :
2 log Rn ≈ χ2p where p = dim Θ − dim Θ0 .

We assume Θ0 = θ0 and Θ has dimension p.
Under the assumption that H0 : θ = θ0 is true and expanding the

log–likelihood function at θ0 as a Taylor series about θ = θ̂n , where
θ̂n is the MLE of θ under the larger model.
log(Ln (θ0 |Y)) ≈ log(Ln (θ̂n |Y)) + (θ0 − θ̂n )> a(θ̂n )
1
+ (θ0 − θ̂n )> b(θ̂n ) (θ0 − θ̂n ),
2
a(θ̂n ) is a vector of first order partial derivatives of log(Ln (θ |Y))

at θ̂n and b(θn ) is a matrix of second order partial derivatives of
log(Ln (θ |Y)) at θ̂n .
As θ̂n is the MLE of θ ⇒ a(θ̂n ) = 0 (i.e. zero vector).

Likelihood Ratio Test – Proof
Now, by using Remark 2, we have that
2
∂ log(Ln (θ |Y))

−b(θ̂n ) ≈ E − = n i( θ0 ).
∂θi ∂θj θ 0
Using these results we have

2 log Rn = 2[log(Ln (θ̂n |Y)) − log(Ln (θ0 |Y))]
≈ (θ0 − θ̂n )> n i(θ0 ) (θ0 − θ̂n ).
If the true underlying value of θ is θ0 from Remark 3, we have for
large n,
(θ̂n − θ0 ) ' N (0, (ni(θ0 ))−1 )
and hence
2 log Rn ≈ χ2p .
Note that in the last step we have used that if Z ∼ Nr (0, Σ), then
Z> Σ−1 Z ∼ χ2r .

Exercise 6
Losses due to vandal damage to cars (in $) over a period of six
months in a certain community are displayed in the following table.
38 56 77 110 112 138 152 168 188 210

228 241 252 273 283 288 291 299 305 317
321 356 374 422 485 527 529 559 567 656
Fit by using maximum likelihood estimation an Exponential distri-

bution with rate parameter λ and a Gamma distribution with shape
parameter α and rate parameter β to this set of data.
Use the likelihood ratio test to test the null hypothesis that Ex-
ponential distribution with rate parameter λ is appropriate against
the alternative hypothesis that Gamma distribution is appropriate.
Perform your statistical test at the 5% significance level.

The Exponential distribution has pdf
f (y|λ) = λ e−λ y , with λ > 0 and y > 0.
The Gamma distribution has pdf
β α α −1 − β y
f (y|α, β) = y e , with β, α > 0 and y > 0
Γ(α)
and where
Z ∞
Γ (t) = xt−1 e−x dx.
0
When λ = β and α = 1 the Exponential distribution is nested in

the Gamma distribution.

library(maxLik)
damage=read.csv("damage.csv",header=FALSE)
attach(damage)
x_data=damage[,1];
logLikFun <- function(param) {

lambda<- param[1]
sum(dexp(x_data, rate = lambda, log = TRUE))
}
mle <- maxLik(logLik = logLikFun, start = c(lambda = 0.003215025))

summary(mle)
LogLikFun2 <- function(param) {

alpha<- param[1]
beta<-param[2]
sum(dgamma(x_data,shape=alpha,rate = beta, log = TRUE))
}
mle2 <- maxLik(logLik = LogLikFun2, start = c(alpha=3.2388,beta =0.0110))

summary(mle2)

--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 11 iterations
Return code 1: gradient close to zero
Log-Likelihood: -200.5142
1 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
lambda 0.003401 0.000621 5.476 4.36e-08 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
--------------------------------------------
--------------------------------------------
Return code 2: successive function values within tolerance limit
2 free parameters
Estimates:
alpha 2.775605 0.677443 4.097 4.18e-05 ***
beta 0.009439 0.002526 3.737 0.000186 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
60 / 97
--------------------------------------------
Unit 2. Likelihood Theory ACTL30004
The LRT for comparing the Exponential and Gamma distribution

gives a test statistic of
2 [−193.8312 + 200.5142] = 13.366.
The rejection region at the 5% significance level is values of the test

statistic greater than or equal to 3.84.
We therefore reject the smaller model (the Exponential model) in

favour of the model based on the Gamma distribution.

Result 1
e− θ θ y
The Poisson distribution has pmf f (y|θ ) = , y = 0, 1, 2, . . . .
y!
The Negative Binomial distribution has pmf:

r y
y+r−1

1 β
f (y|r, β) = y = 0, 1, . . . r, β > 0.
y 1+β 1+β
The Poisson distribution is a special case of the negative binomial

distribution.
That is, P (θ ) is limiting case of the N B(r, β) distribution when

both r β = θ and β → 0.

Proof of Result 1
First, the calculation of the combinatorial term for non–integer r is
performed using
x(x − 1) . . . (x − k + 1) Γ (x + 1)

x
= = .
k k! Γ (k + 1) Γ (x − k + 1)
By re–parameterizing the negative binomial distribution we have
1 ω
= and r = ω.
1+β θ+ω
We get
ω y
y+ω−1

ω θ
f (y|θ, ω ) = y = 0, 1, . . . ω, θ > 0.
y θ+ω θ+ω
As β → 0, ω → ∞ and so we write

Proof of Result 1
θy (y + ω − 1)!ω ω

lim (f (y|θ, ω )) = lim
ω →∞ y! ω → ∞ ( ω − 1 ) ! ( θ + ω ) ω +y
!
θy (y + ω − 1) ! θ −ω

= lim 1+
y! ω →∞ (ω − 1)!(θ + ω )y ω
e− θ θ y
=
y!
This shows that the Poisson distribution is a limiting case of the

negative binomial distribution.

Exercise 7
Consider the following data on motor vehicle claims frequency. For
simplicity, in fitting our models we will assume that the 7+ category
relates exactly to 7 claims. The error introduced by this assumption will
be negligible.
# of claims (yi ) # of drivers (ni )

0 20592
1 2651
2 297
3 41
4 7
5 0
6 1
7+ 0
(a) Fit the Poisson and negative binomial models to the above
data using the method of the maximum likelihood.
(b) Compare the fit of the two models using the likelihood ratio
test. Which model is preferable?
Excel and R are used to estimate parameters for both models.
Poisson model:
θ̂ = 0.144220.
Negative Binomial model:
r̂ = 1.11790 and β̂ = 0.129010.

> obs=c(0,1,2,3,4,5,6,7);
> freq=c(20592,2651,297,41,7,0,1,0)
>
> logLikFunPoi <- function(param) {
+ theta<- param[1]
+ sum(freq*dpois(obs, theta, log = TRUE))
+ }
>
> mle3 <- maxLik(logLik = logLikFunPoi, start = c(theta = 0.156))
> summary(mle3)
--------------------------------------------
Return code 2: successive function values within tolerance limit
1 free parameters
Estimates:
theta 0.144220 0.002473 58.33 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
--------------------------------------------

7+
Define n = ∑ ni = 23, 589.
i=0
Under the Poisson model the maximum of the log–likelihood is

7+ 7+
−nθ̂ + ln θ̂ ∑ yi ni − ∑ ni ln yi ! = −10, 297.84.
i=0 i=0
Under the negative binomial model, we calculate the log–likelihood

using
7+
r + yi − 1

β
∑ ni ln yi
− r ln(1 + β) + yi ln
1+β
= −10, 223.42.
i=0
The GAMMALN function in Excel is useful in calculating this sum.

The LRT for comparing the Poisson and NB models gives a test
statistic of
2 [−10, 223.42 + 10, 297.84] = 148.84.
The rejection region at the 5% significance level is values of the test

statistic greater than or equal to 3.84.
We therefore reject the smaller model (the Poisson model) in favour

of the model based on the negative binomial distribution.

Fisher–Scoring
Fisher–Scoring is a numerical method used to calculate the ML Es-

timates when

∂ log Ln (θ |y)
∂θ1 =0 


.. ..

.. is difficult to solve.
. . .
∂ log Ln (θ |y)


=0 

∂θp
The method of Fisher–Scoring is an extension of the Newton–Raphson

method for approximating the roots of non–linear equations.

Fisher–Scoring
Following steps are required:
1. Revise Newton–Raphson method for approximating the solution

to a single equation with a single unknown variable.
2. Extend Newton–Raphson method to approximate the solution
to a system of n equations in n unknowns, and
3. Introduce the Fisher–Scoring algorithm in the context of ML
estimation and look at some applications of the method to real
actuarial data.

Newton–Raphson Single equation and single unknown
Consider the equation g(x) = 0, with g : R → R

To approximate the solution to this equation:
1. Choose an initial estimate x = x0 , and get g(x0 ).

2. To update this value to find a better estimate of the solution
⇒ write a linear approximation of g about x = x0 (Taylor series
with two terms).
3. Set the equation equal to 0 and solve for x.
4. Updated estimate is x = x1 .

Newton–Raphson Single equation and single unknown
Our Taylor series expansion is
g ( x ) ≈ g ( x0 ) + ( x − x0 ) g 0 ( x0 ) .
Setting this to 0, replacing x with x1 and solving for x1 , we have
0 ≈ g(x0 ) + (x − x0 ) g0 (x0 )
g(x0 )
⇒ x1 = x0 − 0 .
g ( x0 )
We can iterate this process many times to get improved estimates

of the solution to the equation g(x) = 0.

Remark 4
The Newton–Raphson procedure does not work if the initial estimate

is at a stationary point. The process also does not generally converge
if the initial estimate is on the other side of a stationary point or a
discontinuity in the function g to the location of the solution of the
equation.

Exercise 8
Use the Newton–Raphson method with four iterations to approxi-
mate the solution to the equation x2 = 2. Use an initial estimate of
x = 1.5.

The equation to be solved is
g(x) = 0 where g(x) = x2 − 2 and g0 (x) = 2x.
To update our initial estimate, we use
g ( x0 )
x1 = x0 −
g0 (x0 )
1.52 − 2
= 1.5 − = 1.41667.
2 × 1.5
Iterating this process, we get the following table:

x value Updated x value

x0 = 1.50000 1.41667
x1 = 1.41667 1.41422
x2 = 1.41422 1.41421
x3 = 1.41421 1.41421
Hence after four iterations, our estimate of the solution to the equa-
tion x2 = 2 is x = 1.41421.

Newton–Raphson n equation and n unknown variables
Suppose we have a set of n functions
gi : Rn → R, i = 1, 2, . . . , n.
Goal: To find an approximate solution to the system of equations
gi (x) = 0, i = 1, 2, . . . , n
where x = (x1 , x2 , . . . , xn ).
Again, we need to write a linear approximation of gi about x = x0

(Taylor series with two terms) where x0 = (x01 , x02 , . . . , x0n ). (Taylor
series with two terms).

Let us consider f : Rn → R, with
x = (x1 , x2 , . . . , xn ).
The Taylor Series expansion about x0 = (x01 , x02 , . . . , x0n ) is
>
∂f ∂f ∂f
f (x) = f (x0 ) + , ..., (x − x0 )
∂x1 ∂x2 ∂xn
 2 2

∂ f ∂ f ∂2 f
∂x21 ∂x1 ∂x2 ... ∂x1 ∂xn
 
∂2 f ∂2 f ∂2 f
...
 
1 >
 ∂x2 x1 ∂x22 ∂x2 ∂xn

+ (x − x0 )  .. .. ..
 (x − x0 ) + . . .
2 .. 

 . . . . 

∂2 f ∂2 f ∂2 f
∂xn x1 ∂xn x2 ... ∂x2n

Consider the function f (x) where f : Rn → Rn ,
with f (x) = (f1 (x), f2 (x), . . . , fn (x)).
The Jacobian matrix is a matrix made up of first order partial

derivatives and is
 ∂f1 ∂f1 ∂f1 
∂x1 ∂x2 ... ∂xn
∂f2 ∂f2 ∂f2
...
 
 ∂x1 ∂x2 ∂xn 
Jf = 
 .. .. .. .. 

 . . . . 
∂fn ∂fn ∂fn
∂x1 ∂x2 ... ∂xn

Example 1
Consider f : R2 → R2 given by f (x1 , x2 ) = (x21 + x32 , ln x1 ) where
f (x1 , x2 ) = (f1 (x1 , x2 ), f2 (x1 , x2 )) with

f1 (x1 , x2 ) = x21 + x32 and f2 (x1 , x2 ) = ln x1 .
The Jacobian matrix is
2x1 3x22

Jf = 1 .
x1 0

Suppose we have a set of n functions
gi : Rn → R, i = 1, 2, . . . , n.
Goal: To find an approximate solution to the system of equations
gi (x) = 0, i = 1, 2, . . . , n
where x = (x1 , x2 , . . . , xn ).
We will write down n separate first order Taylor series approximations
to the equations, i.e. one for each
gi (x) = 0 for i = 1, 2, . . . , n.
We will expand the Taylor series about a vector of initial estimates
x0 = (x01 , . . . , x0n ).

Let us consider firstly a single equation gi (x) = 0.
Writing a Taylor series for left hand side
>
∂gi ∂gi ∂gi
gi (x) ≈ gi (x0 ) + , ,..., (x − x0 ).
∂x1 ∂x2 ∂xn

Writing similar Taylor series expansions for all n functions
g1 (x), g2 (x), . . . , gn (x).
>
∂g1 ∂g1 ∂g1
g1 (x) ≈ g1 (x0 ) + , ,..., (x − x0 )
∂x1 ∂x2 ∂xn
>
∂g2 ∂g2 ∂g2
g2 (x) ≈ g2 (x0 ) + , ,..., (x − x0 )
∂x1 ∂x2 ∂xn
.. .
. ≈ ..
>
∂gn ∂gn ∂gn
gn ( x ) ≈ gn ( x0 ) + , ,..., (x − x0 )
∂x1 ∂x2 ∂xn

Then collecting the n functions into a single function g : Rn → Rn

where
g(x) = (g1 (x), g2 (x), . . . , gn (x)).

Next, writing Taylor series approximation in matrix form we have
g(x) ≈ g(x0 ) + [Jg (x0 )] (x − x0 ),
where Jg (x0 ) denotes the Jacobian matrix of g at x = x0 .
Finally setting the left hand side equal to 0 and replacing x with x1
we have
x1 = x0 − [Jg (x0 )]−1 g(x0 ).

Remark 5
The derivative on the denominator of the single equation algorithm

is replaced in the multiple equation algorithm by the inverse of the
Jacobian matrix.

Exercise 9
Use the Newton–Raphson method to approximate the solution to
the system of equations:
(
x21 − x2 − 20 = 0
x1 x22 + x1 − 8 = 0
Use one iteration only. Use starting values of x0 = (5, 1).

Using the multivariate Newton–Raphson algorithm, we have
x1 = x0 − [Jg (x0 )]−1 g(x0 ),
where x0 = (5, 1) and g(x0 ) = (4, 2).
The Jacobian matrix is given by

2x1 −1
Jg =
x22 + 1 2x1 x2
and then
10 −1
Jg (x0 ) = .
2 10

On substitution, we have
−1
5 10 −1 4
x1 = −
1 2 10 2

5 1 10 1 4
= −
1 102 −2 10 2

4.588
=
0.882
(to 3 decimal places).

Fisher–Scoring Algorithm
The multivariate version of the Newton–Raphson algorithm can be
applied to solve

∂ log Ln (θ |y)
∂θ1 =0 


.. ..

.. System of p equations.
. . .
∂ log Ln (θ |y)


=0 

∂θp
Note that θ = (θ1 , θ2 , . . . , θp ). Application of the Newton–Raphson

method gives:
! −1
∂2 log L(θ |y)
!
∂ log L(θ |y)
θ1 = θ0 − | θ0 | θ0 where
∂θ∂θ > ∂θ
∂2 log L(θ |y)
− |θ0 is the observed information matrix at θ0 and
∂θ∂θ >
∂ log L(θ |y)
|θ0 is the score vector at θ0 .
∂θ
Fisher–Scoring (Modification)
In practice the observed information is replaced by expected Fisher’s
information
" #!−1
∂2 log L(θ |y)
!
∂ log L(θ |y)
θ1 = θ0 − E | θ0 | θ0 or,
∂θ∂θ > ∂θ
" #!−1
∂2 log L(θ |y)
!
∂ log L(θ |y)
θ1 = θ0 + E − | θ0 | θ0 .
∂θ∂θ > ∂θ
where the matrix that is inverted is the Fisher’s information matrix

evaluated at θ0 .
This algorithm for finding the solution of the system of normal equa-
tions equal to 0 is called Fisher–Scoring.
Exercise 10
Consider the following Poisson model:
Yi ∼ Poisson(λi = α + β zi ).
You should think of Y as response variables and z as predictors of

Y. The two unknown parameters are α and β.
Data on the response variable Y and the predictor z are given below:
y 29 37 32 33 26 30 24 28 20 21
z 1 2 3 4 5 6 7 8 9 10
Note that this data can be thought of as pairs (yi , zi ) for i =

1, 2, . . . , 10.

Exercise 2.10
(a) Verify that the expected information matrix for a data

set with 10 pairs of data as given above is
 10 10 
1 zi
 ∑ α+ βz ∑ α + β zi 
i
I (α, β) =  i= i=1
 1 
10 10
z 2 
zi
∑ α + β zi ∑ α + iβ zi
 
i=1 i=1
(b) Derive the Fisher scoring algorithm for this data.

(c) Using initial values for α and β based on fitting a nor-
mal linear model, implement your Fisher scoring algo-
rithm from (b) using EXCEL.

The likelihood function is
10
e−(α+ β zi ) (α + β zi )yi
L(α, β|y) = ∏ yi !
.
i=1
10
log(L(α, β|y)) = ∑ −(α + β zi ) + yi log(α + β zi ) − ln yi !.
i=1
The first order partial derivatives of the log–likelihood function are:
∂ log(L(α, β|y)) 10
yi
∂α
= ∑ −1 + α + β zi and
i=1
∂ log(L(α, β|y)) 10
yi zi
∂β
= ∑ −zi + α + β zi .
i=1

The Hessian matrix of second derivatives of the log–likelihood func-
tion
− yi − yi zi
 
10 10
∂2 ln(L(α, β|y))  ∑i=1 (α + β zi )2 ∑ i=1 2
(α + β z2i )  .

=
−yi zi −yi zi

α
∑10 10
 
∂ ∂(α, β) i=1 ∑ i = 1
β (α + β zi )2 ( α + β zi ) 2
Replacing the yi values with Yi vectors, putting each term on the

negative and taking expectations, using the fact that E[Yi ] = α +
β zi , we get

1 zi

10 10
 ∑ i = 1 α + β zi ∑ i=1
α + 2β zi 

I (α, β) = 
 10 zi zi
∑10

∑ i=1 i=1
α + β zi α + β zi
as required.
The Fisher–Scoring algorithm is
∂ log(Ln (α,β|y))
!
α1 α0 −1
= + (I (α0 , β 0 )) ∂α
∂ log(Ln (α,β|y)) .
β1 β0 ∂β |(α0 ,β 0 )

Solution to Exercise 2.10 (c)
See EXCEL spreadsheet Lecture Exercise 2.10 Calcs.xls.
y=c(29,37,32,33,26,30,24,28,20,21)
z=c(1,2,3,4,5,6,7,8,9,10)
initial=lm(y~z)
initial
Call:
lm(formula = y ~ z)
Coefficients:
(Intercept) z
35.800 -1.418

As.000 Unit 2 Likelihood W Sols

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

As.000 Unit 2 Likelihood W Sols

Загружено:

Авторское право:

Доступные форматы

ACTL30004

Unit 2. Likelihood Theory

1 / 97 Unit 2. Likelihood Theory ACTL30004

2 / 97 Unit 2. Likelihood Theory ACTL30004

I Provide the asymptotic distribution of the ML estimator for

3 / 97 Unit 2. Likelihood Theory ACTL30004

I Likelihood theory is mostly used as a method for estimating

I Review of materials learnt in MAST20005.

I New Material: Fisher Scoring algorithm.

I Set the foundations for the analysis of GLM’s.

4 / 97 Unit 2. Likelihood Theory ACTL30004

I Consider y as a value taken on by a RV Y.

I Use y in drawing conclusions on the unknown CDF F(·|θ ) of Y

I Quantity θ is called the parameter and the set Θ is the pa-

I Class F(·|θ ) is called the parametric statistical model.

5 / 97 Unit 2. Likelihood Theory ACTL30004

I Choice which best explains why the phenomenom under study

I Usually estimate ≡ point estimate are equivalent.

I General criterion for constructing estimates that associates an

I It defines a function Tn (Y) which is called an estimator of θ.

I Tn (Y) is a RV (or transformation of a set of RV’s).

6 / 97 Unit 2. Likelihood Theory ACTL30004

I pdf (or pmf) associated with a continuous (or discrete) RV Y

I Y ∼ G eo(θ ), its pmf is given by

I Capital letters: random variables

I Lower case letters: sample values.

7 / 97 Unit 2. Likelihood Theory ACTL30004

I Values in the RS are assumed to be drawn from the same RV,

I Equivalent to say that we have a RS of size n obtained from

I Likelihood function is the product of the pdf ( or pmf) is given

8 / 97 Unit 2. Likelihood Theory ACTL30004

Definition: Given a RS of size n from a RV Y with a single unknown

Ln (θ̂ |y1 , . . . , yn ) = max Ln (θ |y1 , . . . , yn )

As log–likelihood function is a monotonic transformation of the like-

`(θ̂ ) = max `(θ ).

9 / 97 Unit 2. Likelihood Theory ACTL30004

10 / 97 Unit 2. Likelihood Theory ACTL30004

Replace y1 , . . . , yn in the MLE θ̂ by RV’s Y1 , . . . , Yn .

11 / 97 Unit 2. Likelihood Theory ACTL30004

As defined above, we use θ̂ to denote MLE of θ.

Suppose that three such components, tested independently, had

12 / 97 Unit 2. Likelihood Theory ACTL30004

where y1 = 120, y2 = 130 and y3 = 128.

log(Ln (θ |y1 , y2 , y3 )).

13 / 97 Unit 2. Likelihood Theory ACTL30004

Maximising this expression by setting the first derivative equal to

14 / 97 Unit 2. Likelihood Theory ACTL30004

Check that θ̂ corresponds to a maximum by observing the sign of

∂2 log(L3 (θ |y1 , y2 , y3 )) 6 2 · 378

which is clearly negative when θ is 63.

15 / 97 Unit 2. Likelihood Theory ACTL30004

16 / 97 Unit 2. Likelihood Theory ACTL30004

17 / 97 Unit 2. Likelihood Theory ACTL30004

∂ log2 (Ln (µ|y))

18 / 97 Unit 2. Likelihood Theory ACTL30004

We adopt the following notation:

∂ log(Ln (θ |y)) ∂2 log(Ln (θ |y))

E[−D2 ] expected Fisher’s information (Information function).

19 / 97 Unit 2. Likelihood Theory ACTL30004

D1 is a function of θ and it is denoted as U (θ ). From the definition

∂ log(L1 (θ |Yj )) ∂ log(f (Yj |θ ))

20 / 97 Unit 2. Likelihood Theory ACTL30004

The expected Fisher’s information or information function is denoted