Вы находитесь на странице: 1из 73

ECON509 Probability and Statistics

Slides 7

Bilkent

This Version: 16 December 2013

(Bilkent) ECON509 This Version: 16 December 2013 1 / 73


Introduction

In this part, we will talk about estimation. Our focus will almost exclusively be on
the maximum likelihood method.
We have worked with many distributions so far, calculated their expectations,
variances or derived their moment generating functions etc.
Importantly, the setting was such that we knew what distribution we were
considering AND we had full knowledge of the parameter values for these
distributions. Or, to put it more precisely, we never contemplated the possibility that
they might not be known.
Then, there are two implicit assumptions:
1 We know the distribution.
2 We know the parameters of the distribution.
In real life, this is rarely the case. We will …rst relax the second assumption and later
on will dispense with the …rst assumption.
The treatment will sacri…ce on formality and will rather focus on ideas. References
for more formal treatments will be provided at the end of this set of slides.

(Bilkent) ECON509 This Version: 16 December 2013 2 / 73


Introduction

Now, let’s assume we have a random sample consisting of X1 , X2 , ..., Xn from the
density fX (x jθ 0 ).
We would like to determine the value of θ 0 , which is unknown.
We could use an estimator.
An estimator is some function of the data

θ̂ n = W (X1 , ..., Xn ) . (1)

The index n underlines that fact that the particular value of the estimate depends on
the sample (and, so, on its size). Note that usually n is dropped and instead simply
θ̂ is used.
Note the di¤erence between the estimator and the estimate. The estimator is a
concept while the estimate is the value of the estimator for a given sample. So, if
the estimator is W (X1 , ..., Xn ) , then the estimate for a particular realisation of
X1 , ..., Xn is given by W (x1 , ..., xn ) .

(Bilkent) ECON509 This Version: 16 December 2013 3 / 73


Introduction

Now, although the de…nition given in (1) implies that any function of the data could
be a valid estimator, we usually look for those that have desirable properties.
In other words, an estimator is a statistic, meaning that it cannot depend on θ or
any other unknown parameters, which has desirable properties.
We have actually introduced one of these desirable properties: consistency. Others
are unbiasedness, minimum mean squared error, minimum variance etc.
Let Θ be the parameter space for θ. An estimator θ̂ of θ 0 is a minimum mean
squared error estimator if for every θ 0 2 Θ.
h i
θ̂ = arg min E (θ θ 0 )2 .
θ 2Θ

An estimator θ̂ of θ 0 is unbiased if for every θ 0 2 Θ,

E [θ̂ ] = θ 0 .

You will learn more about these in your future econometrics courses.

(Bilkent) ECON509 This Version: 16 December 2013 4 / 73


Maximum Likelihood Estimation
Motivation and the Main Ideas

Let us …rst disect the notation. Suppose we are dealing with some generic
distribution such that
F Y (y ; θ ), θ 2 Θ.
F is the cdf, Y is the random variable and y is a particular realisation of Y .
θ is a vector which contains the distribution parameters. This is generally known as
the parameter vector.
The parameter vector takes on values on a set, Θ, known as the parameter space.
For example, for a normal random variable,

θ = µ, σ2 and Θ = f(µ, σ2 ) : ∞ < µ < ∞, σ2 > 0g,

where µ is the mean and σ2 is the variance.

(Bilkent) ECON509 This Version: 16 December 2013 5 / 73


Maximum Likelihood Estimation
Motivation and the Main Ideas

Suppose that we actually know the distribution.


However, usually θ is unknown. How to …nd out the value of θ?
We have to distinguish between the population and the sample. Population contains
all the unknown values. The sample, on the other hand can only provide an
approximation.
For example,

θ : population,
θ̂ : sample.

The maximum likelihood method is a very popular and strong method for estimating
θ when the underlying distribution function, FY , is known (or when one believes that
one actually knows the underlying distribution).

(Bilkent) ECON509 This Version: 16 December 2013 6 / 73


Maximum Likelihood Estimation
Motivation and the Main Ideas

Maximum likelihood estimation (MLE) is based on maximisation of a likelihood


function.
Where to …nd this “likelihood function?” It’s actually pretty easy!
The likelihood function is the same as the probability density function:

fY (y ; θ ) = L (θ; y ).

The only change is the interpretation. When we consider a probability density


function, we implicitly consider θ as …xed and y as random. When we consider a
likelihood function, we assume that data, y , are given and …xed. Instead, it is θ
which is modi…ed.
How to make sense of this? MLE is based on the idea that, if we know the
underlying distribution function, then we should choose θ such that the probability
of the data, y , being observed is maximised.
In other words, we are trying to …nd out the values of θ which are most likely to
generate the observed data. This likelihood principle is due to R. A. Fisher.

(Bilkent) ECON509 This Version: 16 December 2013 7 / 73


Maximum Likelihood Estimation
Motivation and the Main Ideas

1500
2
µ=0, σ =1
2
µ=0, σ =4

1000

500

0
-8 -6 -4 -2 0 2 4 6 8 10

Figure: We have 10,000 iid observations from the true distribution N 0, σ2 . Two possible
values for σ2 . Which one is the correct one? Can we use the data to make a decision?

(Bilkent) ECON509 This Version: 16 December 2013 8 / 73


Maximum Likelihood Estimation
Motivation and the Main Ideas

1500

1000

500

0
-8 -6 -4 -2 0 2 4 6 8 10

Figure: This is what the data tell us.

(Bilkent) ECON509 This Version: 16 December 2013 9 / 73


Maximum Likelihood Estimation
Motivation and the Main Ideas

1500
His togram of Data
2
µ=0, σ =1
2
µ=0, σ =4

1000

500

0
-8 -6 -4 -2 0 2 4 6 8 10

Figure: Can you now decide on which σ2 to choose?

(Bilkent) ECON509 This Version: 16 December 2013 10 / 73


Maximum Likelihood Estimation
Motivation and the Main Ideas

The dataset would preferably consist of many observations on the same random
variable. This ensures that we have su¢ cient information to estimate θ. Consider
some simple examples.
Example: Let Yi be an iid random sequence where i = 1, ..., n. Let also
Yi N µ, σ2 , where Θ = f µ, σ2 : ∞ < µ < ∞, σ2 > 0g gives the parameter
space. Then, thanks to the independence assumption, the joint likelihood function is
given by
n
fY (y ; θ ) = ∏ fY i
(yi ; θ ) ,
i =1
where y = (y1 , ..., yn ) .
Notice that the parameter vector, θ, is common for all variables.

(Bilkent) ECON509 This Version: 16 December 2013 11 / 73


Maximum Likelihood Estimation
Motivation and the Main Ideas

Example: Let Yi be an iid random sequence, conditional on Xi = xi , where


i = 1, ..., n. Let also Yi jXi = xi N β0 xi , σ2 , where xi = (xi 1 , ..., xik )0 and
0
β = ( β1 , ..., βk ) .
Let, also,
y = (y1 , ..., yn ) and x = (x1 , ..., xn ) .
Then,
n
fY jX =x (y ; θ ) = ∏ fY jX =x i i i
(yi ; θ ) .
i =1
An example? A possible structure for Yi jXi = xi would be
iid
yi = 0.3xi 1 + 0.4xi 2 + ui , ui N (0, 1), xi t6 ,

where ui is independent of xi .
Here,
β = (0.3, 0.4)0 , σ2 = 1 and xi = (xi 1 , xi 2 ) 0 .

(Bilkent) ECON509 This Version: 16 December 2013 12 / 73


Maximum Likelihood Estimation

One of the most important models in …nancial econometrics (and, indeed,


econometrics) is the autoregressive conditional heteroskedasticity (ARCH) mode due
to Engle (1982, Econometrica).
Example: Let Yt be the daily return on some equity on day t, where t = 1, ..., T .
The model is given by
Yt jYt 1 = yt 1 N 0, σ2t ,

where
σ2t = ω + αyt2 1 .

(Bilkent) ECON509 This Version: 16 December 2013 13 / 73


Maximum Likelihood Estimation

We can construct the likelihood by using a representation known as prediction


decomposition. Omitting the arguments of the likelihood/density function for
conciseness, we obtain

fY 1 ,...,Y T = fY 2 ,...,Y T jY 1 fY 1
= fY 3 ,...,Y T jY 1 ,Y 2 fY 2 jY 1 fY 1
= fY 4 ,...,Y T jY 1 ,Y 2 ,Y 3 fY 3 jY 1 ,Y 2 fY 2 jY 1 fY 1
..
.
T
= ∏ fY jY t t 1 ,...,Y 1
.
t =1

Then, for the ARCH model we have


T
fY 1 ,...,Y T = ∏ fY jY t t 1
,
t =2

since the conditional distribution of Yt depends on Yt 1 only.

(Bilkent) ECON509 This Version: 16 December 2013 14 / 73


Maximum Likelihood Estimation
Now that we know how to construct the joint likelihood function for a collection of
random variables Y1 , ..., Yn , we can start thinking about how to estimate parameters
by MLE.
Remember our discussion about the logic behind MLE. The idea is to …nd the values
of the parameters that maximise the possibility of obtaining the data that we
observe in the sample.
Our notation is
L (θ; y ) = fY (y ; θ ) ,
where θ and y are the parameter and data matrices, respectively.
Usually, it is more convenient to use the log-likelihood which is

` (θ; y ) = log L (θ; y ) .


Notice that log is a monotone transformation. Hence, as will be obvious in a
moment, for our purposes there is no di¤erence between using ` (θ; y ) and L (θ; y ) .
The maximum likelihood method is based on …nding the parameter values which
maximise the likelihood (or probability) of obtaining the particular sample we have:

θ̂ = arg max ` (θ; y ) .


θ 2Θ

(Bilkent) ECON509 This Version: 16 December 2013 15 / 73


Maximum Likelihood Estimation
Hence, the likelihood function is the objective function. Consequently, there must be
a …rst order condition.
Caution: never confuse estimator with estimate!
This …rst order condition has a special name: the score. The score is a key concept
and deeply in‡uences the behaviour of the ML estimator.
Let θ be a (k 1) vector. When the derivative exists, the score is given by
∂ log L (θ; y ) ∂` (θ; y )
= .
∂θ ∂θ
Of course, this is a (k 1) vector, as well.
Consequently, θ̂ is the value of θ which satis…es,

∂ log L (θ; y )
= 0.
∂θ
θ =θ̂

Importantly, one also has to ensure that

∂2 log L (θ; y )
< 0,
∂θ∂θ 0
θ =θ̂

in the sense that the matrix is negative de…nite.


(Bilkent) ECON509 This Version: 16 December 2013 16 / 73
Maximum Likelihood Estimation

Example: Let Yi N µ, σ2 where Yi ?? Yj for all i 6= j and i , j = 1, ..., n.


Let, as before, y = (y1 , ..., yn ) . The joint likelihood is given by
n
L (θ; y ) = ∏ fY (y i ; θ )
i
i =1
( )
n
1 1 n
2σ2 i∑
p 2
= exp (yi µ) .
2πσ2 =1

Then,
n n 1 n
2σ2 i∑
`(θ; y ) = log L(θ; y ) = log 2π log σ2 (yi µ )2 .
2 2 =1
0
Obviously, θ = µ, σ2 . Let’s …nd the ML estimators.

(Bilkent) ECON509 This Version: 16 December 2013 17 / 73


Maximum Likelihood Estimation

Now,
∂`(θ; y ) 1 n
∂µ
= ∑ (yi
σ̂2 i =1
µ̂) = 0,
µ=µ̂,σ2 =σ̂2

and
n
∂`(θ; y ) n 1
∂σ2
= 2
+ 2 ∑ (yi µ̂)2 = 0.
µ=µ̂,σ2 =σ̂2
2σ̂ 2 σ̂2 i =1

Solving for the …rst-order conditions yields,

1 n 1 n 1 n
n i∑ n i∑ n i∑
µ̂ = ȳ = yi and σ̂2 = (yi µ̂)2 = (yi ȳ )2 .
=1 =1 =1
0
Therefore, θ̂ = ȳ , n1 ∑ni=1 (yi ȳ )2 .

(Bilkent) ECON509 This Version: 16 December 2013 18 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

In the next few slides, we will cover some important common properties of likelihood
functions.
In this discussion, we will assume that the data generating process and the chosen
underlying distribution are the same:

g Y (y ) = fY (y ; θ 0 ) .

where θ 0 is, by de…nition, the true parameter value.


This is not necessarily true in general. In fact, thinking about what happens when

g Y (y ) 6 = fY (y ; θ ) for all possible θ

is crucial.
We will do this later. For the time being, we will stick to the simpler case.

(Bilkent) ECON509 This Version: 16 December 2013 19 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Property 1 (Unbiasedness of the Score):


2 3
∂ log L (θ; y )
Ef 4 5 = 0,
∂θ
θ =θ 0

where the expectation is taken with respect to the distribution fY (y ; θ 0 ) .


Proof: Now,
∂ log L (θ; y ) 1 ∂L (θ; y )
= .
∂θ L (θ; y ) ∂θ
Then, Z
∂ log L (θ; y ) 1 ∂L (θ; y )
Ef = f (y ; θ 0 )dy ,
∂θ L (θ; y ) ∂θ
by de…nition. Observe that this is a function of both θ and θ 0 .

(Bilkent) ECON509 This Version: 16 December 2013 20 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Now,
2 3
Z
∂ log L (θ; y ) ∂L (θ; y ) 1
Ef 4 5 = f (y ; θ 0 )dy
∂θ ∂θ L (θ; y )
θ =θ 0 θ =θ 0
Z
∂L (θ; y )
= dy
∂θ
θ =θ 0
Z

= f (y ; θ )dy
∂θ
θ =θ 0


= 1
∂θ
θ =θ 0
= 0,

where we implicitly assumed that the order of integration and di¤erentiation can be
exchanged. An aside: this requires that the range of y does not depend on θ.
Hence, the expectation of the …rst-order condition, evaluated at the true parameter
value, is zero!
(Bilkent) ECON509 This Version: 16 December 2013 21 / 73
Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Property 2 (The Information Equality):


0 1 2 3
∂ log L ( θ; Y ) ∂2 log L (θ; Y )
Covf @ A= Ef 4 5,
∂θ ∂θ∂θ 0
θ =θ 0 θ =θ 0

where, as before, the expectation and the covariance are taken with respect to
fY (y ; θ ) .
Proof: Now, one can show that
∂2 log L (θ; y ) 1 ∂2 L (θ; y ) 1 ∂L (θ; y ) ∂L (θ; y )
0 = .
∂θ∂θ L (θ; y ) ∂θ∂θ 0 [L (θ; y )]2 ∂θ ∂θ 0

Then,
Z
∂2 log L (θ; y ) 1 ∂2 L (θ; y )
Ef = f (y ; θ 0 )dy
∂θ∂θ 0 L (θ; y ) ∂θ∂θ 0
Z
1 ∂L (θ; y ) ∂L (θ; y )
f (y ; θ 0 )dy .
[L(θ; y )]2 ∂θ ∂θ 0

(Bilkent) ECON509 This Version: 16 December 2013 22 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Take the …rst term.


Z
1 ∂2 L (θ; y )
f (y ; θ 0 )dy
L (θ; y ) ∂θ∂θ 0
θ =θ 0
Z
∂2 L (θ; y ) 1
= f (y ; θ 0 )dy
∂θ∂θ 0 L(θ 0 ; y )
θ =θ 0
Z
∂2
= L (θ; y )dy
∂θ∂θ 0
θ =θ 0

∂2
= 1
∂θ∂θ 0
θ =θ 0
= 0.

These hold if, of course, we can exchange the order of integration and
di¤erentiation, which is implicitly assumed here.

(Bilkent) ECON509 This Version: 16 December 2013 23 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Then,
2 3
Z
∂2 log L (θ; y ) 1 ∂L (θ; y ) ∂L (θ; y )
Ef 4 5 = f (y ; θ 0 )dy
∂θ∂θ 0 [L(θ; y )] 2 ∂θ ∂θ 0
θ =θ 0 θ =θ 0
Z
∂ log L (θ; y ) ∂ log L (θ; y )
= f (y ; θ 0 )dy
∂θ ∂θ 0
θ =θ 0
2 3
∂ log L ( θ; y ) ∂ log L ( θ; y )
= Ef 4 5
∂θ ∂θ 0
θ =θ 0
8 9
< ∂ log L (θ; y ) =
= Covf ,
: ∂θ ;
θ =θ 0
2 3
∂ log L (θ;y )
since E 4 ∂θ
5 is zero-mean.
θ =θ 0

(Bilkent) ECON509 This Version: 16 December 2013 24 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator
Property 3 (Cramér-Rao Inequality): Let θ̃ be some estimator of θ 0 and assume
that Ef [θ̃ ] = θ 0 . Then,
8 2 39 1
< ∂ log L ( θ; y ) =
Varf θ̃ Varf 4 5 0,
: ∂θ ;
θ =θ 0

in the sense that the di¤erence between the two matrices is non-negative de…nite.
Proof: For simplicity of exposition we focus on the univariate case, where θ is a
scalar. Since θ̃ is unbiased for θ 0 , we have
Z
Ef [θ̃ ] = θ̃f (y ; θ 0 )dy = θ 0 .

Now, di¤erentiating with respect to θ 0 gives


Z Z
∂ ∂f (y ; θ )
1 = θ̃f (y ; θ )dy = θ̃ dy
∂θ θ =θ 0 ∂θ θ =θ 0
Z ∂ Z
∂θ f (y ; θ )jθ =θ 0 ∂ log f (y ; θ )
= θ̃ f (y ; θ )dy = θ̃ f (y ; θ )dy
f (y ; θ ) ∂θ θ =θ 0
∂ log f (y ; θ )
= Ef θ̃ .
∂θ θ =θ 0

(Bilkent) ECON509 This Version: 16 December 2013 25 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Now, the previous term looks like the covariance of

∂ log f (y ; θ )
θ̃ and .
∂θ
θ =θ 0

And, actually it is!


Remember that
∂ log f (y ; θ )
Ef = 0. (2)
∂θ θ =θ 0

Since, for any two random variables X and Y ,

Cov (X , Y ) = E [XY ] ,

if either E [X ] = 0 or E [Y ] = 0 (you can easily verify this), by (2) we have

∂ log f (y ; θ )
1 = Ef θ̃
∂θ θ =θ 0
∂ log f (y ; θ )
= Covf θ̃, .
∂θ θ =θ 0

(Bilkent) ECON509 This Version: 16 December 2013 26 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Remember that by the Cauchy-Schwarz Inequality, for any two random variables X
and Y ,
Covf (X , Y )2 Varf (X )Varf (Y ).
Then,
2
∂ log f (y ; θ ) ∂ log f (y ; θ )
Covf θ̃, = 12 Varf θ̃ Varf .
∂θ θ =θ 0 ∂θ θ =θ 0

Rearranging gives the desired result.


We note two important results:
1 The only time the Cramér-Rao bound is achieved is when the estimator is the ML
estimator. In many problems, no estimator would actually achieve this bound.
2 For regular problems, asymptotically the ML estimator achieves the Cramér-Rao
bound. In other words, for large n, ML can achieve the Cramér-Rao bound, that is ML
is e¢ cient in large samples.
In general, an estimator is said to be e¢ cient if it achieves the lowest possible
variance.

(Bilkent) ECON509 This Version: 16 December 2013 27 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Example: Let Y1 , Y2 , ... be an iid sequence where Yi N (θ 0 , 1) for all i . We will


…rst …nd the Cramér-Rao bound for unbiased estimators of θ 0 and then show that
the ML estimator achieves this bound.
First, let’s construct the log-likelihood function. Let y = (y1 , ..., yn )0 . Then
( ) ( )
1 1 (y1 θ )2 1 1 (yn θ )2
L (θ; y ) = p exp ... p exp
2π1 2 1 2π1 2 1
( )
n n
1 1
2 i∑
= p exp (yi θ )2 .
2π =1

This gives
n 1 n
2 i∑
` (θ; y ) = log 2π (yi θ )2 .
2 =1

(Bilkent) ECON509 This Version: 16 December 2013 28 / 73


Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Then, the …rst order condition is given by


n
∂` (θ; y ) 1
= ( 2 ) ∑ yi θ̂ = 0,
∂θ 2 i =1
θ =θ̂

implying that
n n
∑ yi ∑ θ̂ = 0,
i =1 i =1
and, so,
1 n
n i∑
θ̂ = yi .
=1
In addition,
" #
n
∂ log f (y ; θ )
Varf
∂θ θ =θ 0
= Varf ∑ (yi θ0 )
i =1
n
= ∑ Var{z
f (yi ) = n,
i =1 | }
1

due to the iid assumption.


(Bilkent) ECON509 This Version: 16 December 2013 29 / 73
Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator

Therefore, as far as this problem is concerned, the Cramér-Rao bound for any
unbiased estimator θ̃ is given by
1
Varf θ̃ .
n
Now, the variance of the ML estimator is very easy to …nd.
!
1 n 1 n
Var (θ̂ ) = Var ∑
n i =1
yi = 2 ∑ Var (yi )
n i =1
1 1
= n= .
n2 n
But this the same as the Cramér-Rao bound. Hence, the ML estimator in this
particular case is an e¢ cient estimator.

(Bilkent) ECON509 This Version: 16 December 2013 30 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Remember that we restrict ourselves to the case where our random sequence
Y1 , ..., Yn is iid. Let y = (y1 , ..., yn ) .
Let fY i (yi ; θ ) = L (θ; yi ) be the pdf (or the likelihood function) for Yi . Then, the
joint pdf or the joint likelihood function is given by
n
L (θ; y ) = ∏ L (θ; yi ) ,
i =1

and the joint log-likelihood function is


n n
` (θ; y ) = log ∏ L (θ; yi ) = ∑ log L (θ; yi ) .
i =1 i =1

Now, suppose that log L (θ; y1 ) , log L (θ; y2 ) , ... is an iid sequence where
E [log L (θ; yi )] < ∞ for all i .
Then, by the strong Law of Large Numbers,

1 n a.s .
n i∑
log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )].
=1

(Bilkent) ECON509 This Version: 16 December 2013 31 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Let θ̂ be the maximum likelihood estimator or the true parameter value θ 0 . Can we
use the previous convergence result to argue that we will also have
a.s .
θ̂ ! θ 0 ?

To be able to show this, we need to assume that

1 n a.s .
n i∑
sup log L (θ; yi ) Ef (y jθ 0 ) [log L (θ; yi )] ! 0
θ 2Θ =1

(Bilkent) ECON509 This Version: 16 December 2013 32 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

The intuitive idea is as follows. Remember that

θ 0 = arg max Ef (y jθ 0 ) [log L (θ; y )],


θ 2Θ

which follows from the property that


2 3
∂ log L (θ; y )
E f (y j θ 0 ) 4 5 = 0.
∂θ
θ =θ 0

Moreover, by de…nition,
θ̂ = arg max log L (θ; y ) .
θ 2Θ

(Bilkent) ECON509 This Version: 16 December 2013 33 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

a.s .
Then, since n1 ∑ni=1 log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )] uniformly for all θ, we can
intuitively argue that argument that maximises n1 ∑ni=1 log L (θ; yi ) will also converge
to the argument that maximises Ef (y jθ 0 ) [log L (θ; yi )]. In other words,

1 n a.s .
n i∑
log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )] uniformly for all θ,
=1
1 n a.s .
so θ̂ = arg max ∑ log L (θ; y ) ! arg max Ef (y jθ 0 ) [log L(θ; y )] = θ 0 . (3)
θ 2 Θ n i =1 θ 2Θ

Note that there is no harm in dividing log L (θ; y ) by n. This is a monotonic


transformation, so both n1 log L (θ; y ) and log L (θ; y ) will yield the same estimates.
This argument is, of course, not formal. For a more formal proof, see Newey and
McFadden (1994, pp.2121-2122). We will cover this proof later on.
Asymptotic normality, on the other hand, can be proved by using the tools we have
already acquired. The main ingredient will, again, be Taylor’s expansion.

(Bilkent) ECON509 This Version: 16 December 2013 34 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

a.s . p
Now, suppose that we already proved that θ̂ ! θ 0 (which implies that θ̂ ! θ 0 ).
How to show the asymptotic normality of θ̂?
We start with a …rst-order expansion of the score function about θ̂ = θ 0 . Our
attention will again be on the univariate case, where θ is a scalar.
For sake of notational simplicty, whenever we write

1 n ∂ log L (θ ; yi )
n i∑
etc,
=1 ∂θ

we will mean
1 n ∂ log L (θ; yi )
n i∑
.
=1 ∂θ
θ =θ

(Bilkent) ECON509 This Version: 16 December 2013 35 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

We also assume that for all θ,


h i
∂2 log L (θ;yi )
1 E
∂θ 2
< ∞,
∂ log L (θ;yi ) 2
2 0<E ∂θ < ∞.

Throughout, we also assume that the order of integration and di¤erentiation can be
interchanged and that f∂ log L (θ; yi ) /∂θ gi =1,...,n and f∂2 log L (θ; yi ) /∂θ 2 gi =1,...,n
are both iid sequences.

(Bilkent) ECON509 This Version: 16 December 2013 36 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

We will this time use a mean value expansion. This is almost the same as the Taylor
expansion. The only di¤erence is that, instead of including a remainder term, the
…nal term in the expansion is evaluated at the so called mean value.
Hence, a k th order mean value expansion of some function f (x ) about x = x0 is
given by
1
f (x ) = f (x0 ) + f (1 ) (x0 )(x x0 ) + ... + f (k 1)
(x0 )(x x0 )k 1
(k 1)!
1 (k )
+ f (x̃ )(x x0 )k ,
k!
where x̃ 2 [min(x , x0 ), max(x , x0 )].

(Bilkent) ECON509 This Version: 16 December 2013 37 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Then, we have
" #
1 n ∂ log L θ̂; yi 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑
= ∑
n i∑
+ θ̂ θ0 ,
=1 ∂θ n i =1 ∂θ =1 ∂θ 2

where θ̃ 2 [min θ̂, θ 0 , max θ̂, θ 0 ].


We could also have used
" #
1 n ∂2 log L (θ 0 ; yi )
n i∑
θ̂ θ0 + R
=1 ∂θ 2

where R is a remainder term, instead. However, a mean value expansion will be


easier for us to work with.

(Bilkent) ECON509 This Version: 16 December 2013 38 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

a.s . a.s .
Now, since θ̂ ! θ 0 and since θ̃ is always between θ̂ and θ 0 , we have θ̃ ! θ 0 as well,
a.s .
as θ̂ ! θ 0 .
Remember that, by de…nition,

1 n ∂ log L θ̂; yi
n i∑
= 0.
=1 ∂θ

Then, we have
" #
1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
0= ∑
n i∑
+ θ̂ θ0 ,
n i =1 ∂θ =1 ∂θ 2

which implies that


" #" # 1
p p 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑ n i∑
n θ̂ θ0 = n .
=1 ∂θ =1 ∂θ 2
| {z }| {z }
there is a CLT! there is an LLN!

(Bilkent) ECON509 This Version: 16 December 2013 39 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Let’s try to prove that there indeed is a CLT for the highlighted term on the previous
slide.
Remember what we need to have for the existence of a CLT: an iid sequence with
…nite mean and variance.
Firstly, by assumption
∂ log L (θ 0 ; yi )
∂θ i =1,...,n
is an iid sequence.
Also, by the unbiasedness of the score and by a moment assumption we made
previously,
∂ log L (θ 0 ; yi ) ∂ log L (θ 0 ; yi )
E =0<∞ and 0 < Var < ∞.
∂θ ∂θ
Note that the second result above follows from
" #
∂ log L (θ 0 ; yi ) 2 ∂ log L (θ 0 ; yi )
E = Var .
∂θ ∂θ

(Bilkent) ECON509 This Version: 16 December 2013 40 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

De…ne
∂ log L (θ 0 ; yi )
I = Var .
∂θ
Then, we have the following Central Limit Theorem result:
0
z }| {
n
∂ log L (θ 0 ; yi )

1 ∂ log L (θ 0 ;yi )
n ∂θ E
p i =1
∂θ d
n p ! N (0, 1) .
I
Equivalently,
p 1 n
∂ log L (θ 0 ; yi ) d
n
n ∑ ∂θ
! N (0, I) .
i =1

(Bilkent) ECON509 This Version: 16 December 2013 41 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

What about the LLN result?


Remember that, by assumption, the sequence

∂2 log L (θ; yi )
∂θ 2 i =1,...,n

is iid while
∂2 log L (θ; yi )
E < ∞.
∂θ 2
Then, under some other standard assumptions, one can show that

1 n ∂2 log L θ̃; yi a.s . ∂2 log L (θ 0 ; yi )


n i∑
!E .
=1 ∂θ 2 ∂θ 2

Note that the main tricky point here is that the arguments of the two functions are
di¤erent (θ̃ vs θ 0 ). Nevertheless, this type of results is pretty standard.

(Bilkent) ECON509 This Version: 16 December 2013 42 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Now we have our CLT and LLN results.


De…ne
∂2 log L (θ 0 ; yi )
J =E .
∂θ 2
We can now use Slutsky’s Theorem.
" #" # 1
p p 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑ n i∑
n θ̂ θ 0 = n
=1 ∂θ =1 ∂θ 2
| {z }| {z }
d a.s . 1
!N (0,I) !J
d 1 d 2
!J N (0, I) = N 0, IJ ,

d
where = stands for “equal in distribution.”

(Bilkent) ECON509 This Version: 16 December 2013 43 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

The term
∂ log L (θ 0 ; yi )
I =Var
∂θ
is generally referred to as the expected (Fisher) observation, or more precisely, the
Fisher Information.
Moreover,
∂2 log L (θ 0 ; yi )
,
∂θ 2
is generally known as the Hessian, with the associated expected Hessian given by

∂2 log L (θ 0 ; yi )
J =E
∂θ 2

(Bilkent) ECON509 This Version: 16 December 2013 44 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

The quantities I and J can easily be consistently estimated , by using their sample
counterparts; that is,
n 2
1 ∂ log L (θ 0 ; yi ) 1 n ∂2 log L (θ 0 ; yi )
Ib =
n ∑ ∂θ
and Jb =
n i∑ ∂θ 2
i =1 =1

Of course, when θ 0 is unknown (as is always the case), one uses the estimated
parameter value, θ̂.
Therefore,
" #2
1 n ∂ log L θ̂; yi 1 n ∂2 log L θ̂; yi
Ib =
n ∑ ∂θ
and Jb =
n i∑ ∂θ 2
,
i =1 =1

and under general assumptions one can ensure that


p p
Ib ! I and Jb ! J .

(Bilkent) ECON509 This Version: 16 December 2013 45 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Notice that, by Property (2),


I = J,
implying that
p d 1
n θ̂ θ 0 ! N 0, I .

Remember that,
1
1 ∂ log L (θ 0 ; yi )
I = Var ,
∂θ
is the Cramér-Rao lower bound. Hence, this con…rms that, in this given framework,
the ML estimator’s asymptotic variance has the minimum variance property.

(Bilkent) ECON509 This Version: 16 December 2013 46 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Before we move on, let’s do the derivation of the asymptotic distribution for the
case where θ is a (p 1) vector rather than a scalar.
Again, we start with a Taylor expansion, but note that this time, we are dealing with
vectors.
" #
1 n ∂ log L θ̂; yi 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑
= ∑
n i∑
+ θ̂ θ 0 . (4)
=1 ∂θ n i =1 ∂θ =1 ∂θ∂θ 0 (p 1 )
(p 1 ) (p 1 ) (p p )

Here,
2 3
∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi )
6 ∂θ 1 ∂θ 1 ∂θ 1 ∂θ 2 ∂θ 1 ∂θ p 7
6 ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) 7
∂2 log L θ̃; yi 6 7
6 ∂θ 2 ∂θ 1 ∂θ 2 ∂θ 2 ∂θ 2 ∂θ p 7
=6 .. .. .. 7.
∂θ∂θ 0 6
6
.. 7
7
. . . .
4 5
∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi )
∂θ p ∂θ 1 ∂θ p ∂θ 2 ∂θ p ∂θ p

A technical aside: the parameter θ̃ appearing in each entry in the above matrix is
understood to (possibly) di¤er from entry to entry.

(Bilkent) ECON509 This Version: 16 December 2013 47 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

All that we had assumed for the scalar terms in the scalar case are now assumed to
hold entry by entry for all matrices.
Now, rearranging (4) gives,
" #
1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑ n i∑
0 = + θ̂ θ 0
=1 ∂θ =1 ∂θ∂θ 0
" #
1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
) ∑
n i =1 ∂θ∂θ 0
θ̂ θ 0 =
n i∑ ∂θ
=1
" # 1 !
1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
n i∑ n i∑
) θ̂ θ 0 = ,
=1 ∂θ∂θ 0 =1 ∂θ

where 0 is a (p 1) matrix of zeroes.


Always keep in mind that we are dealing with matrices now. For exampl, for two
matrices A and B which are (q q ) and (q 1), respectively, we do not necessarily
have
AB = BA!!!

(Bilkent) ECON509 This Version: 16 December 2013 48 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Of course, this time we have to use the multivariate versions of I and J .


∂ log L (θ 0 ; Yi ) ∂ log L (θ 0 ; Yi ) ∂ log L (θ 0 ; Yi )
I = Var =E
∂θ ∂θ ∂θ 0
(p p )

and
∂ log L (θ 0 ; Yi )
J =E .
∂θ∂θ 0
(p p )

Here, 2 ∂ log L (θ 0 ;yi ) 3


6 ∂ log ∂θ
L (θ 0 ;yi ) 7
1

∂ log L (θ 0 ; yi ) 6 7
6 ∂θ 2 7
=6 . 7,
∂θ 6 .. 7
4 5
∂ log L (θ 0 ;yi )
∂θ p
∂ log L (θ 0 ;yi )
and ∂θ 0
is simply the transpose of this vector.

(Bilkent) ECON509 This Version: 16 December 2013 49 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

These can be consistently estimated using


n ∂ log L θ̂; yi ∂ log L θ̂; yi
1
Ib =
n ∑ ∂θ ∂θ 0
,
i =1

and
1 n ∂ log L θ̂; yi
n i∑
Jb = .
=1 ∂θ∂θ 0

(Bilkent) ECON509 This Version: 16 December 2013 50 / 73


Maximum Likelihood Estimation
Asymptotics of ML Estimators

Then,
" # 1 !
p 1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
n i∑
n θ̂ θ0 = p ∑ .
=1 ∂θ∂θ 0 n i =1 ∂θ
| {z }| {z }
p 1 d
!J !N (0,I)

Therefore, by Slutsky’s Theorem


p d 1 d 1 1
0 d 1
n θ̂ θ0 ! J N (0, I) = N 0, J I J = N 0, I ,

where we used the fact that J 1 is symmetric and that I = J .


A consistent estimator for the asymptotic variance will be
1 p
Jb 1b
I Jb !J 1
IJ 1
.

Also note that, this time we have J 1 IJ 1 as the asymptotic covariance matrix,
rather than J 2 I , which is the equivalent of J 1 IJ 1 in the scalar case.

(Bilkent) ECON509 This Version: 16 December 2013 51 / 73


Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation

Up to know, we have assumed that we know the true distribution that generates the
observed data and tried to estimate the true parameter vector, θ 0 . In reality, the
data generating process would rarely (if at all) be known by the researcher.
To make this point clear, suppose that the data generating process is given by the
cdf
G Y (y ) ,
and by its associated density function, gY (y ).
Do we know what the data generating process is? We might have some idea about
it but usually the short answer is “no.”
So what do we do then? When we model data, we believe (hope?) that the
distribution function we have chosen is indeed the data generating process. Let’s
de…ne this chosen distribution as
F Y (y ; θ ),
with its associated density function, fY (y ; θ ) .
Usually,
G Y (y ) 6 = F Y (y ; θ ) .
But this is not the end of the story.

(Bilkent) ECON509 This Version: 16 December 2013 52 / 73


Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation

When the chosen likelihood function is not the true data generating process,
maximum likelihood estimation is known as quasi or pseudo maximum likelihood
estimatition.
We will only outline the main ideas without getting into formal details.
Under standard assumptions,
Z
1 n p
n i∑
log L (θ; yi ) ! log f (y ; θ )g (y )dy = Eg [log f (y ; θ )],
=1

and therefore, intuitively one can argue that


p
θ̂ ! θ ,

where θ is the value of θ which satis…es


∂ log f (y ; θ )
Eg = 0,
∂θ
Now, θ is called the pseudo-true value of θ. This value has a special statistical
meaning.

(Bilkent) ECON509 This Version: 16 December 2013 53 / 73


Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation

θ is the value of θ that minimises what is known as the Kullback-Leibler


discrepancy:
Z
g (y )
D (f θ , g ) = log g (y ) dy
f (y ; θ )
Z Z
= log [g (y )] g (y ) dy log [f (y ; θ )] g (y ) dy
= Eg [log g (y )] Eg [log f (y ; θ )].

This is a measure of the di¤erence between what we use as our likelihood/density


function, f (y ; θ ), and the true data generating process, g (y ).
This is always non-negative, provided that g (y ) and f (y ; θ ) are continuous in y .
Therefore, the maximum likelihood estimator converges to the value of θ, which
minimises the di¤erence between the density we use and the true density.
Put di¤erently, the maximum likelihood estimator converges to what minimises our
mistake.
If, g (y ) = f (y ; θ ), then the density we are using is correctly speci…ed and θ = θ 0 .

(Bilkent) ECON509 This Version: 16 December 2013 54 / 73


Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation

The asymptotic distribution is now


p d
n θ̂ θ ! N (0, Jg 1 Ig Jg 1 ),

where
∂ log L (θ ; Yi ) ∂ log L (θ ; Yi ) ∂2 log L (θ ; Yi )
Ig = E g and Jg = Eg .
∂θ ∂θ 0 ∂θ∂θ 0
These can consistently be estimated by
n
1 ∂ log L (θ̂; yi ) ∂ log L (θ̂; yi ) 1 n ∂2 log L (θ̂; yi )
Ibg =
n ∑ ∂θ ∂θ 0
and Jbg =
n i∑ ∂θ∂θ 0
,
i =1 =1

where
1 p
Jbg 1 Ibg Jbg ! Jg 1 Ig Jg 1 .
All these results can be proved by using the same arguments as for the case where
the likelihood function is identical to the data generating process. The only
di¤erence is that this time the expansion should be about θ̂ = θ .
Clearly, the main ideas are the same as before. What changes is the interpretation of
what θ̂ converges to.
(Bilkent) ECON509 This Version: 16 December 2013 55 / 73
Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation

It is crucial to underline that the three nice properties we have proved do not
necessarily hold for quasi maximum likelihood estimation.
In particular, generally,
Ig 6 = Jg .
Therefore, the asymptotic variance matrix does not simplify to Ig 1 anymore.
The term Jg 1I
g Jg is known as the sandwich matrix. Likewise, Jbg 1 Ibg Jbg
1 1 is
generally called the sandwich estimator.
Remember that all these results are for the iid case. As we relax the iid assumption,
we may have to deal with further issues, especially in terms of consistent estimation
of the sandwich matrix. We will not deal with these.

(Bilkent) ECON509 This Version: 16 December 2013 56 / 73


Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation

In many cases, convergence to the true parameter value is still possible, even when
the chosen density function is not the correct one.
Example: We suppose that Y1 , ..., Yn is an iid sequence from N (µ, σ2 ). However,
the truth is that although Yi are iid, the distribution is not Normal. In addition, the
true mean and variance are given by µ0 and σ20 , respectively.
p
Let θ̂ = µ̂, σ̂2 and θ 0 = µ0 , σ20 . Now, in order to ensure θ̂ ! θ 0 , we need to have
2 3
∂ log L ( θ, Y i )
Eg 4 5 = 0.
∂θ
θ =θ 0

Let’s check this. The log-likelihood function is given by


2
n n 1 n yi µ
2 i∑
log L (θ ) = log 2π log σ2 .
2 2 =1 σ

(Bilkent) ECON509 This Version: 16 December 2013 57 / 73


Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation

Then, we have
2 3 " #
n
∂ log L (θ, Yi ) n 1
Eg 4
∂σ2
5=
2σ20
+ 2
Eg ∑ (Y i µ0 ) 2
.
θ =θ 0
2 σ20 i =1

Now, h i
E g (Y i µ0 )2 = σ20 ,

by de…nition, so
2 3
∂ log L (θ, Yi ) n 1
Eg 4 5= + nσ20 = 0.
∂σ2 2σ20 2 σ20
2
θ =θ 0

Moreover,
2 3 " #
n
∂ log L (θ, Yi ) 5 = 1 Eg
Eg 4
∂µ σ20
∑ (Y i µ0 ) = 0.
θ =θ 0 i =1

(Bilkent) ECON509 This Version: 16 December 2013 58 / 73


Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation

This con…rms that 2 3


∂ log L (θ, Yi )
Eg 4 5 = 0.
∂θ
θ =θ 0
Therefore, despite misspeci…cation, the QML estimator is still consistent for
µ0 , σ20 !

(Bilkent) ECON509 This Version: 16 December 2013 59 / 73


Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation

S&P500 Daily Returns (%)


5

-5
200 400 600 800 1000 1200 1400 1600
S&P500 Squared Daily Returns (%)
30

20

10

200 400 600 800 1000 1200 1400 1600

Figure: Daily returns and squared returns on S&P500 index from 2 February 2001 to 16 January
2008.

(Bilkent) ECON509 This Version: 16 December 2013 60 / 73


Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation

Engle’s (1982) idea: there is no predictability in asset returns, but there is some
predictability in the asset return volatility.
High volatility periods are clustered together. Likewise, low volatility periods are
clustered together.
Here is the model. Let rt be the return on some asset at time t, t = 1, ..., T . Let
Ft 1 be the information set at time t 1 (e.g. all stock returns, volatilities etc up
to and including period t 1).
Suppose for simplicity that E [rt jFt 1 ] = 0 for all t. This is not a crazy assumption
for stock returns. Indeed, they simply ‡uctuate around zero.

(Bilkent) ECON509 This Version: 16 December 2013 61 / 73


Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation

Engle (1982) proposed the Autoregressive Conditional Heteroskedasticity (ARCH)


model given by
rt = E [rt jFt 1 ] + εt ,

εt jFt 1 N 0, σ2t ,

σ2t = ω + αε2t 1,
ω > 0, α 0.
His doctoral student Tim Bollerslev came up with an extension in 1986, which
proved to be one of the most popular models in econometrics: the Generalised
ARCH (GARCH) model given by,
σ2t = ω + αε2t 1 + βσ2t 1.

This paper is among the most cited papers published in Journal of Econometrics.
This is not a small di¤erence. If anything, it has been more successful empirically.
In 2003, Robert Engle shared the Nobel prize with Sir Clive Granger.
The idea is simple: today’s volatility is a¤ected by (i) yesterday’s shock (ε2t ) and (ii)
yesterday’s volatility (σ2t ).
This model is usually estimated using the ML method.
(Bilkent) ECON509 This Version: 16 December 2013 62 / 73
Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation

Now, when E [rt jFt 1] = 0, we have

rt jFt 1 N 0, σ2t ,

σ2t = ω + αε2t 1+ βσ2t 1 ,


ω > 0, α, β 0 and α + β < 1.

Then, " #
2
1 1 rt
L (ω, α, β; rt jFt 1) = q exp ,
2πσ2t 2 σt

and " #
T 2
1 1 T rt
∏ L(ω, α, β; rt jFt 1) = q exp
2 t∑
.
(2π )n ∏T
t =2 2 =2 σt
t =2 σ t

The …rst observation is dropped due to the conditioning: t = 1, ..., T , so we have to


start with the second observation as we will condition on the …rst observation.
The joint density is obtained by using the prediction decomposition argument, which
we mentioned at the beginning of this slide set.

(Bilkent) ECON509 This Version: 16 December 2013 63 / 73


Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation

Now, the log-likelihood function is given by


2
n 1 T 1 T rt
2 t∑ 2 t∑
`(ω, α, β; r1 , ..., rT ) = log 2π log σ2t .
2 =2 =2 σt

The …rst observation is dropped due to the conditioning: t = 1, ..., T , so we have to


start with the second observation as we will condition on the …rst observation.
The joint density is obtained by using the prediction decomposition argument, which
we mentioned at the beginning of this slide set.
Bollerslev and Wooldridge (1992, Econometric Reviews) showed that under certain
assumptions, even if the likelihood function is misspeci…ed, we will still have
p
θ̂ ! θ 0 .

So, it is …ne to use the Quasi Maximum Likelihood method here.


Unfortunately, we cannot solve this model in closed form due to the recursive
structure of σ2t . You can try to see this for yourselves by trying to take the …rst-order
derivative with respecto to, say, α.
So, estimation is conducted by computer software.

(Bilkent) ECON509 This Version: 16 December 2013 64 / 73


Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation

Examples of popular computer programmes/languages are Matlab, C++, R, SAS


etc.
How we estimate this model using a computer programme is as follows.
1 We write the log-likelihood function as a code and feed this into the software as the
objective function to be maximised. Most importantly, we “tell” the software that our
parameters are ω, α, β.
2 The computer then uses an optimisation algorithm. We can do constrained or
unconstrained optimisation. In the case of constrained optimisation we should, for
example, de…ne the parameter space. For the GARCH example, we do not want
Matlab to look for values of ω below zero, for example.
3 The computer simultaneously tries di¤erent values of (ω, α, β) until it has decided that
the maximum of the objective function (the log-likelihood function) has been achieved.
4 In some case, if the software cannot …nd the maximum (because, for example, the
objective function is ‡at, or is not concave, or there are many local maxima or some
other reason), then we get an error message along with (hopefully) an explanation.
Actually, this model is so well-established that it is very likely that your favourite
econometrics software will have an option to do GARCH estimation. So, all you
have to do is to upload the data and click on a few buttons.

(Bilkent) ECON509 This Version: 16 December 2013 65 / 73


Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation

Here are some examples of parameter estimates.

Stock Annual Variance α β


Amazon 46.78% .0133 .9840
IBM 9.47% .0741 .9188
JP Morgan 17.95% .0719 .9261
Procter & Gamble 6.66% .0285 .9693
Walt Disney 12.53% .0808 .9090
Table: GARCH parameter estimates for a selection of stocks. The estimation period is from 4
Jan 2000 to 1 Dec 2008

(Bilkent) ECON509 This Version: 16 December 2013 66 / 73


Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation

Fitted Variances Using the GARCH Model


100
Amazon
90 IBM
JP Morgan

80

70

60
Variance (%)

50

40

30

20

10

0
2001 2002 2003 2004 2005 2006 2007 2008

(Bilkent) ECON509 This Version: 16 December 2013 67 / 73


Appendix: Estimator Consistency

Before we …nish this part, let’s look at the basic consistency proof for maximum
likelihood (and related) estimators.
Now, as before, let for some iid sequence Yi , i = 1, ..., n, the log-likelihood function
be given by
`(θ; yi ),
where θ is a possibly vector-valued parameter.
Remember that the maximum likelihood estimator that we considered so far is given
by
1 n
θ̂ = arg max ∑ `(θ; yi ), where Θ is the parameter space.
θ 2 Θ n i =1

p
What is the basic proof for θ̂ ! θ 0 ?

(Bilkent) ECON509 This Version: 16 December 2013 68 / 73


Appendix: Estimator Consistency

The following is a slightly less di¢ cult version of Theorem 2.1 (and of its Proof)
from Newey and McFadden (1994, pp. 2121-2122).
Let Q̂n (θ ) = ∑ni=1 `(θ; yi ) and Q0 (θ ) = E [`(θ; yi )].
1
n
Theorem 2.1 (based on Newey and McFadden, 1994): If there is a function
Q0 (θ ) such that (i) for each η > 0,

Q0 (θ 0 ) sup Q0 (θ ) > 0
fθ:jjθ θ 0 jj>η g

and (ii)
p
sup Q̂n (θ ) Q0 (θ ) ! 0,
θ 2Θ
then
p
θ̂ ! θ 0 .
Another way of stating Assumption (i) is to say that θ 0 is uniquely identi…able.

(Bilkent) ECON509 This Version: 16 December 2013 69 / 73


Appendix: Estimator Consistency

Proof: For any ε > 0, we can show that with probability approaching one (w.p.a.1)

(i ) : Q̂n (θ̂ ) > Q̂n (θ 0 ) ε/3,


(ii ) : Q0 (θ̂ ) > Q̂n (θ̂ ) ε/3,
(iii ) : Q̂n (θ 0 ) > Q0 (θ 0 ) ε/3.

How? (i ) follows from the fact that θ̂ is the maximiser of Q̂n (θ ); (ii ) and (iii ) follow
from the Uniform convergence result which implies that w.p.a.1

Q̂n (θ ) Q0 (θ ) < ε/3 and Q0 (θ ) Q̂n (θ ) < ε/3, for any θ 2 Θ.

Then, w.p.a.1

Q̂n (θ̂ ) Q0 (θ̂ ) < ε/3 and Q0 (θ 0 ) Q̂n (θ 0 ) < ε/3.

(Bilkent) ECON509 This Version: 16 December 2013 70 / 73


Appendix: Estimator Consistency

Now, using (i ), (ii ) and (iii ) we get

Q0 (θ̂ ) > Q̂n (θ̂ ) ε/3 > Q̂n (θ 0 ) 2ε/3 > Q0 (θ 0 ) ε,


(ii ) (i ) (iii )

w.p.a.1.
Therefore, for any ε > 0, Q0 (θ̂ ) > Q0 (θ 0 ) ε, w.p.a.1.
Which ε to choose? Actually, let our choice for ε be such that

ε = Q0 (θ 0 ) sup Q0 (θ ) > 0.
fθ:jjθ θ 0 jj>η g

(Bilkent) ECON509 This Version: 16 December 2013 71 / 73


Appendix: Estimator Consistency

Then, w.p.a.1,
" #
Q0 (θ̂ ) > Q0 (θ 0 ) Q0 (θ 0 ) sup Q0 (θ ) > sup Q 0 ( θ ).
fθ:jjθ θ 0 jj>η g fθ:jjθ θ 0 jj>η g

Therefore, w.p.a.1, θ̂ 2
/ fθ : jjθ θ 0 jj > η g or, equivalently, θ̂ 2 fθ : jjθ θ 0 jj η g.
Now, this is all valid for any ε > 0. ε can be arbitrarily close to zero, which requires
η to be arbitrarily close to zero, which in turn implies that θ̂ will be arbitrarily close
to θ 0 , w.p.a.1. In other words,
p
θ̂ ! θ 0 .

(Bilkent) ECON509 This Version: 16 December 2013 72 / 73


Appendix: Estimator Consistency

In the original version of the Theorem, Assumption (i) is replaced by the following
assumptions:

1 Q0 (θ ) is uniquely maximised at θ 0 ,
2 Θ is compact,
3 Q0 (θ ) is continuous.

These three assumptions can also be used to obtain Assumption (i).

(Bilkent) ECON509 This Version: 16 December 2013 73 / 73