00 голосов за00 голосов против

1 просмотров73 стр.Feb 04, 2020

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

© All Rights Reserved

1 просмотров

00 голосов за00 голосов против

© All Rights Reserved

Вы находитесь на странице: 1из 73

Slides 7

Bilkent

Introduction

In this part, we will talk about estimation. Our focus will almost exclusively be on

the maximum likelihood method.

We have worked with many distributions so far, calculated their expectations,

variances or derived their moment generating functions etc.

Importantly, the setting was such that we knew what distribution we were

considering AND we had full knowledge of the parameter values for these

distributions. Or, to put it more precisely, we never contemplated the possibility that

they might not be known.

Then, there are two implicit assumptions:

1 We know the distribution.

2 We know the parameters of the distribution.

In real life, this is rarely the case. We will …rst relax the second assumption and later

on will dispense with the …rst assumption.

The treatment will sacri…ce on formality and will rather focus on ideas. References

for more formal treatments will be provided at the end of this set of slides.

Introduction

Now, let’s assume we have a random sample consisting of X1 , X2 , ..., Xn from the

density fX (x jθ 0 ).

We would like to determine the value of θ 0 , which is unknown.

We could use an estimator.

An estimator is some function of the data

The index n underlines that fact that the particular value of the estimate depends on

the sample (and, so, on its size). Note that usually n is dropped and instead simply

θ̂ is used.

Note the di¤erence between the estimator and the estimate. The estimator is a

concept while the estimate is the value of the estimator for a given sample. So, if

the estimator is W (X1 , ..., Xn ) , then the estimate for a particular realisation of

X1 , ..., Xn is given by W (x1 , ..., xn ) .

Introduction

Now, although the de…nition given in (1) implies that any function of the data could

be a valid estimator, we usually look for those that have desirable properties.

In other words, an estimator is a statistic, meaning that it cannot depend on θ or

any other unknown parameters, which has desirable properties.

We have actually introduced one of these desirable properties: consistency. Others

are unbiasedness, minimum mean squared error, minimum variance etc.

Let Θ be the parameter space for θ. An estimator θ̂ of θ 0 is a minimum mean

squared error estimator if for every θ 0 2 Θ.

h i

θ̂ = arg min E (θ θ 0 )2 .

θ 2Θ

E [θ̂ ] = θ 0 .

You will learn more about these in your future econometrics courses.

Maximum Likelihood Estimation

Motivation and the Main Ideas

Let us …rst disect the notation. Suppose we are dealing with some generic

distribution such that

F Y (y ; θ ), θ 2 Θ.

F is the cdf, Y is the random variable and y is a particular realisation of Y .

θ is a vector which contains the distribution parameters. This is generally known as

the parameter vector.

The parameter vector takes on values on a set, Θ, known as the parameter space.

For example, for a normal random variable,

Maximum Likelihood Estimation

Motivation and the Main Ideas

However, usually θ is unknown. How to …nd out the value of θ?

We have to distinguish between the population and the sample. Population contains

all the unknown values. The sample, on the other hand can only provide an

approximation.

For example,

θ : population,

θ̂ : sample.

The maximum likelihood method is a very popular and strong method for estimating

θ when the underlying distribution function, FY , is known (or when one believes that

one actually knows the underlying distribution).

Maximum Likelihood Estimation

Motivation and the Main Ideas

function.

Where to …nd this “likelihood function?” It’s actually pretty easy!

The likelihood function is the same as the probability density function:

fY (y ; θ ) = L (θ; y ).

function, we implicitly consider θ as …xed and y as random. When we consider a

likelihood function, we assume that data, y , are given and …xed. Instead, it is θ

which is modi…ed.

How to make sense of this? MLE is based on the idea that, if we know the

underlying distribution function, then we should choose θ such that the probability

of the data, y , being observed is maximised.

In other words, we are trying to …nd out the values of θ which are most likely to

generate the observed data. This likelihood principle is due to R. A. Fisher.

Maximum Likelihood Estimation

Motivation and the Main Ideas

1500

2

µ=0, σ =1

2

µ=0, σ =4

1000

500

0

-8 -6 -4 -2 0 2 4 6 8 10

Figure: We have 10,000 iid observations from the true distribution N 0, σ2 . Two possible

values for σ2 . Which one is the correct one? Can we use the data to make a decision?

Maximum Likelihood Estimation

Motivation and the Main Ideas

1500

1000

500

0

-8 -6 -4 -2 0 2 4 6 8 10

Maximum Likelihood Estimation

Motivation and the Main Ideas

1500

His togram of Data

2

µ=0, σ =1

2

µ=0, σ =4

1000

500

0

-8 -6 -4 -2 0 2 4 6 8 10

Maximum Likelihood Estimation

Motivation and the Main Ideas

The dataset would preferably consist of many observations on the same random

variable. This ensures that we have su¢ cient information to estimate θ. Consider

some simple examples.

Example: Let Yi be an iid random sequence where i = 1, ..., n. Let also

Yi N µ, σ2 , where Θ = f µ, σ2 : ∞ < µ < ∞, σ2 > 0g gives the parameter

space. Then, thanks to the independence assumption, the joint likelihood function is

given by

n

fY (y ; θ ) = ∏ fY i

(yi ; θ ) ,

i =1

where y = (y1 , ..., yn ) .

Notice that the parameter vector, θ, is common for all variables.

Maximum Likelihood Estimation

Motivation and the Main Ideas

i = 1, ..., n. Let also Yi jXi = xi N β0 xi , σ2 , where xi = (xi 1 , ..., xik )0 and

0

β = ( β1 , ..., βk ) .

Let, also,

y = (y1 , ..., yn ) and x = (x1 , ..., xn ) .

Then,

n

fY jX =x (y ; θ ) = ∏ fY jX =x i i i

(yi ; θ ) .

i =1

An example? A possible structure for Yi jXi = xi would be

iid

yi = 0.3xi 1 + 0.4xi 2 + ui , ui N (0, 1), xi t6 ,

where ui is independent of xi .

Here,

β = (0.3, 0.4)0 , σ2 = 1 and xi = (xi 1 , xi 2 ) 0 .

Maximum Likelihood Estimation

econometrics) is the autoregressive conditional heteroskedasticity (ARCH) mode due

to Engle (1982, Econometrica).

Example: Let Yt be the daily return on some equity on day t, where t = 1, ..., T .

The model is given by

Yt jYt 1 = yt 1 N 0, σ2t ,

where

σ2t = ω + αyt2 1 .

Maximum Likelihood Estimation

decomposition. Omitting the arguments of the likelihood/density function for

conciseness, we obtain

fY 1 ,...,Y T = fY 2 ,...,Y T jY 1 fY 1

= fY 3 ,...,Y T jY 1 ,Y 2 fY 2 jY 1 fY 1

= fY 4 ,...,Y T jY 1 ,Y 2 ,Y 3 fY 3 jY 1 ,Y 2 fY 2 jY 1 fY 1

..

.

T

= ∏ fY jY t t 1 ,...,Y 1

.

t =1

T

fY 1 ,...,Y T = ∏ fY jY t t 1

,

t =2

Maximum Likelihood Estimation

Now that we know how to construct the joint likelihood function for a collection of

random variables Y1 , ..., Yn , we can start thinking about how to estimate parameters

by MLE.

Remember our discussion about the logic behind MLE. The idea is to …nd the values

of the parameters that maximise the possibility of obtaining the data that we

observe in the sample.

Our notation is

L (θ; y ) = fY (y ; θ ) ,

where θ and y are the parameter and data matrices, respectively.

Usually, it is more convenient to use the log-likelihood which is

Notice that log is a monotone transformation. Hence, as will be obvious in a

moment, for our purposes there is no di¤erence between using ` (θ; y ) and L (θ; y ) .

The maximum likelihood method is based on …nding the parameter values which

maximise the likelihood (or probability) of obtaining the particular sample we have:

θ 2Θ

Maximum Likelihood Estimation

Hence, the likelihood function is the objective function. Consequently, there must be

a …rst order condition.

Caution: never confuse estimator with estimate!

This …rst order condition has a special name: the score. The score is a key concept

and deeply in‡uences the behaviour of the ML estimator.

Let θ be a (k 1) vector. When the derivative exists, the score is given by

∂ log L (θ; y ) ∂` (θ; y )

= .

∂θ ∂θ

Of course, this is a (k 1) vector, as well.

Consequently, θ̂ is the value of θ which satis…es,

∂ log L (θ; y )

= 0.

∂θ

θ =θ̂

∂2 log L (θ; y )

< 0,

∂θ∂θ 0

θ =θ̂

(Bilkent) ECON509 This Version: 16 December 2013 16 / 73

Maximum Likelihood Estimation

Let, as before, y = (y1 , ..., yn ) . The joint likelihood is given by

n

L (θ; y ) = ∏ fY (y i ; θ )

i

i =1

( )

n

1 1 n

2σ2 i∑

p 2

= exp (yi µ) .

2πσ2 =1

Then,

n n 1 n

2σ2 i∑

`(θ; y ) = log L(θ; y ) = log 2π log σ2 (yi µ )2 .

2 2 =1

0

Obviously, θ = µ, σ2 . Let’s …nd the ML estimators.

Maximum Likelihood Estimation

Now,

∂`(θ; y ) 1 n

∂µ

= ∑ (yi

σ̂2 i =1

µ̂) = 0,

µ=µ̂,σ2 =σ̂2

and

n

∂`(θ; y ) n 1

∂σ2

= 2

+ 2 ∑ (yi µ̂)2 = 0.

µ=µ̂,σ2 =σ̂2

2σ̂ 2 σ̂2 i =1

1 n 1 n 1 n

n i∑ n i∑ n i∑

µ̂ = ȳ = yi and σ̂2 = (yi µ̂)2 = (yi ȳ )2 .

=1 =1 =1

0

Therefore, θ̂ = ȳ , n1 ∑ni=1 (yi ȳ )2 .

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

In the next few slides, we will cover some important common properties of likelihood

functions.

In this discussion, we will assume that the data generating process and the chosen

underlying distribution are the same:

g Y (y ) = fY (y ; θ 0 ) .

This is not necessarily true in general. In fact, thinking about what happens when

is crucial.

We will do this later. For the time being, we will stick to the simpler case.

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

2 3

∂ log L (θ; y )

Ef 4 5 = 0,

∂θ

θ =θ 0

Proof: Now,

∂ log L (θ; y ) 1 ∂L (θ; y )

= .

∂θ L (θ; y ) ∂θ

Then, Z

∂ log L (θ; y ) 1 ∂L (θ; y )

Ef = f (y ; θ 0 )dy ,

∂θ L (θ; y ) ∂θ

by de…nition. Observe that this is a function of both θ and θ 0 .

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

Now,

2 3

Z

∂ log L (θ; y ) ∂L (θ; y ) 1

Ef 4 5 = f (y ; θ 0 )dy

∂θ ∂θ L (θ; y )

θ =θ 0 θ =θ 0

Z

∂L (θ; y )

= dy

∂θ

θ =θ 0

Z

∂

= f (y ; θ )dy

∂θ

θ =θ 0

∂

= 1

∂θ

θ =θ 0

= 0,

where we implicitly assumed that the order of integration and di¤erentiation can be

exchanged. An aside: this requires that the range of y does not depend on θ.

Hence, the expectation of the …rst-order condition, evaluated at the true parameter

value, is zero!

(Bilkent) ECON509 This Version: 16 December 2013 21 / 73

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

0 1 2 3

∂ log L ( θ; Y ) ∂2 log L (θ; Y )

Covf @ A= Ef 4 5,

∂θ ∂θ∂θ 0

θ =θ 0 θ =θ 0

where, as before, the expectation and the covariance are taken with respect to

fY (y ; θ ) .

Proof: Now, one can show that

∂2 log L (θ; y ) 1 ∂2 L (θ; y ) 1 ∂L (θ; y ) ∂L (θ; y )

0 = .

∂θ∂θ L (θ; y ) ∂θ∂θ 0 [L (θ; y )]2 ∂θ ∂θ 0

Then,

Z

∂2 log L (θ; y ) 1 ∂2 L (θ; y )

Ef = f (y ; θ 0 )dy

∂θ∂θ 0 L (θ; y ) ∂θ∂θ 0

Z

1 ∂L (θ; y ) ∂L (θ; y )

f (y ; θ 0 )dy .

[L(θ; y )]2 ∂θ ∂θ 0

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

Z

1 ∂2 L (θ; y )

f (y ; θ 0 )dy

L (θ; y ) ∂θ∂θ 0

θ =θ 0

Z

∂2 L (θ; y ) 1

= f (y ; θ 0 )dy

∂θ∂θ 0 L(θ 0 ; y )

θ =θ 0

Z

∂2

= L (θ; y )dy

∂θ∂θ 0

θ =θ 0

∂2

= 1

∂θ∂θ 0

θ =θ 0

= 0.

These hold if, of course, we can exchange the order of integration and

di¤erentiation, which is implicitly assumed here.

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

Then,

2 3

Z

∂2 log L (θ; y ) 1 ∂L (θ; y ) ∂L (θ; y )

Ef 4 5 = f (y ; θ 0 )dy

∂θ∂θ 0 [L(θ; y )] 2 ∂θ ∂θ 0

θ =θ 0 θ =θ 0

Z

∂ log L (θ; y ) ∂ log L (θ; y )

= f (y ; θ 0 )dy

∂θ ∂θ 0

θ =θ 0

2 3

∂ log L ( θ; y ) ∂ log L ( θ; y )

= Ef 4 5

∂θ ∂θ 0

θ =θ 0

8 9

< ∂ log L (θ; y ) =

= Covf ,

: ∂θ ;

θ =θ 0

2 3

∂ log L (θ;y )

since E 4 ∂θ

5 is zero-mean.

θ =θ 0

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

Property 3 (Cramér-Rao Inequality): Let θ̃ be some estimator of θ 0 and assume

that Ef [θ̃ ] = θ 0 . Then,

8 2 39 1

< ∂ log L ( θ; y ) =

Varf θ̃ Varf 4 5 0,

: ∂θ ;

θ =θ 0

in the sense that the di¤erence between the two matrices is non-negative de…nite.

Proof: For simplicity of exposition we focus on the univariate case, where θ is a

scalar. Since θ̃ is unbiased for θ 0 , we have

Z

Ef [θ̃ ] = θ̃f (y ; θ 0 )dy = θ 0 .

Z Z

∂ ∂f (y ; θ )

1 = θ̃f (y ; θ )dy = θ̃ dy

∂θ θ =θ 0 ∂θ θ =θ 0

Z ∂ Z

∂θ f (y ; θ )jθ =θ 0 ∂ log f (y ; θ )

= θ̃ f (y ; θ )dy = θ̃ f (y ; θ )dy

f (y ; θ ) ∂θ θ =θ 0

∂ log f (y ; θ )

= Ef θ̃ .

∂θ θ =θ 0

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

∂ log f (y ; θ )

θ̃ and .

∂θ

θ =θ 0

Remember that

∂ log f (y ; θ )

Ef = 0. (2)

∂θ θ =θ 0

Cov (X , Y ) = E [XY ] ,

∂ log f (y ; θ )

1 = Ef θ̃

∂θ θ =θ 0

∂ log f (y ; θ )

= Covf θ̃, .

∂θ θ =θ 0

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

Remember that by the Cauchy-Schwarz Inequality, for any two random variables X

and Y ,

Covf (X , Y )2 Varf (X )Varf (Y ).

Then,

2

∂ log f (y ; θ ) ∂ log f (y ; θ )

Covf θ̃, = 12 Varf θ̃ Varf .

∂θ θ =θ 0 ∂θ θ =θ 0

We note two important results:

1 The only time the Cramér-Rao bound is achieved is when the estimator is the ML

estimator. In many problems, no estimator would actually achieve this bound.

2 For regular problems, asymptotically the ML estimator achieves the Cramér-Rao

bound. In other words, for large n, ML can achieve the Cramér-Rao bound, that is ML

is e¢ cient in large samples.

In general, an estimator is said to be e¢ cient if it achieves the lowest possible

variance.

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

…rst …nd the Cramér-Rao bound for unbiased estimators of θ 0 and then show that

the ML estimator achieves this bound.

First, let’s construct the log-likelihood function. Let y = (y1 , ..., yn )0 . Then

( ) ( )

1 1 (y1 θ )2 1 1 (yn θ )2

L (θ; y ) = p exp ... p exp

2π1 2 1 2π1 2 1

( )

n n

1 1

2 i∑

= p exp (yi θ )2 .

2π =1

This gives

n 1 n

2 i∑

` (θ; y ) = log 2π (yi θ )2 .

2 =1

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

n

∂` (θ; y ) 1

= ( 2 ) ∑ yi θ̂ = 0,

∂θ 2 i =1

θ =θ̂

implying that

n n

∑ yi ∑ θ̂ = 0,

i =1 i =1

and, so,

1 n

n i∑

θ̂ = yi .

=1

In addition,

" #

n

∂ log f (y ; θ )

Varf

∂θ θ =θ 0

= Varf ∑ (yi θ0 )

i =1

n

= ∑ Var{z

f (yi ) = n,

i =1 | }

1

(Bilkent) ECON509 This Version: 16 December 2013 29 / 73

Maximum Likelihood Estimation

Properties of the Maximum Likelihood Estimator

Therefore, as far as this problem is concerned, the Cramér-Rao bound for any

unbiased estimator θ̃ is given by

1

Varf θ̃ .

n

Now, the variance of the ML estimator is very easy to …nd.

!

1 n 1 n

Var (θ̂ ) = Var ∑

n i =1

yi = 2 ∑ Var (yi )

n i =1

1 1

= n= .

n2 n

But this the same as the Cramér-Rao bound. Hence, the ML estimator in this

particular case is an e¢ cient estimator.

Maximum Likelihood Estimation

Asymptotics of ML Estimators

Remember that we restrict ourselves to the case where our random sequence

Y1 , ..., Yn is iid. Let y = (y1 , ..., yn ) .

Let fY i (yi ; θ ) = L (θ; yi ) be the pdf (or the likelihood function) for Yi . Then, the

joint pdf or the joint likelihood function is given by

n

L (θ; y ) = ∏ L (θ; yi ) ,

i =1

n n

` (θ; y ) = log ∏ L (θ; yi ) = ∑ log L (θ; yi ) .

i =1 i =1

Now, suppose that log L (θ; y1 ) , log L (θ; y2 ) , ... is an iid sequence where

E [log L (θ; yi )] < ∞ for all i .

Then, by the strong Law of Large Numbers,

1 n a.s .

n i∑

log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )].

=1

Maximum Likelihood Estimation

Asymptotics of ML Estimators

Let θ̂ be the maximum likelihood estimator or the true parameter value θ 0 . Can we

use the previous convergence result to argue that we will also have

a.s .

θ̂ ! θ 0 ?

1 n a.s .

n i∑

sup log L (θ; yi ) Ef (y jθ 0 ) [log L (θ; yi )] ! 0

θ 2Θ =1

Maximum Likelihood Estimation

Asymptotics of ML Estimators

θ 2Θ

2 3

∂ log L (θ; y )

E f (y j θ 0 ) 4 5 = 0.

∂θ

θ =θ 0

Moreover, by de…nition,

θ̂ = arg max log L (θ; y ) .

θ 2Θ

Maximum Likelihood Estimation

Asymptotics of ML Estimators

a.s .

Then, since n1 ∑ni=1 log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )] uniformly for all θ, we can

intuitively argue that argument that maximises n1 ∑ni=1 log L (θ; yi ) will also converge

to the argument that maximises Ef (y jθ 0 ) [log L (θ; yi )]. In other words,

1 n a.s .

n i∑

log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )] uniformly for all θ,

=1

1 n a.s .

so θ̂ = arg max ∑ log L (θ; y ) ! arg max Ef (y jθ 0 ) [log L(θ; y )] = θ 0 . (3)

θ 2 Θ n i =1 θ 2Θ

transformation, so both n1 log L (θ; y ) and log L (θ; y ) will yield the same estimates.

This argument is, of course, not formal. For a more formal proof, see Newey and

McFadden (1994, pp.2121-2122). We will cover this proof later on.

Asymptotic normality, on the other hand, can be proved by using the tools we have

already acquired. The main ingredient will, again, be Taylor’s expansion.

Maximum Likelihood Estimation

Asymptotics of ML Estimators

a.s . p

Now, suppose that we already proved that θ̂ ! θ 0 (which implies that θ̂ ! θ 0 ).

How to show the asymptotic normality of θ̂?

We start with a …rst-order expansion of the score function about θ̂ = θ 0 . Our

attention will again be on the univariate case, where θ is a scalar.

For sake of notational simplicty, whenever we write

1 n ∂ log L (θ ; yi )

n i∑

etc,

=1 ∂θ

we will mean

1 n ∂ log L (θ; yi )

n i∑

.

=1 ∂θ

θ =θ

Maximum Likelihood Estimation

Asymptotics of ML Estimators

h i

∂2 log L (θ;yi )

1 E

∂θ 2

< ∞,

∂ log L (θ;yi ) 2

2 0<E ∂θ < ∞.

Throughout, we also assume that the order of integration and di¤erentiation can be

interchanged and that f∂ log L (θ; yi ) /∂θ gi =1,...,n and f∂2 log L (θ; yi ) /∂θ 2 gi =1,...,n

are both iid sequences.

Maximum Likelihood Estimation

Asymptotics of ML Estimators

We will this time use a mean value expansion. This is almost the same as the Taylor

expansion. The only di¤erence is that, instead of including a remainder term, the

…nal term in the expansion is evaluated at the so called mean value.

Hence, a k th order mean value expansion of some function f (x ) about x = x0 is

given by

1

f (x ) = f (x0 ) + f (1 ) (x0 )(x x0 ) + ... + f (k 1)

(x0 )(x x0 )k 1

(k 1)!

1 (k )

+ f (x̃ )(x x0 )k ,

k!

where x̃ 2 [min(x , x0 ), max(x , x0 )].

Maximum Likelihood Estimation

Asymptotics of ML Estimators

Then, we have

" #

1 n ∂ log L θ̂; yi 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi

n i∑

= ∑

n i∑

+ θ̂ θ0 ,

=1 ∂θ n i =1 ∂θ =1 ∂θ 2

We could also have used

" #

1 n ∂2 log L (θ 0 ; yi )

n i∑

θ̂ θ0 + R

=1 ∂θ 2

easier for us to work with.

Maximum Likelihood Estimation

Asymptotics of ML Estimators

a.s . a.s .

Now, since θ̂ ! θ 0 and since θ̃ is always between θ̂ and θ 0 , we have θ̃ ! θ 0 as well,

a.s .

as θ̂ ! θ 0 .

Remember that, by de…nition,

1 n ∂ log L θ̂; yi

n i∑

= 0.

=1 ∂θ

Then, we have

" #

1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi

0= ∑

n i∑

+ θ̂ θ0 ,

n i =1 ∂θ =1 ∂θ 2

" #" # 1

p p 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi

n i∑ n i∑

n θ̂ θ0 = n .

=1 ∂θ =1 ∂θ 2

| {z }| {z }

there is a CLT! there is an LLN!

Maximum Likelihood Estimation

Asymptotics of ML Estimators

Let’s try to prove that there indeed is a CLT for the highlighted term on the previous

slide.

Remember what we need to have for the existence of a CLT: an iid sequence with

…nite mean and variance.

Firstly, by assumption

∂ log L (θ 0 ; yi )

∂θ i =1,...,n

is an iid sequence.

Also, by the unbiasedness of the score and by a moment assumption we made

previously,

∂ log L (θ 0 ; yi ) ∂ log L (θ 0 ; yi )

E =0<∞ and 0 < Var < ∞.

∂θ ∂θ

Note that the second result above follows from

" #

∂ log L (θ 0 ; yi ) 2 ∂ log L (θ 0 ; yi )

E = Var .

∂θ ∂θ

Maximum Likelihood Estimation

Asymptotics of ML Estimators

De…ne

∂ log L (θ 0 ; yi )

I = Var .

∂θ

Then, we have the following Central Limit Theorem result:

0

z }| {

n

∂ log L (θ 0 ; yi )

∑

1 ∂ log L (θ 0 ;yi )

n ∂θ E

p i =1

∂θ d

n p ! N (0, 1) .

I

Equivalently,

p 1 n

∂ log L (θ 0 ; yi ) d

n

n ∑ ∂θ

! N (0, I) .

i =1

Maximum Likelihood Estimation

Asymptotics of ML Estimators

Remember that, by assumption, the sequence

∂2 log L (θ; yi )

∂θ 2 i =1,...,n

is iid while

∂2 log L (θ; yi )

E < ∞.

∂θ 2

Then, under some other standard assumptions, one can show that

n i∑

!E .

=1 ∂θ 2 ∂θ 2

Note that the main tricky point here is that the arguments of the two functions are

di¤erent (θ̃ vs θ 0 ). Nevertheless, this type of results is pretty standard.

Maximum Likelihood Estimation

Asymptotics of ML Estimators

De…ne

∂2 log L (θ 0 ; yi )

J =E .

∂θ 2

We can now use Slutsky’s Theorem.

" #" # 1

p p 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi

n i∑ n i∑

n θ̂ θ 0 = n

=1 ∂θ =1 ∂θ 2

| {z }| {z }

d a.s . 1

!N (0,I) !J

d 1 d 2

!J N (0, I) = N 0, IJ ,

d

where = stands for “equal in distribution.”

Maximum Likelihood Estimation

Asymptotics of ML Estimators

The term

∂ log L (θ 0 ; yi )

I =Var

∂θ

is generally referred to as the expected (Fisher) observation, or more precisely, the

Fisher Information.

Moreover,

∂2 log L (θ 0 ; yi )

,

∂θ 2

is generally known as the Hessian, with the associated expected Hessian given by

∂2 log L (θ 0 ; yi )

J =E

∂θ 2

Maximum Likelihood Estimation

Asymptotics of ML Estimators

The quantities I and J can easily be consistently estimated , by using their sample

counterparts; that is,

n 2

1 ∂ log L (θ 0 ; yi ) 1 n ∂2 log L (θ 0 ; yi )

Ib =

n ∑ ∂θ

and Jb =

n i∑ ∂θ 2

i =1 =1

Of course, when θ 0 is unknown (as is always the case), one uses the estimated

parameter value, θ̂.

Therefore,

" #2

1 n ∂ log L θ̂; yi 1 n ∂2 log L θ̂; yi

Ib =

n ∑ ∂θ

and Jb =

n i∑ ∂θ 2

,

i =1 =1

p p

Ib ! I and Jb ! J .

Maximum Likelihood Estimation

Asymptotics of ML Estimators

I = J,

implying that

p d 1

n θ̂ θ 0 ! N 0, I .

Remember that,

1

1 ∂ log L (θ 0 ; yi )

I = Var ,

∂θ

is the Cramér-Rao lower bound. Hence, this con…rms that, in this given framework,

the ML estimator’s asymptotic variance has the minimum variance property.

Maximum Likelihood Estimation

Asymptotics of ML Estimators

Before we move on, let’s do the derivation of the asymptotic distribution for the

case where θ is a (p 1) vector rather than a scalar.

Again, we start with a Taylor expansion, but note that this time, we are dealing with

vectors.

" #

1 n ∂ log L θ̂; yi 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi

n i∑

= ∑

n i∑

+ θ̂ θ 0 . (4)

=1 ∂θ n i =1 ∂θ =1 ∂θ∂θ 0 (p 1 )

(p 1 ) (p 1 ) (p p )

Here,

2 3

∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi )

6 ∂θ 1 ∂θ 1 ∂θ 1 ∂θ 2 ∂θ 1 ∂θ p 7

6 ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) 7

∂2 log L θ̃; yi 6 7

6 ∂θ 2 ∂θ 1 ∂θ 2 ∂θ 2 ∂θ 2 ∂θ p 7

=6 .. .. .. 7.

∂θ∂θ 0 6

6

.. 7

7

. . . .

4 5

∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi )

∂θ p ∂θ 1 ∂θ p ∂θ 2 ∂θ p ∂θ p

A technical aside: the parameter θ̃ appearing in each entry in the above matrix is

understood to (possibly) di¤er from entry to entry.

Maximum Likelihood Estimation

Asymptotics of ML Estimators

All that we had assumed for the scalar terms in the scalar case are now assumed to

hold entry by entry for all matrices.

Now, rearranging (4) gives,

" #

1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi

n i∑ n i∑

0 = + θ̂ θ 0

=1 ∂θ =1 ∂θ∂θ 0

" #

1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )

) ∑

n i =1 ∂θ∂θ 0

θ̂ θ 0 =

n i∑ ∂θ

=1

" # 1 !

1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )

n i∑ n i∑

) θ̂ θ 0 = ,

=1 ∂θ∂θ 0 =1 ∂θ

Always keep in mind that we are dealing with matrices now. For exampl, for two

matrices A and B which are (q q ) and (q 1), respectively, we do not necessarily

have

AB = BA!!!

Maximum Likelihood Estimation

Asymptotics of ML Estimators

∂ log L (θ 0 ; Yi ) ∂ log L (θ 0 ; Yi ) ∂ log L (θ 0 ; Yi )

I = Var =E

∂θ ∂θ ∂θ 0

(p p )

and

∂ log L (θ 0 ; Yi )

J =E .

∂θ∂θ 0

(p p )

6 ∂ log ∂θ

L (θ 0 ;yi ) 7

1

∂ log L (θ 0 ; yi ) 6 7

6 ∂θ 2 7

=6 . 7,

∂θ 6 .. 7

4 5

∂ log L (θ 0 ;yi )

∂θ p

∂ log L (θ 0 ;yi )

and ∂θ 0

is simply the transpose of this vector.

Maximum Likelihood Estimation

Asymptotics of ML Estimators

n ∂ log L θ̂; yi ∂ log L θ̂; yi

1

Ib =

n ∑ ∂θ ∂θ 0

,

i =1

and

1 n ∂ log L θ̂; yi

n i∑

Jb = .

=1 ∂θ∂θ 0

Maximum Likelihood Estimation

Asymptotics of ML Estimators

Then,

" # 1 !

p 1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )

n i∑

n θ̂ θ0 = p ∑ .

=1 ∂θ∂θ 0 n i =1 ∂θ

| {z }| {z }

p 1 d

!J !N (0,I)

p d 1 d 1 1

0 d 1

n θ̂ θ0 ! J N (0, I) = N 0, J I J = N 0, I ,

A consistent estimator for the asymptotic variance will be

1 p

Jb 1b

I Jb !J 1

IJ 1

.

Also note that, this time we have J 1 IJ 1 as the asymptotic covariance matrix,

rather than J 2 I , which is the equivalent of J 1 IJ 1 in the scalar case.

Maximum Likelihood Estimation

Quasi Maximum Likelihood Estimation

Up to know, we have assumed that we know the true distribution that generates the

observed data and tried to estimate the true parameter vector, θ 0 . In reality, the

data generating process would rarely (if at all) be known by the researcher.

To make this point clear, suppose that the data generating process is given by the

cdf

G Y (y ) ,

and by its associated density function, gY (y ).

Do we know what the data generating process is? We might have some idea about

it but usually the short answer is “no.”

So what do we do then? When we model data, we believe (hope?) that the

distribution function we have chosen is indeed the data generating process. Let’s

de…ne this chosen distribution as

F Y (y ; θ ),

with its associated density function, fY (y ; θ ) .

Usually,

G Y (y ) 6 = F Y (y ; θ ) .

But this is not the end of the story.

Maximum Likelihood Estimation

Quasi Maximum Likelihood Estimation

When the chosen likelihood function is not the true data generating process,

maximum likelihood estimation is known as quasi or pseudo maximum likelihood

estimatition.

We will only outline the main ideas without getting into formal details.

Under standard assumptions,

Z

1 n p

n i∑

log L (θ; yi ) ! log f (y ; θ )g (y )dy = Eg [log f (y ; θ )],

=1

p

θ̂ ! θ ,

∂ log f (y ; θ )

Eg = 0,

∂θ

Now, θ is called the pseudo-true value of θ. This value has a special statistical

meaning.

Maximum Likelihood Estimation

Quasi Maximum Likelihood Estimation

discrepancy:

Z

g (y )

D (f θ , g ) = log g (y ) dy

f (y ; θ )

Z Z

= log [g (y )] g (y ) dy log [f (y ; θ )] g (y ) dy

= Eg [log g (y )] Eg [log f (y ; θ )].

function, f (y ; θ ), and the true data generating process, g (y ).

This is always non-negative, provided that g (y ) and f (y ; θ ) are continuous in y .

Therefore, the maximum likelihood estimator converges to the value of θ, which

minimises the di¤erence between the density we use and the true density.

Put di¤erently, the maximum likelihood estimator converges to what minimises our

mistake.

If, g (y ) = f (y ; θ ), then the density we are using is correctly speci…ed and θ = θ 0 .

Maximum Likelihood Estimation

Quasi Maximum Likelihood Estimation

p d

n θ̂ θ ! N (0, Jg 1 Ig Jg 1 ),

where

∂ log L (θ ; Yi ) ∂ log L (θ ; Yi ) ∂2 log L (θ ; Yi )

Ig = E g and Jg = Eg .

∂θ ∂θ 0 ∂θ∂θ 0

These can consistently be estimated by

n

1 ∂ log L (θ̂; yi ) ∂ log L (θ̂; yi ) 1 n ∂2 log L (θ̂; yi )

Ibg =

n ∑ ∂θ ∂θ 0

and Jbg =

n i∑ ∂θ∂θ 0

,

i =1 =1

where

1 p

Jbg 1 Ibg Jbg ! Jg 1 Ig Jg 1 .

All these results can be proved by using the same arguments as for the case where

the likelihood function is identical to the data generating process. The only

di¤erence is that this time the expansion should be about θ̂ = θ .

Clearly, the main ideas are the same as before. What changes is the interpretation of

what θ̂ converges to.

(Bilkent) ECON509 This Version: 16 December 2013 55 / 73

Maximum Likelihood Estimation

Quasi Maximum Likelihood Estimation

It is crucial to underline that the three nice properties we have proved do not

necessarily hold for quasi maximum likelihood estimation.

In particular, generally,

Ig 6 = Jg .

Therefore, the asymptotic variance matrix does not simplify to Ig 1 anymore.

The term Jg 1I

g Jg is known as the sandwich matrix. Likewise, Jbg 1 Ibg Jbg

1 1 is

generally called the sandwich estimator.

Remember that all these results are for the iid case. As we relax the iid assumption,

we may have to deal with further issues, especially in terms of consistent estimation

of the sandwich matrix. We will not deal with these.

Maximum Likelihood Estimation

Quasi Maximum Likelihood Estimation

In many cases, convergence to the true parameter value is still possible, even when

the chosen density function is not the correct one.

Example: We suppose that Y1 , ..., Yn is an iid sequence from N (µ, σ2 ). However,

the truth is that although Yi are iid, the distribution is not Normal. In addition, the

true mean and variance are given by µ0 and σ20 , respectively.

p

Let θ̂ = µ̂, σ̂2 and θ 0 = µ0 , σ20 . Now, in order to ensure θ̂ ! θ 0 , we need to have

2 3

∂ log L ( θ, Y i )

Eg 4 5 = 0.

∂θ

θ =θ 0

2

n n 1 n yi µ

2 i∑

log L (θ ) = log 2π log σ2 .

2 2 =1 σ

Maximum Likelihood Estimation

Quasi Maximum Likelihood Estimation

Then, we have

2 3 " #

n

∂ log L (θ, Yi ) n 1

Eg 4

∂σ2

5=

2σ20

+ 2

Eg ∑ (Y i µ0 ) 2

.

θ =θ 0

2 σ20 i =1

Now, h i

E g (Y i µ0 )2 = σ20 ,

by de…nition, so

2 3

∂ log L (θ, Yi ) n 1

Eg 4 5= + nσ20 = 0.

∂σ2 2σ20 2 σ20

2

θ =θ 0

Moreover,

2 3 " #

n

∂ log L (θ, Yi ) 5 = 1 Eg

Eg 4

∂µ σ20

∑ (Y i µ0 ) = 0.

θ =θ 0 i =1

Maximum Likelihood Estimation

Quasi Maximum Likelihood Estimation

∂ log L (θ, Yi )

Eg 4 5 = 0.

∂θ

θ =θ 0

Therefore, despite misspeci…cation, the QML estimator is still consistent for

µ0 , σ20 !

Maximum Likelihood Estimation

MLE in Action: Financial Volatility Estimation

5

-5

200 400 600 800 1000 1200 1400 1600

S&P500 Squared Daily Returns (%)

30

20

10

Figure: Daily returns and squared returns on S&P500 index from 2 February 2001 to 16 January

2008.

Maximum Likelihood Estimation

MLE in Action: Financial Volatility Estimation

Engle’s (1982) idea: there is no predictability in asset returns, but there is some

predictability in the asset return volatility.

High volatility periods are clustered together. Likewise, low volatility periods are

clustered together.

Here is the model. Let rt be the return on some asset at time t, t = 1, ..., T . Let

Ft 1 be the information set at time t 1 (e.g. all stock returns, volatilities etc up

to and including period t 1).

Suppose for simplicity that E [rt jFt 1 ] = 0 for all t. This is not a crazy assumption

for stock returns. Indeed, they simply ‡uctuate around zero.

Maximum Likelihood Estimation

MLE in Action: Financial Volatility Estimation

model given by

rt = E [rt jFt 1 ] + εt ,

εt jFt 1 N 0, σ2t ,

σ2t = ω + αε2t 1,

ω > 0, α 0.

His doctoral student Tim Bollerslev came up with an extension in 1986, which

proved to be one of the most popular models in econometrics: the Generalised

ARCH (GARCH) model given by,

σ2t = ω + αε2t 1 + βσ2t 1.

This paper is among the most cited papers published in Journal of Econometrics.

This is not a small di¤erence. If anything, it has been more successful empirically.

In 2003, Robert Engle shared the Nobel prize with Sir Clive Granger.

The idea is simple: today’s volatility is a¤ected by (i) yesterday’s shock (ε2t ) and (ii)

yesterday’s volatility (σ2t ).

This model is usually estimated using the ML method.

(Bilkent) ECON509 This Version: 16 December 2013 62 / 73

Maximum Likelihood Estimation

MLE in Action: Financial Volatility Estimation

rt jFt 1 N 0, σ2t ,

ω > 0, α, β 0 and α + β < 1.

Then, " #

2

1 1 rt

L (ω, α, β; rt jFt 1) = q exp ,

2πσ2t 2 σt

and " #

T 2

1 1 T rt

∏ L(ω, α, β; rt jFt 1) = q exp

2 t∑

.

(2π )n ∏T

t =2 2 =2 σt

t =2 σ t

start with the second observation as we will condition on the …rst observation.

The joint density is obtained by using the prediction decomposition argument, which

we mentioned at the beginning of this slide set.

Maximum Likelihood Estimation

MLE in Action: Financial Volatility Estimation

2

n 1 T 1 T rt

2 t∑ 2 t∑

`(ω, α, β; r1 , ..., rT ) = log 2π log σ2t .

2 =2 =2 σt

start with the second observation as we will condition on the …rst observation.

The joint density is obtained by using the prediction decomposition argument, which

we mentioned at the beginning of this slide set.

Bollerslev and Wooldridge (1992, Econometric Reviews) showed that under certain

assumptions, even if the likelihood function is misspeci…ed, we will still have

p

θ̂ ! θ 0 .

Unfortunately, we cannot solve this model in closed form due to the recursive

structure of σ2t . You can try to see this for yourselves by trying to take the …rst-order

derivative with respecto to, say, α.

So, estimation is conducted by computer software.

Maximum Likelihood Estimation

MLE in Action: Financial Volatility Estimation

etc.

How we estimate this model using a computer programme is as follows.

1 We write the log-likelihood function as a code and feed this into the software as the

objective function to be maximised. Most importantly, we “tell” the software that our

parameters are ω, α, β.

2 The computer then uses an optimisation algorithm. We can do constrained or

unconstrained optimisation. In the case of constrained optimisation we should, for

example, de…ne the parameter space. For the GARCH example, we do not want

Matlab to look for values of ω below zero, for example.

3 The computer simultaneously tries di¤erent values of (ω, α, β) until it has decided that

the maximum of the objective function (the log-likelihood function) has been achieved.

4 In some case, if the software cannot …nd the maximum (because, for example, the

objective function is ‡at, or is not concave, or there are many local maxima or some

other reason), then we get an error message along with (hopefully) an explanation.

Actually, this model is so well-established that it is very likely that your favourite

econometrics software will have an option to do GARCH estimation. So, all you

have to do is to upload the data and click on a few buttons.

Maximum Likelihood Estimation

MLE in Action: Financial Volatility Estimation

Amazon 46.78% .0133 .9840

IBM 9.47% .0741 .9188

JP Morgan 17.95% .0719 .9261

Procter & Gamble 6.66% .0285 .9693

Walt Disney 12.53% .0808 .9090

Table: GARCH parameter estimates for a selection of stocks. The estimation period is from 4

Jan 2000 to 1 Dec 2008

Maximum Likelihood Estimation

MLE in Action: Financial Volatility Estimation

100

Amazon

90 IBM

JP Morgan

80

70

60

Variance (%)

50

40

30

20

10

0

2001 2002 2003 2004 2005 2006 2007 2008

Appendix: Estimator Consistency

Before we …nish this part, let’s look at the basic consistency proof for maximum

likelihood (and related) estimators.

Now, as before, let for some iid sequence Yi , i = 1, ..., n, the log-likelihood function

be given by

`(θ; yi ),

where θ is a possibly vector-valued parameter.

Remember that the maximum likelihood estimator that we considered so far is given

by

1 n

θ̂ = arg max ∑ `(θ; yi ), where Θ is the parameter space.

θ 2 Θ n i =1

p

What is the basic proof for θ̂ ! θ 0 ?

Appendix: Estimator Consistency

The following is a slightly less di¢ cult version of Theorem 2.1 (and of its Proof)

from Newey and McFadden (1994, pp. 2121-2122).

Let Q̂n (θ ) = ∑ni=1 `(θ; yi ) and Q0 (θ ) = E [`(θ; yi )].

1

n

Theorem 2.1 (based on Newey and McFadden, 1994): If there is a function

Q0 (θ ) such that (i) for each η > 0,

Q0 (θ 0 ) sup Q0 (θ ) > 0

fθ:jjθ θ 0 jj>η g

and (ii)

p

sup Q̂n (θ ) Q0 (θ ) ! 0,

θ 2Θ

then

p

θ̂ ! θ 0 .

Another way of stating Assumption (i) is to say that θ 0 is uniquely identi…able.

Appendix: Estimator Consistency

Proof: For any ε > 0, we can show that with probability approaching one (w.p.a.1)

(ii ) : Q0 (θ̂ ) > Q̂n (θ̂ ) ε/3,

(iii ) : Q̂n (θ 0 ) > Q0 (θ 0 ) ε/3.

How? (i ) follows from the fact that θ̂ is the maximiser of Q̂n (θ ); (ii ) and (iii ) follow

from the Uniform convergence result which implies that w.p.a.1

Then, w.p.a.1

Appendix: Estimator Consistency

(ii ) (i ) (iii )

w.p.a.1.

Therefore, for any ε > 0, Q0 (θ̂ ) > Q0 (θ 0 ) ε, w.p.a.1.

Which ε to choose? Actually, let our choice for ε be such that

ε = Q0 (θ 0 ) sup Q0 (θ ) > 0.

fθ:jjθ θ 0 jj>η g

Appendix: Estimator Consistency

Then, w.p.a.1,

" #

Q0 (θ̂ ) > Q0 (θ 0 ) Q0 (θ 0 ) sup Q0 (θ ) > sup Q 0 ( θ ).

fθ:jjθ θ 0 jj>η g fθ:jjθ θ 0 jj>η g

Therefore, w.p.a.1, θ̂ 2

/ fθ : jjθ θ 0 jj > η g or, equivalently, θ̂ 2 fθ : jjθ θ 0 jj η g.

Now, this is all valid for any ε > 0. ε can be arbitrarily close to zero, which requires

η to be arbitrarily close to zero, which in turn implies that θ̂ will be arbitrarily close

to θ 0 , w.p.a.1. In other words,

p

θ̂ ! θ 0 .

Appendix: Estimator Consistency

In the original version of the Theorem, Assumption (i) is replaced by the following

assumptions:

1 Q0 (θ ) is uniquely maximised at θ 0 ,

2 Θ is compact,

3 Q0 (θ ) is continuous.

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.