Вы находитесь на странице: 1из 12

Math 541: Statistical Theory II

Fisher Information and Cramer-Rao Bound


Lecturer: Songfeng Zheng
In the parameter estimation problems, we obtain information about the parameter from a
sample of data coming from the underlying probability distribution. A natural question is:
how much information can a sample of data provide about the unknown parameter? This
section introduces such a measure for information, and we can also see that this information
measure can be used to nd bounds on the variance of estimators, and it can be used to
approximate the sampling distribution of an estimator obtained from a large sample, and
further be used to obtain an approximate condence interval in case of large sample.
In this section, we consider a random variable X for which the pdf or pmf is f(x|), where
is an unknown parameter and , with is the parameter space.
1 Fisher Information
We dene the Fisher information I() in the random sample X as:
I() = E

_
[l

(X|))]
2
_
=
_
[l

(x|))]
2
f(x|)dx (1)
where l(x|) = log f(x|) is the log-likelihood function, and
l

(x|) =

log f(x|) =
f

(x|)
f(x|)
where f

(x|) is the derivative of f(x|) with respect to . Similarly, we denote the second
order derivative of f(x|) with respect to as f

(x|). We assume that we can exchange the


order of dierentiation and integration, then
_
f

(x|)dx =

_
f(x|)dx = 0
Similarly,
_
f

(x|)dx =

2

2
_
f(x|)dx = 0
It is easy to see that
E

[l

(X|)] =
_
l

(x|)f(x|)dx =
_
f

(x|)
f(x|)
f(x|)dx =
_
f

(x|)dx = 0
1
2
Therefore, the denition of Fisher information (1) can be rewritten as
I() = Var

[l

(X|))] (2)
Also, notice that
l

(x|) =

_
f

(x|)
f(x|)
_
=
f

(x|)f(x|) [f

(x|)]
2
[f(x|)]
2
=
f

(x|)
f(x|)
[l

(x|)]
2
Therefore,
E

[l

(x|)] =
_
_
f

(x|)
f(x|)
[l

(x|)]
2
_
f(x|)dx =
_
f

(x|)dx E

_
[l

(X|)]
2
_
= I()
Finally, we have another formula to calculate Fisher information:
I() = E

[l

(x|)] =
_
_

2

2
log f(x|)
_
f(x|)dx (3)
To summarize, we have three methods to calculate Fisher information: equations (1), (2),
and (3). In many problems, using (3) is the most convenient choice.
Example 1: Suppose random variable X has a Bernoulli distribution for which the pa-
rameter is unknown (0 < < 1). We shall determine the Fisher information I() in
X.
The point mass function of X is
f(x|) =
x
(1 )
1x
for x = 1 or x = 0.
Therefore
l(x|) = log f(x|) = x log + (1 x) log(1 )
and
l

(x|) =
x


1 x
1
and l

(x|) =
x

2

1 x
(1 )
2
Since E(X) = , the Fisher information is
I(x|) = E[l

(x|)] =
E(X)

2
+
1 E(X)
(1 )
2
=
1

+
1
1
=
1
(1 )
Example 2: Suppose that X N(,
2
), and is unknown, but the value of
2
is given.
nd the Fisher information I() in X.
For < x < , we have
l(x|) = log f(x|) =
1
2
log(2
2
)
(x )
2
2
2
3
Hence,
l

(x|) =
x

2
and l

(x|) =
1

2
It follows that the Fisher information is
I() = E[l

(x|)] =
1

2
If we make a transformation of the parameter, we will have dierent expressions of Fisher
information with dierent parameterization. More specically, let X be a random variable
for which the pdf or pmf is f(x|), where the value of the parameter is unknown but must
lie in a space . Let I
0
() denote the Fisher information in X. Suppose now the parameter
is replaced by a new parameter , where = (), and is a dierentiable function. Let
I
1
() denote the Fisher information in X when the parameter is regarded as . We will have
I
1
() = [

()]
2
I
0
[()].
Proof: Let g(x|) be the p.d.f. or p.m.f. of X when is regarded as the parameter. Then
g(x|) = f[x|()]. Therefore,
log g(x|) = log f[x|()] = l[x|()],
and

log g(x|) = l

[x|()]

().
It follows that
I
1
() = E
_
_
_
_

log g(X|)
_
2
_
_
_
= [

()]
2
E
_
{l

[X|()]}
2
_
= [

()]
2
I
0
[()]
This will be veried in exercise problems.
Suppose that we have a random sample X
1
, , X
n
coming from a distribution for which the
pdf or pmf is f(x|), where the value of the parameter is unknown. Let us now calculate
the amount of information the random sample X
1
, , X
n
provides for .
Let us denote the joint pdf of X
1
, , X
n
as
f
n
(x|) =
n

i=1
f(x
i
|)
then
l
n
(x|) = log f
n
(x|) =
n

i=1
log f(x
i
|) =
n

i=1
l(x
i
|).
and
l

n
(x|) =
f

n
(x|)
f
n
(x|)
(4)
4
We dene the Fisher information I
n
() in the random sample X
1
, , X
n
as
I
n
() = E

_
[l

n
(X|)]
2
_
=
_

_
[l

n
(X|)]
2
f
n
(x|)dx
1
dx
n
which is an n-dimensional integral. We further assume that we can exchange the order of
dierentiation and integration, then we have
_
f

n
(x|)dx =

_
f
n
(x|)dx = 0
and,
_
f

n
(x|)dx =

2

2
_
f
n
(x|)dx = 0
It is easy to see that
E

[l

n
(X|)] =
_
l

n
(x|)f
n
(x|)dx =
_
f

n
(x|)
f
n
(x|)
f
n
(x|)dx =
_
f

n
(x|)dx = 0 (5)
Therefore, the denition of Fisher information for the sample X
1
, , X
n
can be rewritten
as
I
n
() = Var

[l

n
(X|))] .
It is similar to prove that the Fisher information can also be calculated as
I
n
() = E

[l

n
(X|))] .
From the denition of l
n
(x|), it follows that
l

n
(x|) =
n

i=1
l

(x
i
|).
Therefore, the Fisher information
I
n
() = E

[l

n
(X|))] = E

_
n

i=1
l

(X
i
|)
_
=
n

i=1
E

[l

(X
i
|)] = nI().
In other words, the Fisher information in a random sample of size n is simply n times the
Fisher information in a single observation.
Example 3: Suppose X
1
, , X
n
form a random sample from a Bernoulli distribution for
which the parameter is unknown (0 < < 1). Then the Fisher information I
n
() in this
sample is
I
n
() = nI() =
n
(1 )
.
Example 4: Let X
1
, , X
n
be a random sample from N(,
2
), and is unknown, but the
value of
2
is given. Then the Fisher information I
n
() in this sample is
I
n
() = nI() =
n

2
.
5
2 Cramer-Rao Lower Bound and Asymptotic Distri-
bution of Maximum Likelihood Estimators
Suppose that we have a random sample X
1
, , X
n
coming from a distribution for which
the pdf or pmf is f(x|), where the value of the parameter is unknown. We will show how
to used Fisher information to determine the lower bound for the variance of an estimator of
the parameter .
Let

= r(X
1
, , X
n
) = r(X) be an arbitrary estimator of . Assume E

) = m(), and
the variance of

is nite. Let us consider the random variable l

n
(X|) dened in (4), it was
shown in (5) that E

[l

n
(X|)] = 0. Therefore, the covariance between

and l

n
(X|) is
Cov

, l

n
(X|)] = E

_
[

)][l

n
(X|) E

(l

n
(X|))]
_
= E

{[r(X) m()]l

n
(X|)}
= E

[r(X)l

n
(X|)] m()E

[l

n
(X|)] = E

[r(X)l

n
(X|)]
=
_

_
r(x)l

n
(x|)f
n
(x|)dx
1
dx
n
=
_

_
r(x)f

n
(x|)dx
1
dx
n
(Use Equation 4)
=

_

_
r(x)f
n
(x|)dx
1
dx
n
=

] = m

() (6)
By Cauchy-Schwartz inequality and the denition of I
n
(),
_
Cov

, l

n
(X|)]
_
2
Var

]Var

[l

n
(X|)] = Var

]I
n
()
i.e.,
[m

()]
2
Var

]I
n
() = nI()Var

]
Finally, we get the lower bound of variance of an arbitrary estimator

as
Var

]
[m

()]
2
nI()
(7)
The inequality (7) is called the information inequality, and also known as the Cramer-Rao
inequality in honor of the Sweden statistician H. Cramer and Indian statistician C. R. Rao
who independently developed this inequality during the 1940s. The information inequality
shows that as I() increases, the variance of the estimator decreases, therefore, the quality
of the estimator increases, that is why the quantity is called information.
If

is an unbiased estimator, then m() = E

) = , m

() = 1. Hence, by the information


inequality, for unbiased estimator

,
Var

]
1
nI()
6
The right hand side is always called the Cramer-Rao lower bound (CRLB): under certain
conditions, no other unbiased estimator of the parameter based on an i.i.d. sample of size
n can have a variance smaller than CRLB.
Example 5: Suppose a random sample X
1
, , X
n
from a normal distribution N(, ),
with given and the variance unknown. Calculate the lower bound of variance for any
estimator, and compare to that of the sample variance S
2
.
Solution: We know
f(x|) =
1

2
exp
_

(x )
2
2
_
then
l(x|) =
(x )
2
2

1
2
log 2
1
2
log .
Hence
l

(x|) =
(x )
2
2
2

1
2
,
and
l

(x|) =
(x )
2

3
+
1
2
2
.
Therefore,
I() = E[l

(X|)] = E
_

(X )
2

3
+
1
2
2
_
=
1
2
2
,
and
I
n
() = nI() =
n
2
2
.
Finally, we have the Cramer-Rao lower bound
2
2
n
.
The sample variance is dened as
S
2
=
1
n 1
n

i=1
(X
i


X)
2
and it is known that
(n 1)S
2


2
n1
then,
Var
_
n 1

S
2
_
=
(n 1)
2

2
Var
_
S
2
_
= 2(n 1).
Therefore,
Var
_
S
2
_
=
2
2
n 1
>
2
2
n
i.e., the variance of the estimator S
2
is bigger than the Cramer-Rao lower bound.
7
Now, let us consider the MLE

of , to make notations clear, let us assume the true value of
is
0
. We shall prove that as the sample size n is very big, the distribution of MLE estimator

is approximately normal with mean


0
and variance 1/[nI(
0
)]. Since this is merely a
limiting result, which holds as the sample size tends to innity, we say that the MLE is
asymptotically unbiased and refer to the variance of the limiting normal distribution as
the asymptotic variance of the MLE. More specically, we have the following theorem:
Theorem (The asymptotic distribution of MLE): Let X
1
, , X
n
be a sample of size
n from a distribution for which the pdf or pmf is f(x|), with the unknown parameter.
Assume that the true value of is
0
, and the MLE of is

. Then the probability distribution
of
_
nI(
0
)(


0
) tends to a standard normal distribution. In other words, the asymptotic
distribution of

is
N
_

0
,
1
nI(
0
)
_
Proof: we shall prove that
_
nI(
0
)(


0
) N(0, 1)
asymptotically. We will only give a sketch of the proof; the details of the argument are
beyond the scope of this course.
Remember the log-likelihood function is
l() =
n

i=1
log f(X
i
|)
and

is the solution to l

() = 0. We apply Tylor expansion of l

) at the point
0
, yielding
0 = l

) l

(
0
) + (


0
)l

(
0
)
Therefore,


0

l

(
0
)
l

(
0
)
and

n(


0
)
n
1/2
l

(
0
)
n
1
l

(
0
)
First, let us consider the numerator of the last expression above. Its expectation is
E[n
1/2
l

(
0
)] = n
1/2
n

i=1
E
_

log f(X
i
|
0
)
_
= n
1/2
n

i=1
E [l

(X
i
|
0
)] = 0,
and its variance is
Var[n
1/2
l

(
0
)] =
1
n
n

i=1
E
_

log f(X
i
|
0
)
_
2
=
1
n
n

i=1
E [l

(X
i
|
0
)]
2
= I(
0
).
8
Next, we consider the denominator:
1
n
l

(
0
) =
1
n
n

i=1

2
log f(X
i
|
0
)
By the law of large number, this expression converges to
E
_

2

2
log f(X
i
|
0
)
_
= I(
0
)
We thus have

n(


0
)
n
1/2
l

(
0
)
I(
0
)
Therefore,
E
_

n(


0
)
_

E[n
1/2
l

(
0
)]
I(
0
)
= 0,
and
Var
_

n(


0
)
_

Var[n
1/2
l

(
0
)]
I
2
(
0
)
=
I(
0
)
I
2
(
0
)
=
1
I(
0
)
As n , applying central limit theorem, we have

n(


0
) N
_
0,
1
I(
0
)
_
i.e.,
_
nI(
0
)(


0
) N (0, 1) .
This completes the proof.
This theorem indicates the asymptotic optimality of maximum likelihood estimator since
the asymptotic variance of MLE can achieve the CRLB. For this reason, MLE is frequently
used especially with large samples.
Example 6: Suppose that X
1
, X
2
, , X
n
are i.i.d. random variables on the interval [0, 1]
with the density function
f(x|) =
(2)
()
2
[x(1 x)]
1
where > 0 is a parameter to be estimated from the sample. It can be shown that
E(X) =
1
2
V ar(X) =
1
4(2 + 1)
9
What is the asymptotic variance of the MLE?
Solution. Lets calculate I(): Firstly,
log f(x|) = log (2) 2 log () + ( 1) log[x(1 x)]
Then,
log f(x|)

=
2

(2)
(2)

2

()
()
+ log[x(1 x)]
and

2
log f(x|)

2
=
2

(2)2(2) 2

(2)2

(2)
(2)
2

2

()() 2

()

()
()
2
=
4

(2)(2) (2

(2))
2
(2)
2

2

()() 2 (

())
2
()
2
Therefore,
I() = E
_

2
log f(x|)

2
_
=
2

()() 2 (

())
2

2
()

4

(2)(2) (2

(2))
2

2
(2)
The asymptotic variance of the MLE is
1
nI()
.
Example 7: The Pareto distribution has been used in economics as a model for a density
function with a slowly decaying tail:
f(x|x
0
, ) = x

0
x
1
, x x
0
, > 1
Assume that x
0
> 0 is given and that X
1
, X
2
, , X
n
is an i.i.d. sample. Find the asymptotic
distribution of the mle.
Solution: The asymptotic distribution of

MLE
is N
_
,
1
nI()
_
. Lets calculate I().
Firstly,
log f(x|) = log + log x
0
( + 1) log x
Then,
log f(x|)

=
1

+ log x
0
log x
and

2
log f(x|)

2
=
1

2
So,
I() = E
_

2
log f(x|)

2
_
=
1

2
Therefore, the asymptotic distribution of MLE is
N
_
,
1
nI()
_
= N
_
,

2
n
_
10
3 Approximate Condence Intervals
In previous lectures, we discussed the exact condence intervals. However, to construct an
exact condence interval requires detailed knowledge of the sampling distribution as well as
some cleverness. An alternative method of constructing condence intervals is based on the
large sample theory of the previous section.
According to the large sample theory result, the distribution of
_
nI(
0
)(


0
) is approx-
imately the standard normal distribution. Since the true value of ,
0
, is unknown, we
will use the estimated value

to estimate I(
0
). It can be further argued that the distribu-
tion of
_
nI(

)(


0
) is also approximately standard normal. Since the standard normal
distribution is symmetric about 0,
P
_
z(1 /2)
_
nI(

)(


0
) z(1 /2)
_
1
Manipulation of the inequalities yields

z(1 /2)
1
_
nI(

)

0


+z(1 /2)
1
_
nI(

)
as an approximate 100(1 )% condence interval.
Example 8: Let X
1
, , X
n
denote a random sample from a Poisson distribution that has
mean > 0.
It is easy to see that the MLE of is

=

X. Since the sum of independent Poisson random
variables follows a Poisson distribution, the parameter of which is the sum of the parameters
of the individual summands, n

=

n
i=1
X
i
follows a Poisson distribution with mean n.
Therefore the sampling distribution of

is know, which depends on the true value of .
Exact condence intervals for may be obtained by using this fact, and special tables are
available.
For large samples, condence intervals may be derived as follows. First, we need to calculate
I(). The probability mass function of a Poisson random variable with parameter is
f(x|) = e

x
x!
for x = 0, 1, 2,
then
log f(x|) = x log log x!
It is easy to verify that

2
log f(x|) =
x

2
therefore
I() = E
_

2
_
=
1

11
Thus, an approximate 100(1 )% condence interval for is
_
_
X z(1 /2)


X
n
,

X +z(1 /2)


X
n
_
_
Note that in this case, the asymptotic variance is in fact the exact variance, as we can verify.
The condence interval, however, is only approximate, since the sampling distribution is
only approximately normal.
4 Exercises
Problem 1: Suppose that a random variable X has a Poisson distribution for which the
mean is unknown ( > 0). Find the Fisher information I() in X.
Problem 2: Suppose that a random variable X has a normal distribution for which the
mean is 0 and the standard deviation is unknown ( > 0). Find the Fisher information
I() in X.
Problem 3: Suppose that a random variable X has a normal distribution for which the
mean is 0 and the standard deviation is unknown ( > 0). Find the Fisher information
I(
2
) in X. Note that in this problem, the variance
2
is regarded as the parameter, whereas
in Problem 2 the standard deviation is regarded as the parameter.
Problem 4: The Rayleigh distribution his dened as:
f(x|) =
x

2
e
x
2
/(2
2
)
, x 0, > 0
Assume that X
1
, X
2
, , X
n
is an i.i.d. sample from the Rayleigh distribution. Find the
asymptotic variance of the mle.
Problem 5: Suppose that X
1
, , X
n
form a random sample from a gamma distribution
for which the value of the parameter is unknown and the value of parameter is known.
Show that if n is large, the distribution of the MLE of will be approximately a normal
distribution with mean and variance
[()]
2
n{()

() [

()]
2
}
Problem 6: Let X
1
, X
2
, , X
n
be an i.i.d. sample from an exponential distribution with
the density function
f(x|) =
1

e
x/
, x 0, > 0
a. Find the MLE of .
12
b. What is the exact sampling distribution of the MLE.
c. Use the central limit theorem to nd a normal approximation to the sampling distribution.
d. Show that the MLE is unbiased, and nd its exact variance.
e. Is there any other unbiased estimate with smaller variance?
f. Using the large sample property of MLE, nd the asymptotic distribution of the MLE. Is
it the same as in c.?
g. Find the form of an approximate condence interval for .
h. Find the form of an exact condence interval for .

Вам также может понравиться