Вы находитесь на странице: 1из 144

Mathematical Statistics

Math 392 Course Notes


Spring, 2017

Albyn C. Jones
Department of Mathematics
Reed College
ii
Contents

1 Review: Probability 1
1.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Discrete Distributions . . . . . . . . . . . . . . . . . . 1
1.1.2 Continuous Distributions . . . . . . . . . . . . . . . . . 3
1.2 Location and Scale parameters . . . . . . . . . . . . . . . . . . 6
1.3 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . 8
1.6 MGF’s of sums of RV’s . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Multivariate Moments . . . . . . . . . . . . . . . . . . . . . . 10
1.8 The Multivariate Normal Distribution . . . . . . . . . . . . . . 11
1.9 Marginal and Conditional Normal distributions . . . . . . . . 13
1.10 MGF’s and Characteristic Functions . . . . . . . . . . . . . . 14
1.11 Transforms of random variables . . . . . . . . . . . . . . . . . 17
1.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 The Method of Moments 21


2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Comment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Least Squares 29
3.1 The Modern Linear Model . . . . . . . . . . . . . . . . . . . . 32
3.2 Parameter Estimation by Least Squares . . . . . . . . . . . . . 33
3.3 Projection Operators on Linear Spaces . . . . . . . . . . . . . 34
3.4 Distributional Results . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 The distribution of ˆ . . . . . . . . . . . . . . . . . . . 36
3.4.2 Fitted Values . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Predictions . . . . . . . . . . . . . . . . . . . . . . . . 37

iii
iv CONTENTS

3.4.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 t and F statistics . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 t statistics . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 F statistics . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 t2 is an F statistic! . . . . . . . . . . . . . . . . . . . . 44
3.6 The Multiple Correlation Coefficient: R2 . . . . . . . . . . . . 44
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Bayes, Laplace, and Inverse Probability 51


4.1 Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Laplace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . 58
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 R A Fisher and Maximum Likelihood 65


5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Fisher’s 1912 paper . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 A Computational Example . . . . . . . . . . . . . . . . . . . . 68
5.4 Properties of MLE’s . . . . . . . . . . . . . . . . . . . . . . . 71
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6 Sufficiency and Efficiency 77


6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . 82
6.5 Information, Sufficiency and Unbiasedness . . . . . . . . . . . 84
6.5.1 The Information Inequality . . . . . . . . . . . . . . . 85
6.5.2 The role of sufficiency . . . . . . . . . . . . . . . . . . 87
6.5.3 Notes on Unbiased estimation . . . . . . . . . . . . . . 88
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7 Formal Inference 93
7.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2.1 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
CONTENTS v

7.4 Tests based on the Likelihood function . . . . . . . . . . . . . 109


7.4.1 Score Tests . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4.2 Wald Tests . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5 Confidence Intervals from Hypothesis Tests . . . . . . . . . . . 112
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8 Computational Statistics 117


8.1 Newton and Quasi-Newton algorithms . . . . . . . . . . . . . 117
8.1.1 The EM algorithm . . . . . . . . . . . . . . . . . . . . 122
8.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

9 Markov Chain Monte Carlo 127


9.1 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.2 The Changepoint Problem . . . . . . . . . . . . . . . . . . . . 128
9.3 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . 130
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
vi CONTENTS
Preface

This set of notes on mathematical statistics has a somewhat unusual structure


for a mathematics text. My original motivation for Yet Another Introduction
to Mathematical Statistics was to present an historical overview of the devel-
opment of modern mathematical statistics, inspired by the works of Stigler
and Hald. A strictly historical development is unfortunately in conflict with
a strictly logical development; I have chosen a compromise which is neither.
Various logical threads that form the warp and woof of historical develop-
ments are presented as separate chapters. The original chronology can not
be faithfully reflected in the sequence of chapters in the text.
Historically, Bayes, Laplace, and Gauss wrote seminal works on ‘inverse
probability’ (now called Bayesian inference) and least squares long before the
modern development of the theory underlying hypothesis tests and confidence
intervals. Thus, for example, I have chosen to present the t-test and F-test
arising in the context of least squares before the general ideas of hypothesis
testing.
I see the modern era as beginning with the work of Karl Pearson on the
Method of Moments; this was the first conscious attempt to create a theory
of inference beyond that implied by the law of large numbers. One might
argue, and it has been argued (by Stigler, for example), that many aspects of
modern inference were at least forshadowed by Laplace and Gauss, such as
decision theory and minimum-variance unbiased estimates. Least squares was
certainly well known and used in the 19th century, but limited in application
by the difficulties of large scale matrix computations. Inverse probability
was probably the dominant theoretical idea. It appears to have played a
role in both the development of Pearson’s method of moments and Fisher’s
maximum likelihood estimation, but remained a rarity in practice due to the
difficulty of application to any non-trivial problem. The modern Bayesian

vii
viii CONTENTS

revival began with purely theoretical development, and only gathered steam
in the applied world with the advent of powerful modern computers.
I hope the reader will be tolerant of my peculiar project.
Chapter 1

Review: Probability

This chapter contains results from probability theory that I will assume you
know. You probably don’t know all the probability distributions discussed
below, so read this chapter. It also establishes some notation that will (I
hope!) be consistently used in the rest of the notes.

1.1 Distributions
Here are a few probability distributions that arise in applications and ex-
ercises in statistics books. It is useful to get in the habit of writing the
probability distribution or density in the form f (x|✓), where x is the argu-
ment and ✓ is the parameter (or list of parameters). If the support of the
density depends on a parameter include the support in the definition of the
density via the characteristic function for the set on which the density is
non-zero, for example I(a < x < b) is the open interval (a, b).

1.1.1 Discrete Distributions

Binomial(n,p) The number of successes in n independent trials each with


probability p of success, and probability of failure q = (1 p).
✓ ◆
n k n k
f (k | n, p) = p q
k

1
2 CHAPTER 1. REVIEW: PROBABILITY

for k = 0, 1, 2, . . . , n.

Poisson( ) The poisson distribution is a natural probability model for counts


of ‘rare’ events.
f (k | ) = k e /k!
for k = 0, 1, 2, . . . , 1.

Geometric(p) The geometric distribution counts number of failures before


the first success for independent trials with probability p of success,
and probability of failure q = (1 p).

f (k | p) = pq k

for k = 0, 1, 2, . . . , 1.
Warning: the geometric distribution is sometimes defined as the num-
ber of trials including the first success. For compatibility with R we
will use this definition.

Negative Binomial(r, p) The negative binomial distribution arises when


counting the number of failures before the r th success for independent
trials with probability p of success, and probability of failure q =
(1 p). ✓ ◆
r+k 1 r k
f (k | r, p) = pq
k
for k = 0, 1, 2, . . . , 1.
Warning: the negative binomial distribution is sometimes defined as
the number of trials including the r th success. For compatibility
with R we will use this definition.

Hypergeometric(n, p, N ) The hypergeometric counts the number of suc-


cesses in n trials sampling without replacement from a finite population
of size N , with proportion p labeled ‘success’, q = (1 p).
Np Nq
k n k
f (k | n, p, N ) = N
.
n

Equivalently, we might define the distribution in terms of the sizes of


the two sub-populations. If we have A in one group, and B in the
1.1. DISTRIBUTIONS 3

other, then N = A + B, and the hypergeometric distribution is given


as
A B
k n k
f (k | n, A, B) = A+B
.
n

Multinomial(n, p~) The multinomial models counts by category for inde-


pendent trials with probabilties p~ = (p1 , p2 , . . . , pr ) for the r categories.
n!
f (k1 , k2 , . . . , kr | n, p~) = pk11 pk22 · · · pkr r
k1 !k2 ! · · · kr !
P P
where where pj = 1 and kj = n.

1.1.2 Continuous Distributions


Uniform(a,b) The uniform is also called the ‘rectangular’ distribution.
1
f (x | a, b) = I(a < x < b).
b a
Note the use of the indicator function to delimit the support of the
density to the interval (a, b). In other words, I(x 2 A) is 1 if x 2 A,
and 0 else. Occasionally you will see the term ‘random number’ used
when clearly the intention is a random value from the uniform(0,1)
distribution.
Normal(µ, 2 ) The normal is also known as the Gaussian distribution, and
even occasionally as the Laplacian.
2 1 1 (x µ)
2
f (x | µ, )= p e 2 2
.
2⇡ 2

The standard normal is N(0,1).


Gamma(↵, ) The gamma distribution is named for Euler’s gamma inte-
gral: Z 1
(↵) = x↵ 1 e x dx.
0
The general form is
↵ ↵ 1
x x
f (x | ↵, ) = e I(x > 0)
(↵)
4 CHAPTER 1. REVIEW: PROBABILITY

for ↵ > 0 and > 0. Special cases include the Exponential( ) distribu-
tion, which is a Gamma(1, ), and the 2n which is a Gamma(n/2, 1/2).
is often called a rate parameter. An alternative form, using as a
scale paprameter is given by

x↵ 1
x/
f (x | ↵, ) = ↵
e I(x > 0)
(↵)

Beta(r, s) The beta distribution is named for the beta integral


Z 1
(r) (s)
Beta(r, s) = xr 1 (1 x)s 1 dx = .
0 (r + s)

Dividing the integrand by the value of the integral gives a density:

(r + s) r 1
f (x | r, s) = x (1 x)s 1
(r) (s)

for r > 0, s > 0, and 0  x  1.

Cauchy(µ, ) The Cauchy distribution was invented by Cauchy as a source


of counterexamples; for example a Cauchy random variable does not
have an expected value.

1 1
f (x | µ, ) =
⇡ (1 + (x µ)2 / 2 )

for > 0.
If X and Y are independent Normal(0,1) random variables, then X/Y
has a Cauchy(0,1) distribution. The Cauchy(0,1) is also the t distribu-
tion with 1 degree of freedom.

Double Exponential(µ, ) The double exponential is also known as the


Laplace distribution:

1 |x µ|/
f (x | µ, ) = e
2
for > 0.
1.1. DISTRIBUTIONS 5

Lognormal(µ, 2 ) If X has a lognormal(µ, 2 ) distribution, then log X has


a Normal(µ, 2 ) distribution.

2 1 (log x µ)2
f (x | µ, )= p e 2 2 I(x > 0)
x 2⇡ 2

for > 0.

Logistic(µ, )
1 e (x µ)/
f (x | µ, ) =
(1 + e (x µ)/ )2
for > 0.

Pareto(↵, ) Named for the the Italian economist Vilfredo Pareto, who
used it to model wealth distributions. It is also known as a power law
distribution.

f (x | ↵, ) = +1 I(x > ↵)
x
for ↵ > 0, > 0.

Weibull(↵, ) The Weibull distribution is often used as a model for life-


times.
↵x↵ 1 (x/ )↵
f (x | ↵, ) = ↵
e I(x > 0).

Student’s t with n degrees of freedom The t distribution arises in sam-


pling from a normal distribution.

( n+1 ) 1
fn (x) = p 2 n
n⇡ ( 2 ) (1 + x /n) n+1
2 2

n 1. The t1 is the same as a Cauchy(0, 1). It is a calculus exercise


to show that if Z is a standard normal random variable, and X is a 2n
random variable independent of Z, then

Z
p
X/n

has a tn distribution.
6 CHAPTER 1. REVIEW: PROBABILITY

Fisher’s Fn,m distribution The F distribution also arises in the context


of sampling from normal distributions and hypothesis testing in linear
models.

(n/m)n/2 x(n 2)/2


fn,m (x) =
(n/2, m/2) (1 + (n/m)x)(n+m)/2
for n 1, m 1 and x 0.
The Fn,m can be derived as the ratio
U/n
V /m
where U has a 2n distribution and V is an independent 2m random
variable. The square of a tn random variable has an F1,n distribution.

1.2 Location and Scale parameters


It is useful to develop a few elementary results about location and scale
parameters. If we have a random variable X, then we know that

E(µ + X) = µ + EX

since expectation is a linear operator. The parameter µ shifts the location


of the distribution, while the parameter changes the scale.
If X has a density function f (x), then the cumulative distribution func-
tion (CDF) is Z x
P(X  x) = FX (x) = f (t)dt
1
and
@F (x)
= f (x).
@x
Let Y = µ + X. We can easily find the density function for Y by finding
the CDF and di↵erentiating:

P(Y  y) = P(µ + X  y)
y µ y µ
= P(X  ) = FX ( ).
1.3. JOINT DISTRIBUTIONS 7

The chain rule gives us the density


1 y µ
fY (y) = fX ( ).

Example
If Z has a standard normal distribution, then
1 z 2 /2
fZ (z) = p e .
2⇡
Hence the density for X = µ + Z is
1 x µ 1 (x µ 2
) /2
fX (x) = fZ ( )= p e .
2⇡ 2

Example
If Z has a standard Cauchy distribution:
1 1
fZ (z) =
⇡ 1 + z2
then the density for X = µ + Z is
1 1
f (x | µ, ) = (x µ)2
.
⇡1+ 2

If X has moment generating function MX (t), then the moment generating


function of Y = µ + X is similarly easy to describe in terms of MX (t).

Eet(µ+ X)
= Eetµ et X
= etµ Ee(t )X
= etµ MX (t ).

1.3 Joint Distributions


The joint distribution function for a pair of random variables (X, Y ), which
is a random vector in R2 , is defined to be

F (x, y) = P(X  x, Y  y).


8 CHAPTER 1. REVIEW: PROBABILITY

If F is di↵erentiable, and thus there is a density function f such that


Z x Z y
F (x, y) = f (s, t)dtds
1 1

then we say that the joint distribution of is (absolutely) continuous.


If F can be written as a sum
x y
X X
F (x, y) = P(X = s, Y = t)
s= 1 t= 1

then we will say that the joint distribution is discrete.


Obviously these definitions generalize to Rn , and it is possible to have
random vectors which have some continuous components and some discrete
components.

1.4 Marginal Distributions


For continuous distributions, the marginal distribution of a component X is
defined to be the univariate distribution of X, ie.
Z x Z 1
FX (x) = P(X  x, Y  1) = f (s, t)dtds.
1 1

It follows that X has a density function fX , which must be


Z 1
@
FX (x) = f (x, t)dt.
@x 1

For the discrete case, the analogous result is


y=1
X
P(X = x) = P(X = x, Y = y).
y= 1

1.5 Conditional Distributions


In the continous case, a conditional density function for X, given Y = y,
should satisfy the constraint that when we compute the probability of an
1.6. MGF’S OF SUMS OF RV’S 9

event, we should get the same answer if we compute it directly, or if we


compute it by integrating the conditional density for X given Y = y over the
range of Y :
Z Z Z
P((X, Y ) 2 A) = f (s, t) = fX.Y (x | y)fY (y).
A

This constraint will be satisfied if we define the conditional density to be

fX.Y (x | y) = fX,Y (x, y)/fY (y)

when fY (y) 6= 0.
For the discrete case, we can base our definition directly on the definition
of conditional probability:
P(X = x, Y = y)
P(X = x | Y = y) = .
P(Y = y)

again assuming that P(Y = y) 6= 0.

1.6 MGF’s of sums of RV’s


Often we want to derive the distribution of a sum of two or more random
variables. One basic technique is to use the Moment Generating Function
(MGF); if two RV’s X and Y are independent, then the MGF of their sum
is the product of their MGF’s:

Eet(X+Y ) = EetX etY = EetX EetY .

If we can then recognize the product of the MGF’s the MGF of a known dis-
tribution, we know the distribution of the sum. This generalizes immediately
to the sum of n independent RV’s.

Example
X and Y are independent Poisson ( x) and Poisson ( y ) RV’s.
1
X
tX
Ee = etk k
e /k!
k=0
10 CHAPTER 1. REVIEW: PROBABILITY

1
X
EetX = e ( et )k /k!
k=0
so
(et 1)
MX (t) = e .
Thus the MGF of the sum is
t
MX+Y (t) = MX (t)MY (t) = e( x + y )(e 1)

and the sum of the two independent Poisson RV’s is Poisson, with parameter
x + y.

Example
X and Y are independent Gamma(rx , ) and Gamma(ry , ) RV’s, respec-
tively, where is the inverse of the scale parameter .
Z 1 r r 1
tX x
Ee = etX exp ( x)
0 (r)
and ✓ ◆r
MX (t) = .
t
Thus ✓ ◆(rx +ry )
MX+Y (t) = MX (t)MY (t) =
t
and we can immediately observe that the sum of two Gamma variates with
common scale factor has a Gamma distribution with the same scale factor,
while the shape parameter is the sum of the two shape parameters. In other
words
Gamma(rx , ) + Gamma(ry , ) ⇠ Gamma(rx + ry , ).

1.7 Multivariate Moments


The expected value (mean) of a multivariate random vector is just the vector
of expected values of the component random variables:
0 1 0 1 0 1
X1 E(X)1 µ1
B X2 C B E(X)2 C B µ2 C
B C B C B C
E B .. C = B .. C = B .. C .
@ . A @ . A @ . A
Xn E(X)n µn
1.8. THE MULTIVARIATE NORMAL DISTRIBUTION 11

For two random variables X and Y defined on the same sample space,
the covariance is defined to be

Cov(X, Y ) = E((X µX )(Y µY )) = E(XY ) µX µY .

The (Pearson) correlation coefficient is just the covariance after rescaling to


unit variance:
Cov(X, Y )
⇢X,Y = Cov(X/ x , Y / y ) =
X Y
p p
where X = V ar(X) and Y = V ar(Y ), that is the standard deviations
of X and Y , respectively. Note that variance is just the special case where
X =Y:
V ar(X) = Cov(X, X).
The natural generalization of variance to higher dimensions is the covariance
matrix:
Cov(X) = E((X µ)(X µ)0 )
where expectations are again taken componentwise. You should verify that
the ijth entry in this matrix is just Cov(Xi , Xj ). Cov is a bi-linear operator,
so the covarince matrix of linear transformations of random vectors have a
simple representaion:

Cov(AX) = E(A(X µ)(X µ)0 A0 ) = A Cov(X) A0 .

In the special case where Cov(X) = I, we have

Cov(AX) = A Cov(X) A0 = AA0 .

1.8 The Multivariate Normal Distribution


If we have a set of independent random variables X1 , X2 , X3 , . . ., Xn with
corresponding density functions f1 , f2 , f3 , . . ., fn then the random vector
(X1 , X2 , . . . Xn ) has a joint density function equal to the product of the fi ’s:

f (X1 , X2 , . . . Xn ) = f1 (X1 )f2 (X2 ) . . . fn (Xn ).

Applying this result to a collection of n independent standard normal


random variables generates a multivariate normal distribution. The standard
12 CHAPTER 1. REVIEW: PROBABILITY

normal density is
1 1
f (z) = p exp ( z 2 )
2⇡ 2
so the resulting multivariate density for z is

Y ✓ ◆n/2 ✓ ◆n/2
1 1X 2 1 1 0
f (z) = (f (zi )) = exp ( zi = exp ( z z).
2⇡ 2 2⇡ 2

To get the general form, apply the change of variable formula from calculus.
Let
X = µ + AZ
where A is an n ⇥ n matrix of full rank, and µ is an n vector of constants.
Using the notation of Jerry’s 212 notes, the change of variable theorem states:
Z Z
f (x) = f ( (x))| det 0 |
(K) K

where K is a compact subset of Rn , and here is the inverse transform

(x) = A 1 (x µ).

Thus the joint density function for X is


✓ ◆n
1 1
f (x) = p | det(A 1 )| exp ( (x µ)0 (AA0 ) 1 (x µ)).
2⇡ 2

If we let the covariance matrix AA0 = ⌃, and note that

det ⌃ = det(AA0 ) = (det A)2 ,

then we have the usual representation for the N (µ, ⌃) distribution


1 1 1
f (x) = ( p )n 1 exp ( (x µ)0 ⌃ 1 (x µ)).
2⇡ | det ⌃| 2 2

The standard notation for stating that a random vector X has a multivariate
normal distribution is
X ⇠ N (µ, ⌃).

If X ⇠ N (µ, ⌃), and Y = AX, then Y ⇠ N (Aµ, A⌃A0 ).


1.9. MARGINAL AND CONDITIONAL NORMAL DISTRIBUTIONS 13

1.9 Marginal and Conditional Normal distri-


butions
Marginal distributions are quite easy to specify for the multivariate normal.
Let ✓ ◆ ✓✓ ◆ ✓ ◆◆
X µX ⌃XX ⌃XY
⇠N ,
Y µY ⌃Y X ⌃Y Y
where X and Y are two subsets of the components of a random vector, with
dimension k and n k respectively, and the covariance matrix is specified in
block form. Then the marginal distribution for X is

X ⇠ N (µX , ⌃XX ).

If the univariate random variables X and Y have a joint normal distribu-


tion, then X and Y are independent if and only if Cov(X, Y ) = 0. To find
the conditional distribution of a random vector X given Y, we could work
directly from the definition:

f (x, y)
f (x|Y = y) =
fY (y)

that is, the joint distribution evaluated at Y = y divided by the marginal


distribution for Y evaluated at Y = y. This is essentially a multivariate
version of ‘completing the square’.
Here is a nicer approach, which is analogous to the Gram-Schmidt or-
thogonalization process in linear algebra. I will outline the argument and
leave the details to the reader. Find a linear transformation of the random
vector ✓ ◆ ✓ ◆
X X + AY
7 !
Y Y
so that the resulting covariance matrix is block diagonal; ie. the two compo-
nents are uncorrelated, and thus independent.
Using the results about covariance matrices, we have a matrix equation
in block form to solve for A:
✓ ◆✓ ◆✓ ◆0 ✓ ◆
I A ⌃XX ⌃XY I A ⌃X.Y 0
= .
0 I ⌃Y X ⌃Y Y 0 I 0 ⌃Y Y
14 CHAPTER 1. REVIEW: PROBABILITY

The solution is A = ⌃XY ⌃Y 1Y .


Since the new random vector has a block diagonal covariance matrix, we
can write the joint density as the product of two factors, one the conditional
density for X given Y, the other the marginal density for Y:
fX,Y (x, y) = fX|Y (x|y)fY (y).

You may verify that the following is the conditional density by multiplying
by the marginal density for Y and showing that it is equal to the original
joint density:
1 1
f (x|y) = ( p )k exp ( (x µ)0 ⌃X.Y
1
(x µ))
2⇡ 2
where
µ = µX + ⌃XY ⌃Y 1Y (y µY )
and
⌃X.Y = ⌃XX ⌃XY ⌃Y 1Y ⌃Y X .

1.10 MGF’s and Characteristic Functions


The Moment Generating Function (MGF) of a probability distribution is
defined to be a function of the real argument t
MX (t) = EetX .
For a continuous RV with density function f(x), this is
Z
Ee = etx f (x)dx
tX

while in the discrete case, it is defined by a sum:


X
EetX = etxi f (xi ).

The MGF gets its name from the fact that we can use it to compute
moments for a probability distribution. To see why, expand etX as a power
series:
1 1
EetX = E(1 + tX + (tX)2 + (tX)3 + . . .)
2! 3!
1.10. MGF’S AND CHARACTERISTIC FUNCTIONS 15

1 2 1
= 1 + tEX + t EX 2 + t3 EX 3 + . . . .
2! 3!
Thus, if you di↵erentiate k times, and set t = 0, you get EX k .
If the random variable X has MGF MX (t), then the random variable
Y = a + bX, where a and b are constants, has MGF

MY (t) = Eet(a+bX) = eta EetbX = eta MX (bt).

If two RV’s X and Y are independent, then the MGF of their sum is the
product of their MGF’s:

Eet(X+Y ) = EetX etY = EetX EetY .

Not every probability distribution has a MGF: the distribution must pos-
sess moments of all orders. Every distribution does have a characteristic
function:
itX
X (t) = Ee
p
where i = 1. Since

|eitX | = | cos (tX) + i sin (tX)|  1

the integral always exists. You can compute moments from the characteristic
function if they exist, I’ll leave this as an exercise for the reader.
It is useful to compute these transforms for standard distributions.

distribution density M (t) (t)


1 2 1 2 1 2
standard normal p1 e 2
x
e 2
t
e 2
t
2⇡
1 et✓ 1 eit✓ 1
uniform(0,✓) ✓
I(0 < x < ✓) t✓ it✓
|x ✓|
triangular(0, 2 ✓) 1

(1 ✓
)I(0 < x < 2✓) 2et✓ 1 cos (t✓)
t2 ✓ 2
2eit✓ 1 cos (t✓)
t2 ✓ 2

x↵ 1 1 1
gamma(↵,1) (↵)
e x I(0 < x) (1 t)↵ (1 it)↵

1 1
double exponential 2
e |x| 1+t2

1 1 |t|
Cauchy ⇡ (1+x2 )
e
16 CHAPTER 1. REVIEW: PROBABILITY

There are some interesting relationships in this table. The normal is essen-
tially its own characteristic function (up to a constant), and the Cauchy and
double exponential transform to each other’s density. This is a very useful
example, as it illustrates some important facts about characteristic functions:
if the distribution has no expectation, then the characteristic function is not
di↵erentiable at t = 0. The characteristic function of the triangular(0, 2 ✓)
is the square of the characteristic function of the uniform(0,✓); hence the
sum of two independent uniforms has a triangular distribution. Finally, if a
density is symmetric about 0, then the characteristic function is strictly real.

Example
X and Y are independent Poisson ( x) and Poisson ( y ) RV’s.
1
X
EetX = etk k
e /k!
k=0

1
X
=e ( et )k /k!
k=0
so
(et 1)
MX (t) = e .
Thus the MGF of the sum is
t
MX+Y (t) = MX (t)MY (t) = e( x + y )(e 1)

and the sum of the two independent Poisson RV’s is Poisson, with parameter
x + y.

Example
X and Y are independent Gamma(rx , ) and Gamma(ry , ) RV’s, respec-
tively, where is the inverse of the scale parameter .
Z 1 r r 1
tX x
Ee = etX e( x) dx
0 (r)

and ✓ ◆r
MX (t) = .
t
1.11. TRANSFORMS OF RANDOM VARIABLES 17

Thus ✓ ◆(rx +ry )


MX+Y (t) = MX (t)MY (t) =
t
and we can immediately observe that the sum of two Gamma variates with
common scale factor has a Gamma distribution with the same scale factor,
while the shape parameter is the sum of the two shape parameters. In other
words
Gamma(rx , ) + Gamma(ry , ) ⇠ Gamma(rx + ry , ).

Example
X and Y are independent Cauchy random variables and thus have charac-
teristic functions e |t| .
|t|/2 2 (|t|/2+|t|/2) |t|
X+Y (t) = X (t/2) Y (t/2) = (e ) =e =e .
2

Thus the average of two independent Cauchy RV’s has a Cauchy distribution!
This should lead you to be wary of computing averages for populations that
don’t have a mean! In contrast, the average of two normal RVs with variance
2
has variance 2 /2; averaging decreases the variability.

1.11 Transforms of random variables


For discrete random variables it is generally straight-forward to derive the
distribution of a function of a random variable. In the continuous multivari-
ate setting the change of variable theorem is a useful tool. I’ll restate the
theorem in the form it typically arises. Let G : Rn ! Rn be one-to-one with
continuous derivatives, and let X be a random variable on Rn with a joint
density function given by fX (x). The density function for Y = G(X) is

fY (y) = fX (G 1 (y))|JG 1 (y)|

where G 1 (y) is the inverse transform.

Example
Let (X, Y ) have a joint distribution on R2 . We can find the distribution of
18 CHAPTER 1. REVIEW: PROBABILITY

X + Y as follows: let G(X, Y ) = (X + Y, Y ). Then G 1 (S, T ) = (S T, T ),


and |JG 1 | = 1, so we have
fS,T (s, t) = fX,Y (s t, t).
We then integrate over T to fnd the marginal distribution for S = X + Y . If
X and Y are independent, then this reduces to the convolution theorem:
Z Z
fS (s) = fX,Y (s t, t) = fX (s t)fY (t)dt.

Example
Let X and Y be independent random variables with Gamma(↵x , ) and
Gamma(↵y , ) distributions, respectively. Let V = X/(X + Y ). We want
to find the marginal distribution for V . We could try to do this directly,
but the change of variables theorem gives us a straightforward method. Pick
G(X, Y ) = (U, V ) with a simple form such as U = X, U = Y , or perhaps
U = X + Y . It may not be obvious which choice is best initially! It helps
to have a G 1 with nice Jacobian. You should verify that the choice U = X
gives X = U and Y = U (1 V )/V , while U = Y yields X = U V /(1 V ),
while U = X + Y gives X = U V and Y = U (1 V ). I’ll work through the
last option; it looks simpler.
✓ ◆
v u
fU,V (u, v) = fX,Y (uv, u(1 v)) det .
1 v u
The determinant evaluates to vu u(1 v) = u, yielding
fU,V (u, v) = fX,Y (uv, u(1 v))|u|.
Since X and Y are non-negative, so is U = X + Y , so we may drop the
absolute value. Since X and Y are independent, the joint density is the
product of the marginal densities:
fU,V (u, v) = fX (uv)fY (u(1 v) u

(uv)↵x 1 ↵x
uv (u(1 v))↵y 1 ↵y
u(1 v)
= e e u
(↵x ) (↵y )
↵x 1
↵x +↵y 1 ↵x +↵y uv (1 v)↵y 1
= u e
(↵x ) (↵y )
1.11. TRANSFORMS OF RANDOM VARIABLES 19

Conveniently the u and v bits segregate themselves, and we can recognize


the u terms as a Gamma integral. Thus it will be easy to integrate over u.
Z
fV (v) = fU,V (u, v)du

Z
v ↵x 1 (1 v)↵y 1
= u↵x +↵y 1 ↵x +↵y
e u
du
(↵x ) (↵y )

v ↵x 1 (1 v)↵y 1
= (↵x + ↵y )
(↵x ) (↵y )

We see that V has a Beta(↵x ,↵y ) density. Note that for non-negative random
variables X and Y , X/(X + Y ) must be in (0, 1).
20 CHAPTER 1. REVIEW: PROBABILITY

1.12 Exercises
1. Use the convolution theorem to show that if X and Y are independent
Gamma(↵x , ) and Gamma(↵y , ) random variables, then X + Y has
a Gamma(↵x + ↵y , ) distribution.

2. Let X and Y be independent N(0,1) random variables. Find the density


function for X/Y , and identify the distribution.

3. Let U andpV be independent uniform(0,1) p random variables. Show


that X = 2 log U cos (2⇡V ) and Y = 2 log U sin (2⇡V ) are inde-
pendent N(0,1) random variables.

4. Let X1 , X2 , . . . Xn be IID N(0, 2 ) random variables. Find the distribu-


tion of n
X
Xi2 .
i=1

5. Let Z be N(0,1) and Y be a 2n random variable independent of Z.


Show that
Z
t= p
Y /n
has a tn distribution.
P
, . . . Xn be IID N(µ, 2 ) random variables. Let X̄ =
6. Let X1 , X2P Xi /n
and S = (Xi X̄)2 /(n 1). Show that
2

p
X̄ µ n(X̄ µ)
t= p =
S/ n S

has a tn 1 distribution.
2 2
7. Let X and Y be independent a and b random variables, respectively.
Show that
X/a
F =
Y /b
has an Fa,b distribution.
Chapter 2

The Method of Moments

In 1894 Karl Pearson published a paper titled Contributions to the Mathe-


matical Theory of Evolution. The paper presents analyses of the distribution
of the ratio of width of forehead to body length of crabs. The histogram
of the 1000 observations was clearly asymmetrical, and hence non-normal.
Perhaps because of the firm hold that the normal distribution had on the
imagination at the time, Pearson considered the possibility that the asym-
metry resulted from a mixture of two populations, each of which was normal.
He worked out a method for estimating the parameters of a mixture of two
normal distributions. The mixture density is
2 2
f (x) = p n(x | µ1 , 1) + (1 p) n(x | µ2 , 2)

where n(x | µ, 2 ) is the density function for a normal with mean µ and
variance 2 . There are 5 parameters: p, the proportion of observations in
population 1, and a mean and variance for each population. He discussed the
problem of identifiability, that is the question of whether there are more
than one set of parameter values yielding the same distribution. For exam-
ple, when the two normal distributions collapse to one, we can parametrize
the mixture model in two ways: (1) µ1 = µ2 , 12 = 22 , with any choice of
p, and (2) p = 1, µ1 6= µ2 , 12 6= 22 . The mathematical issue is whether or
not there is a bijection between the parameter space and the space of prob-
ability distributions. Pearson computed the first 5 moments of the mixture
distribution, and the first 5 sample moments from the data, which gives a
set of 5 equations in 5 unknowns. He then solved the equations, producing
estimates for the unknown parameters. There were complications, since the

21
22 CHAPTER 2. THE METHOD OF MOMENTS

equations had more than one solution, and he seemed to indicate that there
was no biological reason to suppose that there were really two subpopulations
represented in the data, and that it was more reasonable to believe that the
underlying population was simply a single component with a non-normal dis-
tribution. Nevertheless, the exercise seems to have convinced Pearson that
he had a good idea. One can estimate the moments of a distribution by its
sample moments. The law of large numbers guarantees that if your sample
is large enough, the sample moments will be close to the true values. If
you have the moments in terms of some parameters, this yields a system of
equations which you can solve for estimates of those parameters.
Pearson developed this idea in the following years, eventually publishing
a set of tables (Biometrika tables for statisticians and biometricians, 1914)
which o↵ered a method for estimation, at least in the families of distributions
he had developed.
These families of distributions arose from solutions of di↵erential equa-
tions of the form
f 0 (x) x a
= .
f (x) g(x)
To create a manageable catalogue of distributions, he considered expanding
g(x) as a quadratic in the neighborhood of the point a where the numerator
vanishes, yielding equations with a polynomial of at most degree 2 in the
denominator.
f0 x a
= .
f c0 + c1 x + c2 x2
Since f 0 = 0 at one point x = a, the families are primarily unimodal dis-
tributions, though there are solutions that have ‘J’ or ‘U’ shaped density
curves.
Pearson categorized the resulting distributions into 7 families or types
based on di↵erent constraints on the coefficients a, c0 , c1 , and c2 . You will
occasionally see references to a ‘Pearson Type III’ or a ‘Pearson Type IV’
distribution, especially in older texts and articles.
You can easily verify by di↵erentiating the normal density function that
c1 = c2 = 0 corresponds to a normal distribution.
It turns out to be more natural to work with the first four moments of
the distributions, or scaled versions thereof. Let µk = E(X µ)k be the
k-th central moment. Pearson used the following versions of the first four
23

moments:

name symbol definition


mean µ E(X)
2
variance µ2
skewness 1 µ3 /µ32
2

kurtosis 2 µ4 /µ22
p
You will also see 1 = 1 used as a measure of skewness, and 2 = 2 3
for kurtosis (for the normal distribution, 2 = 3).
µ, 2 , 1 , and 2 using
Pearson’s idea was to estimate the parameters P
the corresponding sample moments. Let µ̂ = X = Xi /n, where n is the
number of observations, and
X
µ̂k = (Xi X)k /n.

Then
ˆ1 = µ̂2 /µ̂3
3 2

and
ˆ2 = µ̂4 /µ̂2 .
2

To illustrate the procedure, I have generated a pseudo-random sample of


standard normal data using R .

> x <- rnorm(50)


> m1 <- mean(x)
> m2 <- mean( (x-m1)^2)
> m3 <- mean( (x-m1)^3)
> m4 <- mean( (x-m1)^4)
> m1
[1] 0.001916651
> m2
[1] 1.111672
> m3
[1] -0.6378616
> m4
[1] 4.807191
24 CHAPTER 2. THE METHOD OF MOMENTS

> beta1 <- m3^2/m2^3


> beta1
[1] 0.2961575
> beta2 <- m4/m2^2
> beta2
[1] 3.889896

For the standard normal, µ = 0, 2 = 1, 1 = 0 and 2 = 3. Using ˆ1 = .29


and ˆ2 = 3.89, we land in the Type IV region. This is a bit inconvenient as
the Type IV family lacks a closed form representation for the moments. The
density function is of the form
✓ ◆
x2
f (x) = c 1 + 2 e arctan (x/↵) .

Rather than tackle that one now, I’ll leave it as the proverbial exercise
for the reader. Hint: for given values of ↵, , and , c is the normalizing
constant that makes it a density — it must integrate to 1. If we knew those
values, we could integrate numerically to get the corresponding values for µ,
2
, 1 and 2 .
Assuming we knew the data came from a normal distribution, we would
simply use µ̂ = .0019, and µ̂2 = 1.11. We will see later
Pthat the 2standard
2 2
estimator for µ2 = in the normal distribution is s = (Xi X) /(n 1)
The method of moments may be used as I just did, without resorting
to Pearson’s table, if we are willing to assume that we know the family of
distributions. Here is a simple example: Let X1 , X2 , X3 , . . . , Xn be an
independent sample from a Beta(↵, ) distribution. There are two unknown
parameters, so we need to compute the first two moments:
Z 1
(↵ + ) ↵ 1
E(X) = x x (1 x) 1 dx
0 (↵) ( )

(↵ + ) (↵ + 1) ( )
=
(↵) ( ) (↵ + + 1)

=
↵+
25

We have made use of the fact that (x + 1) = x (x). Now we compute the
second moment:
Z 1
2 (↵ + ) ↵ 1
E(X ) = x2 x (1 x) 1 dx
0 (↵) ( )

(↵ + ) (↵ + 2) ( )
=
(↵) ( ) (↵ + + 2)

↵(↵ + 1)
=
(↵ + )(↵ + + 1)

P
We could stop here, and just use the sample moment Xi2 /n, but may be
simpler to compute the variance:


Var(X) = E(X 2 ) (E(X))2 = .
(↵ + )2 (↵ + + 1)

This may not appear any simpler, but we will substitute X for µ = ↵/(↵ + )
and also in 1 µ = /(↵ + ). Your favorite symbolic algebra package would
be useful here to solve for ↵ and in terms of X and ˆ 2 . Here are the
equations:

X=
↵+
and
↵ 1
ˆ2 = .
(↵ + ) (↵ + ) (↵ + + 1)

After a lot of algebra I get

X(1 X) ˆ2

ˆ= X
ˆ2
and
ˆ = X(1 X) ˆ2
(1 X).
ˆ2
Here is a simulation using the Beta(1,1) or Uniform(0,1) distribution. Just
for fun I will repeat the full computation of Pearson.
26 CHAPTER 2. THE METHOD OF MOMENTS

> x <- runif(50)


> m1 <- mean(x)
> m2 <- mean( (x-m1)^2)
> m3 <- mean( (x-m1)^3)
> m4 <- mean( (x-m1)^4)
> m1
[1] 0.4495381
> m2
[1] 0.08206844
> m3
[1] 0.008863663
> m4
[1] 0.01355912
> m3^2/m2^3
[1] 0.1421340
> m4/m2^2
[1] 2.013164

These values for 1 and 2 put us in the Pearson Type I region where we
belong, but in the J1 subset rather than on the left boundary where we
belong. To finish the calculation by solving for ↵ and :

> alpha <- (m1*(1-m1)-m2)/m2 * m1


> beta <- (m1*(1-m1)-m2)/m2 * (1-m1)
> alpha
[1] 0.9059138
> beta
[1] 1.109296

Not too bad!


2.1. EXERCISES 27

2.1 Exercises
For each of the following situations, find the method of moments estimator.

1. X1 , X2 , . . . Xn are IID Uniform(0,✓).

2. X1 , X2 , . . . Xn are IID Gamma(↵, ).

3. X1 , X2 , . . . Xn are IID Pareto(↵, ).

4. X1 , X2 , . . . Xn are IID with density f (x|✓) = e✓ x


for x ✓ and ✓ > 0.

5. X1 , X2 , . . . Xn are IID with density f (x|µ) = 12 e |x µ|

1 1
6. X1 , X2 , . . . Xn are IID with density f (x|µ) = ⇡ 1+(x µ)2

2
7. X1 , X2 , . . . Xn are IID N(µ,⌧ ), where ⌧ = 1/ is the precision.

2.2 Comment
The real significance of Pearson’s work on the method of moments was that
it changed the focus of statistical theory. The tradition had been to compute
the mean and variance of a sample, and rely on the Central Limit Theorem to
allow inference about the mean of the unknown distribution. After Pearson,
people started to focus on estimation of the unknown distribution itself, not
just its mean.
28 CHAPTER 2. THE METHOD OF MOMENTS
Chapter 3

Least Squares

There were several attempts in the late 18th century to come to grips with
the problem of fitting equations to data. Hald’s wonderful text A History of
Mathematical Statistics from 1750 to 1930 mentions several investigations.
The basic problem arises when we have pairs of observations (Xi , Yi )
which appear to exhibit a linear pattern. The earliest methods included
ideas like picking out the ‘best’ observations, just enough to yield a consistent
system of equations, and summing or averaging subsets of the equations to
yield a consistent system.
Boscovitch in 1757 proposed to fit a line Yi = 0 + 1 Xi + ei subject to
the following conditions:
n
X
(Yi 0 1 Xi ) =0
i=1

(the sum of the positive deviations from the line should equal the sum of
negative devaitions from the line) and the sum of the absolute values of the
deviations n
X
|Yi 0 1 Xi |
i=1
should be a minimum. In modern terminology, let the i-th residual ri be
r i = Yi ( 0 + 1 Xi )
P
then the two criteria are that residuals
P sum to 0, ri = 0, and the sum of
the absolute values of the residuals, |ri |, is minimum. Boscovitch proposed

29
30 CHAPTER 3. LEAST SQUARES

the first criterion under the assumption that the distribution of measurement
errors (ei ) or deviations from the line was symmetrical about 0. Hald reports
that there was historical precedent for making the second assumption going
back to Galileo. The first condition forces the fitted line to pass through the
point (X, Y ) which may be verified directly:
n
X
0 = (Yi 0 1 Xi )
i=1

n
X n
X
= Yi n 0 1 Xi
i=1 i=1

Division by n, and solving for Y yields


Y = 0 + 1 X.

That is, if 0 and 1 are coefficients that satisfy the criterion, then the point
(X, Y ) must be on the line Y = 0 + 1 X.
If we now translate the coordinate system to put the origin at (X, Y ), we
have a simpler minimization problem: choose 1 to minimize
n
X
F ( 1) = |(Yi Y) 1 (Xi X)|.
i=1

The function F ( 1 ) is piecewise linear in 1 , with the contribution from


each observation Yi decreasing linearly as 1 increases until the line hits the
point (Xi , Yi ), and then increasing linearly as the slope increases past that
point. Boscovitch showed how to solve this problem graphically given a
small dataset. For a larger dataset the problem becomes more challenging
computationally - it is a linear programming problem. A general algorithm
for the linear programming problem, known as the simplex algorithm, was
first discovered in 1947 by George Dantzig.
Another idea, due to Lambert, was to divide the data into two roughly
equal sized subsets, those with larger X values and those with smaller X
values. Compute the centroids of these two groups, call them (X L , Y L ), and
(X S , Y S ), and use them to compute the slope:

ˆ1 = Y L YS
.
XL XS
31

Add the requirement that the deviations sum to zero, forcing the line through
(X, Y ) again, we now determine the intercept 0 by solving

Y = 0
ˆ1 X.

Adrien-Marie Legendre was the first to publish the least squares criterion,
in 1806. He stated the model in a very modern and general form:

Yi = 1 Xi,1 + 2 Xi,2 + ... + m Xi,m + ✏i , i = 1 . . . n, m  n.

He then argued that minimizing the sum of the squares of the errors is both
easier to do in general than earlier methods, and would tend make the largest
errors small.
For simplicity, we will consider the case of fitting a line again. Let
Q( 0 , 1 ) be the quadratic function of the coefficients representing the sum
of the squares of the residuals.
n
X n
X
Q( 0 , 1) = ri2 = (Yi 0 1 Xi )
2
.
i=1 i=1

Now, di↵erentiate with respect to those arguments, and set equal to 0.


n
X
@Q( 0 , 1)
= 2 (Yi 0 1 Xi )
@ 0 i=1

n
X
@Q( 0 , 1)
= 2 (Yi 0 1 Xi )Xi
@ 1 i=1

This leads to a system of two linear equations in two unknowns. The solution
may be written in various forms, including
P
ˆ1 = (Yi Ȳ )(Xi X̄)
P
(Xi X̄)2
ˆ0 = Ȳ ˆ1 X̄
32 CHAPTER 3. LEAST SQUARES

3.1 The Modern Linear Model


The linear model is often written in essentially the same form given by Leg-
endre:
Yi = 0 + 1 Xi,1 + 2 Xi,2 + . . . + p 1 Xi,p 1 + ✏i
where the ’s are unknown constants, the Xi,j ’s are known constants, and the
✏’s are independent N(0, 2 ) random variables. The ✏’s must be independent
not only of each other, but also must not be correlated with the X’s:
E(✏i |Xi,1 , Xi,2 , . . . , Xi,p 1 ) = 0.
It then follows that
E(Yi |Xi,1 , Xi,2 , . . . , Xi,p 1 ) = 0 + 1 Xi,1 + 2 Xi,2 + ... + p 1 Xi,p 1

and
2
V ar(Yi ) = .
Notice that this rules out models with random parameters ( ’s) and Xi,j ’s
measured with error. The former are often called random coefficient models,
but you may also hear the terms heirarchical linear models or random e↵ects
models. When the Xi,j ’s are random variables rather than known constants
we have the measurement error model.
It will greatly facilitate our discussion to translate the model into matrix
notation as follows:
0 1 0 10 1 0 1
B Y1 C B 1 X1,1 X1,2 . . . X1,p CB C B ✏1 C
B C B 1 CB 0 C B C
B Y2 C B 1 X2,1 X2,2 . . . X2,p CB C B ✏2 C
B C B 1 CB 1 C B C
B Y3 C=B 1 X3,1 X3,2 . . . X3,p CB C+B ✏3 C
B C B 1 CB 2 C B C
B .. C B .. .. .. .. .. CB .. C B .. C
@ . A @ . . . . . A@ . A @ . A
Yn 1 Xn,1 Xn,2 . . . Xn,p 1 p 1 ✏n
or
Y =X +✏
where Y is n ⇥ 1, X is n ⇥ p, is p ⇥ 1, and ✏ is n ⇥ 1. The assumption that
the ✏’s are IID N(0, 2 ) can be restated in terms of the multivariate normal
distribution as
✏ ⇠ N(0, 2 In ).
X is known as the design matrix, since in an experimental context, it repre-
sents the chosen levels of the explanatory factors.
3.2. PARAMETER ESTIMATION BY LEAST SQUARES 33

3.2 Parameter Estimation by Least Squares


In matrix notation, the least squares criterion (choose the coefficients to
minimize the sum of squared deviations of the Yi ’s from the fitted values)
translates to
minimize (Y X )0 (Y X ).
For a particular set of estimates ˆ we define the fitted values

Ŷ = X ˆ

and the residuals


r=Y Ŷ = Y X ˆ.
Thus
P the least squares criterion may be restated: choose to minimize r0 r
or ri2 , known as the residual sum of squares (RSS).
To compute the estimates, note that minimizing Y X ˆ amounts to find-
ing the closest vector in the span of the columns of X to the original vector Y,
in other words the orthogonal projection of Y into span{1, X1 , . . . , Xp 1 },
where 1 is a column vector of 1’s, and the Xj ’s are the other columns of X.
This is equivalent to choosing so that

X0 (Y X )=0

where 0 is a length p vector of zeros. Expanding, we have:

X0 Y X0 X = 0.

This leads to the normal equations:

X0 Y = X0 X .

Assuming that X0 X has full rank (p), that is, X has rank p, then the solution
is
ˆ = (X0 X) 1 X0 Y.
The fitted values are

Ŷ = X ˆ = X(X0 X) 1 X0 Y

and the residuals are

r=Y Ŷ = Y X(X0 X) 1 X0 Y.
34 CHAPTER 3. LEAST SQUARES

The residual standard error is the square root of the estimate of the
residual variance 2 : P 2
2 RSS ri
s = = .
n p n p
2 2
We will show that RSS ⇠ n p, so this is an unbiased estimator for .
Finally, note that we can write the fitted values as a linear combination
of the columns of X:

Ŷ = X ˆ = ˆ0 1 + ˆ1 X1 + . . . + ˆp 1 Xp 1 .

In other words, we have chosen a set of basis vectors for a p dimensional


subspace, namely the columns of the design matrix X, and the s are the
coordinates of Ŷ with respect to this basis.

3.3 Projection Operators on Linear Spaces


Recall that we can decompose an inner product space V into orthogonal
subspaces U and W where V = U W and U? = W . This means that any
vector V 2 V may be written as a sum of two vectors U + W for some U 2 U
and W 2 W. The fact that U and W are orthogonal subspaces means that
U0 W = 0 and thus
V0 V = (U + W)0 (U + W)
= U0 U + W 0 W
or
kVk2 = kUk2 + kWk2
which is just the Pythagorean rule for right triangles.
An orthogonal projection onto the subspace U is a linear operator PU on
V with the property that for any vector

V =U+W

where U 2 U and W 2 W:

PU(V) = PU(U + W) = U.
3.4. DISTRIBUTIONAL RESULTS 35

It can be shown that PU is a projection operator i↵ P2U = PU. In addition,


if P⇤U = PU, then PU is an orthogonal projection.
Now, back to least squares:

Y = Ŷ + r

where
Ŷ = X ˆ = X(X0 X) 1 X0 Y

and
r=Y Ŷ = Y Xˆ = Y X(X0 X) 1 X0 Y.

Let H = X(X0 X) 1
X0 , then

H2 = (X(X0 X) 1 X0 )(X(X0 X) 1 X0 )

= X(X0 X) 1 (X0 X)(X0 X) 1 X0

= X(X0 X) 1 X0 = H

H is a projection operator, projecting into the column span of X. I H is


also a projection, and they project into orthogonal subspaces

H(I H) = H H2 = H H = 0.

3.4 Distributional Results


Since ˆ = (X0 X) 1 X0 Y is a linear function of Y, It is fairly simple to evaluate
the distributions of the Y’s, fitted values, and residuals. The important facts
about multivariate random vectors are that

E(AY) = AE(Y)

ie, expectation is a linear operator, and covariance is bilinear:

Cov(AY) = E (A(Y µ)(Y µ)0 A0 ) = A Cov(Y) A0 .


36 CHAPTER 3. LEAST SQUARES

3.4.1 The distribution of ˆ


Recall that
2
Y ⇠ N(X , In )
and
ˆ = (X0 X) 1 X0 Y.
Thus the expected value of ˆ is

E( ˆ) = E((X0 X) 1 X0 Y)

= (X0 X) 1 X0 E(Y)

= (X0 X) 1 X0 X

Using the fact that X0 X is a symmetric matrix, and thus so is its inverse,
the covariance matrix of ˆ is

Cov( ˆ) = Cov((X0 X) 1 X0 Y)

= (X0 X) 1 X0 Cov(Y) ((X0 X) 1 X0 )0

= (X0 X) 1 X0 ( 2 I) X(X0 X) 1

= 2
(X0 X) 1 (X0 X)(X0 X) 1

= 2
(X0 X) 1

Thus the standard error of a coefficient ˆj is the square root of the corre-
sponding diagonal entry in the matrix 2 (X0 X) 1 :
q
SE( ˆj ) = s (X0 X)j,j1

where P 2
2 RSS ri
s = =
n p n p
2
is the unbiased estimator of .
3.4. DISTRIBUTIONAL RESULTS 37

Since ˆ is a linear function of normally distributed random variables, it


also has a normal distribution:
ˆ ⇠ N( , 2
(X0 X) 1 ).

Note that consistency of the estimates ˆ depends on the variances going to


zero, which requires that the diagonal elements of (X0 X) 1 ) go to zero as n
goes to infinity.

3.4.2 Fitted Values


Again, the fitted values are a linear function of the original observation vector:

Ŷ = HY = X(X0 X) 1
X0 Y.

Note that H is not full rank: H is an n ⇥ n matrix; assuming that rank(X)


is p, the dimension of the range of H is p, hence H has rank p. Thus HY
has a multivariate N(µ,⌃) distribution where

µ = E(HY) = HE(Y) = HX = X

and
⌃ = Cov(HY) = H( 2
I)H0 = 2
HH0 = 2
H2 = 2
H.
Thus the variance of a fitted value Ŷi is 2 multiplied by the ith diagonal
element of H, and the standard error of a fitted value is
p
SE(Ŷi ) = s Hii .

For a fitted value at a specific location x0 , in the form of a row of the


design matrix: x0 = (1, x1 , x2 , ..., xp 1 ) the variance is

V ar(x0 ) = x0 Cov( )x = 2 0
x (X0 X) 1 x.

3.4.3 Predictions
In the previous section we computed the variance of a fitted value at the
location x0 . If we are interested in the prediction error, ie. the error in
making a prediction based on the fitted value, then we must also take into
38 CHAPTER 3. LEAST SQUARES

account variation about the value. Since the variance about the mean value
is 2 , the total variance is (assuming the new observation is independent of
the fitted values) the sum of the variance about the mean, plus the variance
of the estimate of the mean:
2 0
2
+ x (X0 X) 1 x

or
2
(1 + x0 (X0 X) 1 x).

3.4.4 Residuals
The residuals are also linear functions of the observations:

r=Y Ŷ = (I H)Y

and so have a joint normal distribution with

E(r) = E(Y) E(Ŷ) = X X =0

and covariance matrix


2
Cov((I H)Y) = (I H).

Thus the standard error of a residual is determined by the corresponding


diagonal element of I H:
p
SE(ri ) = s 1 hii .

The standardized residuals are


ri
sresi = p .
s 1 hii
It is a common practice to plot the standardized residuals (which now have
variance 1) against the fitted values as a check on model validity. The stan-
dardized residuals should appear to be random, with an approximately stan-
dard normal distribution, if the model is correct. I prefer to plot the raw
residuals, but that is not a strong preference.
Since the null space of I H is the column span of X, which has dimension
p, the rank of I H is n p. Thus the distribution of residual sum of squares
3.4. DISTRIBUTIONAL RESULTS 39

is based on the distribution of the sum of n normal but not independent


random variables. Furthermore, since H and I H are mutually orthogonal,

Y = Ŷ + r = HY + (I H)Y

and orthogonality gives us:

kYk2 = kHYk2 + k(I H)Yk2 .

Since (I H) is orthogonal to the column span of X, it follows that

(I H)Y = (I H)(X + ✏) = (I H)✏.

So the RSS is just k(I H)✏k2 , the squared length of the projection of n
independent N(0, 2 ) random variables into an n p dimensional subspace.
Choose an orthonormal basis {vi }i=1...n for Rn so that {vi }i=1...p span the
column space of X, and thus {vi }i=p+1...n span the orthogonal complement.
Since the change of basis matrix is an orthogonal matrix Q = (Q1 , Q2 ), Q02 is
norm preserving:
k(I H)Yk2 = kQ02 (I H)Yk2 .
But since Q2 is orthogonal to H, it follows that

Q02 (I H) = Q02 Q02 H = Q02 .

Thus
k(I H)Yk2 = kQ02 ✏k2 .
In other words, the RSS has the same distribution as the sum of n p
independent, N(0, 2 ) random variables.
There are two important results imbedded in this discussion:

• If Z has a standard normal distribution, then Z 2 has a 21 distribution,


or equivalently a Gamma( 12 , 12 ). Further, the sum of k independent 21
random variables has a 2k distribution.

• If PU is a projection from Rn into the subspace U, with dimension


k, and Z is an n vector of IID N(0, 2 ) random variables, then for
PU Z = U
Xn
2
kPUZk = U2i ⇠ 2 2k .
i=0
40 CHAPTER 3. LEAST SQUARES

A related result often called Cochran’s Theorem: if P1 , P2 , . . . Pk are


mutually orthogonal projection operators on Rn , and Z is a vector of IID
N(0, 2 ) random variables, then the quantities kPj Zk2 are independent 2
random variables with degrees of freedom rank(Pj ).

3.5 t and F statistics


We will discuss the theory of hypothesis testing later in the course. For now,
it will suffice to note that two standard tests arise frequently in the context of
the linear model, the t-test for testing the hypothesis that a particular coef-
ficient is equal to 0, and the F-test for testing more complicated hypotheses.
For the moment, we will take the Fisherian view of testing: we will reject a
null ypothesis if the value of our test statistic corresponds to a rare tail event
(an ‘outlier’) under the distribution implied by the null hypothesis.

3.5.1 t statistics
We observed earlier that the parameter vector ˆ has a normal distribution:

ˆ ⇠ N( , 2
(X0 X) 1 ).

Thus the marginal distribution of each coefficient is normal and


q
SE( ˆj ) = s (X0 X)j,j1 .

We saw in the last section that the residuals are independent of the parameter
estimates and fitted values, so the numerator and denominator of the ratio

ˆj
SE( ˆj )

are independent of each other.

ˆj ˆ 0
= qj
SE( ˆj ) s (X0 X)j,j1
3.5. T AND F STATISTICS 41

ˆ ⇣ ⌘
= q j
s (X0 X)j,j1

ˆj
q
(X0 X)j,j1
= p
s2 / 2

If = 0, the numerator is a normally distributed distributed random


variable divided by its standard deviation, and thus has a standard normal
distribution.
P 2 2 The denominator is the square root of a 2n p random variable
( ri / ), divided by its degrees of freedom (n p). Thus the ratio has a
tn p distribution. If 6= 0, then the ratio will tend to be larger in absolute
magnitude than a tn p random variable. Hence we are inclined to disbelieve
the hypothesis = 0 if the ratio is too big.

3.5.2 F statistics
The t-statistic can be used to test hypotheses about single parameters. What
if we wish to test the hypothesis that several parameters are all equal to zero?
The key idea is that we are really comparing two regression models to see
if adding or deleting some explanatory variables or, more generally, imposing
some constraints on parameters, leads to a ‘surprising large’ change in the
residual sum of squares (RSS). If it does, then the smaller, or restricted model
is judged to fit worse that the larger, or unrestricted model.
For example, in the regression model

Yi = 0 + 1 Xi,1 + 2 Xi,2 + 3 Xi,3 + ✏i i = 1 . . . n

we might wish to test the hypothesis that 2 = 3 = 0. This is equivalent to


simply omitting the corresponding variables from the regression model. We
thus have two models, a full model (MF ), (in vector notation)

Y= 01 + 1 X1 + 2 X2 + 3 X3 +✏

and a restricted model (MR )

Y= 01 + 1 X1 + ✏.
42 CHAPTER 3. LEAST SQUARES

MF represents the projection of Y into the 4-dimensional subspace of Rn


spanned by {1, X1 , X2 , X3 }. MR represents the projection of Y into the
2-dimensional subspace spanned by {1, X1 }, which is itself a subspace of the
span of the first set. Think of it as a two-step projection; first project into
the subspace defined by the full model, then project the first projection into
the subspace defined by the restricted model.
In terms of projection operators, let VF and VR be two subspaces of Rn
such that VR ⇢ VF . VF corresponds to MF , and VR to MR . Let HR be the
projection into VR , HF be the projection into VF , and HF? be the projection
into VF \ V?R . Then we can orthogonally decompose the vector Y into three
components:
Y = HR Y + HF? Y + (I HF )Y.
Note that HF? must equal HF HR . It may not be obvious that HF HR
is a projection, but it is not hard to demonstrate. Since VR ⇢ VF , it follows
that HR HF = HR , HF HR = HR , and thus

(HF HR )2 = H2F HF HR HF HR + H2R

= HF HR HR + HR

= HF HR

Furthermore, the di↵erence of the squared lengths of HF Y and HR Y is


the same as the di↵erence between the squared lengths of (I HR )Y and
(I HF )Y, which is just RSSR RSSF :

kYk2 = kHF Yk2 + k(I HF )Yk2

and
kYk2 = kHR Yk2 + k(I HR )Yk2 .
Subtracting the second equation from the first gives

0 = (kHF Yk2 kHR Yk2 ) + (k(I HF )Yk2 k(I HR )Yk2 )

or
kHF Yk2 kHR Yk2 = k(I HR )Yk2 k(I HF )Yk2 .
3.5. T AND F STATISTICS 43

Finally, since HF? and HR are orthogonal, and HF? = HF HR we have

kHF Yk2 = kHF? Yk2 + kHR Yk2 .

Thus

kHF? Yk2 = k(HF HR )Yk2 = k(I HR )Yk2 k(I HF )Yk2 .

So, we have three mutually orthogonal projections, and in particular,


kHF? Yk2 and k(I HF )Yk2 are independent 2 2 random variables with
degrees of freedom defined by their ranks, in this case 2 and n 4. Thus

kHF? Yk2 /2
F =
k(I HF )Yk2 /(n 4)

k(HF HR )Yk2 /2
=
k(I HF )Yk2 /(n 4)

(RSSR RSSF )/2


=
RSSF /(n 4)

The F statistic is a ratio of two independent 2 random variables, each


divided by its degrees of freedom, and thus has an F2,n 4 distribution. If this
ratio is too large, then we should conclude that the restricted model does not
fit as well as the full model.
The general version of this result is completely analogous. Let MF corre-
spond to a projection of rank p, and MR correspond to a projection of rank
q < p, where the range of HR is a subspace of the range of HF , then

(RSSR RSSF )/(p q)


F = ⇠ Fp q,n p .
RSSF /(n p)

In the case where the restricted model corresponds to setting one or more
coefficients to zero (i.e. omitting the corresponding variables), rejecting the
null hypothesis that those parameters are all zero implies that at least one is
non-zero (or we have been unlucky). It does not imply that the full model is
correct, simply that the restricted model does not fit the data as well as the
full model.
44 CHAPTER 3. LEAST SQUARES

3.5.3 t2 is an F statistic!
The t-test discussed earlier turns out to be a special case of the F-test,
corresponding to omitting a single variable. Let Z be a standard normal
random variable, and let X be an independent 2n p random variable, then
p
Z/ (X/(n p)) has a tn p distribution, and

Z 2 /1 2
1 /1
t2n p ⇠ 2
⇠ 2
⇠ F1,n p .
n p /(n p) n p /(n p)

The t-test for the null hypothesis k = 0 corresponds exactly to the F-test
arising from the comparison of the full model with all explanatory variables
to a restricted model omitting the single variable Xk . Thus it is conditional
on the model containing all variables included in the full model. In particular,
suppose we have two t-tests which fail to reject the null hypothesis, say for
1 = 0 and 2 = 0. What can we conclude? Not much! We have tested 1
assuming that 2 is not zero, then tested 2 assuming that 1 is not zero. We
can not conclude that neither variable predicts the response! To test that
hypothesis, we need an F-test. If X1 and X2 are highly correlated with each
other, it is quite possible that the model will fit well if either one is included.
When both are included, neither one appears important viewed through the
lens of the t-test because the other is present to do the work.
Finally, it is important to remember that the F test is not a test of
‘goodness of fit’; it is a model comparison. Our test is conditional on the
model being correct - the relationships are linear, all the explanatory variables
that are related to Y have been included in the model, and the error term ✏
has the assumed properties. That’s a lot to assume!

3.6 The Multiple Correlation Coefficient: R2


In matrix notation, our fitted regression model yields

Y = Ŷ + r

where the terms on the right hand side are uncorrelated, since

r = (I H)Y
3.6. THE MULTIPLE CORRELATION COEFFICIENT: R2 45

is orthogonal to
Ŷ = HY = X ˆ.
Thus the Pythagorean theorem applies, and after subtracting out the pro-
jection on the constant vector 1, which is in the column span of X we have
X X X
(Yi Ȳ )2 = (Ŷi Ȳ )2 + ri2 .

If we associate the left hand term with the total variance of Y ; the terms
on the right with variance explained by the regression line and unexplained
variance, respectively, then
P P 2
(Ŷi Ȳ )2 ri
P 2
=1 P .
(Yi Ȳ ) (Yi Ȳ )2

In other words, the ratio is the proportion of variance of the original variable
Y which is ‘explained’ by the regression. Clearly it lies between 0 and 1. It
is possible to show that this is equal to the square of the correlation between
Y and Ŷ . P
(Yi Ȳ )Ŷi
Cor(Y, Ŷ ) = qP P .
(Yi Ȳ )2 (Ŷi Ȳ )2
Hence the name multiple correlation coefficient, or R2 . Since it is inter-
pretable as the proportion of variance explained, it is also sometimes called
the coefficient of determination, and interpreted as a measure of goodness of
fit. One must be careful, as R2 can be close to 1 even when the model is
demonstrably wrong.
Warning: If your regression model does not include the intercept term,
then the definition of R2 is slightly di↵erent, since we are not subtracting out
the projection on the vector 1:
P 2 P 2
Ŷi r
P 2 = 1 P i2
Yi Yi

which as before must be in (0, 1). It is common to find software that computes
R2 incorrectly in this situation, and you will occasionally see people reporting
values of R2 that are not between 0 and 1. They are in error.
46 CHAPTER 3. LEAST SQUARES

3.7 Exercises
1. The QR decomposition.
Let X be an n ⇥ p matrix with rank p. Suppose we can find an n ⇥ p
matrix Q such that Qt Q is a p ⇥ p identity matrix I, and an upper
triangular p ⇥ p matrix R of rank p such that X = QR. For the linear
model
Y =X +✏

(a) find ˆ in terms of Q and R.


(b) show that the ‘hat’ matrix H = QQt .
(c) find Ŷ and r in terms of Q and R .

2. Householder transformations.
Let v be a non-zero vector in Rn . Show that the rank 1 modification
of the identity matrix defined by

vv t
T=I 2
vtv
is symmetric and orthogonal. Show that if we choose v = x ± kxke1
then Tx = ±kxke1 . Draw a sketch in R2 illustrating the two choices.

3. Generalized Least Squares


The standard linear model is

Y =X +✏

where ✏ has a multivariate N(0, 2 I) distribution. If instead, we suppose


that ✏ is a multivariate N(0, 2 ⌃), then the ordinary least squares esti-
mates (OLS) are no longer efficient. Assuming that ⌃ is known, and
is a symmetric, positive definite n ⇥ n matrix, we can find a an upper
triangular matrix B such that Bt B = ⌃, and thus

B t Y = B t X + B t ✏.

Find the least squares estimates ˆGLS for the transformed problem.
Compare the covariance matrix for ˆGLS to the covariance matrix for
ˆOLS .
3.7. EXERCISES 47

4. Weighted Least Squares


Suppose that we have for each case not the raw data, but an average
value for Yi , based on a sample size ni :

Ȳi = 0 + 1 Xi,1 + 2 Xi,2 + ... + p 1 Xi,p 1 + ✏i

Where ✏ now has a multivariate normal distribution with diagonal co-


variance matrix ⌃i,i = 2 /ni . Using your solution to the previous
problem, find the least squares estimates for .

5. Data analysis: t-tests and F-tests.


On the URL http://www.reed.edu/ jones/141/, find the ‘Highway Ac-
cidents’ dataset. Copy and Paste it into your R session.

The data come from a study of highway accident rates on 39 segments


of highway in 1972. Using the variables:
Rate: accidents per million vehicle miles,
ADT: average daily traffic count in thousands of vehicles,
TV: truck volume as a percent of the total,
LIM: speed limit in miles per hour
LW: lane width in feet, and
SIG: number of intersections with traffic signals per mile.

(a) Fit the full model

MF : Rate ⇠ ADT + T V + LIM + LW + SIG

and the restricted model

MR : Rate ⇠ LIM.

(b) Using the extraction function residuals(), compute the sum of the
squared residuals for each model. Verify the computations for
the F statistic given by the anova() function comparing the two
models.
(c) What null hypothesis is tested by the F test here?
(d) Which coefficients in MF are statistically significantly di↵erent
from 0?
48 CHAPTER 3. LEAST SQUARES

(e) Are these results inconsistent? Explain!

6. Omitted variable bias.


In R , create the datasets X, Y, and Z as follows:

x = rnorm(100)
X = x + rnorm(100)
Z = x + rnorm(100)
Y = X + -3*Z + rnorm(100)

Now fit the two regression models Y ⇠ X + Z and Y ⇠ X. Inspect the


summary tables.
Now create new variables

x = rnorm(100)
X = x + rnorm(100)
Z = rnorm(100)+ rnorm(100)
Y = X -3*Z + rnorm(100)

and fit the same two models. Explain the changes in the coefficients
of X and the residual variance s2 under each scenario. It may help to
compute variances or correlations for the variables.

7. Errors in Variables.
Load the Birthweight dataset into your R session. The variable smoke
is a ‘dummy’ variable or indicator variable; it is 1 for smokers and 0
for non-smokers.

(a) Fit the model bwt ⇠ gestation + smoke, and explain why it rep-
resents two parallel lines. What does the coefficient for smoke
represent?
(b) Make a scatterplot of bwt vs. gestation. You should observe
some peculiar points: very low gestation values with normal birth-
weights. What is the probable explanation for these points?
(c) Suppose that we observe gestation with error. We can simulate
the e↵ects of measurement error by creating a new variable which
is gestation plus some noise, say with standard deviations of 10,
20, and 30 days.
3.7. EXERCISES 49

G1 = gestation + rnorm(length(gestation),0,10)
G2 = gestation + rnorm(length(gestation),0,20)
G3 = gestation + rnorm(length(gestation),0,30)
Refit the the model substituting each of these for gestation. What
are the consequences of the di↵erent levels of measurement error
for the estimated coefficients and the residual standard error?

8. Fuel consumption and vehicle weight:


At the Math 141 URL http://www.reed.edu/ jones/141/ follow the link
to the 1993 Consumer Reports Auto data. Load the dataset into your
R session. The R command ‘attach(Cars93)’ adds the column names
of the dataset to the list of datasets that you can access directly by
name.
Plot EPA estimated Highway miles per gallon (Hmpg) against vehicle
weight (Wgt). Plot gallons per mile (1/Hmpg) against vehicle weight
(Wgt). Which relationship appears to be roughly linear with constant
variance? Fit the following regression models:

m1 = lm(Hmpg ~ Wgt, data = Cars93)


m2 = lm(Hmpg ~ Wgt+I(Wgt^2) , data = Cars93)
# I(Wgt^2) is the square of Wgt
W = Wgt- mean(Wgt)
W2 = W^2
m3 = lm(Hmpg ~ W+W2, data = Cars93)
m4 = lm(1/Hmpg ~ Wgt, data = Cars93)

What is the relationship between the models m2 and m3 and their esti-
mated coefficients? Plot the residuals against the fitted values for each
model. Which residual plot corresponds best to a linear relationship
with constant variance?
50 CHAPTER 3. LEAST SQUARES
Chapter 4

Bayes, Laplace, and Inverse


Probability

4.1 Bayes
In elementary probability we learn a theorem due to the Rev. Thomas Bayes
(1701-1767), which gives a mechanism for reversing the direction of condi-
tioning.

Bayes’ Theorem: Let the events A1 , A2 , A3 , . . . be a partition


of of the sample space ⌦, and B any other event in ⌦, then

P(B | Aj )P(Aj )
P(Aj | B) = P .
i P(B | Ai )P(Ai )

Bayes was a Noncomformist minister, and anti-trinitarian, as were many


other notable intellectuals and scientists of the day, such as Isaac Newton
and Joseph Priestley. Bayes’ paper containing a version of this result was
presented posthumously to the Royal Society by his friend Richard Price, who
is reported to have believed that the existence of God is the underlying cause
for statistical regularity (for more details and references, see [14]). Bayes did
not prove the result as stated above, which form was due to Laplace.

51
52 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY

In his paper, Bayes deduced the theorem for the special case of inferring
the probability of some event from a sequence of independent trials. Let
X be Binomial(n,p), where p is unknown. He argued that, at least in the
context he was considering, complete uncertainty about the probability p, it
makes sense to represent our uncertainty about p by a uniform distribution
on the interval [0, 1]. Bayes wanted to find the conditional distribution of the
unknown parameter p, given observed data X. In modern notation this is
P(X = k | p)⇡(p)
⇡(p | X = k) = R 1
0
P(X = k | p)⇡(p)dp
n
k
pk (1 p)n k ⇡(p)
= R1 n
0 k
pk (1 p)n k ⇡(p)dp
n
pk (1 p)n k
= R 1 kn
0 k
pk (1 p)n k dp

pk (1 p)n k
= R1
0
pk (1 p)n k dp

The density function of p, ⇡(p), is known as a prior distribution, since it


represents the state of our knowledge prior to the observation X = k, while
⇡(p | X = k) represents the updated status of our knowledge about p, in
light of the observation. ⇡(p | X = k) is a density function for the posterior
distribution. Bayes’ statement of the result was in the form:
Rb n k
p (1 p)n k dp
P(a < p < b | X = k) = Ra1 kn .
p k (1 p) n k dp
0 k

Because of the reversal of the direction of conditioning in Bayes’ Theorem,


this type of calculation became known as ‘inverse probability’. The difficulty
of evaluating the posterior probability via integrating over a given interval
remained. The posterior density for p is a Beta(k + 1,n k + 1) density,
and so the integral is an incomplete beta function, which can’t be evaluated
analytically. Numerical integration, or quadrature as it was more likely to be
called at the time, would have allowed at least approximate evaluation of the
integral. Bayes, and later Price, did produce approximations for the integral,
4.2. LAPLACE 53

which, according to Hald ([11]) can be interpreted as normal approximation


for the Beta distribution, though Hald sees no evidence that Bayes or Price
thought of it in this way.

4.2 Laplace
It is not clear whether Laplace knew Bayes’ result; he was certainly capable
of deducing it himself, and made no mention of Bayes in his own writings. In
any case, in 1774 Laplace published a paper on the same problem Bayes had
studied, and like Bayes, proceded to assume a uniform prior for p. Laplace’s
argument for the uniform distribution in this and similar cases has come to
be known as the Principle of Insufficient Reason: if there is no reason to
believe that one possibility is more likely than any other, assume they have
equal probability. Laplace reframed the problem in a clever way allowing an
exact solution, and then tackled the problem of showing that the posterior
probability for any interval centered on the ‘true’ p has in the limit (as the
sample size increases) probability 1. He produced a marvelous approximation
for the resulting integral, an idea (remarkably) known today as ‘Laplace
approximation’.
One of Laplace’s inspirations was to ask a slightly di↵erent question:
suppose that we have observed k successes in n independent trials, what
is our best guess of the probability that the next trial will be a success?
We might view this as simply asking for an estimate of p, but the more
radical interpretation is that Laplace transformed the problem into prediction
problem. He proposed E(p | X = k) as the answer, and showed how to
compute it. Starting with the posterior density
pk (1 p)n k
⇡(p | X = k) = R 1
0
pk (1 p)n k dp
the desired expectation is
Z 1 Z 1
pk (1 p)n k
p ⇡(p | X = k)dp = p R1 dp
0 0 0
pk (1 p)n k dp
R1
pk+1 (1 p)n k dp
= R0 1
0
pk (1 p)n k dp
54 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY

B(k + 2, n k + 1)
=
B(k + 1, n k + 1)
k+1
=
n+2

The obvious estimator for p, p̂ = X/n, was known at the time. If we take
p̂1 = E(p | X = k) as an estimator for p, we have a very similar estimate, and
with large n the di↵erence will be trivial. Still, there are a couple of issues
worth noting. First, from a frequentist perspective, the standard estimator
X/n is unbiased: (E(X/n) = p).
✓ ◆
X +1 p+1
E(p̂1 ) = E =
n+2 n+2

so p̂1 is not unbiased - it will be closer to 1/2 than X/n. Second, we can inter-
pret the estimator as incorporating pseudo-observations: we have added two
trials, one success and one failure. This expresses mild skepticism about the
certainty of prediction: even after 20 failures with no successes we would still
refuse to estimate the probability as 0, instead we would use 1/22. Finally,
we can rewrite p̂1 in terms of p̂ = X/n.

X +1 1 2p̂
p̂1 = = p̂ + .
n+2 n+2
This makes it clear that the two estimators are asymptotically equivalent:
p̂1 p̂ ⇠ O(1/n).
Laplace then tackled the problem of showing that when n is large, the
probability that the posterior distribution assigns to any symmetric interval
around the true p goes to 1. In other words, letting p0 be the true value, for
any ✏ > 0
Z p0 +✏
lim ⇡(p | X)dp ! 1.
n!1 p0 ✏

While this looks like a version of the Law of Large Numbers, as it implies
p̂1 !P p0 , imbedded in his proof is essentially a central limit theorem, or
normal approximation for the beta distribution. An outline of his argument
follows.
4.2. LAPLACE 55

1. Write the integral in exponential form:


Z b Z b
pk (1 p)n k
dp = c e(k log p+(n k) log (1 p))
dp
a B(k + 1, n k + 1) a

where c is
1 (n + 2)
= .
B(k + 1, n k + 1) (k + 1) (n k + 1)

2. Find the peak of the exponentiated function f (p):


@ k n k
0= (k log p + (n k) log (1 p)) = .
@p p 1 p
Solving for p, we get p⇤ = k/n.
3. Expand f (p) in a Taylor series around p⇤ :
1
f (p) = f (p⇤ ) + f 0 (p⇤ )(p
p⇤ ) + f 00 (p⇤ )(p p⇤ )2 + etc.
2
Show that we can disregard the higher order terms as n ! 1. Since
p⇤ was chosen to make f 0 vanish, we get
1
f (p) ⇡ f (p⇤ ) + f 00 (p⇤ )(p p⇤ ) 2 .
2
Evaluating the second derivative we have
k n k
f 00 (p) =
p2 (1 p)2
and thus
k n k 1 1 n
f 00 (k/n) = = n( )= .
(k/n)2 (1 k/n)2 p⇤ 1 p⇤ p⇤ (1 p⇤ )

4. Substitute the approximation into the integral and evaluate.


Z b Z b ⇤ 2
⇤ ⇤ 1 n(p p )
c e (k log p+(n k) log (1 p))
dp ⇡ c e(k log p +(n k) log (1 p )) 2 p⇤ (1 p⇤ ) dp
a a
Z b ⇤ 2
1 n(p p )
k (n k) 2 p⇤ (1 p⇤ )
= c(k/n) (1 k/n) e dp
a
56 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY

For a = p⇤ ✏ and b = p⇤ + ✏ we observe that the change of variables

p p⇤
z=p
p⇤ (1 p⇤ )/n

results in the integral


Z p⇤ +✏ ⇤ 2 p Z x
1 n(p p ) 1 2
z
e 2 p⇤ (1 p⇤ ) dp = p⇤ (1 p⇤ )/n e 2 dz
p⇤ ✏ x

where p
n✏
x= p .
p (1 p⇤ )

Letting n ! 1, the last integral becomes


Z 1
1 2
e 2 z dz.
1

Laplace
p was the first to evaluate this integral, and show that it equals
2⇡.

5. Finally, substituting k! for (k + 1) we have

(n + 1)! p p
(k/n)k (1 k/n)(n k)
p⇤ (1 p⇤ )/n 2⇡.
k! (n k)!

Laplace used Stirling’s approximation for the factorial function to show


that this quantity approaches 1 as n ! 1.

Laplace did still more in his 1774 paper, but we will stop here.
One may derive Stirling’s approximation for the gamma function using
Laplace’s method:
Z 1 Z 1
↵ 1 x
(↵) = x e dx = e(↵ 1) log x x dx.
0 0

As before, find the mode of the function f (x) = (↵ 1) log x x, and expand
f in a Taylor series around that point. I’ll leave the rest as an exercise for
the reader.
4.3. BAYESIAN INFERENCE 57

4.3 Bayesian Inference


It is natural to consider generalizing the basic idea formulated by Bayes and
Laplace. Given a probability model for the data, f (x | ✓), given one or more
parameters ✓, and a prior ⇡(✓), we can restate Bayes formula for densities:
f (x | ✓)⇡(✓)
⇡(✓ | x) = R .
f (x | ✓)⇡(✓)d✓
This statement is quite general — x and or ✓ might be a vector, so f might
be a univariate or multivariate density, discrete or continuous, and there is a
corresponding statement for a discrete parameter space replacing the integral
with a sum.
Let us start with the case of independent random variables with a com-
mon distribution. Suppose we have x̃ = (X1 , X2 , X3 , . . . Xn ) all with density
f (x | ✓). The joint density function factors into the product of the marginal
densities:
n
Y
fn (x̃ | ✓) = f (x1 | ✓)f (x2 | ✓) . . . f (xn | ✓) = f (xi | ✓).
i=1

Consider an independent sequence of Bernoulli(p) trials. The joint prob-


ability for X1 , X2 , X3 , . . . Xn may be written
n
Y P P
pXi (1 p)1 Xi
=p Xi
(1 p)n Xi
.
i=1

Since each Xi is either 0 or 1, we get a factor of p whenever Xi = 1, and


a factor of 1 p whenever Xi = 0. When we used the uniform prior for
p, following Bayes and Laplace, we ended up with a beta distribution for
the posterior. What happens if we start with a beta distribution? Let p ⇠
Beta(r, s). Bayes’ Theorem states
f (x̃|p)⇡(p)
⇡(p|x̃) = R 1
0
f (x̃|p)⇡(p)dp
P P
Xi (r+s) r 1
p (1 p)n Xi
(r) (s)
p (1 p)s 1
= R1 P P (r+s) r 1
.
p Xi (1 p)n Xi p (1 p)s 1 dp
0 (r) (s)
58 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY
P
This looks pretty ugly, but we can simplify it considerably. Let Y = Xi ,
and cancel the constants in the numerator and denominator:
pY (1 p)n Y
pr 1 (1 p)s 1
⇡(p | Y ) = R 1
0
pY (1 p)n Y pr 1 (1 p)s 1 dp

pY +r 1 (1 p)n Y +s 1
= R1
0
pY +r 1 (1 p)n Y +s 1 dp

pY +r 1 (1 p)n Y +s 1
=
B(Y + r, n Y + s)

(n + r + s)
= pY +r 1 (1 p)n Y +s 1
(Y + r) (n Y + s)

Interesting! We get yet another beta distribution for the posterior. In fact,
the Bayes/Laplace computation is just the special case r = s = 1.

4.3.1 Conjugate Priors


The combination of the Beta prior with the binomial distribution for the
data is an example of a class of mathematically convenient pairs of families
of probability distributions. The convenience comes from the fact that the
binomial probability expressed in terms of the parameter p has the same
functional form as the Beta prior for p. In general, when the density or
probability function f (x | ✓) is considered a function of the parameter ✓
rather than x, it is called the likelihood function. Another way of stating
the condition is that the likelihood and prior (which are both functions of
the unknown parameter), must be members of a class of functions which
is closed under multiplication: the posterior will be in the same family as
the prior just in case the product of the likelihood and the prior has the
same functional form as the prior. In such cases, the prior is said to be a
conjugate prior for the probability distribution of the data.
2
Consider a single observation from a N (µ, ) distribution, and assume
that 2 is known. The density function is
1 1 (x µ)
2
f (x | µ) = p e 2 2
.
2⇡ 2
4.3. BAYESIAN INFERENCE 59

Note that, as a function of µ, it has the same functional form as a Normal,


so we should try using a Normal prior for µ, say Normal(µ0 , 02 ). Bayes’
Theorem states:
2
2 1 (µ µ0 )
1 (x µ) 2
p 1 p1 2
2⇡ 2
e 2 2
2
e 0
2⇡ 0
⇡(µ | x) = 1 (µ µ0 )2
R 1 (x µ)2 2
p 1 p1 2
2⇡ 2
e 2 2
2
e 0 dµ
2⇡ 0

2
2 1 (µ µ0 )
1 (x µ) 2
2
e 2 2
e 0
= 1 (µ µ0 )2
R 1 (x µ)2
2 2
e 2 2
e 0 dµ
✓ ◆
1 (x µ)2 (µ µ0 )2
2 2 + 2
e 0
= ✓
(µ µ0 )2

(x µ)2
R 1
2 2 + 2
e 0 dµ

We want to express the argument of the exponential function as a quadratic


in µ, and cancel any factors not involving µ with the corresponding factors
in the denominator.
(x µ)2 (µ µ0 ) 2 x2 2xµ + µ2 µ2 2µµ0 + µ20
2
+ 2
= 2
+ 2
0 0

2 2
0 (x 2xµ + µ2 ) + 2
(µ2 2µµ0 + µ20 )
= 2 2
0

2 2
0 (x 2xµ + µ2 ) + 2
(µ2 2µµ0 + µ20 )
= 2 2
0

µ2 ( 2
0 + 2
) 2µ( 02 x + 2
µ0 ) + ( 02 x2 + 2 2
µ0 )
= 2 2
0

The last term in the sum does not involve µ, and thus will cancel the corre-
sponding factor in the denominator. We now complete the square in µ, after
dividing numerator and denominator by ( 02 + 2 ):
2 2
µ2 ( 2
+ 2
) 2µ( 02 x + 2
µ0 ) µ2 2µ( 0
2+ 2 x+ 2 2 µ0 )
0 0 0+
2 2
= 2 2 /( 2 2)
.
0 0 0 +
60 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY

We add and subtract ⇣ ⌘2


2 2
2
0
2 x+ 2 2 µ0
0+ 0+
2 2 /( 2 2)
.
0 0 +
The positive term combines to complete the square, and the negative term,
which does not involve mu will again cancel with the corresponding factor
in the denominator. Finally, we have the quadratic form in µ:
⇣ 2 2
⌘2
µ ( 2 +0 2 x + 2 + 2 µ0 )
0 0
2 2 /( 2 + 2 )
.
0 0

To simplify the expression, let


2 2
0
µ1 = 2 2
x+ 2 2
µ0
0 + 0 +
and
2 2
2 0
1 = 2 2
0 +
then we can write the posterior distribution as
2
1 (µ µ1 )
2 2
e 1
⇡(µ | x) = 1 (µ µ1 )2
.
R 2 2
e 1 dµ
The denominator must be the constant that makes the posterior a density for
µ. Since the functional form is that of
pa normal distribution, we can infer that
the integral in the denominator is 2⇡ 12 . Thus the posterior distribution
is normal with mean µ1 and variance 12 . Note that the posterior mean is a
weighted average of the observed value x and the prior mean µ0 , with weights
inversely proportional to the variances. When 2 2
0 , the observation is
not very precise, relative to the prior, and the posterior mean is close to µ0 .
When 2 ⌧ 02 , the prior is not very precise, relative to the data, and the
posterior mean is close to x.
It is worth noting that the algebra is simplified a bit if we replace the
parameter 2 with 1/⌧ , where ⌧ is called the precision. Since
2 2
✓ ◆ 1
2 0 1 1
1 = 2 2
= 2
+ 2
0 + 0
4.3. BAYESIAN INFERENCE 61

we have a very simple expression for the posterior precision in terms of the
precision of the data, ⌧ , and the prior precision ⌧0 :

⌧1 = ⌧ + ⌧0 .

Similarly, the posterior mean can be expressed in terms of the precisions


instead of the variances:
⌧ ⌧0
µ1 = x+ µ0 .
⌧ + ⌧0 ⌧ + ⌧0

Now, suppose that we have an independent, identically distributed sample


of n observations Xi ⇠ N (µ, 2 ), where 2 is known. The joint density for
an independent sample of size n is the product of their marginal densities:
n
Y ✓ ◆ n2 Pn
1 1 (xi µ)
2
1 1 (xi µ)2
fn (x̃ | µ) = p e 2 2
= 2
e 2 i=1 2
.
i=1 2⇡ 2 2⇡

We expand and simplify the sum, while adding and subtracting the mean x:
n
X n
X
2
(xi µ) = (xi x+x µ)2
i=1 i=1

n
X n
X n
X
2 2
= (xi x) + (x µ) + 2 (xi x)(x µ)
i=1 i=1 i=1

n
X n
X
= (xi x)2 + n(x µ)2 + 2(x µ) (xi x)
i=1 i=1

n
X
= (xi x)2 + n(x µ)2
i=1

Substituting this expression into the density we have


✓ ◆ n2 2 P
1 1 n(x µ) + (xi x)2
fn (x̃ | µ) = 2
e 2 2
.
2⇡

Finally, invoking the good name of Bayes, and assuming a N(µ0 , 02 ) prior,
62 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY

we write the posterior density as


2
n 2 P x)2 1 (µ µ0 )
1 n(x µ) + (xi 2
1 p1 2
2⇡ 2
2
e 2 2
2
e 0
2⇡ 0
⇡(µ | x̃) = P 1 (µ µ0 )2
R 1
n 1 n(x µ)2 + (xi x)2
2
2⇡ 2
2
e 2 2 p1 2
e 2
0 dµ
2⇡ 0

2
2 1 (µ µ0 )
1 n(x µ) 2
2
e 2 2
e 0
= 1 (µ µ0 )2
R 1 n(x µ)2
2 2
e 2 2
e 0 dµ
2 2
1 (x µ) 1 (µ µ0 )
2 /n 2 2
e 2
e 0
= 1 (µ µ0 )2
R 1 (x µ)2
2 /n 2 2
e 2
e 0 dµ

Observe that this is exactly the expression we evaluated for the case of a
single observation, except that we have x replacing x, and 2 /n replacing
2
. In fact, we have the same expression we would have if we had started by
recording only x, and discarding the individual values xi , since the distribu-
tion of the mean x of n independent N(µ, 2 ) observations has a N(µ, 2 /n)
distribution. That may seem like a convenient accident at the moment, but
we will see later that this is an example of a much deeper concept. In any
case, we can simply quote our earlier result: (µ | x̃) ⇠ N(µ1 , 12 ), where
2 2
0
µ1 = 2 2 /n
x+ 2 2 /n
µ0
0 + 0 +

and
2 2
2 0 /n
1 = 2 2 /n
.
0 +
4.4. EXERCISES 63

4.4 Exercises
For each of the following distributions, try to find a conjugate or other con-
venient prior distribution. For that prior, produce an estimator such as the
posterior mean, and find the posterior variance.

1. X1 , X2 , . . . Xn are IID Uniform(0,✓).

2. X1 , X2 , . . . Xn are IID Uniform(0,✓2 ), where ✓ is the parameter of in-


terest.

3. X is Negative Binomial(r,p).

4. X1 , X2 , . . . Xn are IID Pareto(↵, ) with ↵ known.

5. X1 , X2 , . . . Xn are IID Poisson( ).

6. Let X1 , X2 , . . . Xn be IID N(µ,⌧ ), where ⌧ = 1/ 2 is the precision.


Choose a N(µ0 ,⌧ ⌧0 ) prior for µ and a gamma prior for ⌧ . Find the joint
posterior distribution and each marginal.

7. Let X1 , X2 , . . . Xn be IID N(µ,⌧ ), where ⌧ = 1/ 2 is the precision.


Suppose that we take as our prior for µ the mixture distribution

µ ⇠ p1 n(µ|↵1 , ⌧1 ) + p2 n(µ|↵2 , ⌧2 )

where n(µ|↵i , ⌧i ) is the density function for N(↵i , ⌧i ) distribution. Find


the posterior distribution and the posterior mean and variance.

8. Using Laplace’s approximation, derive Sterling’s formula


p
(n + 1) = n! ⇡ 2⇡nnn e n
64 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY
Chapter 5

R A Fisher and Maximum


Likelihood

5.1 Background

The maximum likelihood criterion for choosing an estimate is a version of


‘choose the parameter that best fits’ the data, where best fit is taken to mean
‘maximize the probability of the observed data’. Hald [11] notes several early
uses of the maximum likelihood criterion, including Lambert for the multi-
nomial distribution, and Daniel Bernoulli for a semi-circular distribution.
Maximizing the likelihood is equivalent to maximizing the posterior mode
when the prior is uniform, as in Laplace’s treatment of the binomial. Karl
Pearson used a version of Laplace’s method to get approximate standard
errors for his method of moments estimators. Pearson had also proposed
the 2 ‘goodness of fit’ test for density curves, which could have led to a
theory of estimation (choose the parameters to minimize the 2 statistic),
but was rejected as a basis for a theory of estimation due to the difficulty of
the resulting computations. Thus when R. A. Fisher appeared on the scene,
the idea of maximizing the posterior (that is using the posterior mode as
an estimator) was known, and people were thinking about criteria for ‘curve
fitting’.

65
66 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD

5.2 Fisher’s 1912 paper


Ronald Aylmer Fisher (1890-1962) was a prolific and creative scientist. While
still an undergraduate student at Cambridge, he published a paper titled On
an Absolute Criterion for Fitting Frequency Curves [6] which argued that
the method of moments and least squares were arbitrary and unsuited to
the problem of fitting a relative frequency curve (density function) to data.
Fisher argued

But we may solve the real problem directly.


If f is an ordinate of the theoretical curve of unit area, then
p = f x is the chance of an observation falling within the range
x; and if
n
X
log P 0 = log p,
1
0
then P is proportional to the chance of a given set of observations
occurring. . . ., so the probability of any particular set of ✓’s is
proportional to P , where
n
X
log P = log f.
1

The most probable set of values for the ✓’s will make P a maxi-
mum.

Implicit in his argument is the fact that P is a function of ✓ and that P (✓) is
maximized at the same point that log P (✓) is maximized. You will note that
Fisher is using the language of ‘inverse probability’, as had been common
since Laplace, to describe the criterion, now called maximum likelihood.
Fisher didn’t invent the term likelihood until 1921, to distinguish his method
from inverse probability.
He went on to illustrate the method in the case of independent observa-
tions from a N(µ, 2 ) distribution. Let x1 , x2 , x3 , . . . xn be our sample, then
(in more modern notation)
Xn ✓ ◆
2 1 2 1 (xi µ)2
log f (x1 , x2 , x3 , . . . xn | µ, ) = log (2⇡ ) 2
.
i=1
2 2
5.2. FISHER’S 1912 PAPER 67

We will di↵erentiate with respect to the parameters µ and 2 , and solve for
the maximum. Let the likelihood function be denoted log L(µ, 2 ), then
n
@ 2 1 X
log L(µ, ) = 2
(xi µ)
@µ i=1

n
1 X
= 2
xi nµ
i=1

Pn
@ 2 1 n 1 i=1 (xi µ)2
log L(µ, ) =
@ 2 2 2 2 ( 2 )2

Setting the derivatives equal to 0, we get the likelihood equations.


n
! n
1 X 1X
0= 2
xi nµ ) µ= xi = x
i=1
n i=1

and
Pn
1 n 1 i=1 (xi µ)2
0 =
2 2 2 ( 2 )2
n
2 1X
) = (xi µ)2
n i=1

Substituting the solution µ̂ = x into the last equation, we have:

X n
b2 = 1 (xi x)2 .
n i=1

This same strategy works in most cases, though you must be careful to
consider the possibility that the maximum occurs at the boundary of the
parameter space.
68 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD

2
Note that we di↵erentiated with respect to , not . Would we get a
di↵erent answer if we took as our parameter?
n
!
@ @ n 2 1 X (xi µ)2
log L(µ, ) = log (2⇡ ) 2
@ @ 2 2 i=1
Pn
n i=1 (xi µ)2
= 3

n
2 1X
) = (xi µ)2
n i=1
v
u n
u1 X
) =t (xi µ)2
n i=1

We get a consistent result. This turns out to be true for any one-to-one
ˆ This
transformation of the parameters: if q(✓) is one-to-one then q̂ = q(✓).
is a nice property, and does not hold for all estimation criteria.

5.3 A Computational Example


Let X1 , X2 , . . . Xn be independent observations sampled from a Weibull(↵,
). The density function is

↵x↵ 1
(x/ )↵
f (x | ↵, ) = ↵
e I(x > 0).

The likelihood equations are not analytically solvable. The example below
illustrates the use of the R nlm() function (non-linear minimization) for
numerical solution of the likelihood equations. The algorithm is a variant
of Newton’s method, applied to minimize the negative of the log likelihood
function.

> X <- rweibull(20,2,10)


5.3. A COMPUTATIONAL EXAMPLE 69

> X
[1] 3.241291 4.714366 3.248232 4.607278 7.080301 3.946579 5.616528
[8] 11.164167 8.039719 5.578329 8.558086 23.018470 20.225088 5.486868
[15] 2.458238 14.914739 6.503793 7.425430 14.434551 8.903616

> LL <- function(p) -sum(dweibull(X,p[1],p[2],log=TRUE))


> mle <- nlm(LL,c(1,10),hessian=TRUE)
> mle
\$minimum
[1] 58.9482

\$estimate
[1] 1.675988 9.556432

\$gradient
[1] 8.182324e-07 2.327227e-07

\$hessian
[,1] [,2]
[1,] 14.878803 -1.0205698
[2,] -1.020570 0.6148619

\$code
[1] 1

\$iterations
[1] 9

> solve(mle\$hessian) # compute H-inverse


[,1] [,2]
[1,] 0.07584477 0.1258899
[2,] 0.12588986 1.8353380

> M <- matrix(0,nrow=300,ncol=300)

> sqrt(.07)
[1] 0.2645751
70 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD

> a <- seq(.05,5,len=300)

> sqrt(1.8)
[1] 1.341641

> b <- seq(3,18,len=300)

> for(i in 1:300){


+ for(j in 1:300){
+ M[i,j] <- LL(c(a[i],b[j]))
+ }}

> contour(a,b,M,levels=c(59,60,61,62,63,64,65,67,70))

> curve(dweibull(x,1.67,9.55),0,30,col="red")
> rug(X,col="blue")

Likelihood Contours for Weibull example


15
beta

10
5

0 1 2 3 4 5

alpha
5.4. PROPERTIES OF MLE’S 71

The contour plot above displays the contours of the negative log likeli-
hood function. This serves as a diagnostic, since the shape of the likelihood
function is related to the precision of estimation. The following plot is the
Weibull density for the estimated parameters, with the observations marked
below the horizontal axis.

Weibull density
0.08
0.06
dweibull(x, 1.67, 9.55)

0.04
0.02
0.00

0 5 10 15 20 25 30

5.4 Properties of MLE’s


Going all the way back to Laplace’s method, we see a common thread woven
around the log of the likelihood function. To explore this thread a bit further,
we will focus on the random variable called the score function or ‘efficient
score’
@
U (X) = log f (X | ✓).
@✓
@
Since @✓ log f (X | ✓) is a function of the random variable X, it too is a
random variable. It is the derivative of the log likelihood with respect to the
72 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD

parameter, evaluated at the random value X.


First, we show that as long as we can interchange the order of di↵erentia-
tion and integration (or summation for a discrete random variable) E(U ) = 0.
Probability distributions for which this is valid are called regular families.
Z
@
E(U ) = log f (X | ✓)f (X | ✓) dX
@✓
Z @
@✓
f (X | ✓)
= f (X | ✓) dX
f (X | ✓)
Z
@
= f (X | ✓) dX
@✓
Z
@
= f (X | ✓) dX
@✓
@
= (1)
@✓
= 0

We define the variance of U to be I(✓). The I stands for information, or


Fisher Information, for reasons that will be discussed later. For regular
families, where E(U ) = 0, we have

Var(U ) = E(U 2 ) = I(✓).

Furthermore, there is a connection to the shape of the log likelihood function.


Starting with the expression for E(U ) in the equation above, and di↵erenti-
ating, we have
Z
@
0 = log f (X | ✓)f (X | ✓) dX
@✓
✓Z ◆
@ @ @
(0) = log f (X | ✓)f (X | ✓) dX
@✓ @✓ @✓
Z ✓ 2 ◆ Z
@ @ @
0 = 2
log f (X | ✓) f (X | ✓) dX + log f (X | ✓) f (X | ✓) dX
@✓ @✓ @✓
5.4. PROPERTIES OF MLE’S 73
Z ✓ ◆ Z @
@2 @ @✓
f (X | ✓)
0 = 2
log f (X | ✓) f (X | ✓) dX + log f (X | ✓) f (X | ✓) dX
@✓ @✓ f (X | ✓)
✓ ◆ ✓ ◆2
@2 @
0 = E log f (X | ✓) + E log f (X | ✓)
@✓2 @✓

Thus we have ✓ ◆
@2
E log f (X | ✓) = I(✓).
@✓2
Recall from Laplace’s method that the resulting normal approximation has
as its variance the negative of the second derivative of the log likelihood at
the posterior mode (the MLE!), we see a hint of things to come.
The random variable U (X) is a function of a single random variable. If we
have a sample of independent, identically distributed random variables from a
regular family, then we can apply the central limit theorem. If X1 , X2 , . . . Xn
are an IID sample from a regular family, then U (X1 ), U (X2 ) . . . U (Xn ) are
IID with mean 0 and variance I(✓), and the central limit theorem states that
1
P
n
Ui 0 D
p ! N (0, 1).
I(✓)/n

For IID sampling, the joint likelihood factors into the product of the
marginal likelihoods, and thus the joint log likelihood is the sum of the
marginal log likelihoods, so when we di↵erentiate with respect to ✓ we get:
✓ ◆ ✓ X ◆
@ @
Var log (f (x1 , x2 , . . . , xn | ✓)) = Var log (f (xi | ✓))
@✓ @✓
✓X ◆
@
= Var log (f (xi | ✓))
@✓
X ✓ ◆
@
= Var log (f (xi | ✓))
@✓

= nI(✓)

Thus, for independent samples, the information in n observations is just n


times the information in a single observation. We can restate the central
74 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD

limit theorem above in terms of the total information:


P
U D
p i ! N (0, 1).
nI(✓)

We are ready to prove an important result about MLE’s.

Theorem 1 Let X1 , X2 , . . . Xn be an independent sample from a reg-


ular family with parameter ✓. Let ✓ˆn be the solution to the likelihood
equation
@
log (f (x1 , x2 , . . . , xn | ✓)) = 0.
@✓
Then p D
nI(✓)(✓ˆn ✓) ! N (0, 1).

We will sketch the proof, omitting a few technical details. Consider the
@
function Ui = @✓ log (f (xi | ✓)) as a function of both xi and ✓, and expand
ˆ
Ui (✓) around the true ✓. Summing the Ui , we have
X @ X @ ✓X 2 ◆
ˆ @
log f (xi | ✓) = log f (xi | ✓)+ log f (xi | ✓) (✓ˆ ✓)+etc.
@✓ @✓ @✓2

Since ✓ˆn is the MLE, the lefthand side of the equation equals zero. Disre-
garding the higher order terms, we have
X ✓X 2 ◆
@
0= Ui + log f (xi | ✓) (✓ˆ ✓).
@✓2

After a little rearrangement, we get


⇣ P @2 ⌘
1
P 1
log f (xi | ✓)
Ui n @✓ 2
pn = p (✓ˆ ✓).
I(✓)/n I(✓)/n

Since the expected value of the terms in the second derivative is I(✓), the
LLN implies that
1 X @2
log f (xi | ✓) ! I(✓)
n @✓2
5.4. PROPERTIES OF MLE’S 75

and finally by Slutsky’s theorem, we have


P
p 1
Ui D
nI(✓)(✓ˆn ✓) ⇡ pn ! N (0, 1).
I(✓)/n

Thus ‘well-behaved’ MLE’s


p are asymptotically normally distributed, with
asymptotic variance 1/ nI(✓).
76 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD

5.5 Exercises
For each of the following probability distributions, find the maximum likeli-
hood estimates assuming you have an independent sample X1 , X2 , . . . Xn of
observations.

1. Xi ⇠ Poisson(✓)

2. X ⇠ Negative Binomial(r, p), where r is known.

3. Xi ⇠ Exponential(✓)

4. Xi ⇠ Gamma(↵, ), first assuming ↵ is known and unknown, and


then assuming both are unknown.

5. Xi ⇠ Unif(↵, ).

6. Xi ⇠ Unif(0, ✓2 ), where ✓ is the parameter of interest.

7. Xi ⇠ Pareto(↵, ), assuming that is known.

8. Xi ⇠ Pareto(↵, ), assuming that ↵ is known.


2
9. Xi ⇠ N(µ,⌧ ), where ⌧ = 1/ is the precision.
Chapter 6

Sufficiency and Efficiency

6.1 Background
R. A. Fisher’s seminal 1922 paper On the mathematical foundations of the-
oretical statistics [7] started with a philosophical discussion of the purpose
of statistical methods. Fisher noted that there was considerable confusion of
terminology, extending to failures to distinguish between parameter values
and their empirical estimates. He observed that the problems of application
of statistical methods for a given dataset revolve around the question ‘of what
population is this a random sample?’. Answering that question involves data
reduction, the process of extracting the information about the population
from the sample. Fisher layed out a very list of problems which was to have
lasting influence:

The problems which arise in reduction of data may be conve-


niently divided into three type:

1. Problems of Specification. These arise in the choice of the


mathematical form of the population.
2. Problems of Estimation. These involve the choice of meth-
ods of calculating from a sample statistical derivates, or as
we shall call them statistics, which are designed to estimate
the values of the parameters of the hypothetical population.
3. Problems of Distribution. These include discussions of the

77
78 CHAPTER 6. SUFFICIENCY AND EFFICIENCY

distributions of statistics derived from samples, or in general


any function of quantities whose distribution is known.

Fisher defined several criteria of estimation:

Consistency That when applied to the whole population the derived statistic
should equal the parameter. This is now known as Fisher consistency
to distinguish it from weak consistency, the requirement that the
estimator converge to the parameter value in probability, as in the
P
weak law of large numbers for the sample mean: X ! µ.

Efficiency That in large samples, when the distributions of the statistics tend
to normality, that statistic is to be chosen which has the least probable
error. In other words, the estimator with smaller variance is more
efficient.

Sufficiency That the statistic chosen should summarise the whole of the rele-
vant information supplied by the sample. Fisher explained the concept
as follows: suppose we have two estimators S(x̃) and T (x̃), such that
conditional on the value of S(x̃), the distribution of T (x̃) does not
involve the parameter ✓ of interest, then S(x̃) is sufficient, and T (x̃)
provides no further information about ✓.

Fisher went on to give a heuristic argument that sufficiency would imply


efficiency. Suppose that S(x̃) and T (x̃) have in large samples a joint normal
distribution, each with expected value ✓, variances S2 and T2 , and correaltion
⇢. Then the conditional distribution of (T | S) is normal with mean

✓ + ⇢( T / S )(S ✓)

and variance (1 ⇢2 ) T2 . Since the variance doesn’t depend on ✓, the con-


ditional distribution will not depend on ✓ just in case the mean does not
depend on ✓. That condition is satisfied if ⇢ T = S , then:

E(T | S) = ✓ + ⇢( T / S )(S ✓) = ✓ + S ✓ = S.

If ⇢ T = S, then
2
T ⇢2 2
T = 2
S

since 0  ⇢2  1, and S has smaller variance than T .


6.2. SUFFICIENCY 79

6.2 Sufficiency
Fisher’s definition of sufficiency has been generalized to the following:

Definition a statistic S(x̃) is sufficient for the parameter


✓ if, conditional on S(x̃), the distribution of the data x̃
does not depend on ✓: f (x̃ | S(x̃)) is functionally inde-
pendent of ✓.

ConsiderP X1 , X2 , . . . , Xn , a sample of independent Bernoulli(p) trials. Let


S(X̃) = Xi . Then
P P
P({X1 = x1 , . . . , Xn = xn } \ { Xi = xi = s})
f (x1 , x2 , . . . , xn | S(x̃) = s) = P
P( Xi = s)

P(X1 = x1 , . . . , Xn = xn )
= P
P( Xi = s)

px1 (1 p)1 x1 x2
p (1 p)1 x2 . . . pxn (1 p)1 xn
= n s
s
p (1 p)n s

ps (1 p)n s
= n s
s
p (1 p)n s

1
= n .
s

The conditional distribution of the Xi ’s, given their sum, does not depend on
the parameter p. ns is just the number of orders of n trials with s successes.
This suggests another way to think about sufficiency. We could conceptu-
alize the experiment that produced the original sequence X1 , X2 , . . . , Xn as
involving a two step procedure: first select the sufficient statistic s, which
tells us how many of the n trials are successes, and then generate the Xi ’s as
a random permutation of the s successes and n s failures. The second step
just adds pure random noise to our sufficient statistic. All the information
about p is contained in the value of s, the rest is noise.
80 CHAPTER 6. SUFFICIENCY AND EFFICIENCY

It is tedious to verify sufficiency from the definition every time. Looking


at Fisher’s argument, we see a key idea. The conditional density doesn’t
depend on the parameter ✓ just in case the joint density of the data factors
into two terms, one of which depends on the sufficient statistic and the other
doesn’t involve ✓.

Theorem 2 (Factorization Theorem) The statistic S(x̃) is suffi-


cient for the parameter ✓ if and only if

f (x̃ | ✓) = g(S(x̃), ✓)h(x̃)

where h does not depend on ✓.

We will give a proof for the discrete case. If S(x̃) is sufficient, then
P(S(x̃) = s)P(X1 = x1 , . . . , Xn = xn | S(x̃) = s) = P({X1 = x1 , . . . , Xn = xn }\{S(x̃ = s)}.
The right hand side is zero if s 6= S(x̃), otherwise it is just f (x̃ | ✓).
X
P(S(x̃ = s)) = f (x̃ | ✓)
x̃:{x̃7!s}

and so is a function of s and ✓, while


P(X1 = x1 , . . . , Xn = xn | S(x̃) = s)
is a function of x̃ but does not depend on ✓ by hypothesis. Thus we have
written f (x̃ | ✓) in the desired factored form.
If f (x̃ | ✓) factors as in the theorem, then
P({X1 = x1 , . . . , Xn = xn } \ {S(x̃ = s)})
P(X1 = x1 , . . . , Xn = xn | S(x̃) = s) =
P(S(x̃ = s)

f (x̃ | ✓)
= P
x̃7!s f (x̃ | ✓)

g(S(x̃), ✓)h(x̃)
= P
x̃7!s g(S(x̃), ✓)h(x̃)

h(x̃)
= P
x̃7!s h(x̃)
6.3. EXAMPLES 81

The right hand side does not depend on ✓, so S(x̃) is sufficient for ✓.

6.3 Examples
1. Let X1 , X2 , . . . , Xn be an independent sample from a Uniform(0, ✓)
distribution. Then
1
f (x̃ | ✓) = I(0  x1 , . . . , xn  ✓)
✓n
1
= I(0  x(n)  ✓)1
✓n

Let S(x̃) = max xi = x(n) . Then the density is factored into


1
g(S(x̃), ✓) = I(0  x(n)  ✓)
✓n
and
h(x̃) = 1.

2. Let X1 , X2 , . . . , Xn be an independent sample from a Normal(µ, 2 )


distribution, where 2 is known. Then
✓ ◆n ✓ ◆
1 1 X 2
f (x̃ | ✓) = exp (xi µ)
2⇡ 2 2 2
✓ ◆n ✓ ◆
1 1 X 2
= exp (xi x + x µ)
2⇡ 2 2 2
✓ ◆n ✓ ◆
1 1 X 2 2
= exp (xi x) + n(x µ)
2⇡ 2 2 2
⇣ n ⌘ ✓✓ 1 ◆n ✓
1 X
◆◆
2 2
= exp (x µ) exp (xi x)
2 2 2⇡ 2 2 2

Let S(x̃) = x. The first factor is our g(S, µ), the second is our h(x̃),
and does not involve ✓. Thus x is a sufficient statistic for the mean µ.
82 CHAPTER 6. SUFFICIENCY AND EFFICIENCY

3. Let X1 , X2 , . . . , Xn be an independent sample from a Cauchy distribu-


tion with location parameter ✓.
1 Y 1
f (x̃ | ✓) = n .
⇡ 1 + (xi ✓2 )
The product in the denominator on the right is a polynomial in n
variables x1 , x2 , . . . , xn . It can not be factored in any way that does
not include terms in each of the individual variables xi , so the sufficient
statistic is the set of order statistics

x(1) , x(2) , x(3) , . . . , x(n) .

This is an example where the sufficient statistic is not of the same


dimension as the parameter.

6.4 The Exponential Family


Gauss’s [10] derivation of the normal distribution started from the question
‘if x is the posterior mode, and µ is a location parameter, then what is the
distribution of X?’ The argument was generalized by several authors (see
Hald [11] sec. 18.6). The basic result is that if x is the posterior mode, then
the likelihood must have the form

f (x|µ) = c(x)ea(✓)x+b(✓) .

Then the joint likelihood for a sample of n independent observations has the
form ⇣Y ⌘ P
f (x1 , x2 , . . . , xn |µ) = c(xi ) ea(✓) xi +nb(✓) .
It is easy to see that if the likelihood has this form, then the mode of the log
likelihood is the solution of the equation
X
a0 (✓) xi + nb0 (✓) = 0

or
b0 (✓)
= x.
a0 (✓)
After Fisher introduced the concept of sufficiency, I suspect it was only a mat-
ter of time before someone saw the connection. The pieces came together
6.4. THE EXPONENTIAL FAMILY 83

in Koopman [15], who found a general class of probability distributions for


which there exists a sufficient statistic of the same dimension as the param-
eter: the class of distributions for which the likelihood can be written in the
form
f (x | ✓) = h(x) exp (a(✓)S(x) b(✓))
for the one dimensional case, and in the k dimensional case (that is, ✓˜ has k
components)
k
!
X
˜ = h(x) exp
f (x | ✓) ˜ j (x) b(✓)
aj (✓)S ˜ .
j=1

The exponential family includes the normal, binomial, poisson, gamma,


beta, and exponential distributions. We will do the binomial as an example:
let X have a binomial(n,p) distribution. Then
✓ ◆
n x
f (x|p) = p (1 p)n x
x
✓ ◆
n
= exp (x log p + (n x) log (1 p))
x
✓ ◆ ✓ ◆
n p
= exp x log + n log (1 p)
x 1 p

So ✓ ◆
n
h(x) =
x
S(x) = x
p
a(p) = log
1 p
and
b(p) = log (1 p).
It is often useful to work with the likelihood in a canonical form, using what
are often called the natural parameters ⌘. We reparametrize the likelihood
by
⌘ = a(✓).
84 CHAPTER 6. SUFFICIENCY AND EFFICIENCY

Then the likelihood has the simpler form

f (x | ⌘) = h(x) exp (⌘S(x) B(⌘)).

In the binomial, the natural parameter is the logit transform of the proba-
bility p
p
⌘ = log .
1 p
The logit is the inverse of the logistic function

e⌘
p=
1 + e⌘

and the canonical representation of the binomial likelihood is


✓ ◆
n
exp (⌘x n log (1 + ⌘)).
x

There is something for everyone in the exponential family. For the Bayesian,
if
f (x | ⌘) = h(x) exp (⌘S(x) B(⌘))

we can construct a conjugate family of priors for ⌘ by examination, letting


S be a parameter and ⌘ the argument:

⇡(⌘) = c exp (⌘S B(⌘)).

6.5 Information, Sufficiency and Unbiasedness


I don’t wish to focus on unbiased estimation, as it is not central to modern
statistical theory and practice. However, there are some ideas of great and
lasting significance that arose in this context. The earliest versions of these
results appeared in C. R. Rao [18], which is reprinted with commentary by
P. K. Pathak in Johnson and Kotz [13]. In this same paper Rao was also
the first to observe that the information matrix defines a Riemannian metric
and associated geodesic distances between probability distributions.
6.5. INFORMATION, SUFFICIENCY AND UNBIASEDNESS 85

6.5.1 The Information Inequality


An estimator T (X̃) is unbiased for the parameter ✓ if E(T ) = ✓. More
generally, T (X̃) is unbiased for g(✓) if E(T ) = g(✓). If X has a regular
density f (x | ✓) and E(T ) = g(✓), then

@ @
g(✓) = g 0 (✓) = E(T )
@✓ @✓
Z
@
= T (X̃)f (X̃ | ✓)dx
@✓
Z
@
= T (X̃) f (X̃ | ✓)dx
@✓
Z @
@✓
f (X̃ | ✓)
= T (X̃) f (X̃ | ✓)dx
f (X̃ | ✓)
Z
@
= T (X̃) log f (X̃ | ✓)f (X̃ | ✓)dx
@✓
✓ ◆
@
= Cov T (X̃), log f (X̃ | ✓)
@✓

The last equality follows because E(log f (X̃ | ✓) = 0), and for any random
variable Y such that E(Y ) = 0

E(T Y ) = E ((T E(T ))(Y E(Y ))) = Cov(T, Y ).

On the L2 space of random variables with finite variance, the covariance is


an inner product, and the Cauchy-Schwartz inequality states that
✓ ✓ ◆◆2
@
Cov T (X̃), log f (X̃ | ✓)  Var(T (X̃)) Var(log f (X̃ | ✓)).
@✓

Since Var(log f (X̃ | ✓)) = In (✓) = nI(✓) we have, finally:

(g 0 (✓))2  Var(T (X̃)) n I(✓)


86 CHAPTER 6. SUFFICIENCY AND EFFICIENCY

or, as a lower bound on Var(T )


(g 0 (✓))2
 Var(T (X̃)).
n I(✓)
This is known as the Information Inequality, or in the special case where
g(✓) = ✓ and T is unbiased for ✓, the Cramér-Rao bound:
1
Var(T (X̃)) .
n I(✓)

Let X be Binomial(n,p). Then the log likelihood for p is


✓ ◆
n
`(p) = log + x log p + (n x) log (1 p).
x
Di↵erentiating twice we get
@`(p) x n x
=
@p p 1 p
and
@ 2 `(p) x n x
= .
@p2 p2 (1 p)2
The information In (p) is
E(X) n E(X)
E(`00 (p)) = +
p2 (1 p)2
np n np
= 2
+
p (1 p)2
n n
= +
p 1 p
n
=
p(1 p)

The estimator p̂ = X/n has expectation p, and variance


p(1 p) 1
= .
n In (p)
Since the estimator is unbiased, and achieves the Cramér-Rao bound, it is a
minimum variance unbiased estimator.
6.5. INFORMATION, SUFFICIENCY AND UNBIASEDNESS 87

6.5.2 The role of sufficiency


If we have an unbiased estimator that hits the Cramér-Rao bound, we are
doing well. What if our unbiased estimator doesn’t achieve the bound, can
we improve on it in some systematic way? This question is answered af-
firmatively by a result proved by C. R. Rao [18] and independently David
Blackwell [3].

Theorem 3 Rao-Blackwell Theorem Let T (X̃) be an unbiased estimator


for ✓ and S(X̃) be a sufficient statistic for ✓. Then ✓ˆ = E(T | S) satisfies

1. ✓ˆ is a function of S(X̃) not involving ✓.

2. ✓ˆ is unbiased for ✓.
ˆ  Var(T )
3. Var(✓)

1. ✓ˆ is a function of S(X̃), and by the definition of sufficiency, does not


depend on the parameter ✓.

2. By the double expectation theorem,

ˆ = E(E(T | S)) = E(T ) = ✓.


E(✓)

3. By the double expectation theorem for variances,

Var(T ) = Var(E(T | S)) + E(Var(T | S))


ˆ + E(Var(T | S))
= Var(✓)
ˆ
Var(✓)

While the theorem only guarantees some improvement due to condition-


ing on a sufficient statistic, examples illustrate the fact that the improve-
ment can be dramatic. For example, consider a sequence of n independent
Bernoulli(p) trials X1 , X2 , . . . Xn . p̂1 = X1 is an unbiased estimator for p, but
a silly one: with probability p wePget p̂1 = 1, and with probability 1 p we get
p̂1 = 0. We know that S(X̃) = Xi is a sufficient statistic, let us condition
88 CHAPTER 6. SUFFICIENCY AND EFFICIENCY

on it. First, we will compute the conditional probability distribution for X1 ,


given S, then the conditional expectation.
P
P({X1 = 1} \ { Xi = s})
Pr(X1 = 1 | S = s) = P
P( Xi = s)
P
P({X1 = 1} \ { ni=2 Xi = s 1})
= P
P( Xi = s)
n 1
p s 1
ps 1 (1 p)(n 1) (s 1)
= n
s
ps (1 p)n s

n 1
s 1
= n
s

s
= .
n

Thus (X1 | S = s) is a Bernoulli trial with success probability s/n, and has
expectation s/n. In other words,

p̂ = E(X1 | S = s) = s/n.

As we now know, the sample proportion achieves the information bound, so


our new p̂ is not just a little better than p̂1 , it is a lot better!

6.5.3 Notes on Unbiased estimation


In the exponential family, we can often find minimum variance unbiased
estimators. While the exponential family is more inclusive than it may appear
at first, in general we can’t expect to find unbiased estimators.
We can generate superfamilies and subfamilies of exponential families by
straightforward methods. For example, let Yi be Normal(µi , 2 ). The joint
density of an independent sample Y1 , Y2 , . . . Yn involves factors of the form
✓ ◆
µi 1 2 2
exp Y
2 i
Y b(µ, ) .
2 2 i
6.6. EXERCISES 89

When multiplied together they retain the exponential family form. It is often
useful to write the exponential family form in vector notation

k
X
✓j tj (X) = ✓˜0 T̃ .
j=1

Similarly, if A is an k by p matrix of constants, then the transformed parametriza-


tion A ˜ = ✓˜ is also in exponential family form

(A ˜)0 T̃ = ˜0 (A0 T̃ )

with sufficient statistics A0 T̃ .

6.6 Exercises
1. Let X1 , X2 , . . . Xn be an IID sample from a Poisson(µ) distribution.
Show
P that the Poisson is a member of the exponential family, and that
Xi is a sufficient statistic for µ. Show that X is unbiased for µ; does
it achieve the Cramér-Rao bound?

2. Let X1 , X2 , . . . Xn be an IID sample from a lognormal distribution with


density (for x > 0)

1 1 1
(log x)2
f (x) = p e 2 2 .
x 2⇡ 2

Is the lognormal distribution a member of the exponential family, and


are there sufficient statistics?

3. Let X1 , X2 , . . . Xn be an IID sample from a Gamma(↵, ) distribution.


Show that the Gamma is a member of the exponential family, and find
sufficient statistics.

4. Let X1 , X2 , . . . Xn be an IID sample from a Weibull(↵, ) distribution.


Is the Weibull distribution a member of the exponential family, and are
there sufficient statistics?
90 CHAPTER 6. SUFFICIENCY AND EFFICIENCY

5. Let X1 , X2 , . . . Xn be an IID sample from the 5 parameter mixture


distribution with density
2 2
f (x) = p n(x | µ1 , 1) + (1 p) n(x | µ2 , 2)

where n(x | µ, 2 ) is the Normal(µ, 2 ) density. Is the mixture distri-


bution a member of the exponential family?

6. Let X1 , X2 , . . . Xn be an IID sample from a Poisson(µ) distribution.


Note that P(Xi = 0) = e µ .

(a) What is the MLE for ✓ = e µ ? What is its variance (exactly if


possible, otherwise approximately)?
(b) Define Yi = 1 if Xi = 0, and 0 else. Show that Ȳ is unbiased for
✓. What is its variance?
P
(c) Evaluate ✓ˆ = E(Y1 | Xi ) and find its variance.

7. Let X be a Negative Binomial(r, p) random variable. Recall that if


X = k then there were r + k independent Bernoulli(p) trials with r
successes and k failures.

(a) Show that the outcome of the first trial, call it Y1 , is an unbiased
estimator for p.
(b) Find a sufficient statistic for p, and use the Rao-Blackwell theorem
to find a better unbiased estimator.
(c) Find the MLE, and its asymptotic variance.
(d) Do a simulation study comparing the variance, bias, and mean
squared error of the unbiased estimator you found to that of the
MLE.

8. For each of the following situations compare the bias, variance and
mean squared errors for the method of moments estimator, the best
unbaised estimator, the maximum likelihood estimator, and if a conju-
gate or other convenient prior exists, a Bayesian alternative (the poste-
rior distribution, and an estimator such as the posterior mean). If you
can’t do it analytically, do a simulation.

(a) X1 , X2 , . . . Xn are IID Uniform(0,✓).


6.6. EXERCISES 91

(b) X1 , X2 , . . . Xn are IID Exponential( ).


(c) X1 , X2 , . . . Xn are IID N(µ, 2 ). The parameter of interest is the
variance, 2 . What if we are interested in , or in ⌧ = 1/ 2 ?
92 CHAPTER 6. SUFFICIENCY AND EFFICIENCY
Chapter 7

Formal Inference

We have been studying the problem known as point estimation, that is pro-
ducing a single number as an estimate for a parameter. Statisticians have
come to view that as an incomplete specification of the problem. We really
want more than just an estimate, we want an estimate of the precision or
accuracy of an estimate. This is known as a confidence interval, or if you
prefer, and interval estimate: a procedure for producing a range of values,
with some specified properties) for the unknown parameter. We will see that
the standard treatment of this problem is intimately related to the construc-
tion of hypothesis tests. A hypothesis test is, roughly speaking, a procedure
for answering a question of the form

Do the data appear to be inconsistent with the specified proba-


bility model?

This form of hypothesis test is often called a pure significance test and was
popularized by the work of R. A. Fisher in his influential texts Statistical
methods for research workers [8] and The design of experiments [9]. There
is another class of situations of great interest, in which we wish to compare
two probability models.
The earliest known instance of hypothesis testing is an example of a pure
significance test. John Arbuthnott, physician to Queen Anne, observed [1]
that if births are independent, and the probability that each birth is a male
is 12 , then the probability that the total number of male births in the city
of London exceed the total number of female births in a year is also 12 . He

93
94 CHAPTER 7. FORMAL INFERENCE

noted that over the years for which records then existed in London (1629 to
1710) male births exceeded female births in all 82 years.
He then computed the probability of this particular sequence of outcomes
under the hypothesis p = 12 . If the results in di↵erent years are independent,
we have 82 trials, each with p = 12 , resulting in ( 12 )82 as the probability for the
82 consecutive years of excess male births. Arbuthnott concluded that this
was too small to have occurred by chance. He went on to argue (apparently
seriously) that this was evidence of divine intervention:

‘From whence it follows that it is Art, not Chance, that governs’

Arbuthnott’s test was a pure significance test in the Fisherian sense.


He computed what we now would call a p-value, ignoring the other tail of
the distribution. He did not specify an alternative hypothesis, or give any
argument as to why this particular outcome was special. Note that any par-
ticular sequence of 82 outcomes also has probability ( 12 )82 , and thus would
apparently be equally good evidence of divine intervention. He also equated
‘chance’ with equiprobable outcomes (p = 12 ), a common error of reasoning
about random events. Arbuthnott is perhaps better known as a close friend
of Jonathan Swift and the author of ‘The History of John Bull’, so there is
at least some question as to whether the argument was delivered in serious-
ness. I am inclined to suspect that it was, based on his choice of venue (the
Philosophical Transactions of the Royal Society).
The earliest confidence interval I know of is due to Laplace (see Hald [11],
p24.). It is really an asymptotic interval for the region of highest posterior
density the for binomial parameter p, given p̂ = X/n. He showed that with a
uniform prior for p, the posterior distribution for p, given X, is approximately
Normal(p̂, p̂(1 p̂)/n) and thus the interval with probability 1 ↵ is
r r
p̂(1 p̂) p̂(1 p̂)
p̂ z  p  p̂ + z
n n
where z is the 1 ↵/2 quantile of the standard normal distribution, that is
Z z
1 x2
p e 2 dx = 1 ↵.
z 2⇡

We will start with a brief discussion of confidence intervals, and then


cover hypothesis tests in more detail. We shall see that there is a connection
7.1. CONFIDENCE INTERVALS 95

between confidence intervals and hypothesis tests, so if we can construct ei-


ther one, we can derive the other from it. To preview coming attractions, a
1 ↵ confidence region is the set of values for the parameter(s) which would
not be rejected by a size ↵ hypothesis test. While there is a logical equiv-
alence between the two, there is an important di↵erence in the information
conveyed: a hpothesis test answers a single question of the form

Is ✓0 a plausible value for the parameter?

while the coresponding confidence interval answers all possible questions of


that form, that is it conveys the entire set of plausible values, which is con-
siderably more information.

7.1 Confidence Intervals


A confidence region is a set C(X̃), which depends on the data but not on the
unknown parameter ✓. We would like the region to have a given coverage
probability, that is
P(✓ 2 C(X̃)) = 1 ↵.

Note that in the frequentist interpretation, ✓ is not a random variable, so the


validity of the probability statement derives from the fact that the region is
a function of the random variable X̃. Conditional on the observed the value
of the random variable, the confidence region is fixed, so it is inappropriate
to claim that the probability that the confidence interval contains ✓ is 1 ↵;
the probability is either 0 or 1. What we can say while in frequentist mode
is that as we go through life constructing confidence intervals of size 1 ↵,
roughly 100(1 ↵)% will contain the corresponding parameters.
Probably the single most common forms of confidence intervals are those
based on the normal distribution. Suppose that X1 , X2 , . . . Xn are an inde-
pendent sample from a Normal(µ, 2 ) population, where 2 is known. Let z↵
be the 1 ↵2 quantile of the standard normal distribution. Then
✓ ◆
X µ
P p  z↵ =1 ↵
/ n
96 CHAPTER 7. FORMAL INFERENCE

and
X µ X µ
p  z↵ () z↵  p  z↵
/ n / n

() z↵ p  X µ  z↵ p
n n

() X z↵ p  µ X + z↵ p
n n

() X z↵ p  µ  X + z↵ p
n n

Thus the random interval X ± z↵ pn contains µ with probability 1 ↵. The


same argument applies in any case where we have an estimator with an
asymptotically normal distribution, and an estimate of the asymptotic vari-
ance. For example, if an MLE ✓ˆ is asymptotically normal, then the following
are all approximate level 1 ↵ confidence intervals:
1
✓ˆ ± z↵ p
nI(✓)

(assuming that I(✓) does not depend on ✓)


1
✓ˆ ± z↵ q
ˆ
nI(✓)

and finally
1
✓ˆ ± z↵ q .
@2 ˆ
log f (X̃ | ✓)
@✓ 2

In general, we must find a function of the data and the parameter with a
known distribution that does not depend on the parameter. Such a function
is called a pivotal quantity. In the case of the normal distribution,

X µ
p
/ n
is a pivotal quantity. It has a standard normal distribution, which is inde-
pendent of µ.
7.2. HYPOTHESIS TESTS 97

Another example of a pivotal quantity arises in the estimation of the


variance for normally distributed data. Let X1 , X2 , . . . Xn be an indepen-
dent sample from a Normal(µ, 2 ) population, where both parameters are
unknown. We know that
n
X
(Xi X)2 ⇠ 2 2
n 1.
i=1

Thus Pn
i=1 (Xi X)2
2

is a pivotal quantity with a Chi-squared distribution on n 1 degrees of


freedom. If q1 and q2 are respectively the ↵/2 and 1 ↵/2 quantiles of the
2
n 1 distribution, then

✓ Pn ◆
i=1 (Xi X)2
P q1  2
 q2 =1 ↵

and
Pn
i=1 (Xi X)2 1 2
1
q1  2
 q2 ()  Pn 
q2 i=1 (Xi X)2 q1
n n
1 X 2 2 1 X
() (Xi X)   (Xi X)2
q2 i=1 q1 i=1

2
Our level 1 ↵ confidence interval for is thus
n n
!
1 X 2 1
X
(Xi X) , (Xi X)2 .
q2 i=1 q1 i=1

7.2 Hypothesis Tests


Suppose I tell you that I have tossed 10 di↵erent coins, and that all 10 tosses
resulted in ‘heads’. What are the possible explanations? They might include:
98 CHAPTER 7. FORMAL INFERENCE

Chance: The coins are fair, the tosses were independent; we observed a rare,
but not impossible event. We get 10 heads in 10 tosses about one time
in 210 repetitions of the experiment.

Bias: The coins aren’t fair coins; i.e., the probability of heads on each toss is
not 12 , but possibly close to 1. Maybe the coins are all double-headed?

Failure of Independence: If the first coin turning up heads makes it more


likely that the others turn up heads, then this might not be such a rare
event after all. In the extreme, imagine that all 10 coins are taped
together facing the same direction. Then each toss of the group of 10
likely yields either 10 heads or 10 tails.

What do we expect to see if the coin is fair and the tosses independent?
About 89% of the time we expect to see between 3 and 7 heads, and about
98% of the time between 2 and 8 heads. If we see 10 heads, we feel naturally
inclined to skepticism about the mechanism claimed to be generating the
data.
The standard formalization of this idea is the ‘Hypothesis Test’. The
focus is decision rather than estimation, in other words, the question:

Should we reject the claim that p = 12 ?

rather than

What is our best guess for p?

In essence, the idea of the hypothesis test is to specify a working hypothesis,


called the null hypothesis, which in this case represents the notion that the
coin is a fair coin. Then we formulate a probability model corresponding to
the null hypothesis, and finally we ask ourselves whether the observed data
seem to be consistent with this probability model. If the data do not seem
to be consistent with the model, we ‘reject the null hypothesis’. If the data
are not inconsistent with the model, we ‘fail to reject the null hypothesis’.
In general it is a bad idea to decide that the null hypothesis is true because
7.2. HYPOTHESIS TESTS 99

you have failed to reject it: it is also possible that your experiment didn’t
generate enough data to reject the null hypothesis for the actual size of the
di↵erence.
Let’s break the process down into a sequence of steps using our coin
tossing problem as an example:

1. Formulate the substantive hypothesis: ‘Heads and Tails are equally


likely.’

2. Specify the probability model implied by the substantive hypothesis,


often called the ‘null hypothesis’ (H0 ): ‘10 independent coin tosses
P{Heads} = 12 .

3. Specify the alternative hypothesis (H1 ) of interest: ‘10 independent


6 12 .
coin tosses, P{Heads} =

4. Pick a rejection region, a set of outcomes you are willing to accept


as convincing evidence against H0 : ‘Reject H0 if the number of heads
is less than 3 or greater than 7’.

The rejection region is the set of outcomes that are taken to be evidence
against H0 and in favor of some alternative hypothesis. In this case, for
P{Heads} greater than 12 , outcomes greater than 7 will occur more often
than under H0 , while for P{Heads} less than 12 , outcomes less than 3 will
occur more often than under H0 .
The alternative hypothesis is usually not specific, but rather a class
of probability distributions. In the example above, that class is the set of
binomial(10, p) distributions with p 6= 12 .
100 CHAPTER 7. FORMAL INFERENCE

0.30
0.25
0.20
0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10

binomial(10, 12 )

The corresponding probabilities:

X p
0 0.001 |
1 0.010 reject H_0
2 0.044 |
--------------------
3 0.117
4 0.205
5 0.246 do not reject H_0
6 0.205
7 0.117
--------------------
8 0.044 |
9 0.010 reject H_0
10 0.001 |

Notice that we can’t claim infallibility. In our example, we will incorrectly


reject the null hypothesis about 10.9% of the time when it is really true.
7.2. HYPOTHESIS TESTS 101

Why? If p = 12 , then the probability of getting a result in the specified


rejection region is the sum of the binomial(10, 12 ) probabilities for 0, 1, 2, 8,
9, and 10; this is not zero!

The probability of rejecting the null hypothesis, given that it is


true, is called the significance level or ↵-level of the test.

We could have chosen a di↵erent significance level, the common choice


being a rejection region that yields a significance level of .05 or 5%. For the
binomial distribution, that is not always possible, since the distribution is
discrete, and ‘lumpy’ for small sample sizes. Equivalently, we could compute
the p-value.

The p-value is the probability of seeing a result at least as unusual


as the observed result, if H0 were true.

For example, if we did the experiment, and got 7 Heads, then the p-value
would be the probability of getting at least 7 Heads plus the probability of
getting no more than 3 Heads:
1 1
P(X  3) + P(X 7) = pbinom(3, 10, ) + (1 pbinom(6, 10, ))
2 2
or about 0.344.
Choosing a rejection region with significance level of .05 corresponds ex-
actly to rejecting H0 if the p-value is .05 or smaller.
P -values are often considered to be measures of the strength of the ev-
idence against the null hypothesis: smaller p-values go with more extreme
results indicating greater evidence against H0 . This is unsound, as it ignores
the simultaneous dependence of the p-value on both the e↵ect size (the dif-
ference between the parameter specified by H0 and actual parameter value)
and the sample size, as well as the inherent plausibility of the null hypothesis.
Either way, notice that we are essentially comparing the observed outcome
to the distribution it would have if H0 were true, to see if it represents an
102 CHAPTER 7. FORMAL INFERENCE

extreme or unusual value for that distribution, in the sense of falling too far
from the center of the distribution, in a region of low probability.
We can’t claim to be testing all possible alternatives (for example, lack
of independence of the tosses). We are restricting the alternative hypotheses
to a specific class: binomial distributions with p 6= 12 .
The fact that the alternative hypothesis is ‘binomial with p 6= 12 ’ is im-
portant to the construction of the rejection region. For example, suppose
that p = 12 , but the tosses are not independent. In the extreme case, if all
10 outcomes are identical, as they might be if the coins were taped together
all facing the same direction, then this hypothesis test will reject the null
hypothesis p = 12 with probability 1, even though p is 12 . The failure of
independence leads to the rejection of the null hypothesis.
Consider the following scenario. Suppose that the trials are dependent
with the property that we always get either 6 successes and 4 failures, or 4
successes and 6 failures, each with probability 12 .
0.6
0.5
0.4

H1
0.3
0.2

H0
0.1
0.0

0 1 2 3 4 5 6 7 8 9 10
7.2. HYPOTHESIS TESTS 103

What should the rejection region be for the null hypothesis

H0 : binomial(10, 12 )

in this situation? If these really are the two hypotheses of interest, then an
extreme result like 10 heads is not evidence for the alternative hypothesis! In
fact, 10 heads is conclusive evidence that the alternative hypothesis is false,
and H0 is true, since 10 heads is an impossible outcome under the alternative
hypothesis. Here the only reasonable rejection region is the set {4, 6}: the
outcomes that are more likely to occur under the alternative than under H0 .
This is the idea underlying the likelihood ratio test, coming soon.
Returning to the original example, when the alternative hypothesis is
p 6= 12 , that is, a binomial(10, p) distribution with some other probability
p of success, then the optimal rejection region is indeed the tails (extreme
outcomes like {0, 1, 2, 8, 9, 10}) of the binomial(10, 12 ) distribution.
Finally, failure to reject the null hypothesis is not proof that the null
hypothesis is correct. If nothing else, it is always possible that the sample
size was just too small to reliably detect the di↵erence!

7.2.1 Power

So far our discussion of hypothesis testing has focused on the null hypothesis.
For example, we choose a rejection region by selecting a significance level ↵,
then trying to find outcomes whose probabilities add up to about the desired
↵ value. Recall that ↵ measures the probability of rejecting H0 , given that
H0 is in fact true. This is not the only type of error we could make. It is also
possible that we might fail to reject a false null hypothesis. The probability
of doing so depends on the true, and usually unknown, value of the param-
eter in question, and is usually labeled . It is useful to display the various
possibilities in tabular form with the associated probabilities in parentheses:
104 CHAPTER 7. FORMAL INFERENCE

H0 True H0 False

No Type II
Don’t reject Error Error
H0 (1 ↵) ( )

Type I No
reject Error Error
H0 (↵) (1 )

1 is known as the power of the test; for a given significance level ↵ we


would like the power to be as large as possible. In general we can’t compute
the true power of a test since we don’t know the actual probability distribu-
tion. However we can compute the power for hypothetical scenarios. Since
power is the probability of rejecting H0 , under some specified alternative hy-
pothesis, it is just the sum of the probabilities of outcomes in the rejection
region for that distribution.
Let us return once more to the example of the 10 coin tosses. Let the
rejection region for H0 : p = 12 be the set
RR = {0, 1, 2, 8, 9, 10}.
Consider the alternative hypothesis H1 : p = 34 . If X is the number of heads,
the power is just
3
P(RR|H1 ) = P(X 2 {0, 1, 2, 8, 9, 10} | p = )
4
3 3 3
= P(X = 0 | p = ) + P(X = 1 | p = ) . . . + P(X = 10 | p = )
4 4 4
= 0.0000 + 0.0000 + 0.0004 + 0.2816 + 0.1877 + 0.0563

= 0.5260

Thus, if the true P{Heads} were 34 , the probability of rejecting H0 : p = 12


would be .526. Note that most of the probability comes from the right tail of
7.3. LIKELIHOOD 105

the distribution (outcomes 8, 9, and 10). Outcomes in the left tail are much
less likely if p = 34 .

7.3 Likelihood
Examples like the correlated coin tosses in the previous section persuaded
Neyman and Pearson [17] that the crucial issue for choosing a rejection region
is the ratio of the probabilities assigned to outcomes under the competing
hypotheses. This is known as the likelihood ratio:
P(X|H0 )
LR = .
P(X|H1 )
More generally, Neyman and Pearson showed that for a given choice of sig-
nificance level (↵), the power is maximized by choosing a rejection region
based on values of the likelihood ratio.
A more general prescription, known as the likelihood principle (see,
for example, Berger and Wolpert [2]) states that all inferences should be
based on the likelihood function. The likelihood function for a particular
hypothesis is just the probability density implied by that hypothesis:

L(X, H0 ) = P(X|H0 ).

It is called the likelihood function in recognition of the fact that it is a func-


tion of the hypothesis with the observed data taken as given parameters,
rather than a function of the data given the hypothesized probability distri-
bution. In the case of the binomial distribution, the likelihood function is a
function of the parameter p:
✓ ◆
n k
L(k, H0 ) = p (1 p)n k .
k
This looks exactly the like the binomial probability for k successes in n trials,
but in the likelihood function k is considered fixed and p is the variable.
The likelihood ratio criterion tells us to choose a rejection region based
on looking at the likelihood ratio. Suppose we have a binomial outcome
with n = 10 and the null hypothesis p = 12 . If we choose ↵ = .05 then
for any alternative p > 12 the extreme outcomes in the upper tail of the
106 CHAPTER 7. FORMAL INFERENCE

distribution will have higher probability under the alternative than under
H0 . The extreme outcomes in the lower tail will have higher likelihood under
any alternative p < 12 . Thus the likelihood ratio criterion implies that we
should choose our rejection region to be those outcomes in the tails with
total probability no greater than .05 under H0 . This leads us to the rejection
region
RR = {0, 1, 9, 10}.

If you accept the likelihood principle, it provides a fundamental justifica-


tion for rejection regions of the form we have already developed. Even if you
don’t accept it as a general principle, the Neyman-Pearson argument based
on the likelihood ratio leads to the same conclusion.
The likelihood principle has some implications which many frequentists
find unappealing. Consider the following two experiments:

E1 : We conduct 20 independent trials, and observe 15 successes.

E2 : We conduct independent trials, continuing until we observe the 15th


success. The 15th success occurs on the 20th trial.

The likelihood functions for the two experiments are respectively binomial
and negative binomial:
✓ ◆
20 15
L1 = p (1 p)5
15

and ✓ ◆
19 15
L2 = p (1 p)5 .
14
These two likelihoods are proportional — the only di↵erence is the constant
factor. Hence the likelihood principle implies that we should make the same
inferences about p from either experiment; in other words, it requires the
same estimate of p in either case. A frequentist using the criterion of un-
biasedness would use the estimates p̂ = 15 20
and p̂ = 14
19
respectively. The
criterion of unbiasedness takes account of the possibility of arbitrary num-
bers of failures before seeing the 15th success in the second experiment - when
we compute the expected value of the estimator we are averaging across two
di↵erent sample spaces. In the first experiment there are a fixed number of
7.3. LIKELIHOOD 107

trials, and we compute expectation with respect to that sample space and
the corresponding binomial distribution. The likelihood principle demands
that inferences depend on the data only through the likelihood function; un-
biasedness asks us to take into account data that might have been observed,
but wasn’t.
In any case, the classic result on the connection between likelihood ratios
and power is the Neyman-Pearson Lemma. I will present the result in terms
of density functions for continuous random variables, the argument for the
discrete case is parallel.

Theorem 4 (Neyman-Pearson)
Given H0 : X ⇠ f0 (x) and H1 : X ⇠ f1 (x), let the rejection region R⇤ be
defined by
f1 (x)
R⇤ = {x : c}
f0 (x)
and suppose that R⇤ has size ↵. Let R be any other rejection region of size
↵. Then the test defined by R⇤ has higher power than the one defined by R.

The proof is straightforward. The power is defined to be the probability of


rejecting the null hypothesis, given that the alternative holds. This is just
the integral over the rejection region of the density f1 :
Z Z Z Z
f1 (x) f1 (x) = f1 (x) f1 (x)
R⇤ R R⇤ \R R\R⇤
Z Z
c f0 (x) c f0 (x)
R⇤ \R R\R⇤
✓Z Z ◆
= c f0 (x) f0 (x)
R⇤ R

= 0

Since the log function is monotone,


f1
c () log f1 log f0 c0 = log c.
f0
108 CHAPTER 7. FORMAL INFERENCE

Example:
Let X1 , X2 , X3 , . . . Xn be an IID sample from a Normal(µ, 2 ) population.
Find the most powerful test of H0 : µ = µ0 versus H1 : µ = µ1 > µ0 ,
assuming 2 is known.

Pn Pn
f1 (x̃) i=1 (Xi µ1 ) 2 i=1 (Xi µ0 ) 2
log c () c
f0 (x̃) 2 2 2 2
n
X n
X
2
() (Xi µ0 ) (Xi µ1 ) 2 c2 2

i=1 i=1

n
X n
X
() Xi2 2nXµ0 + nµ20 Xi2 + 2nXµ1 nµ21 c2 2

i=1 i=1

2
() 2nXµ0 + 2nXµ1 c2 nµ20 + nµ21
2
() X(2nµ1 2nµ0 ) c2 nµ20 + nµ21
2
c2 nµ20 + nµ21
() X
2nµ1 2nµ0

() X c0 .
Choosing the constant c0 is equivalent to choosing a significance level ↵. This
is true for each µ1 > µ0 Thus the one-sided ‘Z test’, which rejects H0 : µ = µ0
if
X µ0
p > z↵/2
/ n
is an optimal size ↵/2 test for each µ1 > µ0 . Similarly the test which rejects
H0 : µ = µ0 if
X µ0
p < z↵/2
/ n
is an optimal size ↵/2 test for each µ1 < µ0 . It follows that the standard ‘Z
test’, which rejects H0 : µ = µ0 if
X µ
p 0 > z↵/2
/ n
is an optimal two-sided test for the alternative µ1 6= µ0 .
7.4. TESTS BASED ON THE LIKELIHOOD FUNCTION 109

7.4 Tests based on the Likelihood function


The Neyman-Pearson lemma suggests that likelihood ratios play a crucial
role in designing rejection regions. When we have a one parameter family of
probability distributions we can often use the Neyman-Pearson lemma to find
optimal tests. In general, we can not do so. We can still use the likelihood
ratio, and evaluate its behavior asymptotically.

Theorem 5 The Likelihood Ratio Test


Let f (X̃ | ✓) be a regular family, and the MLE ✓ˆ asymptotically normally
distributed. Let the null hypothesis be H0 : ✓ 2 ⇥0 , the alternative hypothesis
be H1 : ✓ 2 ⇥1 , where the parameter space ⇥ = ⇥0 [ ⇥1 and dim(⇥0 ) <
dim(⇥1 ). Let dim(⇥1 ) dim(⇥0 ) = p. Define the likelihood ratio statistic to
be
sup{f (X̃ | ✓) : ✓ 2 ⇥1 }
(X̃) = .
sup{f (X̃ | ✓) : ✓ 2 ⇥0 }
where X̃ = {X1 , X2 , . . . , Xn } is an IID sample from the probability distribu-
tion parametrized by ✓. Then if H0 is true,
D
2 log (X̃) = 2(`n (✓ˆ1 ) `n (✓ˆ0 )) ! 2
p

where ✓ˆi is the MLE under Hi and


ˆ = log(f (X̃ | ✓))
`n (✓)

I will sketch the proof in the case where dim(⇥) = 1 and ⇥0 = {✓0 }. We
know that p D
nI(✓0 ) (✓ˆn ✓0 ) ! N (0, 1).
Expand `n (✓0 ) in a Taylor series about ✓ˆn :

1
`n (✓0 ) = `n (✓ˆn ) + `0n (✓ˆn )(✓ˆn ✓0 ) + `00n (✓n⇤ )(✓ˆn ✓0 ) 2
2
for some ✓n⇤ satisfying
|✓n⇤ ✓ˆn |  |✓0 ✓ˆn |.
110 CHAPTER 7. FORMAL INFERENCE

Since `0n (✓ˆn ) = 0, we have

2 log (X̃) = 2(`n (✓ˆn ) `n (✓0 ))

= `00n (✓n⇤ )(✓ˆn ✓0 ) 2

`00n (✓n⇤ )
= nI(✓0 )(✓ˆn ✓0 ) 2
nI(✓0 )
D
! Z2

ˆ
where Z ⇠ N (0, 1). We make use here of the consistency of the MLE ✓,
which guarantees that
`00n (✓n⇤ ) P
!1
nI(✓0 )

ˆ Since Z 2 ⇠
and the asymptotic normality of ✓. 2
1, we are done.

Example:
Let X be a binomial(n, p) random variable. Let H0 specify p = p0 , and
H1 : p 6= p0 . Let p̂ = X/n be the MLE. The likelihood ratio test statistic is

2(`(p̂) `(p0 )) = 2(log f (X|p̂) log f (X|p0 ))

= 2(x log p̂ + (n x) log (1 p̂)


x log p0 (n x) log (1 p0 ))
✓ ◆
p̂/(1 p̂) 1 p̂
= 2 x log + n log .
p0 /(1 p0 ) 1 p0

The properties of this test statistic may not be immediately obvious, but it
is interesting to note that the first term involves an odds ratio — the ratio
of the odds under the MLE to the odds under H0 .
7.4. TESTS BASED ON THE LIKELIHOOD FUNCTION 111

7.4.1 Score Tests


The first derivative of the log likelihood, known as the score function
@
Un (X̃) = `(X̃ | ✓)
@✓
was shown earlier to be asymptotically normal with mean 0 and variance
nI(✓). Thus we can construct an approximate size ↵ Z-test, called the
score test: reject H0 : ✓ = ✓0 if

Un (X̃, ✓0 )
p > z↵ .
nI(✓0 )

A more general version, for a vector parameter ✓˜ when


dim(⇥1 ) dim(⇥0 ) = p, is given by the quadratic form

Ũ (✓ˆ0 )0 In 1 (✓ˆ0 )Ũ (✓ˆ0 ) ⇠ 2


p.

Example:
Let X be a binomial(n, p) random variable. Let H0 specify p = p0 , and
H1 : p 6= p0 . Then
@
U (p0 ) = `(X | p0 )
@p0
X n X
= +
p0 1 p0
X np0
= .
p0 (1 p0 )

and
n
In (p0 ) = .
p0 (1 p0 )
Thus, the test statistic is
X np0
p0 (1 p0 ) X np0
q = p .
n n p0 (1 p0 )
p0 (1 p0 )
112 CHAPTER 7. FORMAL INFERENCE

7.4.2 Wald Tests


Yet another large sample test derives from the asymptotic normality of the
MLE. If ✓ = ✓0 , then
✓ˆ ✓0
q ⇠ N(0, 1).
ˆ
1/In (✓)
An approximate size ↵ Z-test for H0 : ✓ = ✓0 rejects when

✓ˆ ✓0
q > z↵ .
ˆ
1/In (✓)

Example:
Let X be a binomial(n, p) random variable. Let H0 specify p = p0 , and
H1 : p 6= p0 . Then
p̂ p0 p̂ p0
p =p .
1/In (p̂) p̂(1 p̂)/n

7.5 Confidence Intervals from Hypothesis


Tests
Suppose that we have a family of probability distributions P, parametrized
by ✓ 2 ⇥. C(X) is a level 1 ↵ confidence region for ✓ if

P✓ (✓ 2 C(X)) 1 ↵

for all ✓ 2 ⇥. A level ↵ test for H0 : ✓ = ✓0 has a rejection region R✓0 such
that
P✓0 (X 2 R✓0 )  ↵
or equivalently
P✓0 (X 2
/ R✓ 0 ) 1 ↵.

Given a set of level ↵ tests with rejection regions R✓ , we can define a set
function S(X) taking values in ⇥ by

S(X) = {✓ : X 2
/ R✓ }.
7.5. CONFIDENCE INTERVALS FROM HYPOTHESIS TESTS 113

Then for each ✓

P✓ (✓ 2 S(X)) = P✓ (X 2
/ R✓ ) 1 ↵

and we have a level 1 ↵ confidence region for ✓.


Conversely, if S(X) is a level 1 ↵ confidence region for ✓, then we can
define rejections regions R✓ for each ✓ by

X 2 R✓ () ✓ 2
/ S(X)

and it follows that R✓0 is a rejection region for H0 : ✓ = ✓0 of size no bigger


than ↵.

Example:
Let X1 , X2 , . . . , Xn be IID Normal(µ, 2 ) random variables, where 2
is
known. As we saw earlier, the interval
✓ ◆
C(X) = X z↵ p , X + z↵ p
n n
contains µ with probability 1 ↵. On the other hand, the size ↵ Z-test
fails to reject H0 : µ = µ0 if and only if

X µ
p 0  z↵
/ n
which is equivalent to the condition

X µ
z↵  p 0  z↵
/ n
and thus p p
z↵ / n  X µ0  z↵ / n.
This occurs just in case
p p
X z ↵ / n  µ0  X + z ↵ / n

and µ0 2 C(X).
Thus we can construct approximate confidence intervals using any
hypothesis test, including the likelihood ratio test, score test, and Wald test.
114 CHAPTER 7. FORMAL INFERENCE

7.6 Exercises
1. Let X1 , X2 , . . . Xn be IID N(µ, 2 ). Construct an approximate 95%
confidence interval for the coefficient of variation /µ. Hint: X̄ and
S 2 are independent, and there is a normal approximation for the 2m
as m ! 1.

2. Let X1 , X2 , . . . Xn be IID Uniform(0, ✓). Construct an exact or


approximate 95% confidence interval for ✓.

3. Let X1 , X2 , . . . Xn be IID Pareto(↵, ). Supposing that ↵ is known,


construct an exact or approximate confidence interval for .

4. Let X ⇠ Binomial(n, p). Compare the exact 95% confidence interval


to those based on the normal approximation, the score test, Wald test
and the likelihood ratio test for X = 15, n = 20.

5. Let X1 , X2 , . . . Xn be IID Poisson(✓). Find 95% confidence/probability


intervals by the following methods for Bortkiewicz’s data on deaths
by horse kick for Prussian Cavalry units. Repeat for a random sample
of size 25, for a random sample of size 50, and for the full dataset
(n=200). Here is some R code to create the dataset and sample it:

Deaths = c(0,1,2,3,4,5)
N = c(109,65,22,3,1,0)
# expand the dataset into individual cases
D1 = rep(Deaths,N)
# a sample of size 25:
X25 = sample(D1,size=25,replace=F)

(a) An ‘exact’ confidence interval using the same idea as the


Clopper-Pearson interval for the binomial, ie solve the following
equations for lower and upper bounds, for
X
Y = Xi ⇠ Poisson(n✓) :

✓L = inf{✓ : P(Y Yobs |✓) .025}


✓U = sup{✓ : P(Y  Yobs |✓) .025}
7.6. EXERCISES 115

(b) The asymptotic confidence interval based using the fact that X̄
is approximately Normal(✓,1/In (✓).
(c) The confidence interval based on the asymptotic normality of the
score function, that is, solve the following for (✓L , ✓U ):

U (X̃)
pn = 1.96
In (✓)

(d) The confidence interval based on the likelihood ratio test


statistic:
ˆ
2(log(f (X̃|✓)) log(f (X̃|✓))) ⇠ 2n 1
In other words,
ˆ
{✓ : log(f (X̃|✓)) log(f (X̃|✓)))  3.84/2}

(e) A Bayesian HPD region of posterior probability .95 using a


Gamma(2, 12 ) prior. Hint: the R function pgamma(x,a,b)
computes the probability that a Gamma(a,b) random variable is
less than x.
(f) An approximate Bayesian 95% probability interval using
Laplace’s approximation for the posterior (ie a normal
approximation for the posterior density (see Chapter 4 for the
binomial example).
116 CHAPTER 7. FORMAL INFERENCE
Chapter 8

Computational Statistics

Modern statistics has been changed dramatically be the existence of serious


computational power. Large scale simulation and iterative solution of
systems of equations has become commonplace.

8.1 Newton and Quasi-Newton algorithms


Newton’s method for solving a system of non-linear equations is simple in
concept: start with an initial guess, approximate the system locally by a
linear system (the tangent line or plane), then solve the simpler linear
system for the next guess. For maximization, the system of equations we
wish to solve comes from setting the gradient of the objective function
equal to zero. Given a starting value x0 , set

0 = rf (x0 ) + H(x x0 )

where H is the Hessian matrix. This yields the linear system

rf (x0 ) = H(x x0 ).

Analytically, the solution is

x = x0 H rf (x0 ).

There are many issues in the implementation of the such algorithms, such
as stopping rules, error checking, and so on. You can get some sense of the

117
118 CHAPTER 8. COMPUTATIONAL STATISTICS

richness of the problem by reading the help files in R for the nlm and optim
functions.
Suppose that we have a random sample X1 , X2 , . . . Xn from a population
with two distinct subpopulations. If each subpopulation has a normal
distribution with mean µi and variance i2 , then the distribution of a
random element of the full population has a mixture distribution, in this
case a mixture of two normal distributions. The density function is a
convex combination of the two densities:

2 2 2 2
f (x | p, µ1 , µ2 , 1, 2) = p n(x | µ1 , 1) + (1 p) n(x | µ2 , 2)

where p is the proportion from the first subpopulation.


As an illustration, the following R code generates a random sample from a
two component normal mixture with parameters p = 3/10, µ1 = 6, 1 = 2,
and µ2 = 0, 2 = 4:

MX <- rnorm(50)
p <- .2
n <- rbinom(50,1,.3)
MX <- ifelse(n,2*MX+6,4*MX)
plot(density(MX))
8.1. NEWTON AND QUASI-NEWTON ALGORITHMS 119

Mixture density

0.08
0.06
d

0.04
0.02
0.00

−15 −10 −5 0 5 10 15

Let’s try estimating the parameters by maximum likelihood, using the


nlm() function in R . First, we need to define the function to be
minimized, that is the negative of the log likelihood. Note that the function
we are maximizing must be specified with a vector of paramters as the
argument; the dataset has to be ‘built in’ to the function by name (MX).

f <- function(P)
{
# P is the parameter vector (p,mu1,mu2,s1,s2)
# note the use of SDs instead of variances
p <- P[1]
m1 <- P[2]
m2 <- P[3]
s1 <- P[4]
s2 <- P[5]
-sum(log(p*dnorm(MX,m1,s1)+(1-p)*dnorm(MX,m2,s2)))
}
120 CHAPTER 8. COMPUTATIONAL STATISTICS

Now, let’s try running it through nlm() with the data we created:

mle <- nlm(f,c(.5,5,0,1,1),hessian=TRUE)

Warning messages:
1: NaNs produced in: log(x)
2: NA/Inf replaced by maximum positive value
3: NaNs produced in: log(x)
4: NA/Inf replaced by maximum positive value

Do all those warnings indicate a problem? We can’t tell without looking at


the output:

mle
$minimum
[1] 135.9917

$estimate
[1] 0.3358718 5.9991124 -1.1386181 0.7881826 3.7583494

$gradient
[1] -5.599077e-06 -8.300367e-06 3.245006e-07 2.154366e-05 9.074742e-07

$hessian
[,1] [,2] [,3] [,4] [,5]
[1,] 184.717845 6.54627873 4.1521531 -10.4477664 5.20012061
[2,] 6.546279 20.91935439 -0.4691034 6.5323202 -0.06922796
[3,] 4.152153 -0.46910339 1.9033706 0.8896393 -0.58319964
[4,] -10.447766 6.53232019 0.8896393 36.6926315 0.70978473
[5,] 5.200121 -0.06922796 -0.5831996 0.7097847 3.88659680

$code
[1] 1

$iterations
[1] 24

As you may verify by checking the nlm help file, termination codes 1 and 2
are indicators that the search converged to a solution, so the warning
8.1. NEWTON AND QUASI-NEWTON ALGORITHMS 121

messages may be safely ignored. The parameter estimates look ok. The
square root of the diagonal of the inverse of the hessian gives asymptotic
standard errors for the estimates:

sqrt(diag(solve(mle$hessian)))
[1] 0.08082643 0.23065559 0.79375142 0.17661707 0.54407076

Collecting the pieces, we have:

parameter estimate SE
p 0.336 0.081
µ1 5.999 0.231
µ2 1.138 0.794
1 0.788 0.177
2 3.758 0.544

Things can easily go astray. Newton-like algorithms depend on having


pretty good guesses for initial values. If we pick a sufficiently poor starting
value, the algorithm may go wildly wrong:

> mle <- nlm(f,c(.5,0,1,1,1),hessian=TRUE)


There were 50 or more warnings (use warnings() to see the first 50)

It gave us lots of warnings, but again, we need to check the convergence


code and other output to see if it worked:

> mle
$minimum
[1] -250.9311

$estimate
[1] 7.463620e+03 7.809437e-01 4.630075e+01 1.913076e+01 8.201048e-09

$gradient
[1] -0.006699159 -0.065276225 0.000000000 2.462322988 0.000000000

$hessian
[,1] [,2] [,3] [,4] [,5]
122 CHAPTER 8. COMPUTATIONAL STATISTICS

[1,] 8.973960e-07 7.616066e-10 0 -5.971586e-11 0


[2,] 7.616066e-10 1.366317e-01 0 6.822424e-03 0
[3,] 0.000000e+00 0.000000e+00 0 0.000000e+00 0
[4,] -5.971586e-11 6.822424e-03 0 -1.128783e-01 0
[5,] 0.000000e+00 0.000000e+00 0 0.000000e+00 0

$code
[1] 2

$iterations
[1] 59

Notice that the termination code (2) suggests convergence. The nlm
function is minimizing the negative of the log likelihood, which has to be
positive, but the final minimum is negative. The parameter values are fishy:
most notably p̂ = 7463. That’s a bit big for a probability! Two of the
diagonal elements of the hessian matrix are 0 — not good for a matrix that
is supposed to be positive definite!

8.1.1 The EM algorithm


Various algorithms which can be interpreted as versions of the EM
algorithm have been around for a long time. The first modern presentation
was in Dempster, Laird and Rubin [5], where the authors presented it as an
algorithm for incomplete data. We would like to compute MLE’s for a
problem where there is either explicitly missing data, or there is a variable
which, if we knew its value, would greatly simplify the computation. I will
describe the EM algorithm for the mixture problem.
There are three main units: the set-up, the E-step, and the M-step. The
set-up step may involve a little creativity, known as data-augmentation.
This may involve inventing a new variable which possibly has not been
measured, and perhaps even could not be. Sometimes it is simply noting
the missing data.
8.1. NEWTON AND QUASI-NEWTON ALGORITHMS 123

Data Augmentation

We want to maximize the log likelihood for the normal mixture


problem. Each observation Xi is either from group one or group two.
Define a new variable Zi which is an indicator for group membership:
Zi = 1 if Xi was from group one, and Zi = 0 if Xi was from group
two. Since each Zi is a Bernoulli trial,

E(Zi ) = Pr(Zi = 1) = p.

If we knew all the Zi ’s, the problem would be simple — just use the
group one X’s to estimate the mean and variance of that component
of the mixture, and use the group two X’s to estimate the mean and
variance of their group.

The E step
In general this step is the computation of the expected value of the
Zi ’s, given the X’s and parameter values (here p, µ1 , µ2 , 12 , 22 ). We
can apply Bayes’ rule to compute qi , the posterior probability that
Zi = 1, which is equal to the posterior expectation.

p f (Xi | µ1 , 12 )
qi = E(Zi | Xi ) = .
p f (Xi | µ1 , 12 ) + (1 p) f (Xi | µ2 , 2
2)

Knowing the qi ’s allows us to estimate p as


P
qi
p̂ = .
n

The M step
Now we don’t know the Zi ’s, but we have their expected values, given
the observed data. Thus the likelihood for each Xi can be written as
2 2 2 2
f (xi | Zi , µ1 , µ2 , 1, 2) = qi n(x | µ1 , 1) + (1 qi ) n(x | µ2 , 2 ).

It is not hard to show that the MLE’s for the parameters are now just
weighted averages, with the qi ’s as weights. For example,
P
i qi X i
µˆ1 = P .
qi
124 CHAPTER 8. COMPUTATIONAL STATISTICS

8.2 Bootstrapping
Bootstrapping is a simulation method. Suppose that we wanted to study
the distribution of some function of an IID sample X1 , X2 , . . . , Xn . If we
knew the distribution of the population from which the data were sampled,
we could simply generate many pseudo-random samples of size n, compute
the value of our function for each sample, collect the values, and estimate
the parameter or paramters of the sampling distribution we would like to
know.

dx <- seq(.01,15,.01)
plot(dx,dlnorm(dx),type="l")
X <- rlnorm(50)

dx <- seq(.01,15,.01)
plot(dx,dlnorm(dx),type="l")

X <- rlnorm(50)
lines(density(X),col="red")

X <- rlnorm(50)
lines(density(X),col="blue")

X <- rlnorm(50)
lines(density(X),col="green")

x <- matrix(rlnorm(50000),ncol=50)
mx <- apply(x,1,mean)
plot(density(mx))
quantile(mx,c(.025,.0975))
2.5\% 9.75\%
1.146058 1.286618

mx <- rep(0,1000)
for(i in 1:1000)
{
x <- sample(X,size=50,replace=T)
mx[i] <- mean(x)
8.2. BOOTSTRAPPING 125

}
lines(density(mx),col="red")
quantile(mx,c(.025,.0975))
2.5% 9.75%
0.949869 1.040169

mx <- rep(0,1000)
for(i in 1:1000)
{
x <- sample(X,size=50,replace=T)
mx[i] <- mean(x)
}
plot(density(mx))
Mx <- rep(0,100000)
for(i in 1:100000)
{
Mx[i] <- mean(rweibull(30,3,2))
}
126 CHAPTER 8. COMPUTATIONAL STATISTICS
Chapter 9

Markov Chain Monte Carlo

Suppose that we wish to evaluate an integral


Z
h(x) f (x) dx

where f (x) is a density function. In other words, we wish to compute


E(h(X)). The law of large numbers insures that if we sample
(X1 , X2 , . . . , Xn ) independently from the probability distribution with
density f (x), then
P Z
h(Xi )
! E(h(X)) = h(x) f (x) dx.
n
Further, the Central Limit Theorem allows us to evaluate the accuracy of
the estimate, at least approximately, via an asymptotic 95% confidence
interval:
sh
h ± 1.96 p
n
where sh is the sample standard deviation of the h(Xi )’s.
What if we can’t directly generate values from the desired probability
distribution? The idea of Markov Chain Monte Carlo is to construct a
Markov process whose stationary distribution is the desired probability
distribution. The theory of Markov processes guarantees that samples from
that Markov process also obey the Law of Large Numbers. More generally,
if our sample from the stationary distribution may be used to estimate any
parameter of the distribution, or the density function itself.

127
128 CHAPTER 9. MARKOV CHAIN MONTE CARLO

9.1 Gibbs sampling


Suppose we would like to sample from a joint distribution specified by
density f (x, y), but for some reason we can’t easily generate pseudo random
numbers directly from the joint distribution. If we can generate from the
conditional distributions easily, we can construct a Markov Process with
the desired stationary distribution as follows. Start with any reasonable
value for Y0 .

Generate Xi+1
Generate an observation from the conditional distribution

Xi+1 ⇠ fx (x | Yi ).

Generate Yi+1
Generate an observation from the conditional distribution

Yi+1 ⇠ fy (y | Xi+1 ).

This is a Markov process: the conditional distribution of the next


observation depends only on the current state, not on the history of the
process. Assuming that the process is ergodic, after some time it will be in
the stationary distribution, so we are then sampling from the desired joint
distribution. For a more detailed discussion, see Casella and George [4].

9.2 The Changepoint Problem


Let Y1 , Y2 , . . . , Yn be Poisson random variables, with a changepoint at an
unknown time m. In other words,

Y1 , Y2 , . . . , Ym ⇠ Poisson(µ)

and
Ym+1 , Ym+2 , . . . , Yn ⇠ Poisson( ).
Let the prior distributions for the parameters be µ ⇠ Gamma(↵, ),
⇠ Gamma(⌫, ), and (m) ⇠ Uniform{1, 2, . . . , n}. Then the joint
9.2. THE CHANGEPOINT PROBLEM 129

distribution of the data and the parameters is


m ✓
Y Yi
◆Yn ✓ Yi
◆ ↵ ↵ 1 ⌫ ⌫ 1
µµ µ µ 1
f (Ỹ , µ, , m) = e e e e .
1
Yi ! m+1 Yi ! (↵) (⌫) n

The joint posterior distribution is thus

f (Ỹ , µ, , m)
⇡(µ, , m | Ỹ ) =
fỸ (Ỹ )
Pm Pn
mµ Yi (n m) Yi
= ce µ 1 e m+1 µ↵ 1
e µ ⌫ 1
e
Pm Pn
= c µ(↵+ 1 Yi ) 1
e ( +m)µ (⌫+ m+1 Yi ) 1
e ( +n m)

From here, it is not hard to compute the conditional distributions for each
parameter, given the data Ỹ and the other parameters. The two Poisson
parameters µ and are conditionally independent given m and Ỹ .
m
X
⇡1 (µ | Ỹ , m) ⇠ Gamma(↵ + Yi , + m)
1

n
X
⇡2 ( | Ỹ , m) ⇠ Gamma(⌫ + Yi , + (n m))
m+1

Finally, the conditional distribution for m, given Ỹ , µ and , is for each m is


Pm Pn
(⌫+ Yi ) 1
µ(↵+ 1 Yi ) 1 e ( +m)µ m+1 e ( +n m)
⇡3 (m | Ỹ , µ, ) = Pn 1 P Pn .
µ (↵+ k1 Yi ) 1 e ( +k)µ (⌫+ k+1 Yi ) 1
e ( +n k)
k=1

It is easy to sample from each of these conditional distributions. In R ,


rgamma(1,a,b) will generate an observation from the Gamma(a, b)
distribution. Sampling from an arbitrary discrete distribution may be done
with the sample() function:

sample(1:n,size=1,prob=p)

will generate an observation from the integers 1, 2, . . . , n with probabilities


given by the vector p.
130 CHAPTER 9. MARKOV CHAIN MONTE CARLO

9.3 The Metropolis-Hastings Algorithm


Often it is not possible to generate from exact conditional distributions for
the parameters. Typically the difficulty arises in computing the
denominator for Bayes Theorem:

f (X|✓)⇡(✓)
⇡(✓|X) = R .
f (X|✓)⇡(✓)d✓

The idea is to replace sampling from the exact conditional distributions for
components of ✓ with an accept-reject algorithm for which the sequence of
trials still converges to the stationary distribution.
Let’s start with a simplified version. Suppose we have a K parameter state
space, and we want to generate samples from a joint distribution f (y). We
can’t run the standard MCMC sampler, generating from the conditional
distributions f (yn |yn 1 ), perhaps because we don’t have the normalizing
constants. We create a proposal distribution g(z|yn 1 ), which may not have
any connection to the distribution f (y), but rather proposes a jump to a
new location in the state space. We start the chain at some arbitrary
location y0 , and generate proposal Y 0 s from the distribution g(Y |yn ). We
now accept or reject with probability

f (Y )g(yn |Y )
↵ = min ,1 .
f (yn )g(Y |yn )

Then set yn+1 = Y with probability ↵, and yn+1 = yn with probability


1 ↵. If the proposal distribution g is chosen so that we cover the whole
state space with positive probability then the process still has stationary
distribution f (Metropolis et al. [16], Hastings [12]). If the proposal
distribution is symmetric, such as Y ⇠ N (yn , 2 ), then g(Y |yn ) = g(yn |Y ),
and the test criterion simplifies to

f (Y )
↵ = min ,1 .
f (yn )

In other words, if the proposal Y has greater probability than the current
yn , always accept it (always move to points of higher probability than the
current location). If the proposed Y is less probable, then accept with
positive probability ↵ < 1. The idea is that you want the process to spend
9.3. THE METROPOLIS-HASTINGS ALGORITHM 131

more time in regions of higher probability, but you don’t want to get stuck
there; you want to explore the whole state space.
Often one wants to combine Gibbs sampling with Metropolis-Hastings.
Here is psuedo-code to implement MCMC for a vector parameter
✓˜ = (✓1 , ✓2 , . . . , ✓k ). In essence we use the accept-reject criterion successively
˜
for each component of ✓:

{Initialize ✓˜1 }
for n in 1 to N do
for i in 1 to k do
{generate Yi ⇠ g(Yi |✓i )}
Ỹ (✓1 , . . . , ✓i 1 , Yi , ✓i+1 , . . . , ✓k )
( )
f (X|Ỹ )⇡(Ỹ )g(✓i |Yi )
↵ min ,1
˜ ✓)g(Y
f (X|✓)⇡( ˜ i |✓i )

˜
{set ✓ = Ỹ with probability ↵, else retain ✓}
end for
end for
That pseudo-code describes one iteration of the Markov process, it should
be repeated many times to ensure that the process converges to its
stationary distribution.
132 CHAPTER 9. MARKOV CHAIN MONTE CARLO

9.4 Exercises
1. Compute the integral Z 1
log |x| 1 2
x
p e 2

1 2⇡
that is, E(log |X|) for X ⇠ N (0, 1), by the following methods. For the
monte carlo methods, estimate the integral and compute a standard
error for your estimate.

(a) The trapezoid rule.


(b) The R integrate() function.
(c) Monte Carlo Integration, sampling directly from the N (0, 1)
distribution.
(d) Importance sampling, using a t distribution with 3 df. In other
words, the average will be computed from the values of

log (abs(X)) ⇤ dnorm(X)/dt(X, 3).

2. Estimate E(log |X|) for X ⇠ t3 (a t random variable with 3 df.), using


the same methods, but for importance sampling, sample from the
standard normal distribution, ie. reverse the roles of the normal and t
distributions in the previous exercise. For the monte carlo methods,
estimate the integral and compute a standard error for your estimate.

3. Bootstrapping doesn’t always work!


Let (X1 , X2 , . . . , Xn ) be IID U (0, ✓) random variables. Let Mn be
X(n) , the maximum of the X’s.

(a) Compute the distribution of Mn directly, i.e. find the cumulative


distribution function F (x) via F (x) = Pr(Mn < x).
(b) Show that Mn /✓ is a pivotal quantity, that is, it has a
distribution that does not depend on ✓. Hint: Xi /✓ has a
uniform distribution on what interval?
(c) Hence we can construct a confidence interval by finding a and b
such that Pr(a < Mn /✓ < b) = 1 ↵. Show that this CI is
(Mn /b, Mn /a).
9.4. EXERCISES 133

(d) Generate a random sample in R using X <- runif(25,0,5),


find the largest (M = max(X)), and construct the exact 95% CI.
(e) Show that ✓ˆ = 26/25 ⇤ M is unbiased for ✓.
(f) Construct an approximate 95% CI for ✓ by bootstrapping ✓, ˆ
using the .025 and .975 quantiles of the bootstrap distribution
ˆ
for ✓.
(g) How does the bootstrap CI compare to the exact 95% CI
constructed above? Explain why the bootstrap CI doesn’t work
in this case.

4. Retrieve the British coal mine disaster data from the Math 141 web
page at

http://www.reed.edu/~jones/141

(a) Fit the change point model by maximum likelihood. Plot the
profile log-likelihood vs year. The profile likelihood for year m is
log L(µ̂, ˆ , m), in other words, given m, maximize the
log-likelihood in the other parameters.
(b) Let the priors for µ and , respectively, be Gamma(4, 1) and
Gamma(2, 2). Use Markov chain Monte Carlo to estimate the
posterior marginal densities, means and variances for each
parameter. Plot the prior and posterior densities for each
parameter. Be sure to provide details about your MCMC
computations (R code, number of ‘burn-in’ iterations, total
iterations, etc.)
134 CHAPTER 9. MARKOV CHAIN MONTE CARLO
Bibliography

[1] John Arbuthnott. An argument for divine providence, taken from the
constant regularity observ’d in the births of both sexes. Phil. Trans. of
the Royal Society, 27:186–190, 1710.

[2] James O. Berger and Robert L. Wolpert. The Likelihood Principle.


Institute of Mathematical Statistics, 2 edition, 1988.

[3] David Blackwell. Conditional expectation and unbiased sequential


estimation. Annals of Mathematical Statistics, 18:105–110, 1947.

[4] G. Casella and E. George. Explaining the gibbs sampler. The


American Statistician, 46(3):167–174, 1992.

[5] N. M. Dempster, A. P. Laird and D. B. Rubin. Maximum likelihood


from incomplete data via the em algorithm. Journal of the Royal
Statistical Society, pages 1–22, 1977.

[6] R. A. Fisher. On an absolute criterion for fitting frequency curves.


Messenger of Mathematics, 41:155–160, 1912. reprinted in Statistical
Science, Vol. 12, No. 1. (Feb., 1997), pages 39-41.

[7] R. A. Fisher. On the mathematical foundations of theoretical


statistics. Phil. Trans. of the Royal Society, 222:309–368, 1922.

[8] R. A. Fisher. Statistical methods for research workers. Oliver & Boyd,
Edinburgh, 1925.

[9] R. A. Fisher. The design of experiments. Oliver & Boyd, Edinburgh,


1935.

135
136 BIBLIOGRAPHY

[10] Carl Friedrich Gauss. Theoria Motus Corporum Coelestium. Perthes et


Besser, Hamburg, 1809.

[11] Anders Hald. A history of mathematical statistics from 1750 to 1930.


Wiley, New York, 1998.

[12] W. Hastings. Monte carlo sampling methods using markov chains and
their application. Biometrika, 57:97–109, 1970.

[13] Norman L. Johnson and Samuel Kotz, editors. Breakthroughs in


Statistics, volume 1. Springer-Verlag, New York, 1992.

[14] Norman L. Johnson and Samuel Kotz, editors. Leading personalities in


statistical sciences: from the 17th century to the present. Wiley, New
York, 1997.

[15] B. O. Koopman. On distributions admitting a sufficient statistic.


Transactions of the American Mathematical Society, 39(3):399–409,
1936.

[16] N. et al. Metropolis. Equations of state calculations by fast computing


machines. J. Chem. Phys., 21(6):1087–1092, 1953.

[17] J. Neymann and E. S. Pearson. On the problem of the most efficient


tests of statistical hypotheses. Phil. Trans. of the Royal Society, 231,
1933.

[18] C. Radhakrishna Rao. Information and accuracy attainable in the


estimation of statistical parameters. Bulletin of the Calcutta
Mathematical Society, 37:81–91, 1945.

Вам также может понравиться