Академический Документы
Профессиональный Документы
Культура Документы
Albyn C. Jones
Department of Mathematics
Reed College
ii
Contents
1 Review: Probability 1
1.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Discrete Distributions . . . . . . . . . . . . . . . . . . 1
1.1.2 Continuous Distributions . . . . . . . . . . . . . . . . . 3
1.2 Location and Scale parameters . . . . . . . . . . . . . . . . . . 6
1.3 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . 8
1.6 MGF’s of sums of RV’s . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Multivariate Moments . . . . . . . . . . . . . . . . . . . . . . 10
1.8 The Multivariate Normal Distribution . . . . . . . . . . . . . . 11
1.9 Marginal and Conditional Normal distributions . . . . . . . . 13
1.10 MGF’s and Characteristic Functions . . . . . . . . . . . . . . 14
1.11 Transforms of random variables . . . . . . . . . . . . . . . . . 17
1.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Least Squares 29
3.1 The Modern Linear Model . . . . . . . . . . . . . . . . . . . . 32
3.2 Parameter Estimation by Least Squares . . . . . . . . . . . . . 33
3.3 Projection Operators on Linear Spaces . . . . . . . . . . . . . 34
3.4 Distributional Results . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 The distribution of ˆ . . . . . . . . . . . . . . . . . . . 36
3.4.2 Fitted Values . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Predictions . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
iv CONTENTS
3.4.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 t and F statistics . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 t statistics . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 F statistics . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 t2 is an F statistic! . . . . . . . . . . . . . . . . . . . . 44
3.6 The Multiple Correlation Coefficient: R2 . . . . . . . . . . . . 44
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7 Formal Inference 93
7.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2.1 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
CONTENTS v
vii
viii CONTENTS
revival began with purely theoretical development, and only gathered steam
in the applied world with the advent of powerful modern computers.
I hope the reader will be tolerant of my peculiar project.
Chapter 1
Review: Probability
This chapter contains results from probability theory that I will assume you
know. You probably don’t know all the probability distributions discussed
below, so read this chapter. It also establishes some notation that will (I
hope!) be consistently used in the rest of the notes.
1.1 Distributions
Here are a few probability distributions that arise in applications and ex-
ercises in statistics books. It is useful to get in the habit of writing the
probability distribution or density in the form f (x|✓), where x is the argu-
ment and ✓ is the parameter (or list of parameters). If the support of the
density depends on a parameter include the support in the definition of the
density via the characteristic function for the set on which the density is
non-zero, for example I(a < x < b) is the open interval (a, b).
1
2 CHAPTER 1. REVIEW: PROBABILITY
for k = 0, 1, 2, . . . , n.
f (k | p) = pq k
for k = 0, 1, 2, . . . , 1.
Warning: the geometric distribution is sometimes defined as the num-
ber of trials including the first success. For compatibility with R we
will use this definition.
for ↵ > 0 and > 0. Special cases include the Exponential( ) distribu-
tion, which is a Gamma(1, ), and the 2n which is a Gamma(n/2, 1/2).
is often called a rate parameter. An alternative form, using as a
scale paprameter is given by
x↵ 1
x/
f (x | ↵, ) = ↵
e I(x > 0)
(↵)
(r + s) r 1
f (x | r, s) = x (1 x)s 1
(r) (s)
1 1
f (x | µ, ) =
⇡ (1 + (x µ)2 / 2 )
for > 0.
If X and Y are independent Normal(0,1) random variables, then X/Y
has a Cauchy(0,1) distribution. The Cauchy(0,1) is also the t distribu-
tion with 1 degree of freedom.
1 |x µ|/
f (x | µ, ) = e
2
for > 0.
1.1. DISTRIBUTIONS 5
2 1 (log x µ)2
f (x | µ, )= p e 2 2 I(x > 0)
x 2⇡ 2
for > 0.
Logistic(µ, )
1 e (x µ)/
f (x | µ, ) =
(1 + e (x µ)/ )2
for > 0.
Pareto(↵, ) Named for the the Italian economist Vilfredo Pareto, who
used it to model wealth distributions. It is also known as a power law
distribution.
↵
f (x | ↵, ) = +1 I(x > ↵)
x
for ↵ > 0, > 0.
( n+1 ) 1
fn (x) = p 2 n
n⇡ ( 2 ) (1 + x /n) n+1
2 2
Z
p
X/n
has a tn distribution.
6 CHAPTER 1. REVIEW: PROBABILITY
E(µ + X) = µ + EX
P(Y y) = P(µ + X y)
y µ y µ
= P(X ) = FX ( ).
1.3. JOINT DISTRIBUTIONS 7
Example
If Z has a standard normal distribution, then
1 z 2 /2
fZ (z) = p e .
2⇡
Hence the density for X = µ + Z is
1 x µ 1 (x µ 2
) /2
fX (x) = fZ ( )= p e .
2⇡ 2
Example
If Z has a standard Cauchy distribution:
1 1
fZ (z) =
⇡ 1 + z2
then the density for X = µ + Z is
1 1
f (x | µ, ) = (x µ)2
.
⇡1+ 2
Eet(µ+ X)
= Eetµ et X
= etµ Ee(t )X
= etµ MX (t ).
when fY (y) 6= 0.
For the discrete case, we can base our definition directly on the definition
of conditional probability:
P(X = x, Y = y)
P(X = x | Y = y) = .
P(Y = y)
If we can then recognize the product of the MGF’s the MGF of a known dis-
tribution, we know the distribution of the sum. This generalizes immediately
to the sum of n independent RV’s.
Example
X and Y are independent Poisson ( x) and Poisson ( y ) RV’s.
1
X
tX
Ee = etk k
e /k!
k=0
10 CHAPTER 1. REVIEW: PROBABILITY
1
X
EetX = e ( et )k /k!
k=0
so
(et 1)
MX (t) = e .
Thus the MGF of the sum is
t
MX+Y (t) = MX (t)MY (t) = e( x + y )(e 1)
and the sum of the two independent Poisson RV’s is Poisson, with parameter
x + y.
Example
X and Y are independent Gamma(rx , ) and Gamma(ry , ) RV’s, respec-
tively, where is the inverse of the scale parameter .
Z 1 r r 1
tX x
Ee = etX exp ( x)
0 (r)
and ✓ ◆r
MX (t) = .
t
Thus ✓ ◆(rx +ry )
MX+Y (t) = MX (t)MY (t) =
t
and we can immediately observe that the sum of two Gamma variates with
common scale factor has a Gamma distribution with the same scale factor,
while the shape parameter is the sum of the two shape parameters. In other
words
Gamma(rx , ) + Gamma(ry , ) ⇠ Gamma(rx + ry , ).
For two random variables X and Y defined on the same sample space,
the covariance is defined to be
normal density is
1 1
f (z) = p exp ( z 2 )
2⇡ 2
so the resulting multivariate density for z is
Y ✓ ◆n/2 ✓ ◆n/2
1 1X 2 1 1 0
f (z) = (f (zi )) = exp ( zi = exp ( z z).
2⇡ 2 2⇡ 2
To get the general form, apply the change of variable formula from calculus.
Let
X = µ + AZ
where A is an n ⇥ n matrix of full rank, and µ is an n vector of constants.
Using the notation of Jerry’s 212 notes, the change of variable theorem states:
Z Z
f (x) = f ( (x))| det 0 |
(K) K
(x) = A 1 (x µ).
The standard notation for stating that a random vector X has a multivariate
normal distribution is
X ⇠ N (µ, ⌃).
X ⇠ N (µX , ⌃XX ).
f (x, y)
f (x|Y = y) =
fY (y)
You may verify that the following is the conditional density by multiplying
by the marginal density for Y and showing that it is equal to the original
joint density:
1 1
f (x|y) = ( p )k exp ( (x µ)0 ⌃X.Y
1
(x µ))
2⇡ 2
where
µ = µX + ⌃XY ⌃Y 1Y (y µY )
and
⌃X.Y = ⌃XX ⌃XY ⌃Y 1Y ⌃Y X .
The MGF gets its name from the fact that we can use it to compute
moments for a probability distribution. To see why, expand etX as a power
series:
1 1
EetX = E(1 + tX + (tX)2 + (tX)3 + . . .)
2! 3!
1.10. MGF’S AND CHARACTERISTIC FUNCTIONS 15
1 2 1
= 1 + tEX + t EX 2 + t3 EX 3 + . . . .
2! 3!
Thus, if you di↵erentiate k times, and set t = 0, you get EX k .
If the random variable X has MGF MX (t), then the random variable
Y = a + bX, where a and b are constants, has MGF
If two RV’s X and Y are independent, then the MGF of their sum is the
product of their MGF’s:
Not every probability distribution has a MGF: the distribution must pos-
sess moments of all orders. Every distribution does have a characteristic
function:
itX
X (t) = Ee
p
where i = 1. Since
the integral always exists. You can compute moments from the characteristic
function if they exist, I’ll leave this as an exercise for the reader.
It is useful to compute these transforms for standard distributions.
x↵ 1 1 1
gamma(↵,1) (↵)
e x I(0 < x) (1 t)↵ (1 it)↵
1 1
double exponential 2
e |x| 1+t2
1 1 |t|
Cauchy ⇡ (1+x2 )
e
16 CHAPTER 1. REVIEW: PROBABILITY
There are some interesting relationships in this table. The normal is essen-
tially its own characteristic function (up to a constant), and the Cauchy and
double exponential transform to each other’s density. This is a very useful
example, as it illustrates some important facts about characteristic functions:
if the distribution has no expectation, then the characteristic function is not
di↵erentiable at t = 0. The characteristic function of the triangular(0, 2 ✓)
is the square of the characteristic function of the uniform(0,✓); hence the
sum of two independent uniforms has a triangular distribution. Finally, if a
density is symmetric about 0, then the characteristic function is strictly real.
Example
X and Y are independent Poisson ( x) and Poisson ( y ) RV’s.
1
X
EetX = etk k
e /k!
k=0
1
X
=e ( et )k /k!
k=0
so
(et 1)
MX (t) = e .
Thus the MGF of the sum is
t
MX+Y (t) = MX (t)MY (t) = e( x + y )(e 1)
and the sum of the two independent Poisson RV’s is Poisson, with parameter
x + y.
Example
X and Y are independent Gamma(rx , ) and Gamma(ry , ) RV’s, respec-
tively, where is the inverse of the scale parameter .
Z 1 r r 1
tX x
Ee = etX e( x) dx
0 (r)
and ✓ ◆r
MX (t) = .
t
1.11. TRANSFORMS OF RANDOM VARIABLES 17
Example
X and Y are independent Cauchy random variables and thus have charac-
teristic functions e |t| .
|t|/2 2 (|t|/2+|t|/2) |t|
X+Y (t) = X (t/2) Y (t/2) = (e ) =e =e .
2
Thus the average of two independent Cauchy RV’s has a Cauchy distribution!
This should lead you to be wary of computing averages for populations that
don’t have a mean! In contrast, the average of two normal RVs with variance
2
has variance 2 /2; averaging decreases the variability.
Example
Let (X, Y ) have a joint distribution on R2 . We can find the distribution of
18 CHAPTER 1. REVIEW: PROBABILITY
Example
Let X and Y be independent random variables with Gamma(↵x , ) and
Gamma(↵y , ) distributions, respectively. Let V = X/(X + Y ). We want
to find the marginal distribution for V . We could try to do this directly,
but the change of variables theorem gives us a straightforward method. Pick
G(X, Y ) = (U, V ) with a simple form such as U = X, U = Y , or perhaps
U = X + Y . It may not be obvious which choice is best initially! It helps
to have a G 1 with nice Jacobian. You should verify that the choice U = X
gives X = U and Y = U (1 V )/V , while U = Y yields X = U V /(1 V ),
while U = X + Y gives X = U V and Y = U (1 V ). I’ll work through the
last option; it looks simpler.
✓ ◆
v u
fU,V (u, v) = fX,Y (uv, u(1 v)) det .
1 v u
The determinant evaluates to vu u(1 v) = u, yielding
fU,V (u, v) = fX,Y (uv, u(1 v))|u|.
Since X and Y are non-negative, so is U = X + Y , so we may drop the
absolute value. Since X and Y are independent, the joint density is the
product of the marginal densities:
fU,V (u, v) = fX (uv)fY (u(1 v) u
(uv)↵x 1 ↵x
uv (u(1 v))↵y 1 ↵y
u(1 v)
= e e u
(↵x ) (↵y )
↵x 1
↵x +↵y 1 ↵x +↵y uv (1 v)↵y 1
= u e
(↵x ) (↵y )
1.11. TRANSFORMS OF RANDOM VARIABLES 19
Z
v ↵x 1 (1 v)↵y 1
= u↵x +↵y 1 ↵x +↵y
e u
du
(↵x ) (↵y )
v ↵x 1 (1 v)↵y 1
= (↵x + ↵y )
(↵x ) (↵y )
We see that V has a Beta(↵x ,↵y ) density. Note that for non-negative random
variables X and Y , X/(X + Y ) must be in (0, 1).
20 CHAPTER 1. REVIEW: PROBABILITY
1.12 Exercises
1. Use the convolution theorem to show that if X and Y are independent
Gamma(↵x , ) and Gamma(↵y , ) random variables, then X + Y has
a Gamma(↵x + ↵y , ) distribution.
p
X̄ µ n(X̄ µ)
t= p =
S/ n S
has a tn 1 distribution.
2 2
7. Let X and Y be independent a and b random variables, respectively.
Show that
X/a
F =
Y /b
has an Fa,b distribution.
Chapter 2
where n(x | µ, 2 ) is the density function for a normal with mean µ and
variance 2 . There are 5 parameters: p, the proportion of observations in
population 1, and a mean and variance for each population. He discussed the
problem of identifiability, that is the question of whether there are more
than one set of parameter values yielding the same distribution. For exam-
ple, when the two normal distributions collapse to one, we can parametrize
the mixture model in two ways: (1) µ1 = µ2 , 12 = 22 , with any choice of
p, and (2) p = 1, µ1 6= µ2 , 12 6= 22 . The mathematical issue is whether or
not there is a bijection between the parameter space and the space of prob-
ability distributions. Pearson computed the first 5 moments of the mixture
distribution, and the first 5 sample moments from the data, which gives a
set of 5 equations in 5 unknowns. He then solved the equations, producing
estimates for the unknown parameters. There were complications, since the
21
22 CHAPTER 2. THE METHOD OF MOMENTS
equations had more than one solution, and he seemed to indicate that there
was no biological reason to suppose that there were really two subpopulations
represented in the data, and that it was more reasonable to believe that the
underlying population was simply a single component with a non-normal dis-
tribution. Nevertheless, the exercise seems to have convinced Pearson that
he had a good idea. One can estimate the moments of a distribution by its
sample moments. The law of large numbers guarantees that if your sample
is large enough, the sample moments will be close to the true values. If
you have the moments in terms of some parameters, this yields a system of
equations which you can solve for estimates of those parameters.
Pearson developed this idea in the following years, eventually publishing
a set of tables (Biometrika tables for statisticians and biometricians, 1914)
which o↵ered a method for estimation, at least in the families of distributions
he had developed.
These families of distributions arose from solutions of di↵erential equa-
tions of the form
f 0 (x) x a
= .
f (x) g(x)
To create a manageable catalogue of distributions, he considered expanding
g(x) as a quadratic in the neighborhood of the point a where the numerator
vanishes, yielding equations with a polynomial of at most degree 2 in the
denominator.
f0 x a
= .
f c0 + c1 x + c2 x2
Since f 0 = 0 at one point x = a, the families are primarily unimodal dis-
tributions, though there are solutions that have ‘J’ or ‘U’ shaped density
curves.
Pearson categorized the resulting distributions into 7 families or types
based on di↵erent constraints on the coefficients a, c0 , c1 , and c2 . You will
occasionally see references to a ‘Pearson Type III’ or a ‘Pearson Type IV’
distribution, especially in older texts and articles.
You can easily verify by di↵erentiating the normal density function that
c1 = c2 = 0 corresponds to a normal distribution.
It turns out to be more natural to work with the first four moments of
the distributions, or scaled versions thereof. Let µk = E(X µ)k be the
k-th central moment. Pearson used the following versions of the first four
23
moments:
kurtosis 2 µ4 /µ22
p
You will also see 1 = 1 used as a measure of skewness, and 2 = 2 3
for kurtosis (for the normal distribution, 2 = 3).
µ, 2 , 1 , and 2 using
Pearson’s idea was to estimate the parameters P
the corresponding sample moments. Let µ̂ = X = Xi /n, where n is the
number of observations, and
X
µ̂k = (Xi X)k /n.
Then
ˆ1 = µ̂2 /µ̂3
3 2
and
ˆ2 = µ̂4 /µ̂2 .
2
Rather than tackle that one now, I’ll leave it as the proverbial exercise
for the reader. Hint: for given values of ↵, , and , c is the normalizing
constant that makes it a density — it must integrate to 1. If we knew those
values, we could integrate numerically to get the corresponding values for µ,
2
, 1 and 2 .
Assuming we knew the data came from a normal distribution, we would
simply use µ̂ = .0019, and µ̂2 = 1.11. We will see later
Pthat the 2standard
2 2
estimator for µ2 = in the normal distribution is s = (Xi X) /(n 1)
The method of moments may be used as I just did, without resorting
to Pearson’s table, if we are willing to assume that we know the family of
distributions. Here is a simple example: Let X1 , X2 , X3 , . . . , Xn be an
independent sample from a Beta(↵, ) distribution. There are two unknown
parameters, so we need to compute the first two moments:
Z 1
(↵ + ) ↵ 1
E(X) = x x (1 x) 1 dx
0 (↵) ( )
(↵ + ) (↵ + 1) ( )
=
(↵) ( ) (↵ + + 1)
↵
=
↵+
25
We have made use of the fact that (x + 1) = x (x). Now we compute the
second moment:
Z 1
2 (↵ + ) ↵ 1
E(X ) = x2 x (1 x) 1 dx
0 (↵) ( )
(↵ + ) (↵ + 2) ( )
=
(↵) ( ) (↵ + + 2)
↵(↵ + 1)
=
(↵ + )(↵ + + 1)
P
We could stop here, and just use the sample moment Xi2 /n, but may be
simpler to compute the variance:
↵
Var(X) = E(X 2 ) (E(X))2 = .
(↵ + )2 (↵ + + 1)
This may not appear any simpler, but we will substitute X for µ = ↵/(↵ + )
and also in 1 µ = /(↵ + ). Your favorite symbolic algebra package would
be useful here to solve for ↵ and in terms of X and ˆ 2 . Here are the
equations:
↵
X=
↵+
and
↵ 1
ˆ2 = .
(↵ + ) (↵ + ) (↵ + + 1)
X(1 X) ˆ2
↵
ˆ= X
ˆ2
and
ˆ = X(1 X) ˆ2
(1 X).
ˆ2
Here is a simulation using the Beta(1,1) or Uniform(0,1) distribution. Just
for fun I will repeat the full computation of Pearson.
26 CHAPTER 2. THE METHOD OF MOMENTS
These values for 1 and 2 put us in the Pearson Type I region where we
belong, but in the J1 subset rather than on the left boundary where we
belong. To finish the calculation by solving for ↵ and :
2.1 Exercises
For each of the following situations, find the method of moments estimator.
1 1
6. X1 , X2 , . . . Xn are IID with density f (x|µ) = ⇡ 1+(x µ)2
2
7. X1 , X2 , . . . Xn are IID N(µ,⌧ ), where ⌧ = 1/ is the precision.
2.2 Comment
The real significance of Pearson’s work on the method of moments was that
it changed the focus of statistical theory. The tradition had been to compute
the mean and variance of a sample, and rely on the Central Limit Theorem to
allow inference about the mean of the unknown distribution. After Pearson,
people started to focus on estimation of the unknown distribution itself, not
just its mean.
28 CHAPTER 2. THE METHOD OF MOMENTS
Chapter 3
Least Squares
There were several attempts in the late 18th century to come to grips with
the problem of fitting equations to data. Hald’s wonderful text A History of
Mathematical Statistics from 1750 to 1930 mentions several investigations.
The basic problem arises when we have pairs of observations (Xi , Yi )
which appear to exhibit a linear pattern. The earliest methods included
ideas like picking out the ‘best’ observations, just enough to yield a consistent
system of equations, and summing or averaging subsets of the equations to
yield a consistent system.
Boscovitch in 1757 proposed to fit a line Yi = 0 + 1 Xi + ei subject to
the following conditions:
n
X
(Yi 0 1 Xi ) =0
i=1
(the sum of the positive deviations from the line should equal the sum of
negative devaitions from the line) and the sum of the absolute values of the
deviations n
X
|Yi 0 1 Xi |
i=1
should be a minimum. In modern terminology, let the i-th residual ri be
r i = Yi ( 0 + 1 Xi )
P
then the two criteria are that residuals
P sum to 0, ri = 0, and the sum of
the absolute values of the residuals, |ri |, is minimum. Boscovitch proposed
29
30 CHAPTER 3. LEAST SQUARES
the first criterion under the assumption that the distribution of measurement
errors (ei ) or deviations from the line was symmetrical about 0. Hald reports
that there was historical precedent for making the second assumption going
back to Galileo. The first condition forces the fitted line to pass through the
point (X, Y ) which may be verified directly:
n
X
0 = (Yi 0 1 Xi )
i=1
n
X n
X
= Yi n 0 1 Xi
i=1 i=1
That is, if 0 and 1 are coefficients that satisfy the criterion, then the point
(X, Y ) must be on the line Y = 0 + 1 X.
If we now translate the coordinate system to put the origin at (X, Y ), we
have a simpler minimization problem: choose 1 to minimize
n
X
F ( 1) = |(Yi Y) 1 (Xi X)|.
i=1
ˆ1 = Y L YS
.
XL XS
31
Add the requirement that the deviations sum to zero, forcing the line through
(X, Y ) again, we now determine the intercept 0 by solving
Y = 0
ˆ1 X.
Adrien-Marie Legendre was the first to publish the least squares criterion,
in 1806. He stated the model in a very modern and general form:
He then argued that minimizing the sum of the squares of the errors is both
easier to do in general than earlier methods, and would tend make the largest
errors small.
For simplicity, we will consider the case of fitting a line again. Let
Q( 0 , 1 ) be the quadratic function of the coefficients representing the sum
of the squares of the residuals.
n
X n
X
Q( 0 , 1) = ri2 = (Yi 0 1 Xi )
2
.
i=1 i=1
n
X
@Q( 0 , 1)
= 2 (Yi 0 1 Xi )Xi
@ 1 i=1
This leads to a system of two linear equations in two unknowns. The solution
may be written in various forms, including
P
ˆ1 = (Yi Ȳ )(Xi X̄)
P
(Xi X̄)2
ˆ0 = Ȳ ˆ1 X̄
32 CHAPTER 3. LEAST SQUARES
and
2
V ar(Yi ) = .
Notice that this rules out models with random parameters ( ’s) and Xi,j ’s
measured with error. The former are often called random coefficient models,
but you may also hear the terms heirarchical linear models or random e↵ects
models. When the Xi,j ’s are random variables rather than known constants
we have the measurement error model.
It will greatly facilitate our discussion to translate the model into matrix
notation as follows:
0 1 0 10 1 0 1
B Y1 C B 1 X1,1 X1,2 . . . X1,p CB C B ✏1 C
B C B 1 CB 0 C B C
B Y2 C B 1 X2,1 X2,2 . . . X2,p CB C B ✏2 C
B C B 1 CB 1 C B C
B Y3 C=B 1 X3,1 X3,2 . . . X3,p CB C+B ✏3 C
B C B 1 CB 2 C B C
B .. C B .. .. .. .. .. CB .. C B .. C
@ . A @ . . . . . A@ . A @ . A
Yn 1 Xn,1 Xn,2 . . . Xn,p 1 p 1 ✏n
or
Y =X +✏
where Y is n ⇥ 1, X is n ⇥ p, is p ⇥ 1, and ✏ is n ⇥ 1. The assumption that
the ✏’s are IID N(0, 2 ) can be restated in terms of the multivariate normal
distribution as
✏ ⇠ N(0, 2 In ).
X is known as the design matrix, since in an experimental context, it repre-
sents the chosen levels of the explanatory factors.
3.2. PARAMETER ESTIMATION BY LEAST SQUARES 33
Ŷ = X ˆ
X0 (Y X )=0
X0 Y X0 X = 0.
X0 Y = X0 X .
Assuming that X0 X has full rank (p), that is, X has rank p, then the solution
is
ˆ = (X0 X) 1 X0 Y.
The fitted values are
Ŷ = X ˆ = X(X0 X) 1 X0 Y
r=Y Ŷ = Y X(X0 X) 1 X0 Y.
34 CHAPTER 3. LEAST SQUARES
The residual standard error is the square root of the estimate of the
residual variance 2 : P 2
2 RSS ri
s = = .
n p n p
2 2
We will show that RSS ⇠ n p, so this is an unbiased estimator for .
Finally, note that we can write the fitted values as a linear combination
of the columns of X:
Ŷ = X ˆ = ˆ0 1 + ˆ1 X1 + . . . + ˆp 1 Xp 1 .
V =U+W
where U 2 U and W 2 W:
PU(V) = PU(U + W) = U.
3.4. DISTRIBUTIONAL RESULTS 35
Y = Ŷ + r
where
Ŷ = X ˆ = X(X0 X) 1 X0 Y
and
r=Y Ŷ = Y Xˆ = Y X(X0 X) 1 X0 Y.
Let H = X(X0 X) 1
X0 , then
H2 = (X(X0 X) 1 X0 )(X(X0 X) 1 X0 )
= X(X0 X) 1 X0 = H
H(I H) = H H2 = H H = 0.
E(AY) = AE(Y)
E( ˆ) = E((X0 X) 1 X0 Y)
= (X0 X) 1 X0 E(Y)
= (X0 X) 1 X0 X
Using the fact that X0 X is a symmetric matrix, and thus so is its inverse,
the covariance matrix of ˆ is
Cov( ˆ) = Cov((X0 X) 1 X0 Y)
= (X0 X) 1 X0 ( 2 I) X(X0 X) 1
= 2
(X0 X) 1 (X0 X)(X0 X) 1
= 2
(X0 X) 1
Thus the standard error of a coefficient ˆj is the square root of the corre-
sponding diagonal entry in the matrix 2 (X0 X) 1 :
q
SE( ˆj ) = s (X0 X)j,j1
where P 2
2 RSS ri
s = =
n p n p
2
is the unbiased estimator of .
3.4. DISTRIBUTIONAL RESULTS 37
Ŷ = HY = X(X0 X) 1
X0 Y.
µ = E(HY) = HE(Y) = HX = X
and
⌃ = Cov(HY) = H( 2
I)H0 = 2
HH0 = 2
H2 = 2
H.
Thus the variance of a fitted value Ŷi is 2 multiplied by the ith diagonal
element of H, and the standard error of a fitted value is
p
SE(Ŷi ) = s Hii .
V ar(x0 ) = x0 Cov( )x = 2 0
x (X0 X) 1 x.
3.4.3 Predictions
In the previous section we computed the variance of a fitted value at the
location x0 . If we are interested in the prediction error, ie. the error in
making a prediction based on the fitted value, then we must also take into
38 CHAPTER 3. LEAST SQUARES
account variation about the value. Since the variance about the mean value
is 2 , the total variance is (assuming the new observation is independent of
the fitted values) the sum of the variance about the mean, plus the variance
of the estimate of the mean:
2 0
2
+ x (X0 X) 1 x
or
2
(1 + x0 (X0 X) 1 x).
3.4.4 Residuals
The residuals are also linear functions of the observations:
r=Y Ŷ = (I H)Y
Y = Ŷ + r = HY + (I H)Y
So the RSS is just k(I H)✏k2 , the squared length of the projection of n
independent N(0, 2 ) random variables into an n p dimensional subspace.
Choose an orthonormal basis {vi }i=1...n for Rn so that {vi }i=1...p span the
column space of X, and thus {vi }i=p+1...n span the orthogonal complement.
Since the change of basis matrix is an orthogonal matrix Q = (Q1 , Q2 ), Q02 is
norm preserving:
k(I H)Yk2 = kQ02 (I H)Yk2 .
But since Q2 is orthogonal to H, it follows that
Thus
k(I H)Yk2 = kQ02 ✏k2 .
In other words, the RSS has the same distribution as the sum of n p
independent, N(0, 2 ) random variables.
There are two important results imbedded in this discussion:
3.5.1 t statistics
We observed earlier that the parameter vector ˆ has a normal distribution:
ˆ ⇠ N( , 2
(X0 X) 1 ).
We saw in the last section that the residuals are independent of the parameter
estimates and fitted values, so the numerator and denominator of the ratio
ˆj
SE( ˆj )
ˆj ˆ 0
= qj
SE( ˆj ) s (X0 X)j,j1
3.5. T AND F STATISTICS 41
ˆ ⇣ ⌘
= q j
s (X0 X)j,j1
ˆj
q
(X0 X)j,j1
= p
s2 / 2
3.5.2 F statistics
The t-statistic can be used to test hypotheses about single parameters. What
if we wish to test the hypothesis that several parameters are all equal to zero?
The key idea is that we are really comparing two regression models to see
if adding or deleting some explanatory variables or, more generally, imposing
some constraints on parameters, leads to a ‘surprising large’ change in the
residual sum of squares (RSS). If it does, then the smaller, or restricted model
is judged to fit worse that the larger, or unrestricted model.
For example, in the regression model
Y= 01 + 1 X1 + 2 X2 + 3 X3 +✏
Y= 01 + 1 X1 + ✏.
42 CHAPTER 3. LEAST SQUARES
= HF HR HR + HR
= HF HR
and
kYk2 = kHR Yk2 + k(I HR )Yk2 .
Subtracting the second equation from the first gives
or
kHF Yk2 kHR Yk2 = k(I HR )Yk2 k(I HF )Yk2 .
3.5. T AND F STATISTICS 43
Thus
kHF? Yk2 /2
F =
k(I HF )Yk2 /(n 4)
k(HF HR )Yk2 /2
=
k(I HF )Yk2 /(n 4)
In the case where the restricted model corresponds to setting one or more
coefficients to zero (i.e. omitting the corresponding variables), rejecting the
null hypothesis that those parameters are all zero implies that at least one is
non-zero (or we have been unlucky). It does not imply that the full model is
correct, simply that the restricted model does not fit the data as well as the
full model.
44 CHAPTER 3. LEAST SQUARES
3.5.3 t2 is an F statistic!
The t-test discussed earlier turns out to be a special case of the F-test,
corresponding to omitting a single variable. Let Z be a standard normal
random variable, and let X be an independent 2n p random variable, then
p
Z/ (X/(n p)) has a tn p distribution, and
Z 2 /1 2
1 /1
t2n p ⇠ 2
⇠ 2
⇠ F1,n p .
n p /(n p) n p /(n p)
The t-test for the null hypothesis k = 0 corresponds exactly to the F-test
arising from the comparison of the full model with all explanatory variables
to a restricted model omitting the single variable Xk . Thus it is conditional
on the model containing all variables included in the full model. In particular,
suppose we have two t-tests which fail to reject the null hypothesis, say for
1 = 0 and 2 = 0. What can we conclude? Not much! We have tested 1
assuming that 2 is not zero, then tested 2 assuming that 1 is not zero. We
can not conclude that neither variable predicts the response! To test that
hypothesis, we need an F-test. If X1 and X2 are highly correlated with each
other, it is quite possible that the model will fit well if either one is included.
When both are included, neither one appears important viewed through the
lens of the t-test because the other is present to do the work.
Finally, it is important to remember that the F test is not a test of
‘goodness of fit’; it is a model comparison. Our test is conditional on the
model being correct - the relationships are linear, all the explanatory variables
that are related to Y have been included in the model, and the error term ✏
has the assumed properties. That’s a lot to assume!
Y = Ŷ + r
where the terms on the right hand side are uncorrelated, since
r = (I H)Y
3.6. THE MULTIPLE CORRELATION COEFFICIENT: R2 45
is orthogonal to
Ŷ = HY = X ˆ.
Thus the Pythagorean theorem applies, and after subtracting out the pro-
jection on the constant vector 1, which is in the column span of X we have
X X X
(Yi Ȳ )2 = (Ŷi Ȳ )2 + ri2 .
If we associate the left hand term with the total variance of Y ; the terms
on the right with variance explained by the regression line and unexplained
variance, respectively, then
P P 2
(Ŷi Ȳ )2 ri
P 2
=1 P .
(Yi Ȳ ) (Yi Ȳ )2
In other words, the ratio is the proportion of variance of the original variable
Y which is ‘explained’ by the regression. Clearly it lies between 0 and 1. It
is possible to show that this is equal to the square of the correlation between
Y and Ŷ . P
(Yi Ȳ )Ŷi
Cor(Y, Ŷ ) = qP P .
(Yi Ȳ )2 (Ŷi Ȳ )2
Hence the name multiple correlation coefficient, or R2 . Since it is inter-
pretable as the proportion of variance explained, it is also sometimes called
the coefficient of determination, and interpreted as a measure of goodness of
fit. One must be careful, as R2 can be close to 1 even when the model is
demonstrably wrong.
Warning: If your regression model does not include the intercept term,
then the definition of R2 is slightly di↵erent, since we are not subtracting out
the projection on the vector 1:
P 2 P 2
Ŷi r
P 2 = 1 P i2
Yi Yi
which as before must be in (0, 1). It is common to find software that computes
R2 incorrectly in this situation, and you will occasionally see people reporting
values of R2 that are not between 0 and 1. They are in error.
46 CHAPTER 3. LEAST SQUARES
3.7 Exercises
1. The QR decomposition.
Let X be an n ⇥ p matrix with rank p. Suppose we can find an n ⇥ p
matrix Q such that Qt Q is a p ⇥ p identity matrix I, and an upper
triangular p ⇥ p matrix R of rank p such that X = QR. For the linear
model
Y =X +✏
2. Householder transformations.
Let v be a non-zero vector in Rn . Show that the rank 1 modification
of the identity matrix defined by
vv t
T=I 2
vtv
is symmetric and orthogonal. Show that if we choose v = x ± kxke1
then Tx = ±kxke1 . Draw a sketch in R2 illustrating the two choices.
Y =X +✏
B t Y = B t X + B t ✏.
Find the least squares estimates ˆGLS for the transformed problem.
Compare the covariance matrix for ˆGLS to the covariance matrix for
ˆOLS .
3.7. EXERCISES 47
MR : Rate ⇠ LIM.
(b) Using the extraction function residuals(), compute the sum of the
squared residuals for each model. Verify the computations for
the F statistic given by the anova() function comparing the two
models.
(c) What null hypothesis is tested by the F test here?
(d) Which coefficients in MF are statistically significantly di↵erent
from 0?
48 CHAPTER 3. LEAST SQUARES
x = rnorm(100)
X = x + rnorm(100)
Z = x + rnorm(100)
Y = X + -3*Z + rnorm(100)
x = rnorm(100)
X = x + rnorm(100)
Z = rnorm(100)+ rnorm(100)
Y = X -3*Z + rnorm(100)
and fit the same two models. Explain the changes in the coefficients
of X and the residual variance s2 under each scenario. It may help to
compute variances or correlations for the variables.
7. Errors in Variables.
Load the Birthweight dataset into your R session. The variable smoke
is a ‘dummy’ variable or indicator variable; it is 1 for smokers and 0
for non-smokers.
(a) Fit the model bwt ⇠ gestation + smoke, and explain why it rep-
resents two parallel lines. What does the coefficient for smoke
represent?
(b) Make a scatterplot of bwt vs. gestation. You should observe
some peculiar points: very low gestation values with normal birth-
weights. What is the probable explanation for these points?
(c) Suppose that we observe gestation with error. We can simulate
the e↵ects of measurement error by creating a new variable which
is gestation plus some noise, say with standard deviations of 10,
20, and 30 days.
3.7. EXERCISES 49
G1 = gestation + rnorm(length(gestation),0,10)
G2 = gestation + rnorm(length(gestation),0,20)
G3 = gestation + rnorm(length(gestation),0,30)
Refit the the model substituting each of these for gestation. What
are the consequences of the di↵erent levels of measurement error
for the estimated coefficients and the residual standard error?
What is the relationship between the models m2 and m3 and their esti-
mated coefficients? Plot the residuals against the fitted values for each
model. Which residual plot corresponds best to a linear relationship
with constant variance?
50 CHAPTER 3. LEAST SQUARES
Chapter 4
4.1 Bayes
In elementary probability we learn a theorem due to the Rev. Thomas Bayes
(1701-1767), which gives a mechanism for reversing the direction of condi-
tioning.
P(B | Aj )P(Aj )
P(Aj | B) = P .
i P(B | Ai )P(Ai )
51
52 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY
In his paper, Bayes deduced the theorem for the special case of inferring
the probability of some event from a sequence of independent trials. Let
X be Binomial(n,p), where p is unknown. He argued that, at least in the
context he was considering, complete uncertainty about the probability p, it
makes sense to represent our uncertainty about p by a uniform distribution
on the interval [0, 1]. Bayes wanted to find the conditional distribution of the
unknown parameter p, given observed data X. In modern notation this is
P(X = k | p)⇡(p)
⇡(p | X = k) = R 1
0
P(X = k | p)⇡(p)dp
n
k
pk (1 p)n k ⇡(p)
= R1 n
0 k
pk (1 p)n k ⇡(p)dp
n
pk (1 p)n k
= R 1 kn
0 k
pk (1 p)n k dp
pk (1 p)n k
= R1
0
pk (1 p)n k dp
4.2 Laplace
It is not clear whether Laplace knew Bayes’ result; he was certainly capable
of deducing it himself, and made no mention of Bayes in his own writings. In
any case, in 1774 Laplace published a paper on the same problem Bayes had
studied, and like Bayes, proceded to assume a uniform prior for p. Laplace’s
argument for the uniform distribution in this and similar cases has come to
be known as the Principle of Insufficient Reason: if there is no reason to
believe that one possibility is more likely than any other, assume they have
equal probability. Laplace reframed the problem in a clever way allowing an
exact solution, and then tackled the problem of showing that the posterior
probability for any interval centered on the ‘true’ p has in the limit (as the
sample size increases) probability 1. He produced a marvelous approximation
for the resulting integral, an idea (remarkably) known today as ‘Laplace
approximation’.
One of Laplace’s inspirations was to ask a slightly di↵erent question:
suppose that we have observed k successes in n independent trials, what
is our best guess of the probability that the next trial will be a success?
We might view this as simply asking for an estimate of p, but the more
radical interpretation is that Laplace transformed the problem into prediction
problem. He proposed E(p | X = k) as the answer, and showed how to
compute it. Starting with the posterior density
pk (1 p)n k
⇡(p | X = k) = R 1
0
pk (1 p)n k dp
the desired expectation is
Z 1 Z 1
pk (1 p)n k
p ⇡(p | X = k)dp = p R1 dp
0 0 0
pk (1 p)n k dp
R1
pk+1 (1 p)n k dp
= R0 1
0
pk (1 p)n k dp
54 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY
B(k + 2, n k + 1)
=
B(k + 1, n k + 1)
k+1
=
n+2
The obvious estimator for p, p̂ = X/n, was known at the time. If we take
p̂1 = E(p | X = k) as an estimator for p, we have a very similar estimate, and
with large n the di↵erence will be trivial. Still, there are a couple of issues
worth noting. First, from a frequentist perspective, the standard estimator
X/n is unbiased: (E(X/n) = p).
✓ ◆
X +1 p+1
E(p̂1 ) = E =
n+2 n+2
so p̂1 is not unbiased - it will be closer to 1/2 than X/n. Second, we can inter-
pret the estimator as incorporating pseudo-observations: we have added two
trials, one success and one failure. This expresses mild skepticism about the
certainty of prediction: even after 20 failures with no successes we would still
refuse to estimate the probability as 0, instead we would use 1/22. Finally,
we can rewrite p̂1 in terms of p̂ = X/n.
X +1 1 2p̂
p̂1 = = p̂ + .
n+2 n+2
This makes it clear that the two estimators are asymptotically equivalent:
p̂1 p̂ ⇠ O(1/n).
Laplace then tackled the problem of showing that when n is large, the
probability that the posterior distribution assigns to any symmetric interval
around the true p goes to 1. In other words, letting p0 be the true value, for
any ✏ > 0
Z p0 +✏
lim ⇡(p | X)dp ! 1.
n!1 p0 ✏
While this looks like a version of the Law of Large Numbers, as it implies
p̂1 !P p0 , imbedded in his proof is essentially a central limit theorem, or
normal approximation for the beta distribution. An outline of his argument
follows.
4.2. LAPLACE 55
where c is
1 (n + 2)
= .
B(k + 1, n k + 1) (k + 1) (n k + 1)
p p⇤
z=p
p⇤ (1 p⇤ )/n
where p
n✏
x= p .
p (1 p⇤ )
⇤
Laplace
p was the first to evaluate this integral, and show that it equals
2⇡.
(n + 1)! p p
(k/n)k (1 k/n)(n k)
p⇤ (1 p⇤ )/n 2⇡.
k! (n k)!
Laplace did still more in his 1774 paper, but we will stop here.
One may derive Stirling’s approximation for the gamma function using
Laplace’s method:
Z 1 Z 1
↵ 1 x
(↵) = x e dx = e(↵ 1) log x x dx.
0 0
As before, find the mode of the function f (x) = (↵ 1) log x x, and expand
f in a Taylor series around that point. I’ll leave the rest as an exercise for
the reader.
4.3. BAYESIAN INFERENCE 57
pY +r 1 (1 p)n Y +s 1
= R1
0
pY +r 1 (1 p)n Y +s 1 dp
pY +r 1 (1 p)n Y +s 1
=
B(Y + r, n Y + s)
(n + r + s)
= pY +r 1 (1 p)n Y +s 1
(Y + r) (n Y + s)
Interesting! We get yet another beta distribution for the posterior. In fact,
the Bayes/Laplace computation is just the special case r = s = 1.
2
2 1 (µ µ0 )
1 (x µ) 2
2
e 2 2
e 0
= 1 (µ µ0 )2
R 1 (x µ)2
2 2
e 2 2
e 0 dµ
✓ ◆
1 (x µ)2 (µ µ0 )2
2 2 + 2
e 0
= ✓
(µ µ0 )2
◆
(x µ)2
R 1
2 2 + 2
e 0 dµ
2 2
0 (x 2xµ + µ2 ) + 2
(µ2 2µµ0 + µ20 )
= 2 2
0
2 2
0 (x 2xµ + µ2 ) + 2
(µ2 2µµ0 + µ20 )
= 2 2
0
µ2 ( 2
0 + 2
) 2µ( 02 x + 2
µ0 ) + ( 02 x2 + 2 2
µ0 )
= 2 2
0
The last term in the sum does not involve µ, and thus will cancel the corre-
sponding factor in the denominator. We now complete the square in µ, after
dividing numerator and denominator by ( 02 + 2 ):
2 2
µ2 ( 2
+ 2
) 2µ( 02 x + 2
µ0 ) µ2 2µ( 0
2+ 2 x+ 2 2 µ0 )
0 0 0+
2 2
= 2 2 /( 2 2)
.
0 0 0 +
60 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY
we have a very simple expression for the posterior precision in terms of the
precision of the data, ⌧ , and the prior precision ⌧0 :
⌧1 = ⌧ + ⌧0 .
We expand and simplify the sum, while adding and subtracting the mean x:
n
X n
X
2
(xi µ) = (xi x+x µ)2
i=1 i=1
n
X n
X n
X
2 2
= (xi x) + (x µ) + 2 (xi x)(x µ)
i=1 i=1 i=1
n
X n
X
= (xi x)2 + n(x µ)2 + 2(x µ) (xi x)
i=1 i=1
n
X
= (xi x)2 + n(x µ)2
i=1
Finally, invoking the good name of Bayes, and assuming a N(µ0 , 02 ) prior,
62 CHAPTER 4. BAYES, LAPLACE, AND INVERSE PROBABILITY
2
2 1 (µ µ0 )
1 n(x µ) 2
2
e 2 2
e 0
= 1 (µ µ0 )2
R 1 n(x µ)2
2 2
e 2 2
e 0 dµ
2 2
1 (x µ) 1 (µ µ0 )
2 /n 2 2
e 2
e 0
= 1 (µ µ0 )2
R 1 (x µ)2
2 /n 2 2
e 2
e 0 dµ
Observe that this is exactly the expression we evaluated for the case of a
single observation, except that we have x replacing x, and 2 /n replacing
2
. In fact, we have the same expression we would have if we had started by
recording only x, and discarding the individual values xi , since the distribu-
tion of the mean x of n independent N(µ, 2 ) observations has a N(µ, 2 /n)
distribution. That may seem like a convenient accident at the moment, but
we will see later that this is an example of a much deeper concept. In any
case, we can simply quote our earlier result: (µ | x̃) ⇠ N(µ1 , 12 ), where
2 2
0
µ1 = 2 2 /n
x+ 2 2 /n
µ0
0 + 0 +
and
2 2
2 0 /n
1 = 2 2 /n
.
0 +
4.4. EXERCISES 63
4.4 Exercises
For each of the following distributions, try to find a conjugate or other con-
venient prior distribution. For that prior, produce an estimator such as the
posterior mean, and find the posterior variance.
3. X is Negative Binomial(r,p).
µ ⇠ p1 n(µ|↵1 , ⌧1 ) + p2 n(µ|↵2 , ⌧2 )
5.1 Background
65
66 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD
The most probable set of values for the ✓’s will make P a maxi-
mum.
Implicit in his argument is the fact that P is a function of ✓ and that P (✓) is
maximized at the same point that log P (✓) is maximized. You will note that
Fisher is using the language of ‘inverse probability’, as had been common
since Laplace, to describe the criterion, now called maximum likelihood.
Fisher didn’t invent the term likelihood until 1921, to distinguish his method
from inverse probability.
He went on to illustrate the method in the case of independent observa-
tions from a N(µ, 2 ) distribution. Let x1 , x2 , x3 , . . . xn be our sample, then
(in more modern notation)
Xn ✓ ◆
2 1 2 1 (xi µ)2
log f (x1 , x2 , x3 , . . . xn | µ, ) = log (2⇡ ) 2
.
i=1
2 2
5.2. FISHER’S 1912 PAPER 67
We will di↵erentiate with respect to the parameters µ and 2 , and solve for
the maximum. Let the likelihood function be denoted log L(µ, 2 ), then
n
@ 2 1 X
log L(µ, ) = 2
(xi µ)
@µ i=1
n
1 X
= 2
xi nµ
i=1
Pn
@ 2 1 n 1 i=1 (xi µ)2
log L(µ, ) =
@ 2 2 2 2 ( 2 )2
and
Pn
1 n 1 i=1 (xi µ)2
0 =
2 2 2 ( 2 )2
n
2 1X
) = (xi µ)2
n i=1
X n
b2 = 1 (xi x)2 .
n i=1
This same strategy works in most cases, though you must be careful to
consider the possibility that the maximum occurs at the boundary of the
parameter space.
68 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD
2
Note that we di↵erentiated with respect to , not . Would we get a
di↵erent answer if we took as our parameter?
n
!
@ @ n 2 1 X (xi µ)2
log L(µ, ) = log (2⇡ ) 2
@ @ 2 2 i=1
Pn
n i=1 (xi µ)2
= 3
n
2 1X
) = (xi µ)2
n i=1
v
u n
u1 X
) =t (xi µ)2
n i=1
We get a consistent result. This turns out to be true for any one-to-one
ˆ This
transformation of the parameters: if q(✓) is one-to-one then q̂ = q(✓).
is a nice property, and does not hold for all estimation criteria.
↵x↵ 1
(x/ )↵
f (x | ↵, ) = ↵
e I(x > 0).
The likelihood equations are not analytically solvable. The example below
illustrates the use of the R nlm() function (non-linear minimization) for
numerical solution of the likelihood equations. The algorithm is a variant
of Newton’s method, applied to minimize the negative of the log likelihood
function.
> X
[1] 3.241291 4.714366 3.248232 4.607278 7.080301 3.946579 5.616528
[8] 11.164167 8.039719 5.578329 8.558086 23.018470 20.225088 5.486868
[15] 2.458238 14.914739 6.503793 7.425430 14.434551 8.903616
\$estimate
[1] 1.675988 9.556432
\$gradient
[1] 8.182324e-07 2.327227e-07
\$hessian
[,1] [,2]
[1,] 14.878803 -1.0205698
[2,] -1.020570 0.6148619
\$code
[1] 1
\$iterations
[1] 9
> sqrt(.07)
[1] 0.2645751
70 CHAPTER 5. R A FISHER AND MAXIMUM LIKELIHOOD
> sqrt(1.8)
[1] 1.341641
> contour(a,b,M,levels=c(59,60,61,62,63,64,65,67,70))
> curve(dweibull(x,1.67,9.55),0,30,col="red")
> rug(X,col="blue")
10
5
0 1 2 3 4 5
alpha
5.4. PROPERTIES OF MLE’S 71
The contour plot above displays the contours of the negative log likeli-
hood function. This serves as a diagnostic, since the shape of the likelihood
function is related to the precision of estimation. The following plot is the
Weibull density for the estimated parameters, with the observations marked
below the horizontal axis.
Weibull density
0.08
0.06
dweibull(x, 1.67, 9.55)
0.04
0.02
0.00
0 5 10 15 20 25 30
Thus we have ✓ ◆
@2
E log f (X | ✓) = I(✓).
@✓2
Recall from Laplace’s method that the resulting normal approximation has
as its variance the negative of the second derivative of the log likelihood at
the posterior mode (the MLE!), we see a hint of things to come.
The random variable U (X) is a function of a single random variable. If we
have a sample of independent, identically distributed random variables from a
regular family, then we can apply the central limit theorem. If X1 , X2 , . . . Xn
are an IID sample from a regular family, then U (X1 ), U (X2 ) . . . U (Xn ) are
IID with mean 0 and variance I(✓), and the central limit theorem states that
1
P
n
Ui 0 D
p ! N (0, 1).
I(✓)/n
For IID sampling, the joint likelihood factors into the product of the
marginal likelihoods, and thus the joint log likelihood is the sum of the
marginal log likelihoods, so when we di↵erentiate with respect to ✓ we get:
✓ ◆ ✓ X ◆
@ @
Var log (f (x1 , x2 , . . . , xn | ✓)) = Var log (f (xi | ✓))
@✓ @✓
✓X ◆
@
= Var log (f (xi | ✓))
@✓
X ✓ ◆
@
= Var log (f (xi | ✓))
@✓
= nI(✓)
We will sketch the proof, omitting a few technical details. Consider the
@
function Ui = @✓ log (f (xi | ✓)) as a function of both xi and ✓, and expand
ˆ
Ui (✓) around the true ✓. Summing the Ui , we have
X @ X @ ✓X 2 ◆
ˆ @
log f (xi | ✓) = log f (xi | ✓)+ log f (xi | ✓) (✓ˆ ✓)+etc.
@✓ @✓ @✓2
Since ✓ˆn is the MLE, the lefthand side of the equation equals zero. Disre-
garding the higher order terms, we have
X ✓X 2 ◆
@
0= Ui + log f (xi | ✓) (✓ˆ ✓).
@✓2
Since the expected value of the terms in the second derivative is I(✓), the
LLN implies that
1 X @2
log f (xi | ✓) ! I(✓)
n @✓2
5.4. PROPERTIES OF MLE’S 75
5.5 Exercises
For each of the following probability distributions, find the maximum likeli-
hood estimates assuming you have an independent sample X1 , X2 , . . . Xn of
observations.
1. Xi ⇠ Poisson(✓)
3. Xi ⇠ Exponential(✓)
5. Xi ⇠ Unif(↵, ).
6.1 Background
R. A. Fisher’s seminal 1922 paper On the mathematical foundations of the-
oretical statistics [7] started with a philosophical discussion of the purpose
of statistical methods. Fisher noted that there was considerable confusion of
terminology, extending to failures to distinguish between parameter values
and their empirical estimates. He observed that the problems of application
of statistical methods for a given dataset revolve around the question ‘of what
population is this a random sample?’. Answering that question involves data
reduction, the process of extracting the information about the population
from the sample. Fisher layed out a very list of problems which was to have
lasting influence:
77
78 CHAPTER 6. SUFFICIENCY AND EFFICIENCY
Consistency That when applied to the whole population the derived statistic
should equal the parameter. This is now known as Fisher consistency
to distinguish it from weak consistency, the requirement that the
estimator converge to the parameter value in probability, as in the
P
weak law of large numbers for the sample mean: X ! µ.
Efficiency That in large samples, when the distributions of the statistics tend
to normality, that statistic is to be chosen which has the least probable
error. In other words, the estimator with smaller variance is more
efficient.
Sufficiency That the statistic chosen should summarise the whole of the rele-
vant information supplied by the sample. Fisher explained the concept
as follows: suppose we have two estimators S(x̃) and T (x̃), such that
conditional on the value of S(x̃), the distribution of T (x̃) does not
involve the parameter ✓ of interest, then S(x̃) is sufficient, and T (x̃)
provides no further information about ✓.
✓ + ⇢( T / S )(S ✓)
E(T | S) = ✓ + ⇢( T / S )(S ✓) = ✓ + S ✓ = S.
If ⇢ T = S, then
2
T ⇢2 2
T = 2
S
6.2 Sufficiency
Fisher’s definition of sufficiency has been generalized to the following:
P(X1 = x1 , . . . , Xn = xn )
= P
P( Xi = s)
px1 (1 p)1 x1 x2
p (1 p)1 x2 . . . pxn (1 p)1 xn
= n s
s
p (1 p)n s
ps (1 p)n s
= n s
s
p (1 p)n s
1
= n .
s
The conditional distribution of the Xi ’s, given their sum, does not depend on
the parameter p. ns is just the number of orders of n trials with s successes.
This suggests another way to think about sufficiency. We could conceptu-
alize the experiment that produced the original sequence X1 , X2 , . . . , Xn as
involving a two step procedure: first select the sufficient statistic s, which
tells us how many of the n trials are successes, and then generate the Xi ’s as
a random permutation of the s successes and n s failures. The second step
just adds pure random noise to our sufficient statistic. All the information
about p is contained in the value of s, the rest is noise.
80 CHAPTER 6. SUFFICIENCY AND EFFICIENCY
We will give a proof for the discrete case. If S(x̃) is sufficient, then
P(S(x̃) = s)P(X1 = x1 , . . . , Xn = xn | S(x̃) = s) = P({X1 = x1 , . . . , Xn = xn }\{S(x̃ = s)}.
The right hand side is zero if s 6= S(x̃), otherwise it is just f (x̃ | ✓).
X
P(S(x̃ = s)) = f (x̃ | ✓)
x̃:{x̃7!s}
f (x̃ | ✓)
= P
x̃7!s f (x̃ | ✓)
g(S(x̃), ✓)h(x̃)
= P
x̃7!s g(S(x̃), ✓)h(x̃)
h(x̃)
= P
x̃7!s h(x̃)
6.3. EXAMPLES 81
The right hand side does not depend on ✓, so S(x̃) is sufficient for ✓.
6.3 Examples
1. Let X1 , X2 , . . . , Xn be an independent sample from a Uniform(0, ✓)
distribution. Then
1
f (x̃ | ✓) = I(0 x1 , . . . , xn ✓)
✓n
1
= I(0 x(n) ✓)1
✓n
Let S(x̃) = x. The first factor is our g(S, µ), the second is our h(x̃),
and does not involve ✓. Thus x is a sufficient statistic for the mean µ.
82 CHAPTER 6. SUFFICIENCY AND EFFICIENCY
f (x|µ) = c(x)ea(✓)x+b(✓) .
Then the joint likelihood for a sample of n independent observations has the
form ⇣Y ⌘ P
f (x1 , x2 , . . . , xn |µ) = c(xi ) ea(✓) xi +nb(✓) .
It is easy to see that if the likelihood has this form, then the mode of the log
likelihood is the solution of the equation
X
a0 (✓) xi + nb0 (✓) = 0
or
b0 (✓)
= x.
a0 (✓)
After Fisher introduced the concept of sufficiency, I suspect it was only a mat-
ter of time before someone saw the connection. The pieces came together
6.4. THE EXPONENTIAL FAMILY 83
So ✓ ◆
n
h(x) =
x
S(x) = x
p
a(p) = log
1 p
and
b(p) = log (1 p).
It is often useful to work with the likelihood in a canonical form, using what
are often called the natural parameters ⌘. We reparametrize the likelihood
by
⌘ = a(✓).
84 CHAPTER 6. SUFFICIENCY AND EFFICIENCY
In the binomial, the natural parameter is the logit transform of the proba-
bility p
p
⌘ = log .
1 p
The logit is the inverse of the logistic function
e⌘
p=
1 + e⌘
There is something for everyone in the exponential family. For the Bayesian,
if
f (x | ⌘) = h(x) exp (⌘S(x) B(⌘))
@ @
g(✓) = g 0 (✓) = E(T )
@✓ @✓
Z
@
= T (X̃)f (X̃ | ✓)dx
@✓
Z
@
= T (X̃) f (X̃ | ✓)dx
@✓
Z @
@✓
f (X̃ | ✓)
= T (X̃) f (X̃ | ✓)dx
f (X̃ | ✓)
Z
@
= T (X̃) log f (X̃ | ✓)f (X̃ | ✓)dx
@✓
✓ ◆
@
= Cov T (X̃), log f (X̃ | ✓)
@✓
The last equality follows because E(log f (X̃ | ✓) = 0), and for any random
variable Y such that E(Y ) = 0
2. ✓ˆ is unbiased for ✓.
ˆ Var(T )
3. Var(✓)
n 1
s 1
= n
s
s
= .
n
Thus (X1 | S = s) is a Bernoulli trial with success probability s/n, and has
expectation s/n. In other words,
p̂ = E(X1 | S = s) = s/n.
When multiplied together they retain the exponential family form. It is often
useful to write the exponential family form in vector notation
k
X
✓j tj (X) = ✓˜0 T̃ .
j=1
(A ˜)0 T̃ = ˜0 (A0 T̃ )
6.6 Exercises
1. Let X1 , X2 , . . . Xn be an IID sample from a Poisson(µ) distribution.
Show
P that the Poisson is a member of the exponential family, and that
Xi is a sufficient statistic for µ. Show that X is unbiased for µ; does
it achieve the Cramér-Rao bound?
1 1 1
(log x)2
f (x) = p e 2 2 .
x 2⇡ 2
(a) Show that the outcome of the first trial, call it Y1 , is an unbiased
estimator for p.
(b) Find a sufficient statistic for p, and use the Rao-Blackwell theorem
to find a better unbiased estimator.
(c) Find the MLE, and its asymptotic variance.
(d) Do a simulation study comparing the variance, bias, and mean
squared error of the unbiased estimator you found to that of the
MLE.
8. For each of the following situations compare the bias, variance and
mean squared errors for the method of moments estimator, the best
unbaised estimator, the maximum likelihood estimator, and if a conju-
gate or other convenient prior exists, a Bayesian alternative (the poste-
rior distribution, and an estimator such as the posterior mean). If you
can’t do it analytically, do a simulation.
Formal Inference
We have been studying the problem known as point estimation, that is pro-
ducing a single number as an estimate for a parameter. Statisticians have
come to view that as an incomplete specification of the problem. We really
want more than just an estimate, we want an estimate of the precision or
accuracy of an estimate. This is known as a confidence interval, or if you
prefer, and interval estimate: a procedure for producing a range of values,
with some specified properties) for the unknown parameter. We will see that
the standard treatment of this problem is intimately related to the construc-
tion of hypothesis tests. A hypothesis test is, roughly speaking, a procedure
for answering a question of the form
This form of hypothesis test is often called a pure significance test and was
popularized by the work of R. A. Fisher in his influential texts Statistical
methods for research workers [8] and The design of experiments [9]. There
is another class of situations of great interest, in which we wish to compare
two probability models.
The earliest known instance of hypothesis testing is an example of a pure
significance test. John Arbuthnott, physician to Queen Anne, observed [1]
that if births are independent, and the probability that each birth is a male
is 12 , then the probability that the total number of male births in the city
of London exceed the total number of female births in a year is also 12 . He
93
94 CHAPTER 7. FORMAL INFERENCE
noted that over the years for which records then existed in London (1629 to
1710) male births exceeded female births in all 82 years.
He then computed the probability of this particular sequence of outcomes
under the hypothesis p = 12 . If the results in di↵erent years are independent,
we have 82 trials, each with p = 12 , resulting in ( 12 )82 as the probability for the
82 consecutive years of excess male births. Arbuthnott concluded that this
was too small to have occurred by chance. He went on to argue (apparently
seriously) that this was evidence of divine intervention:
and
X µ X µ
p z↵ () z↵ p z↵
/ n / n
() z↵ p X µ z↵ p
n n
() X z↵ p µ X + z↵ p
n n
() X z↵ p µ X + z↵ p
n n
and finally
1
✓ˆ ± z↵ q .
@2 ˆ
log f (X̃ | ✓)
@✓ 2
In general, we must find a function of the data and the parameter with a
known distribution that does not depend on the parameter. Such a function
is called a pivotal quantity. In the case of the normal distribution,
X µ
p
/ n
is a pivotal quantity. It has a standard normal distribution, which is inde-
pendent of µ.
7.2. HYPOTHESIS TESTS 97
Thus Pn
i=1 (Xi X)2
2
✓ Pn ◆
i=1 (Xi X)2
P q1 2
q2 =1 ↵
and
Pn
i=1 (Xi X)2 1 2
1
q1 2
q2 () Pn
q2 i=1 (Xi X)2 q1
n n
1 X 2 2 1 X
() (Xi X) (Xi X)2
q2 i=1 q1 i=1
2
Our level 1 ↵ confidence interval for is thus
n n
!
1 X 2 1
X
(Xi X) , (Xi X)2 .
q2 i=1 q1 i=1
Chance: The coins are fair, the tosses were independent; we observed a rare,
but not impossible event. We get 10 heads in 10 tosses about one time
in 210 repetitions of the experiment.
Bias: The coins aren’t fair coins; i.e., the probability of heads on each toss is
not 12 , but possibly close to 1. Maybe the coins are all double-headed?
What do we expect to see if the coin is fair and the tosses independent?
About 89% of the time we expect to see between 3 and 7 heads, and about
98% of the time between 2 and 8 heads. If we see 10 heads, we feel naturally
inclined to skepticism about the mechanism claimed to be generating the
data.
The standard formalization of this idea is the ‘Hypothesis Test’. The
focus is decision rather than estimation, in other words, the question:
rather than
you have failed to reject it: it is also possible that your experiment didn’t
generate enough data to reject the null hypothesis for the actual size of the
di↵erence.
Let’s break the process down into a sequence of steps using our coin
tossing problem as an example:
The rejection region is the set of outcomes that are taken to be evidence
against H0 and in favor of some alternative hypothesis. In this case, for
P{Heads} greater than 12 , outcomes greater than 7 will occur more often
than under H0 , while for P{Heads} less than 12 , outcomes less than 3 will
occur more often than under H0 .
The alternative hypothesis is usually not specific, but rather a class
of probability distributions. In the example above, that class is the set of
binomial(10, p) distributions with p 6= 12 .
100 CHAPTER 7. FORMAL INFERENCE
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
binomial(10, 12 )
X p
0 0.001 |
1 0.010 reject H_0
2 0.044 |
--------------------
3 0.117
4 0.205
5 0.246 do not reject H_0
6 0.205
7 0.117
--------------------
8 0.044 |
9 0.010 reject H_0
10 0.001 |
For example, if we did the experiment, and got 7 Heads, then the p-value
would be the probability of getting at least 7 Heads plus the probability of
getting no more than 3 Heads:
1 1
P(X 3) + P(X 7) = pbinom(3, 10, ) + (1 pbinom(6, 10, ))
2 2
or about 0.344.
Choosing a rejection region with significance level of .05 corresponds ex-
actly to rejecting H0 if the p-value is .05 or smaller.
P -values are often considered to be measures of the strength of the ev-
idence against the null hypothesis: smaller p-values go with more extreme
results indicating greater evidence against H0 . This is unsound, as it ignores
the simultaneous dependence of the p-value on both the e↵ect size (the dif-
ference between the parameter specified by H0 and actual parameter value)
and the sample size, as well as the inherent plausibility of the null hypothesis.
Either way, notice that we are essentially comparing the observed outcome
to the distribution it would have if H0 were true, to see if it represents an
102 CHAPTER 7. FORMAL INFERENCE
extreme or unusual value for that distribution, in the sense of falling too far
from the center of the distribution, in a region of low probability.
We can’t claim to be testing all possible alternatives (for example, lack
of independence of the tosses). We are restricting the alternative hypotheses
to a specific class: binomial distributions with p 6= 12 .
The fact that the alternative hypothesis is ‘binomial with p 6= 12 ’ is im-
portant to the construction of the rejection region. For example, suppose
that p = 12 , but the tosses are not independent. In the extreme case, if all
10 outcomes are identical, as they might be if the coins were taped together
all facing the same direction, then this hypothesis test will reject the null
hypothesis p = 12 with probability 1, even though p is 12 . The failure of
independence leads to the rejection of the null hypothesis.
Consider the following scenario. Suppose that the trials are dependent
with the property that we always get either 6 successes and 4 failures, or 4
successes and 6 failures, each with probability 12 .
0.6
0.5
0.4
H1
0.3
0.2
H0
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
7.2. HYPOTHESIS TESTS 103
H0 : binomial(10, 12 )
in this situation? If these really are the two hypotheses of interest, then an
extreme result like 10 heads is not evidence for the alternative hypothesis! In
fact, 10 heads is conclusive evidence that the alternative hypothesis is false,
and H0 is true, since 10 heads is an impossible outcome under the alternative
hypothesis. Here the only reasonable rejection region is the set {4, 6}: the
outcomes that are more likely to occur under the alternative than under H0 .
This is the idea underlying the likelihood ratio test, coming soon.
Returning to the original example, when the alternative hypothesis is
p 6= 12 , that is, a binomial(10, p) distribution with some other probability
p of success, then the optimal rejection region is indeed the tails (extreme
outcomes like {0, 1, 2, 8, 9, 10}) of the binomial(10, 12 ) distribution.
Finally, failure to reject the null hypothesis is not proof that the null
hypothesis is correct. If nothing else, it is always possible that the sample
size was just too small to reliably detect the di↵erence!
7.2.1 Power
So far our discussion of hypothesis testing has focused on the null hypothesis.
For example, we choose a rejection region by selecting a significance level ↵,
then trying to find outcomes whose probabilities add up to about the desired
↵ value. Recall that ↵ measures the probability of rejecting H0 , given that
H0 is in fact true. This is not the only type of error we could make. It is also
possible that we might fail to reject a false null hypothesis. The probability
of doing so depends on the true, and usually unknown, value of the param-
eter in question, and is usually labeled . It is useful to display the various
possibilities in tabular form with the associated probabilities in parentheses:
104 CHAPTER 7. FORMAL INFERENCE
H0 True H0 False
No Type II
Don’t reject Error Error
H0 (1 ↵) ( )
Type I No
reject Error Error
H0 (↵) (1 )
= 0.5260
the distribution (outcomes 8, 9, and 10). Outcomes in the left tail are much
less likely if p = 34 .
7.3 Likelihood
Examples like the correlated coin tosses in the previous section persuaded
Neyman and Pearson [17] that the crucial issue for choosing a rejection region
is the ratio of the probabilities assigned to outcomes under the competing
hypotheses. This is known as the likelihood ratio:
P(X|H0 )
LR = .
P(X|H1 )
More generally, Neyman and Pearson showed that for a given choice of sig-
nificance level (↵), the power is maximized by choosing a rejection region
based on values of the likelihood ratio.
A more general prescription, known as the likelihood principle (see,
for example, Berger and Wolpert [2]) states that all inferences should be
based on the likelihood function. The likelihood function for a particular
hypothesis is just the probability density implied by that hypothesis:
L(X, H0 ) = P(X|H0 ).
distribution will have higher probability under the alternative than under
H0 . The extreme outcomes in the lower tail will have higher likelihood under
any alternative p < 12 . Thus the likelihood ratio criterion implies that we
should choose our rejection region to be those outcomes in the tails with
total probability no greater than .05 under H0 . This leads us to the rejection
region
RR = {0, 1, 9, 10}.
The likelihood functions for the two experiments are respectively binomial
and negative binomial:
✓ ◆
20 15
L1 = p (1 p)5
15
and ✓ ◆
19 15
L2 = p (1 p)5 .
14
These two likelihoods are proportional — the only di↵erence is the constant
factor. Hence the likelihood principle implies that we should make the same
inferences about p from either experiment; in other words, it requires the
same estimate of p in either case. A frequentist using the criterion of un-
biasedness would use the estimates p̂ = 15 20
and p̂ = 14
19
respectively. The
criterion of unbiasedness takes account of the possibility of arbitrary num-
bers of failures before seeing the 15th success in the second experiment - when
we compute the expected value of the estimator we are averaging across two
di↵erent sample spaces. In the first experiment there are a fixed number of
7.3. LIKELIHOOD 107
trials, and we compute expectation with respect to that sample space and
the corresponding binomial distribution. The likelihood principle demands
that inferences depend on the data only through the likelihood function; un-
biasedness asks us to take into account data that might have been observed,
but wasn’t.
In any case, the classic result on the connection between likelihood ratios
and power is the Neyman-Pearson Lemma. I will present the result in terms
of density functions for continuous random variables, the argument for the
discrete case is parallel.
Theorem 4 (Neyman-Pearson)
Given H0 : X ⇠ f0 (x) and H1 : X ⇠ f1 (x), let the rejection region R⇤ be
defined by
f1 (x)
R⇤ = {x : c}
f0 (x)
and suppose that R⇤ has size ↵. Let R be any other rejection region of size
↵. Then the test defined by R⇤ has higher power than the one defined by R.
= 0
Example:
Let X1 , X2 , X3 , . . . Xn be an IID sample from a Normal(µ, 2 ) population.
Find the most powerful test of H0 : µ = µ0 versus H1 : µ = µ1 > µ0 ,
assuming 2 is known.
Pn Pn
f1 (x̃) i=1 (Xi µ1 ) 2 i=1 (Xi µ0 ) 2
log c () c
f0 (x̃) 2 2 2 2
n
X n
X
2
() (Xi µ0 ) (Xi µ1 ) 2 c2 2
i=1 i=1
n
X n
X
() Xi2 2nXµ0 + nµ20 Xi2 + 2nXµ1 nµ21 c2 2
i=1 i=1
2
() 2nXµ0 + 2nXµ1 c2 nµ20 + nµ21
2
() X(2nµ1 2nµ0 ) c2 nµ20 + nµ21
2
c2 nµ20 + nµ21
() X
2nµ1 2nµ0
() X c0 .
Choosing the constant c0 is equivalent to choosing a significance level ↵. This
is true for each µ1 > µ0 Thus the one-sided ‘Z test’, which rejects H0 : µ = µ0
if
X µ0
p > z↵/2
/ n
is an optimal size ↵/2 test for each µ1 > µ0 . Similarly the test which rejects
H0 : µ = µ0 if
X µ0
p < z↵/2
/ n
is an optimal size ↵/2 test for each µ1 < µ0 . It follows that the standard ‘Z
test’, which rejects H0 : µ = µ0 if
X µ
p 0 > z↵/2
/ n
is an optimal two-sided test for the alternative µ1 6= µ0 .
7.4. TESTS BASED ON THE LIKELIHOOD FUNCTION 109
I will sketch the proof in the case where dim(⇥) = 1 and ⇥0 = {✓0 }. We
know that p D
nI(✓0 ) (✓ˆn ✓0 ) ! N (0, 1).
Expand `n (✓0 ) in a Taylor series about ✓ˆn :
1
`n (✓0 ) = `n (✓ˆn ) + `0n (✓ˆn )(✓ˆn ✓0 ) + `00n (✓n⇤ )(✓ˆn ✓0 ) 2
2
for some ✓n⇤ satisfying
|✓n⇤ ✓ˆn | |✓0 ✓ˆn |.
110 CHAPTER 7. FORMAL INFERENCE
`00n (✓n⇤ )
= nI(✓0 )(✓ˆn ✓0 ) 2
nI(✓0 )
D
! Z2
ˆ
where Z ⇠ N (0, 1). We make use here of the consistency of the MLE ✓,
which guarantees that
`00n (✓n⇤ ) P
!1
nI(✓0 )
ˆ Since Z 2 ⇠
and the asymptotic normality of ✓. 2
1, we are done.
Example:
Let X be a binomial(n, p) random variable. Let H0 specify p = p0 , and
H1 : p 6= p0 . Let p̂ = X/n be the MLE. The likelihood ratio test statistic is
The properties of this test statistic may not be immediately obvious, but it
is interesting to note that the first term involves an odds ratio — the ratio
of the odds under the MLE to the odds under H0 .
7.4. TESTS BASED ON THE LIKELIHOOD FUNCTION 111
Un (X̃, ✓0 )
p > z↵ .
nI(✓0 )
Example:
Let X be a binomial(n, p) random variable. Let H0 specify p = p0 , and
H1 : p 6= p0 . Then
@
U (p0 ) = `(X | p0 )
@p0
X n X
= +
p0 1 p0
X np0
= .
p0 (1 p0 )
and
n
In (p0 ) = .
p0 (1 p0 )
Thus, the test statistic is
X np0
p0 (1 p0 ) X np0
q = p .
n n p0 (1 p0 )
p0 (1 p0 )
112 CHAPTER 7. FORMAL INFERENCE
✓ˆ ✓0
q > z↵ .
ˆ
1/In (✓)
Example:
Let X be a binomial(n, p) random variable. Let H0 specify p = p0 , and
H1 : p 6= p0 . Then
p̂ p0 p̂ p0
p =p .
1/In (p̂) p̂(1 p̂)/n
P✓ (✓ 2 C(X)) 1 ↵
for all ✓ 2 ⇥. A level ↵ test for H0 : ✓ = ✓0 has a rejection region R✓0 such
that
P✓0 (X 2 R✓0 ) ↵
or equivalently
P✓0 (X 2
/ R✓ 0 ) 1 ↵.
Given a set of level ↵ tests with rejection regions R✓ , we can define a set
function S(X) taking values in ⇥ by
S(X) = {✓ : X 2
/ R✓ }.
7.5. CONFIDENCE INTERVALS FROM HYPOTHESIS TESTS 113
P✓ (✓ 2 S(X)) = P✓ (X 2
/ R✓ ) 1 ↵
X 2 R✓ () ✓ 2
/ S(X)
Example:
Let X1 , X2 , . . . , Xn be IID Normal(µ, 2 ) random variables, where 2
is
known. As we saw earlier, the interval
✓ ◆
C(X) = X z↵ p , X + z↵ p
n n
contains µ with probability 1 ↵. On the other hand, the size ↵ Z-test
fails to reject H0 : µ = µ0 if and only if
X µ
p 0 z↵
/ n
which is equivalent to the condition
X µ
z↵ p 0 z↵
/ n
and thus p p
z↵ / n X µ0 z↵ / n.
This occurs just in case
p p
X z ↵ / n µ0 X + z ↵ / n
and µ0 2 C(X).
Thus we can construct approximate confidence intervals using any
hypothesis test, including the likelihood ratio test, score test, and Wald test.
114 CHAPTER 7. FORMAL INFERENCE
7.6 Exercises
1. Let X1 , X2 , . . . Xn be IID N(µ, 2 ). Construct an approximate 95%
confidence interval for the coefficient of variation /µ. Hint: X̄ and
S 2 are independent, and there is a normal approximation for the 2m
as m ! 1.
Deaths = c(0,1,2,3,4,5)
N = c(109,65,22,3,1,0)
# expand the dataset into individual cases
D1 = rep(Deaths,N)
# a sample of size 25:
X25 = sample(D1,size=25,replace=F)
(b) The asymptotic confidence interval based using the fact that X̄
is approximately Normal(✓,1/In (✓).
(c) The confidence interval based on the asymptotic normality of the
score function, that is, solve the following for (✓L , ✓U ):
U (X̃)
pn = 1.96
In (✓)
Computational Statistics
0 = rf (x0 ) + H(x x0 )
rf (x0 ) = H(x x0 ).
x = x0 H rf (x0 ).
There are many issues in the implementation of the such algorithms, such
as stopping rules, error checking, and so on. You can get some sense of the
117
118 CHAPTER 8. COMPUTATIONAL STATISTICS
richness of the problem by reading the help files in R for the nlm and optim
functions.
Suppose that we have a random sample X1 , X2 , . . . Xn from a population
with two distinct subpopulations. If each subpopulation has a normal
distribution with mean µi and variance i2 , then the distribution of a
random element of the full population has a mixture distribution, in this
case a mixture of two normal distributions. The density function is a
convex combination of the two densities:
2 2 2 2
f (x | p, µ1 , µ2 , 1, 2) = p n(x | µ1 , 1) + (1 p) n(x | µ2 , 2)
MX <- rnorm(50)
p <- .2
n <- rbinom(50,1,.3)
MX <- ifelse(n,2*MX+6,4*MX)
plot(density(MX))
8.1. NEWTON AND QUASI-NEWTON ALGORITHMS 119
Mixture density
0.08
0.06
d
0.04
0.02
0.00
−15 −10 −5 0 5 10 15
f <- function(P)
{
# P is the parameter vector (p,mu1,mu2,s1,s2)
# note the use of SDs instead of variances
p <- P[1]
m1 <- P[2]
m2 <- P[3]
s1 <- P[4]
s2 <- P[5]
-sum(log(p*dnorm(MX,m1,s1)+(1-p)*dnorm(MX,m2,s2)))
}
120 CHAPTER 8. COMPUTATIONAL STATISTICS
Now, let’s try running it through nlm() with the data we created:
Warning messages:
1: NaNs produced in: log(x)
2: NA/Inf replaced by maximum positive value
3: NaNs produced in: log(x)
4: NA/Inf replaced by maximum positive value
mle
$minimum
[1] 135.9917
$estimate
[1] 0.3358718 5.9991124 -1.1386181 0.7881826 3.7583494
$gradient
[1] -5.599077e-06 -8.300367e-06 3.245006e-07 2.154366e-05 9.074742e-07
$hessian
[,1] [,2] [,3] [,4] [,5]
[1,] 184.717845 6.54627873 4.1521531 -10.4477664 5.20012061
[2,] 6.546279 20.91935439 -0.4691034 6.5323202 -0.06922796
[3,] 4.152153 -0.46910339 1.9033706 0.8896393 -0.58319964
[4,] -10.447766 6.53232019 0.8896393 36.6926315 0.70978473
[5,] 5.200121 -0.06922796 -0.5831996 0.7097847 3.88659680
$code
[1] 1
$iterations
[1] 24
As you may verify by checking the nlm help file, termination codes 1 and 2
are indicators that the search converged to a solution, so the warning
8.1. NEWTON AND QUASI-NEWTON ALGORITHMS 121
messages may be safely ignored. The parameter estimates look ok. The
square root of the diagonal of the inverse of the hessian gives asymptotic
standard errors for the estimates:
sqrt(diag(solve(mle$hessian)))
[1] 0.08082643 0.23065559 0.79375142 0.17661707 0.54407076
parameter estimate SE
p 0.336 0.081
µ1 5.999 0.231
µ2 1.138 0.794
1 0.788 0.177
2 3.758 0.544
> mle
$minimum
[1] -250.9311
$estimate
[1] 7.463620e+03 7.809437e-01 4.630075e+01 1.913076e+01 8.201048e-09
$gradient
[1] -0.006699159 -0.065276225 0.000000000 2.462322988 0.000000000
$hessian
[,1] [,2] [,3] [,4] [,5]
122 CHAPTER 8. COMPUTATIONAL STATISTICS
$code
[1] 2
$iterations
[1] 59
Notice that the termination code (2) suggests convergence. The nlm
function is minimizing the negative of the log likelihood, which has to be
positive, but the final minimum is negative. The parameter values are fishy:
most notably p̂ = 7463. That’s a bit big for a probability! Two of the
diagonal elements of the hessian matrix are 0 — not good for a matrix that
is supposed to be positive definite!
Data Augmentation
E(Zi ) = Pr(Zi = 1) = p.
If we knew all the Zi ’s, the problem would be simple — just use the
group one X’s to estimate the mean and variance of that component
of the mixture, and use the group two X’s to estimate the mean and
variance of their group.
The E step
In general this step is the computation of the expected value of the
Zi ’s, given the X’s and parameter values (here p, µ1 , µ2 , 12 , 22 ). We
can apply Bayes’ rule to compute qi , the posterior probability that
Zi = 1, which is equal to the posterior expectation.
p f (Xi | µ1 , 12 )
qi = E(Zi | Xi ) = .
p f (Xi | µ1 , 12 ) + (1 p) f (Xi | µ2 , 2
2)
The M step
Now we don’t know the Zi ’s, but we have their expected values, given
the observed data. Thus the likelihood for each Xi can be written as
2 2 2 2
f (xi | Zi , µ1 , µ2 , 1, 2) = qi n(x | µ1 , 1) + (1 qi ) n(x | µ2 , 2 ).
It is not hard to show that the MLE’s for the parameters are now just
weighted averages, with the qi ’s as weights. For example,
P
i qi X i
µˆ1 = P .
qi
124 CHAPTER 8. COMPUTATIONAL STATISTICS
8.2 Bootstrapping
Bootstrapping is a simulation method. Suppose that we wanted to study
the distribution of some function of an IID sample X1 , X2 , . . . , Xn . If we
knew the distribution of the population from which the data were sampled,
we could simply generate many pseudo-random samples of size n, compute
the value of our function for each sample, collect the values, and estimate
the parameter or paramters of the sampling distribution we would like to
know.
dx <- seq(.01,15,.01)
plot(dx,dlnorm(dx),type="l")
X <- rlnorm(50)
dx <- seq(.01,15,.01)
plot(dx,dlnorm(dx),type="l")
X <- rlnorm(50)
lines(density(X),col="red")
X <- rlnorm(50)
lines(density(X),col="blue")
X <- rlnorm(50)
lines(density(X),col="green")
x <- matrix(rlnorm(50000),ncol=50)
mx <- apply(x,1,mean)
plot(density(mx))
quantile(mx,c(.025,.0975))
2.5\% 9.75\%
1.146058 1.286618
mx <- rep(0,1000)
for(i in 1:1000)
{
x <- sample(X,size=50,replace=T)
mx[i] <- mean(x)
8.2. BOOTSTRAPPING 125
}
lines(density(mx),col="red")
quantile(mx,c(.025,.0975))
2.5% 9.75%
0.949869 1.040169
mx <- rep(0,1000)
for(i in 1:1000)
{
x <- sample(X,size=50,replace=T)
mx[i] <- mean(x)
}
plot(density(mx))
Mx <- rep(0,100000)
for(i in 1:100000)
{
Mx[i] <- mean(rweibull(30,3,2))
}
126 CHAPTER 8. COMPUTATIONAL STATISTICS
Chapter 9
127
128 CHAPTER 9. MARKOV CHAIN MONTE CARLO
Generate Xi+1
Generate an observation from the conditional distribution
Xi+1 ⇠ fx (x | Yi ).
Generate Yi+1
Generate an observation from the conditional distribution
Yi+1 ⇠ fy (y | Xi+1 ).
Y1 , Y2 , . . . , Ym ⇠ Poisson(µ)
and
Ym+1 , Ym+2 , . . . , Yn ⇠ Poisson( ).
Let the prior distributions for the parameters be µ ⇠ Gamma(↵, ),
⇠ Gamma(⌫, ), and (m) ⇠ Uniform{1, 2, . . . , n}. Then the joint
9.2. THE CHANGEPOINT PROBLEM 129
f (Ỹ , µ, , m)
⇡(µ, , m | Ỹ ) =
fỸ (Ỹ )
Pm Pn
mµ Yi (n m) Yi
= ce µ 1 e m+1 µ↵ 1
e µ ⌫ 1
e
Pm Pn
= c µ(↵+ 1 Yi ) 1
e ( +m)µ (⌫+ m+1 Yi ) 1
e ( +n m)
From here, it is not hard to compute the conditional distributions for each
parameter, given the data Ỹ and the other parameters. The two Poisson
parameters µ and are conditionally independent given m and Ỹ .
m
X
⇡1 (µ | Ỹ , m) ⇠ Gamma(↵ + Yi , + m)
1
n
X
⇡2 ( | Ỹ , m) ⇠ Gamma(⌫ + Yi , + (n m))
m+1
sample(1:n,size=1,prob=p)
f (X|✓)⇡(✓)
⇡(✓|X) = R .
f (X|✓)⇡(✓)d✓
The idea is to replace sampling from the exact conditional distributions for
components of ✓ with an accept-reject algorithm for which the sequence of
trials still converges to the stationary distribution.
Let’s start with a simplified version. Suppose we have a K parameter state
space, and we want to generate samples from a joint distribution f (y). We
can’t run the standard MCMC sampler, generating from the conditional
distributions f (yn |yn 1 ), perhaps because we don’t have the normalizing
constants. We create a proposal distribution g(z|yn 1 ), which may not have
any connection to the distribution f (y), but rather proposes a jump to a
new location in the state space. We start the chain at some arbitrary
location y0 , and generate proposal Y 0 s from the distribution g(Y |yn ). We
now accept or reject with probability
⇢
f (Y )g(yn |Y )
↵ = min ,1 .
f (yn )g(Y |yn )
In other words, if the proposal Y has greater probability than the current
yn , always accept it (always move to points of higher probability than the
current location). If the proposed Y is less probable, then accept with
positive probability ↵ < 1. The idea is that you want the process to spend
9.3. THE METROPOLIS-HASTINGS ALGORITHM 131
more time in regions of higher probability, but you don’t want to get stuck
there; you want to explore the whole state space.
Often one wants to combine Gibbs sampling with Metropolis-Hastings.
Here is psuedo-code to implement MCMC for a vector parameter
✓˜ = (✓1 , ✓2 , . . . , ✓k ). In essence we use the accept-reject criterion successively
˜
for each component of ✓:
{Initialize ✓˜1 }
for n in 1 to N do
for i in 1 to k do
{generate Yi ⇠ g(Yi |✓i )}
Ỹ (✓1 , . . . , ✓i 1 , Yi , ✓i+1 , . . . , ✓k )
( )
f (X|Ỹ )⇡(Ỹ )g(✓i |Yi )
↵ min ,1
˜ ✓)g(Y
f (X|✓)⇡( ˜ i |✓i )
˜
{set ✓ = Ỹ with probability ↵, else retain ✓}
end for
end for
That pseudo-code describes one iteration of the Markov process, it should
be repeated many times to ensure that the process converges to its
stationary distribution.
132 CHAPTER 9. MARKOV CHAIN MONTE CARLO
9.4 Exercises
1. Compute the integral Z 1
log |x| 1 2
x
p e 2
1 2⇡
that is, E(log |X|) for X ⇠ N (0, 1), by the following methods. For the
monte carlo methods, estimate the integral and compute a standard
error for your estimate.
4. Retrieve the British coal mine disaster data from the Math 141 web
page at
http://www.reed.edu/~jones/141
(a) Fit the change point model by maximum likelihood. Plot the
profile log-likelihood vs year. The profile likelihood for year m is
log L(µ̂, ˆ , m), in other words, given m, maximize the
log-likelihood in the other parameters.
(b) Let the priors for µ and , respectively, be Gamma(4, 1) and
Gamma(2, 2). Use Markov chain Monte Carlo to estimate the
posterior marginal densities, means and variances for each
parameter. Plot the prior and posterior densities for each
parameter. Be sure to provide details about your MCMC
computations (R code, number of ‘burn-in’ iterations, total
iterations, etc.)
134 CHAPTER 9. MARKOV CHAIN MONTE CARLO
Bibliography
[1] John Arbuthnott. An argument for divine providence, taken from the
constant regularity observ’d in the births of both sexes. Phil. Trans. of
the Royal Society, 27:186–190, 1710.
[8] R. A. Fisher. Statistical methods for research workers. Oliver & Boyd,
Edinburgh, 1925.
135
136 BIBLIOGRAPHY
[12] W. Hastings. Monte carlo sampling methods using markov chains and
their application. Biometrika, 57:97–109, 1970.