Вы находитесь на странице: 1из 66

Statistical Concepts 2013/14 Sheet 0 Probability Revision

For this course, it is crucial that you have excellent knowledge of the material presented
in the rst-year Probability course. These exercises are strongly recommended. You may
wish to go over the lecture notes and summary handouts of the Probability course if you
struggle with some of these exercises. Some of these may be discussed in class, and full
solutions will be handed out.
1. I have ten coins in my pocket. Nine of them are ordinary coins with equal chances
of coming up head and tail when tossed, and one has two heads.
(a) If I take one of the coins at random from my pocket, what is the probability
that it is the coin with two heads?
(b) If I toss the coin and it comes up heads, what is the probability that it is the
coin with two heads?
(c) If I toss the coin a further n times and it comes up heads every time, what is
the probability that it is the coin with two heads?
(d) If I toss the coin one further time and it comes up tails, what is the probability
that it is one of the nine ordinary coins?
2. Let U be a random quantity which has probability density function
f(u) =

1 u [0, 1]
0 otherwise
Calculate
(a) P[U 0.3]
(b) E[U]
(c) Var [U]
(d) E[log(1 + U)]
(e) If the random quantity Y has a uniform distribution on the interval [a, b],
express Y in terms of U above and hence nd E[Y ] and Var [Y ].
3. Let Y be a random quantity which has a Poisson distribution. Suppose that E[Y ] =
3.
(a) What is Var [Y ]?
(b) Suppose that I take a large number of independent random quantities, each
with the same distribution as Y . Why should I suppose that a normal dis-
tribution would be a good approximation to the distribution of their average

Y ?
(c) Use part (b) to calculate an interval which would have approximately a 98%
probability of containing

Y based on 100 such random quantities.
4. An entomologist has a large colony of ants which he knows contains just two types, A
and B, of similar appearance. By spending some time examining each ant carefully
he can discover which type it is, but he decides, on grounds of cost, to use the much
quicker method of classifying an ant as type A if its length is less than 8 mm and
as type B otherwise.
He knows that lengths of each of the two types of ants are normally distributed:
type A with expectation 6.5 mm and standard deviation 0.8 mm, and type B with
expectation 9.4 mm and standard deviation 0.9 mm.
What proportion of (i) type A and (ii) type B ants would he misclassify by this
method?
If the colony consists of 70% of type A and 30% of type B, what proportion of all
ants would he misclassify?
It is thought that the number of of ants misclassied may be reduced by choosing
a critical point other than 8 mm. Discuss!
5. The mad king has captured Anne, Betty and Charles. He would like to kill them all,
but, as he is a fan of probability puzzles, he oers them the following challenge. The
following morning, each of the three prisoners will be escorted to a cell. They will
each enter the cell simultaneously, but each through a dierent door. Immediately
before entering the cell, a hat will be placed on the head of each prisoner. The
colour of the hat will be either red or blue, and the choice for each prisoner will be
decided by the ip of a fair coin, independently for each prisoner. As they enter
the room, each prisoner will be able to see the colours of the hats of the other
two prisoners, but not the colour of their own hat. No communication of any kind
is allowed between the prisoners. At the moment that all of the prisoners enter
the cell, and observe the colours of their comrades hats, each may choose either
to remain silent or, instantly, to guess the colour of the hat on their head. If at
least one prisoner guesses correctly the colour of their hat, and nobody guesses
incorrectly, all the prisoners will be set free. Otherwise, they will all be executed.
The prisoners are allowed a meeting beforehand to discuss their strategy. They
immediately realise that one possible strategy would be for Anne to guess that her
hat was red, and Betty and Charles to stay silent. This strategy gives a probability
of 1/2 that all will go free. Is there a better strategy?
Statistical Concepts 2013/14 Solutions 0 Probability Revision
1. (a) Each of the coins is equally likely to be chosen. The ten probabilities must sum to
one since exactly one coin must be chosen. Hence, P[coin with two heads] = 1/10.
(b) P[head | coin with two heads] = 1 and P[head | fair coin] = 1/2. Therefore, apply-
ing Bayes theorem,
P[coin with two heads | head] =
1 1/10
1 1/10 + 1/2 9/10
=
2
11
0.182
(c) Start from the beginning. A total of n + 1 heads have occurred. But
P[n + 1 heads | coin with two heads] = 1
and
P[n + 1 heads | fair coin] = 1/2
n+1
Therefore, applying Bayes theorem as in the previous part,
P[coin with two heads | n + 1 heads] =
1 1/10
1 1/10 + 1/2
n+1
9/10
=
2
n+1
9 + 2
n+1
(d) This probability is 1, verify this using Bayes theorem.
2. Using the basic rules for handling probability density functions, expectations and vari-
ances, we have:
(a) P[U 0.3] =

0.3
f(u) du =

1
0.3
1 du = 0.7.
(b) E[U] =

uf(u) du =

1
0
udu = 1/2.
(c) E

U
2

1
0
u
2
f(u) du = 1/3 Var [U] = E

U
2

E[U]
2
= 1/3 1/4 = 1/12.
(d) E[log(1 + U)] =

1
0
log(1+u)f(u) du = [(1+u)[log(1+u)1]
1
0
= 2(log 21)(1) =
2 log 2 1.
(e) It is straightforward to show that Y = a + (b a)U has the required uniform distri-
bution. Hence, E[Y ] = a + (b a)/2 = (a + b)/2 and Var [Y ] = (b a)
2
/12.
3. Recall that if Y has the Poisson distribution with parameter then E[Y ] = and
Var [Y ] = .
(a) Therefore, Var [Y ] = 3.
(b) The central limit theorem says that when Y
1
, . . . , Y
n
are independent and identically
distributed with mean and variance
2
, the distribution of

n(Y )/ for large n


will be approximately N(0, 1).
(c) Applying the central limit theorem, the distribution of Z = 10(Y 3)/1.732 should be
approximately N(0, 1). But from tables of the normal distribution, if Z N(0, 1) then
P[Z 2.33] = 0.99 and so P[|Z| 2.33] = 0.98. Therefore, P

|Y 3| 0.40


0.98; that is, P

Y [2.60, 3.40]

0.98. Thus, [2.60, 3.40] is the required interval.


4. Let Y denote the length of an ant. Then,
P[misclassify an ant | type A] = P[Y > 8 | A] = P[Z > 1.875] 0.0304
where Z N(0, 1). Similarly, P[misclassify an ant | type B] 0.0599.
The overall misclassication rate is given by
P[misclassify an ant]
= P[misclassify | type A] P[type A] + P[misclassify | type B] P[type B]
= 0.0304 0.7 + 0.0599 0.3 0.0392
Intuitively, the cuto point should be made larger to reduce the overall error rate: this
reduces the error rate for the type A ants, the more numerous of the two populations.
[The following remarks go beyond the intuitive explanation suggested above. First, notice
that the original cuto point of 8 mm is slightly larger than the average of the type A and
type B expectations (7.95 mm): this is to account for the dierence in standard deviations.
In general, for two populations with densities f
1
(y) and f
2
(y) and associated proportions
p
1
and p
2
, it can be shown that the classication regions are determined by the solution(s)
to the equation p
1
f
1
(y) = p
2
f
2
(y), and with this choice the overall misclassication rate is
minimised. We can see that this might be the case by noticing that p
i
f
i
(y) is proportional
to the conditional probability that an ant of length y belongs to population i; and an
intuitive rule would be to assign such an ant to the population with the largest of these
conditional probabilities. This rule generalises to any number of populations, and it turns
out that the overall misclassication rate is minimised]
5. Here is the best strategy. Each prisoner does the following. If the hat colours of the other
two prisoners are dierent, they say nothing. If the colour is the same, they guess the
opposite colour. This method will be successful unless all of the hats are of the same
colour. The chance that all are of the same colour is 1/4. (There is a 1/8 chance that all
are red, and 1/8 chance that all are blue.) Therefore this strategy has probability 3/4 of
success.
2H Statistical Concepts 2013/14 Sheet 1 Sampling
1. [Hand in to be marked] This question illustrates sampling distributions and related
ideas in the context of a very small population with known x-values. The population
comprises 5 individuals with x-values {1, 3, 3, 7, 9}. A sample of size two (resulting in
values Y
1
and Y
2
) is drawn at random and without replacement from the population.
(a) What is value of N? Compute the population mean and the population variance.
(b) What is value of n? Write down the (sampling) distribution of Y
1
. What is the
(sampling) distribution of Y
2
?
(c) Derive the exact (sampling) distribution of

Y and, in this case, check directly the
formulae for E

and Var

given in lectures.
(d) Derive the exact (sampling) distribution for the range of the two sample values
(the largest minus the smallest) and show that in this case the sample range is not
an unbiased estimator of the population range (the largest minus smallest of the
population values). Under what general conditions on the population size and values
and the sample size will the sample range be an unbiased estimator for the population
range?
2. This is a simple numerical (no context) exercise on some basic things you should have
learned so far. A simple random sample (without replacement) of size 25 from a population
of size 2000 yielded the following values:
104 109 111 109 87
86 80 119 88 122
91 103 99 108 96
104 98 98 83 107
79 87 94 92 97
For the above data,

25
1
Y
j
= 2451,

25
1
Y
2
j
= 243505
(a) Calculate an unbiased estimate of the population mean and of the population total.
(b) Calculate unbiased estimates of the population variance and of Var

.
(c) Compute (estimated) standard errors for the population mean and for the population
total.
3. Among three boys, Andy has 3 sweets, Bill has 4 sweets and Charles has 5 sweets. Among
three girls, Doreen has 4 sweets, Eve has 6 sweets and Florence has 8 sweets. One boy
is selected at random, with number of sweets B
1
, and independently, one girl is selected
with number of sweets G
1
. Let D
1
= G
1
B
1
.
(a) Find the sampling distribution of D
1
and thus nd, directly, the expected value and
variance of D
1
.
(b) Find the expected value and variance of D
1
by rst nding the corresponding values
for G
1
and B
1
, and check that you get the same answers.
(c) A second boy is selected at random from the remaining two boys, with number of
sweets B
2
, and a second girl is selected with number of sweets G
2
. Let D
2
= G
2
B
2
.
Find the sampling distribution of D = (D
1
+D
2
)/2 and thus nd the expected value
and variance of D.
(d) Find the expected value and variance of D, using the formulae for E(D
1
+D
2
), V ar(D
1
+
D
2
), and check that you get the same answers.
4. Show that with simple random sampling without replacement from a nite population the
random quantity
s
2
n

1
n
N

(usually denoted by s
2

Y
) is an unbiased estimator of Var

, where
s
2
=
1
n 1
n

i=1
(Y
i


Y )
2
.
[Hint: First show that
n

i=1
(Y
i


Y )
2
=
n

i=1
Y
2
i
n

Y
2
and then use the expression for Var

given in lectures in combination with the general


result that E

Z
2

= Var [Z] + [E[Z]]


2
for any random quantity Z.]
Statistical Concepts 2013/14Solutions 1Sampling
1. (a) N = 5. = 23/5 = 4.6,
2
= (149/5) (4.6)
2
= 8.64
(b) n = 2. P[Y
j
= 1] = P[Y
j
= 7] = P[Y
j
= 9] = 0.2, P[Y
j
= 3] = 0.4, j = 1, 2
(c) Possible samples and corresponding values for the sample mean and range are
Sample (y
1
, y
2
) (1,3) (1,3) (1,7) (1,9) (3,3) (3,7) (3,9) (3,7) (3,9) (7,9)
Mean ( y) 2 2 4 5 3 5 6 5 6 8
Range (r) 2 2 6 8 0 4 6 4 6 2
Hence, the sampling distribution of

Y is
y 2 3 4 5 6 8
P
_

Y = y

0.2 0.1 0.1 0.3 0.2 0.1


E
_

= 2 0.2 + + 8 0.1 = 4.6 = , E


_

Y
2

= 2
2
0.2 + + 8
2
0.1 = 24.4. Thus,
Var
_

= 24.4 (4.6)
2
= 3.24. This agrees with
Var
_

=
_
N n
N 1
_

2
n
=
_
5 2
5 1
_
8.64
2
= 3.24
(d) The sampling distribution of R is
r 0 2 4 6 8
P[R = r] 0.1 0.3 0.2 0.3 0.1
E[R] = 0 0.1 + + 8 0.1 = 4 < population range = 9 1 = 8
In general, the sample range can never be larger than the population range. Therefore it will
be biased if there is a positive probability that the sample range will be smaller than the
population range. Therefore there are two situations:
If the population range is zero (all values in the population are the same), the sample
range will always be zero and will be unbiased.
If the population range is postive: let a denote the maximum value and b the minimum
value in the population; the sample range can only be unbiased if every sample must
contain at least one a and at least one b as otherwise there would be positive probability
of obtaining a sample with smaller range than the population. The only way to guarantee
that both a and b appear in the sample is if n > N min(N
a
, N
b
) where N
x
denotes the
number of times the value x appears in the population.
2. n = 25, N = 2000,

n
1
Y
j
= 2451,

n
1
Y
2
j
= 243505,

n
1
(Y
j


Y )
2
= 243505 (2451
2
/25) =
3208.96
(a)

Y = 98.04 is an unbiased estimate of the population mean ; and T = 2000 98.04 = 196080
is an unbiased estimate of the population total .
(b)
20001
2000

3208.96
251
= 133.64 is an unbiased estimate of the population variance
2
.
s
2

Y
=
s
2
n
_
1
n
N
_
=
3208.96
2524
_
1
25
2000
_
5.28, an unbiased estimate of Var
_

.
(c) Estimated SE of

Y as an estimate of the population mean is s
Y
= 2.298; and the estimated
SE of T as an estimate of the population total is 2000 times this; namely, s
T
= 4596.
3. (a) The possible values of D
1
are -1,0,1,1,2,3,3,4,5 each with probability 1/9. Therefore E(D
1
) =
(1 + 0 1 + 3 + 2 + 1 + 5 + 4 + 3)/9 = 2, and V ar(D
1
) = E(D
1
2)
2
= (1 + 4 + 9 + 1 + 0 +
1 + 9 + 4 + 1)/9 = 10/3.
(b)
E(B
1
) = (3 + 4 + 5)/3 = 4, E(G
1
) = (4 + 6 + 8)/3 = 6
so
E(G
1
B
1
) = E(G
1
) E(B
1
) = 6 4 = 2.
V ar(B
1
) = E(B
1
4)
2
= 2/3, V ar(G
1
) = E(G
1
6)
2
= 8/3
so, as G
1
, B
1
are independent,
V ar(G
1
B
1
) = V ar(G
1
) +V ar(B
1
) = 10/3
.
(c) If we choose two boys and two girls, then we leave one boy and one girl behind. Call their values
B
3
, G
3
with D
3
= G
3
B
3
. As D
1
+D
2
+D
3
= 6, we have D = (6 D
3
)/2. D
3
has the same
distribution as D
1
so that the possible values of D are 7/2, 6/2, 5/2, 5/2, 4/2, 3/2, 3/2, 2/2, 1/2,
each with probability 1/9. So, we can nd E(D), V ar(D) directly from this distribution, or
from
E(D) = (6 E(D
3
))/2 = (6 2)/2 = 2,
V ar(D) = V ar((6 D
3
)/2) = V ar(D
3
)/4 = 10/12.
(d)
E(D) = (1/2)(E(D
1
) +E(D
2
)) = 2
V ar(D) = (1/4)V ar(D
1
+D
2
) = (1/4)(V ar(D
1
) +V ar(D
2
) + 2Cov(D
1
, D
2
)).
We have V ar(D
1
) = V ar(D
2
) = 10/3 and
Cov(D
1
, D
2
) = Cov(G
1
B
1
, G
2
B
2
) = Cov(G
1
, G
2
) +Cov(B
1
, B
2
),
as G
1
, G
2
are independent of B
1
, B
2
so Cov(B
1
, G
2
) = Cov(G
1
, B
2
) = 0. From results in
lectures, we have that the covariance between any two values sampled without replacement
from a population is minus the variance of a single sample, divided by one less than the
population size, so that
Cov(G
1
, G
2
) = V ar(G
1
)/(3 1) = 8/6, Cov(B
1
, B
2
) = V ar(B
1
)/(3 1) = 2/6,
Cov(D
1
, D
2
) = 8/6 2/6 = 10/6
so
V ar(D) = (1/4)(10/3 + 10/3 20/6) = 10/12
4. (n 1)s
2
=

n
j=1
(Y
j


Y )
2
=

n
j=1
(Y
2
j
2

Y Y
j
+

Y
2
) =

n
j=1
Y
2
j
2

n
j=1
Y
j
+

n
j=1

Y
2
=

n
j=1
Y
2
j
2

Y n

Y +n

Y
2
=

n
j=1
Y
2
j
n

Y
2
.
In what follows use (i) E
_
Y
2

= Var [Y ] +(E[Y ])
2
for any random quantity Y , and (ii) Var
_

=
_
Nn
N1
_

2
n
. Then
(n 1)E
_
s
2

=
n

j=1
E
_
Y
2
j

nE
_

Y
2

=
n

j=1
(
2
+
2
) n
_

2
+
_
N n
N 1
_

2
n
_
= n
2
_
1
N n
n(N 1)
_
=
(n 1)N
N 1

2
Therefore, E
_
s
2

=
N
N1

2
. Hence
E
_
s
2

=
1
n
_
1
n
N
_
E
_
s
2

=
1
n
_
1
n
N
_
N
N 1

2
=
_
N n
N 1
_

2
n
= Var
_

Hence, s
2

Y
is an unbiased estimator of Var
_

.
2H Statistical Concepts 2013/14 Sheet 2
Estimators and Condence Intervals
1. [Hand in to be marked] At the time of a historic potential challenge for the
leadership of the Conservative party (the stalking horse aair where Sir Anthony
Meyer challenged Mrs Thatcher for the leadership of the Conservative party), the
Independent newspaper performed an opinion poll to assess the level of support for
Mrs Thatcher. They asked 150 of the 377 Conservative MPs whether or not they
felt it was time for a change of leader and used the results to draw conclusions about
the level of support for Mrs Thatcher in the whole of the parliamentary party.
Supposing the actual level of support to be 40% among the 377 Conservative MPs,
(i) calculate the standard deviation of the proportion of the sample supporting
Mrs Thatcher, assuming simple random sampling; and (ii) using the Central Limit
Theorem, estimate the chance that the level of support in a sample of size 150 will
be in error by more than 1%.
Suppose that 50 in the sample of 150 said they supported Mrs Thatcher. Compute
an approximate 95% condence interval for the percentage support (without assum-
ing that the actual level of support is 40%). Discuss whether or not this interval is
consistent with an actual level of support of 40%.
2. In a private library the books are kept on 130 shelves of similar size. The numbers
of books on 15 shelves selected at random were found to have a sum of 381 and
a sum of squares of 9947. Estimate the total number of books in the library and
provide an estimated standard error for your estimate. Give an approximate 95%
condence interval for the total number of books in the library, and comment on
the reliability of the approximation in this instance.
3. In auditing, the following sampling method is sometimes used to estimate the total
unknown value = a
1
+a
2
+ +a
N
of an inventory of N items, where a
i
is the
(as yet unknown) audit value of item i and, as is often the case, a book value b
i
of each item i = 1, . . . , N is readily available.[Think of second-hand cars with their
published blue book values, or stamps with their catalogue values.] A simple
random sample without replacement of size n is taken from the inventory, and for
each item j in the sample the dierence D
j
= A
j
B
j
between the audited value
A
j
and the book value B
j
is recorded and the sample average

D =

A

B is formed.
The total inventory value is estimated as V = ND+, where = b
1
+b
2
+ +b
N
is the known sum of the book values of the inventory.
(a) Show that V is an unbiased estimator of the total value of the inventory. (b)
Find an expression for the variance of the estimator V in terms of the population
variances
2
a
and
2
b
of the inventory values and book values and their covariance

ab
, where you may assume
cov(A, B) =

N n
N 1


ab
n
[Note
ab
is dened to be

N
i=1
(a
i

a
)(b
i

b
)/N where
a
= /N and
b
= /N
are the inventory and book value population means; and when a = b we get the
usual variance formula] (c) Under what conditions on the variances and covariances
of the inventory value and book value populations will the variance of V be smaller
than that of the usual estimator NA of the total inventory ? (d) Under what
circumstances will the answer to (c) be useful in practice?
Statistical Concepts 2012/13 Solutions 2
Estimators and Condence Intervals
1. (i) Assuming true population proportion p = 0.4, standard deviation of estimator p is

p
=

p(1 p)
n
(N n)
(N 1)
=

0.4(1 0.4)
150
(377 150)
(377 1)
= 0.031 (about 3%)
(ii)
P[ p 0.4| > 0.01] = 2
_
1
_
0.01
0.031
__
= 0.748 (about a 75% chance)
(iii) p =
50
150
= 0.333. SE of p is
s
p
=
_
p(1 p)
n 1
_
1
n
N
_
= 0.03
Hence, approximate 95% limits are 0.3331.960.03 leading to the interval [27.5%, 39.2%].
Since 40% is outside this interval, data is not consistent with this level of support.
2. n = 15, N = 130,

n
1
Y
j
= 381,

n
1
Y
2
j
= 9947
s
2
=
_
9947
381
2
15
_
/(15 1) = 19.257.
T = N

Y = 3302, an unbiased estimate of the total number of books in the library.
s
2

Y
=
s
2
n
_
1
n
N
_
= 1.136.
s
T
= Ns
Y
= 138.54, the SE of T.
z
0.025
= 1.96. Hence, an approximate 95% CI for the total number of books has limits,
T z
0.025
s
T
, which evaluates to [3030, 3574] (nearest integer) .
As n is not large, the accuracy of the CLT-based CI cannot be guaranteed.
3. (a) E[V ] = NE
_

+ = N(
a

b
) + = + =
(b) Var [V ]
= N
2
Var
_

= N
2
Var
_

A

B

= N
2
_
Var
_

A

+ Var
_

2Cov
_

A,

B

= N
2
_

2
a
n
N n
N 1
+

2
b
n
N n
N 1
2

ab
n
N n
N 1
_
=
N
2
(N n)
n(N 1)
_

2
a
+
2
b
2
ab

(c) Var
_
N

A

= N
2 Nn
N1

2
a
n
> Var [V ] when
2
b
< 2
ab
.
(d) Useful, provided we have knowledge about the relative magnitudes of
2
b
and
ab
. We
know the value of
2
b
but not
ab
. The closer the audit and book values are related,
as measured by
ab
. the more useful V would be.
Statistical Concepts 2013/14 Sheet 3
Probability Models and Goodness of Fit
1. [Hand in to be marked] The Poisson distribution has been used by trac engineers as a model
for light trac, based on the rationale that if the rate is approximately constant and the trac is
light, so that cars move independently of each other, the distribution of counts of cars in a given
time interval should be nearly Poisson. The following table shows the numbers of right turns
during 300 three-minute intervals at a specic road intersection over various hours of the day and
various days of the week.
# right turns 0 1 2 3 4 5 6 7 8 9 10 11 12 13+
count 14 30 36 68 43 43 30 14 10 6 4 1 1 0
Estimate the rate parameter in the Poisson distribution. After pooling the last ve cells
(explain why we do this), assess the t of the Poisson distribution using Pearsons chi-square
statistic. Carefully explain any lack of t.
2. Are birthrates constant throughout the year? Here are all the births in Sweden, in 1935, grouped
by season.
Spring (Apr-June) 23,385
Summer (Jul-Aug) 14,978
Autumn (Sep-Oct) 14,106
Winter (Nov-Mar) 35,804
(a) Carry out a chi-square test of the constant birthrate hypothesis for these data. Comment
on any divergences from constant birth rate.
(b) From the given data, construct a 95% condence interval for the proportion of spring births.
Comment on the relationship of this analysis with part (a).
3. Capture-recapture. How does one estimate the number of sh in a lake? The following technique
is actually used for this and other problems concerning sizes of wildlife populations. A net is set
up to catch some sh. These are marked and returned to the lake. At a later date another batch
of sh are caught. The size of the sh population can then be estimated from seeing how many
of the marked sh have been caught in the second sample.
Argue that if M marked sh from the rst stage are returned to the lake, then the probability
distribution P[Y = y | M, n, N] of the number Y of marked sh caught in a second catch of n
sh, when the total number of sh in the lake is N, is given by
P[Y = y | M, n, N] =

M
y

N M
n y

N
n

Find E[Y | M, n, N] and hence suggest an estimator for N. Note that you can evaluate the
expectation without using P [Y = y | M, n, N]; explain how. Evaluate your estimator for
the possible values of y in the case when 6 sh are marked in rst stage and 10 sh are to be
caught in the second stage.
For an observed value y of Y , discuss how you might use P[Y = y | M, n, N] (when considered
as a function of N) as providing an alternative route to estimating N. [To clarify this, you
might consider by way of an example plotting P[Y = 3 | M = 6, n = 10, N] as a function of N,
corresponding to the specic situation described above in which 3 marked sh are caught at the
second stage.]
The probability model in this question is known as the hypergeometric distribution; it appears in
many contexts.
Statistical Concepts 2013/14 Solutions 3
Probability Models and Goodness of Fit
1. Total = 1168, n = 300,

= 1168/300 = 3.893. Pool cells to ensure E 5 in each cell.
0 1 2 3 4 5 6 7 8 9
+
Total
O 14 30 36 68 43 43 30 14 10 12 300
E 6.1 23.8 46.3 60.1 58.5 45.6 29.6 16.4 8.0 5.5 300
X
2
10.2 1.6 2.3 1.0 4.1 0.1 0.01 0.4 0.5 7.8 27.93
Pearsons chi-square statistic, X
2
= 27.93. Degrees-of-freedom, = (10 1) 1 = 8.
Signicance probability, P[X
2
> 27.93 | Poisson model] = 0.0005, which is very strong
evidence against the Poisson model. There are too many zero counts and too many counts
of 9 or more. The Poisson model is unlikely to be reasonable, as the trac rate will tend
to vary over dierent times of the day and on dierent days of the week. A Poisson model
would more likely hold at a specic place during a specic time period on the same day of
the week, such as 7:00 am to 8:00 am on a Sunday, when trac density is more likely to
be light and nearly constant.
2. (a) Under the simple model where births are independent and occur with the same prob-
ability every day (assume 365 days in a year), where we have seen a random sample
of n = 88, 273 births from a hypothetical innite population of such births, here is
the layout of the chi-square calculation.
Obs. freq Probability Exp freq
Season O p E=np (O-E) (O E)
2
/E
Spring (Apr-June) 23,385 0.24932 22,008 1,377 86.16
Summer (Jul-Aug) 14,978 0.16986 14,994 -16 0.02
Autumn (Sep-Oct) 14,106 0.16712 14,752 -646 28.29
Winter (Nov-Mar) 35,804 0.41370 36,519 -715 14.00
From the table, the
2
statistic is 128.47, which should be compared to the null
distribution of
2
with 3 degrees of freedom. The p-value for this statistic is very small
(much smaller than the smallest value in your
2
tables), so that we can reject the
null hypothesis of equal birthrates on all days with a very low signicance probability.
Note that with very large data sets we can be sensitive to very small discrepancies
from the null distribution. Looking at the table, we see more births than we would
expect in spring compensated by less in autumn and winter.
(b) To see how far o equal birthrates the data are, we may assess the probability of birth
in each season. For spring, the estimate is p = 23385/88, 273 = 0.265. Our condence
interval is therefore 0.265 1.96

(0.265)(0.735)/88, 273 = 0.265 0.0029. If birth


rates were constant over days, then this probability would be 0.249. Note that this
value does not lie within the condence interval. The observed value is about 6%
above the equal rate value, and our condence interval suggests that the true ratio
is within 5% and 7%. You can check that the condence intervals for autumn and
winter similarly fail to include the equal rate values.
3. Assuming no mixing, no deaths, etc, the expression for the quantity P[Y = y | M, n, N]
follows from 1H Probability; compare for example to counting the number of aces in a
poker hand.
Y = Y
1
+ +Y
n
, where Y
i
= 1 if i-th sh caught is marked and Y
i
= 0 otherwise. Then
E[Y | M, n, N] = nE[Y
1
| M, n, N] = n
M
N
Equate number of marked sh y (in second sample) to n
M
N
to obtain estimator N

= nM/y.
For M = 6, n = 10, we obtain
y 0 1 2 3 4 5 6
N

60 30 20 15 12 10
For observed Y = y, we can use P[Y = y | M, n, N] as a likelihood function for N, and
nd the value of N,

N, which maximises this function. This is the maximum likelihood
estimator. For the above example with say y = 3, we have
P[Y = 3 | M = 6, n = 10, N] =

6
3

N6
103

N
10

N6
103

N
10
l(N)
l(N) = 0 for N < 13. Plot l(N) for N 13. Note that
l(N + 1)
l(N)
=
(N 5)(N 9)
(N 12)(N + 1)
1 for N 19
Therefore, the m.l.e of N is not unique:

N = 19 or 20 (cf. N

= 20 when y = 3).
Statistical Concepts 2013/14 Sheet 4 Probability Models
1. [Hand in to be marked] A r.q. Y has a gamma distribution with parameters , > 0
if its p.d.f. is
f(y | , ) =
1
()

y
1
e
y
y 0.
and zero otherwise.
(a) Show that Var [Y | , ] = /
2
.
(b) Show that the moment generating function of Y is given by
M
Y
(t) =

t <
and explain the condition on t.
(c) Deduce that if Y
1
, . . . , Y
n
is an independent random sample from such a gamma
distribution then their sum also has a gamma distribution, and give the parameters.
2. Overeld and Klauber (1980) published data on the incidence of tuberculosis in relation
to blood groups in a sample of Eskimos. For the data below, is there any association of
the disease and blood group within the ABO system?
Severity O A AB B
Moderate-to-advanced 7 5 3 13
Minimal 27 32 8 18
Not present 55 50 7 24
3. A family contains just two boys. Write down, evaluate, and sketch the likelihood func-
tion for the family size n, assuming the simple Bernoulli model where successive births
are independent and where the probability of a boy is 1/2 and the same for a girl. What
is the m.l.e. n of n?
Statistical Concepts 2013/14 Solutions 4 Probability Models
1. (a) E[Y | , ] = / (from lectures)
E
_
Y
2
| ,

=
_

0
y
2
1
()

y
1
e
y
dy
=
( + 2)

2
()
_

0
1
( + 2)

+2
y
(+2)1
e
y
dy
=
( + 2)

2
()
=
( + 1)()

2
()
=
( + 1)

2
Therefore
Var [Y | , ] =
( + 1)

2

_

_
2
=

2
(b) We can calculate the moment generating function similarly
M
Y
(t) = E
_
e
tY
| ,

=
_

0
e
ty
1
()

y
1
e
y
dy
=
_

t
_

_

0
1
()
( t)

y
1
e
(t)y
dy =
_

t
_

for t > 0
The condition on t is required as the integral does not converge/exist for other
values of t. As the moment generating function is dened for t in an open interval
including 0, it can be used to nd all moments E(Y
r
), and hence uniquely species
the distribution of Y (see your 1H Probability notes, or the textbook by Rice, for
more details on mgfs if needed to refresh your knowledge).
(c) Put Y = Y
1
+ +Y
n
. Then (1H Probability course), because the Y
i
are independent,
M
Y
(t) = M
Y
1
(t)M
Y
2
(t) M
Y
n
(t) =
_

t
_
n
Therefore, Y Gamma(n, ).
2. Observed values are
Severity O A AB B
Moderate-to-advanced 7 5 3 13
Minimal 27 32 8 18
Not present 55 50 7 24
Expected values are
Severity O A AB B
Moderate-to-advanced 10.01 9.78 2.02 6.18
Minimal 30.38 29.70 6.14 18.78
Not present 48.61 47.52 9.83 30.04
The (O E)
2
/E entries are
Severity O A AB B
Moderate-to-advanced 0.904 2.339 0.471 7.510
Minimal 0.376 0.178 0.560 0.032
Not present 0.840 0.130 0.815 1.214
giving a total of 15.36957. Under the hypothesis of no association between disease and
blood group, the null distribution is a chi-square distribution with (r1)(c1) = 32 = 6
degrees of freedom. The observed value corresponds to a signicance level of 0.0176. (In
R this can be computed as 1-pchisq(15.36957,6)).)
[Note that one of the cells, [AB,moderate] has a small expected value. If we had a large
observed value in this cell, then our chi-square distribution approximation would not be
reliable.]
Thus, there is some evidence of association of disease and blood group within the ABO
system, which is mainly conned to A and B in Moderate-to-advanced, especially B.
3. Likelihood l(n) for n is given by
l(n) = P[2 boys | n] =
_
n
2
_
1
2
n
=
n(n 1)
2
n+1
for n = 2, 3, 4, 5, . . .
and takes values
1
4
,
3
8
,
3
8
,
5
16
, . . . Thus, m.l.e. n is not unique, as both n = 3 and n = 4
maximise l(n). [Obviously, l(0) = l(1) = 0.]
4. [Hand in to be marked]
(a) Likelihood, l() =

n
i=1

1
exp (y
i
/) =
n
exp (n y/) for > 0. Hence, it is
sucient to know the values of (n, y) to compute l().
(b) L() = nlog n y/. Hence,
L

() =
n

+n
y

2
= 0 = y
_
L

( ) =
n
y
2
< 0
_
(c) E
_

Y |

= E[Y | ], where Y Gamma(, ) with = 1, = 1/. Thus,


E[ | ] = / = (unbiased).
5. The likelihood is l(p) =

n
i=1
(1 p)p
y
i
1
= (1 p)
n
p
n( y1)
. Put L(p) = log l(p). Then
L

(p) =
n
1 p
+
n( y 1)
p
p =
y 1
y
For data,

y
i
= 1 48 +2 31 + +12 1 = 363, n = 48 +31 + +1 = 130. Hence,
p = (363 130)/363 = 0.642. Noting that E
k+1
= pE
k
and E
1
= 130(1 p) = 46.6 (1 dp)
we obtain the following table:
Hops 1 2 3 4 5 6 7 8 9 10 11 12
+
Total
O 48 31 20 9 6 5 4 2 1 1 2 1 130
E 46.6 29.9 19.2 12.3 7.9 5.1 3.3 2.1 1.3 0.9 0.6 0.4 130
X
2
.045 .042 .035 .891 .458 .001 .397 1.868
Pooled cells have expectation 9.092 = 130P[Y 7 | p = 0.642] and a contribution of
0.3967 to X
2
. Degrees-of-freedom = 711 = 5, X
2
= 1.868 and P
_
X
2
5
> 1.868 | geometric model

=
0.8672. This large signicance probability suggests that the geometric model ts well
perhaps too well. However, why should the geometric distribution model the number of
hops?
Statistical Concepts 2013/14 Sheet 5 Likelihood
1. [Hand in to be marked] Let y
1
, . . . , y
n
be a random sample from a geometric distri-
bution
P[Y = y | p] = (1 p)p
y1
y = 1, 2, . . .
where p [0, 1]. [Remember from 1H Probability that this is the distribution for
the number trials to the rst failure in a sequence of independent trials with success
probability p.] Show that, if the model is correct, it is sucient to know the sample
size n and the sample mean y to evaluate the likelihood function. Find the maximum
likelihood estimator (m.l.e.) of p in terms of y.
In an ecological study of the feeding behaviour of birds, the number of hops between
ights was counted for several birds. For the following data, t a geometric distribution
to these data using the m.l.e. of p, and test for goodness of t using Pearsons chi-square
statistic, remembering the rule of thumb to pool counts in adjacent cells so that all
resulting cell expectations are at least 5.
Number of hops 1 2 3 4 5 6 7 8 9 10 11 12
Count 48 31 20 9 6 5 4 2 1 1 2 1
2. Let y
1
, . . . , y
n
be a random sample from an exponential distribution with p.d.f.
f(y | ) =
1

e
y/
0 y <
and zero otherwise, where > 0.
(a) Show that if the model is correct it is sucient to know the sample size n and the
sample mean y to evaluate the likelihood function for any value of .
(b) Find the maximum likelihood estimator of in terms of y.
(c) Show that is an unbiased estimator of ; that is, E[ | ] = for all > 0..
More problems on other side
3. Suppose that a parameter can assume one of three possible values
1
= 1,
2
= 10
and
3
= 20. The distribution of a discrete random quantity Y , with possible values
y
1
, y
2
, y
3
, y
4
, depends on as follows:

1

2

3
y
1
.1 .2 .4
y
2
.1 .2 .3
y
3
.2 .3 .1
y
4
.6 .3 .2
Thus, each column gives the distribution of Y given the value of at the head of the
column.
(a) Write down the parameter space .
(b) A single observation of Y is made. Sketch the likelihood function and evaluate the
m.l.e.

of for each of the possible values of Y .
(c) Evaluate the sampling distribution of

; that is, for each compute the probability
distribution of

, based on a single observation of Y . Display your answer in tabular
form.
(d) Is

an unbiased estimator of ? Prove your result!
4. The random quantity Y has a geometric distribution with probability function
P[Y = y | p] = (1 p)p
y1
y = 1, 2, . . . p [0, 1]
Show that P[Y > y | p] = p
y
. Recall from 1H that Y counts the number of trials to the
rst failure in a sequence of Bernoulli trials, each with success probability p.
As part of a quality control procedure for a certain mass production process, batches
containing very large numbers of components from the production are inspected for
defectives. We will assume the process is in equilibrium and denote by q the overall
proportion of defective components produced.
The inspection procedure is as follows. During each shift n batches are selected from
the production and for each such batch components are inspected until a defective one
is found, and the number of inspected components is recorded. At the end of the shift,
there may be some inspected batches which have not yet yielded a defective component;
and for such batches the number of inspected components is recorded.
Suppose at the end of one such inspection shift, a defective component was detected in
each of r of the batches, the recorded numbers of inspected components being y
1
, . . . , y
r
.
Inspection of the remaining s = nr batches was incomplete, the recorded numbers of
inspected components being c
1
, . . . , c
s
.
(a) Show that the likelihood function for q based on these data is
l(q) = q
r
(1 q)
y+cr
q [0, 1]
where y = y
1
+ + y
r
and c = c
1
+ + c
s
.
(b) Therefore, show that the maximum likelihood estimate of q is q = 1/a, where
a = (y + c)/r.
Statistical Concepts 2013/14 Solutions 5 Likelihood
1. The likelihood is l(p) =

n
i=1
(1p)p
y
i
1
= (1p)
n
p
n( y1)
. Therefore, by the factorisation
criterion, y is sucient for p, if we know n.
Put L(p) = log l(p). Then
L

(p) =
n
1 p
+
n( y 1)
p
p =
y 1
y
(This is the maximum as L

(p) is negative for p [0, 1].)


For data,

y
i
= 1 48 +2 31 + +12 1 = 363, n = 48 +31 + +1 = 130. Hence,
p = (363 130)/363 = 0.642. Noting that E
k+1
= pE
k
and E
1
= 130(1 p) = 46.6 (1 dp)
we obtain the following table:
Hops 1 2 3 4 5 6 7 8 9 10 11 12
+
Total
O 48 31 20 9 6 5 4 2 1 1 2 1 130
E 46.6 29.9 19.2 12.3 7.9 5.1 3.3 2.1 1.3 0.9 0.6 1.0 130
X
2
.045 .042 .035 .891 .458 .001 .397 1.868
Pooled cells have expectation 9.092 = 130P[Y 7 | p = 0.642] and a contribution of
0.3967 to X
2
. Degrees-of-freedom = 7 1 1 = 5, X
2
= 1.868 and
P
_
X
2
5
> 1.868 | geometric model

= 0.8672
This large signicance probability suggests that the geometric model ts wellperhaps
too well. However, why should the geometric distribution model the number of hops?
2. (a) Likelihood, l() =

n
i=1

1
exp (y
i
/) =
n
exp (n y/) for > 0. Hence, it is
sucient to know the values of (n, y) to compute l() (by factorisation criterion).
(b) L() = nlog n y/. Hence, the MLE is
L

() =
n

+ n
y

2
= 0 = y
_
as L

( ) =
n
y
2
< 0
_
(c) E
_

Y |

= E[Y | ], where Y Gamma(, ) with = 1, = 1/. Thus,


E[ | ] = / = (unbiased).
3. (a) = {1, 10, 20}.
(b)

(y
1
) = 20,

(y
2
) = 20,

(y
3
) = 10,

(y
4
) = 1.
(c) There are three dierent (sampling) distributions (displayed as columns in the table
below) for

, one for each .

1 10 20
20 .2 .4 .7
10 .2 .3 .1
1 .6 .3 .2
Eg, P
_

= 20 | = 1
_
= P[Y = y
1
| = 1] + P[Y = y
2
| = 1] = 0.1 + 0.1 = 0.2.
(d) E
_

| = 1
_
= 20 0.2 + 10 0.2 + 1 0.6 = 6.6 = 1. Therefore, E
_

|
_
= for
at least one value of ; so

is not an unbiased estimator of .
4. (a) P[Y > y] = 1 P[Y y] = 1

y
r=1
(1 p)p
r1
= p
y
. The likelihood, i.e. the
probability of the observed data for a given p, is
P[Y
1
= y
1
, . . . , Y
r
= y
r
, Y
r+1
> c
1
, . . . , Y
n
> c
s
| p] =
r

i=1
(1p)p
y
i
1
s

j=1
p
c
j
= (1p)
r
p
yr
p
c
(b) Therefore the log-likelihood for q = 1 p is
L(q) = r log q + (y + c r) log (1 q)
and
L

(q) =
r
q
+
y + c r
1 q
(1) = 0 q =
1
a
where a = (y + c)/r.
Statistical Concepts 2013/14 Sheet 6 Likelihood
1. [Hand in to be marked] An independent sample x = (x
1
, . . . , x
n
) of size n is drawn from a
Rayleigh distribution with pdf
f(x|) =
x

e
x
2
/2
, x > 0
= 0, x 0
with unknown parameter > 0
(a) Show that the maximum likelihood estimator for is =

n
i=1
x
2
i
/2n.
(b) If X has a Rayleigh distribution, parameter , show that E(X
2
) = 2.
Therefore, show that Fishers information for a sample of size one is 1/
2
. Therefore,
write down the information in a sample of size n.
(c) Calculate an approximate 95% condence interval for if n is large.
2. An ospring in a breeding experiment can be of three types with probabilities, independently
of other ospring,
1
4
(2 + p)
1
2
(1 p)
1
4
p
(a) Show that for n ospring the probability that there are a, b and c of the three types,
respectively, is of the form
K (2 + p)
a
(1 p)
b
p
c
where K does not depend on p.
(b) Show that the maximum likelihood estimate p of p is a root of np
2
+(2b+ca)p2c = 0.
(c) Suppose that an experiment gives a = 58, b = 33 and c = 9. Find the m.l.e. p.
(d) Find Fishers information, and give an approximate 95% condence interval for p.
(e) Use p to calculate expected frequencies of the three types of ospring, and test the ade-
quacy of the genetic model using Pearsons chi-square statistic.
3. A random quantity Y has a uniform distribution on the interval (0, ) so that its p.d.f. is given
by
f(y | ) =
1

0 < y
and zero elsewhere. Why must be positive?
A random sample Y
1
, . . . , Y
n
is drawn from this distribution in order to learn about the value
of . Show that the joint p.d.f. is
f(y
1
, . . . , y
n
| ) =
1

n
0 < m
and zero elsewhere, where m = max{y
1
, . . . , y
n
}.
Sketch the likelihood function of . What is the maximum likelihood estimate

of ?
Derive the exact sampling distribution of M = max{Y
1
, . . . , Y
n
}, and hence show that
E

1
1
n + 1

Is it surprising that

under-estimates ? Provide an unbiased estimator of .
[Hint: To calculate the sampling distribution of M, rst calculate its c.d.f. by noting that
M m if and only if all the Y
i
are less than or equal to m: then nd its p.d.f.]
4. Suppose that each individual in a very large population must fall into exactly one of k mutually
exclusive classes C
1
, . . . , C
k
with probabilities p
1
, . . . , p
k
. In a random sample of n such indi-
viduals from the population, let Y
1
, . . . , Y
k
be the numbers falling in C
1
, . . . , C
k
, respectively.
(a) Reason that
P[Y
1
= y
1
, Y
2
= y
2
, . . . , Y
k
= y
k
| p
1
, p
2
, . . . , p
k
] = Kp
y
1
1
p
y
2
2
. . . p
y
k
k
where K depends on y
1
, . . . , y
k
but not on p
1
, . . . , p
k
. For the remainder of this question
you will not need to know the expression for K.
(b) What is the distribution of the sum of any (proper) subset of Y
1
, . . . , Y
k
? For example,
what is the distribution of Y
2
, or of Y
2
+ Y
4
?
(c) Suppose that in the population of twins, males (M) and females (F) are equally likely to
occur and that the probability that the twins are identical is . If twins are not identical
their genders are independent. Show that P[MM] = P[FF] = (1 + )/4 and P[MF] =
(1 )/2.
(d) Suppose that n twins are sampled and it is found that y
1
are MM, y
2
are FF and y
3
are
MF, but it is not known which twins are identical. Find the m.l.e.

of in terms of n
and y
1
, y
2
, y
3
.
(e) Is

an unbiased estimator of ?
(f) Compute the variance of

exactly.
(g) Find Fishers information, for this sampling experiment, and therefore nd the large
sample approximation to the variance of

. Compare your answer to the result of part
(f).
Statistical Concepts 2013/14 Solutions 6 Likelihood
1. (a) The likelihood of x is
l() =
n

i=1
f(x
i
|) =
n

i=1
x
i

exp(x
2
i
/2) =

n
i=1
x
i

n
exp(
n

i=1
x
2
i
/2)
Therefore
L() = log(l()) = log(
n

i=1
x
i
) nlog()
T(x)
2
where T(x) =

n
i=1
x
2
i
, so that
d
d
L() =
n

+
T(x)
2
2
= 0
when = T(x)/2n. To check that this is a maximum, we evaluate
d
2
d
2
L() =
n

2

T(x)

3
=
1

2
[n
T(x)

]
so
d
2
d
2
L( ) =
1

2
[n 2n] =
4n
3
T(x)
2
< 0,
so is the maximum.
(b) Integrating by parts, we have
E(X
2
) =
_

0
x
2
f(x|)dx =
_

0
x
2
x

e
x
2
/2
dx
= [x
2
e
x
2
/2
]

0
+ 2
_

0
xe
x
2
/2
dx = 2
Therefore, Fishers information, for a sample of size one, is
I() = E(
d
2
d
2
log(f(X|))
= E(
1

2
(
X
2

1)) = (
1

2
(
2

1)) =
1

2
so that the information in a sample of size n is n/
2
.
(c) For large n, the probability distribution of is approximately normal, mean , variance 1/nI(). Therefore
an approximate 95% condence interval for is 1.96
_
1/nI(). We estimate by so the approximate
95% condence interval is 1.96
_
T
2
(x)/4n
3
2. (a)
P[A = a, B = b, C = c | p, n] =
n!
a!b!c!
_
1
4
(2 +p)
_
a
_
1
2
(1 p)
_
b
_
1
4
p
_
c
= K(2 +p)
a
(1 p)
b
p
c
(b)
L(p) = const +a log (2 +p) +b log (1 p) +c log p
L

(p) =
a
2 +p

b
1 p
+
c
p
= 0 np
2
+ (2b +c a)p 2c = 0
(c) a = 58, b = 33, c = 9, n = 100 100p
2
+ 17p 18 = 0 p = (17 +

7489)/200 = 0.3477.
This value is a maximum as
L

(p) = (
a
(2 +p)
2
+
b
(1 p)
2
+
c
p
2
) < 0
(d) For a sample of size 1, given p,
E(a) =
2 +p
4
, E(b) =
1 p
2
, E(c) =
p
4
.
So, Fishers information is
I(p) = E(L

(p)) = (
E(a)
(2 +p)
2
+
E(b)
(1 p)
2
+
E(c)
p
2
) =
_
1
4(2 +p)
+
1
2(1 p)
+
1
4p
_
which we approximate as
I( p) = 1.59202
The large sample approximation to the variance of p, with n = 100, is therefore
V ar( p)
1
nI( p)
=
1
159.202
Hence, the large sample SE of p is s
p
= 1/

159.202 = 0.07925, leading to [0.1924, 0.5030] as an approximate


95% CI for p.
(e) Computing expected values (E) in the usual way we obtain:
O 58 33 9 100
E 58.69 32.62 8.69 100
(O E)
2
/E .0082 .0045 .0109 .0236
Degrees-of-freedom = 3 1 1 = 1, X
2
= 0.0236 and
P
_

2
1
> 0.0236 | genetic model is correct

= 0.8779
This large signicance probability suggests that the genetic model ts wellperhaps too well.
3. We must have > 0 for f(y|) to be non-negative.
The joint pdf is
f(y
1
, . . . , y
n
| ) =
_

n
for 0 < y
i
i = 1, . . . , n
0 otherwise
But {0 < y
i
, i = 1, . . . , n} {0 < m }, where m = max{y
1
, . . . , y
n
}, and the result follows.
The likelihood function is
l() =
_
0 if < m

n
if m
so that

= m.
P[M m | ] = P[Y
1
m, . . . , Y
n
m | ] =
_

_
0 if m 0
(m/)
n
if 0 < m
1 if m >
Hence
E
_

|
_
=
_

0
m
_
n
m
n1

n
_
dm =
_
1
1
n + 1
_

We would expect the true value of to be larger than the largest observation, so result is not surprising.
An unbiased estimator of is [(n + 1)/n]M, which is bigger than M =

.
4. (a) One way of observing y
1
, . . . y
k
is
C
1
C
1
C
1
. .
y
1
C
2
C
2
C
2
. .
y
2
C
k
C
k
C
k
. .
y
k
with probability
p
1
p
1
p
1
. .
y
1
p
2
p
2
p
2
. .
y
2
p
k
p
k
p
k
. .
y
k
= p
y
1
1
p
y
2
1
p
y
k
1
Thus
P[Y
1
= y
1
, Y
2
= y
2
, . . . Y
k
= y
k
| p
1
, p
2
, . . . , p
k
] = Kp
y
1
1
p
y
2
1
p
y
k
1
where K is the number of dierent ways this event can occur.
(b) Let Y = the sum of a proper subset of Y
1
, Y
2
, . . . , Y
k
and p the sum of the corresponding subset of p
1
, p
2
, . . . , p
k
.
Then Y is the number of observations that fall into the corresponding disjoint union of classes C
1
, C
2
, . . . , C
k
.
Thus, Y Binomial(n, p).
(c)
P[MM] = P[MM | I] P[I] + P[MM | I
c
] P[I
c
] =
1
2
+
1
2
1
2
(1 ) =
1 +
4
= P[FF]
P[MF] = 1 P[FF] P[MM] =
1
2
(d) The likelihood is
l() = C
_
1 +
4
_
y
1
_
1 +
4
_
y
2
_
1
2
_
y
3
Therfore, the log-likelihood is
L() = constant + (y
1
+y
2
) log (1 +) +y
3
log (1 )
Dierentiating wrt to we obtain
L

() =
y
1
+y
2
1 +

y
3
1
= 0
which has solution

= 1
2y
3
n
. This

will be the maximum likelihood estimator provided that

0, so
then

=

= 1
2y
3
n
, otherwise

= 0. (Check that

is the maximum by checking L

) < 0 as in part (g)


below.)
(e) Observe that the expectation of

is found as
E[

| ] = 1
2
n
E[Y
3
| ] = 1
2
n
n
_
1
2
_
=
so

is unbiased. We know that



, but actually there is a positive probability that

< 0 in which case

= 0 >

, hence E
_

|
_
> so the estimator

is biased.
(f) For large n, we have

. So, we may nd the variance of



approximately as
Var
_

|
_
Var [

| ] =
_
2
n
_
2
Var [Y
3
| ] =
_
2
n
_
2
n
_
1
2
__
1 +
2
_
=
1
2
n
(g)
L

() =
y
1
+y
2
(1 +)
2

y
3
(1 )
2
For a sample of size 1, given ,
E(y
1
) = E(y
2
) =
1 +
4
, E(y
3
) =
1
2
.
So, Fishers information is
I() = E(L

()) =
E(y
1
) +E(y
2
)
(1 +)
2
+
E(y
3
)
(1 )
2
=
1+
2
(1 +)
2
+
1
2
(1 )
2
) =
1
(1
2
)
The large sample approximation to the variance of

is therefore
V ar(

)
1
nI()
=
(1
2
)
n
which, in this case, is the same value as found in (f).
Statistical Concepts 2012/13 Sheet 7 Sample information
1. [Hand in to be marked] A thousand individuals were classied according to gender and
whether or not they were colourblind:
Male Female
Normal 442 514
Colourblind 38 6
According to genetic theory, each individual, independently of other individuals, has the
following probabilities of belonging to the above categories:
Male Female
Normal
1
2
p pq +
1
2
p
2
Colourblind
1
2
q
1
2
q
2
where q = 1 p.
(a) Show that the maximum likelihood estimate q of q is 0.0871, to four decimal places.
(b) Compute the large sample estimated standard error for the maximum likelihood esti-
mate, using the observed information. Hence, nd an approximate 99% condence
interval for q.
2. Evaluate and compare
(i) the estimated sample information, nI(

),
and
(ii) the observed information, L

), for the given sample,


for each of the following situations.
(a) A sample X from a binomial distribution, parameters n (known) and p (unknown).
(b) An iid sample Y
1
, ..., Y
n
of size n, from a Poisson distribution, parameter .
Statistical Concepts 2013/14 Solutions 7 Sample information
1. (a) Putting a = 442, b = 514, c = 38, d = 6, n = 1000, the likelihood is
l(q) = constant p
a
(pq +
1
2
p
2
)
b
q
c
(q
2
)
d
(1 q)
a+b
(1 + q)
b
q
c+2d
L(q) = log(l(q)) = constant + (a + b) log (1 q) + b log (1 + q) + (c + 2d) log q
L

(q) =
(a + b)
(1 q)
+
b
(1 + q)
+
(c + 2d)
q
= 0 (a + 2b + c + 2d)q
2
+ aq (c + 2d) = 0
760q
2
+ 221q 25 = 0 q = 0.0871
(Check q is a maximum by checking L

( q) < 0 as below.)
(b) Dierentiating again, we have
L

( q) = [
(a + b)
(1 q)
2
+
b
(1 + q)
2
+
(c + 2d)
q
2
]
Substituting the observed values of a, b, c, d and q, the observed information is
L

( q) = 1147.127 + 434.935 + 6590.733 = 8172.79


Hence, the estimated standard error of q is s
q
=

1/ L

( q) = 0.011. As z
0.005
= 2.5758, an approximate
99% condence interval for q has limits
0.0871 2.5758 0.011 [0.059, 0.116]
2. (a) (i) First nd Fishers information for a sample, z say, of size 1. As in lectures, the likelihood is
l(p) = f(z|p) = p
z
(1 p)
1z
So, if L(p) = ln(l(p)), then
L

(p) =
z
p
2

1 z
(1 p)
2
so that
I(p) = E(L

(p)) =
p
p
2
+
1 p
(1 p)
2
=
1
p(1 p)
From lectures, the maximum likelihood estimate for p given the binomial sample X = x is p =
x
n
.
Therefore, we estimate the sample information as
nI( p) =
n
p(1 p)
=
n
x
n
(1
x
n
)
=
n
3
x(n x)
(ii) Alternately, writing out the likelihood for the sample X = x of size n, we have
l(p) =
n!
x!(n x)!
p
x
(1 p)
nx
so that
L

(p) =
x
p
2

n x
(1 p)
2
Therefore the observed information is
L

( p) =
x
p
2
+
n x
(1 p)
2
=
x
(
x
n
)
2
+
n x
(1
x
n
)
2
=
n
3
x(n x)
Observe that (i) and (ii) are the same in this case.
(b) (i) First, nd Fishers information for a sample, y, of size 1. As in lectures, the likelihood is
f(y|) =
e

y
y!
Therefore
L

() =
y

2
so that
I() = E(L

()) =
1

From lectures, the maximum likelihood estimate for given sample values y
1
, ..., y
n
is

= y, the mean of
the n observations. Therefore, we estimate the sample information as
nI(

) =
n

=
n
y
(ii) Alternately, writing out the likelihood for the sample we have
l() =
n
i=1
e

y
i
y
i
!
so that
L

() =

n
i=1
y
i

2
Therefore the observed information is
L

) =

n
i=1
y
i

2
=
ny
y
2
=
n
y
Observe that (i) and (ii) are the same in this case.
(They are not always the same!)
Statistical Concepts 2013/14 Sheet 8 LR Tests
1. [Hand in to be marked] An independent, identically distributed sample, x = (x
1
, ..., x
n
),
of size n, is drawn from a Poisson distribution with parameter . We want to test the null
hypothesis H
0
: =
1
against the alternative hypothesis H
1
: =
2
, where
1
<
2
.
(a) Write down the likelihood ratio for the data, and show that all likelihood ratio tests
of H
0
against H
1
are of the form: Reject H
0
if

n
i=1
x
i
> c, for some c.
(b) Suppose that n = 50,
1
= 2,
2
= 3. By using the central limit theorem, nd,
approximately,
(i) the value of c for which the signicance level of the test is 0.01.
(ii) the power of the test for this choice of c.
2. We want to construct a test of hypothesis H
0
against H
1
, based on observation of a random
quantity Y , which takes possible values 1,2,3,4,5, with probabilities, given H
0
and H
1
as
follows.
1 2 3 4 5
H
0
.4 .2 .2 .1 .1
H
1
.1 .2 .2 .2 .3
(a) Suppose that
0
() is the probability that the test accepts H
1
, if H
0
is true, and

1
() is the probability that accepts H
0
, if H
1
is true. Suppose that we are a
little more concerned to avoid making the rst type of error than we are to avoid
making the second type of error. Therefore, we decide to construct the test

which
minimises the quantity () = 1.5
0
() +
1
(). Find the test,

, and nd the values


of
0
(

),
1
(

).
(b) In the above example, suppose that we replace () = 1.5
0
() +
1
() by (, c) =
c
0
() +
1
(). Find the corresponding optimal test
c
, and nd the corresponding
values
0
(
c
),
1
(
c
) for each value of c > 0.
3. If gene frequencies AA, Aa, aa are in Hardy-Weinberg equilibrium, then the gene frequen-
cies are (1 )
2
, 2(1 ),
2
, for some value of .
Suppose that we wish to test the null hypothesis H
0
: = 1/3, against the alternative
H
1
: = 2/3, based on the number of individuals, x
1
, x
2
, x
3
with the given genotypes in a
sample of n individuals.
(a) Find the general form of the likelihood ratio test.
(b) If n = 36, nd, approximately, the test with signicance level 0.01, and nd the power
of this test.
[Hint: You will need to nd the mean and variance of (x
3
x
1
). First nd these for
a sample of size n = 1.]
Comment on possible improvements to this choice of test procedure.
Statistical Concepts 2013/14 Solutions 8 LR tests.
1. (a) The Poisson distribution, parameter has frequency function
f(x|) =
e

x
x!
, x = 0, 1, ...
Therefore the likelihood, for the data x = (x
1
, , ..., x
n
), given is
l() =
n
i=1
e

xi
x
i
!
=
e
n

n
i=1
xi

n
i=1
x
i
!
Therefore the likelihood ratio for the data is
LR(x) =
l(
2
)
l(
1
)
= e
n(21)
(

1
)

n
i=1
xi
Each likelihood ratio test is of form: Reject H
0
if LR(x) > k for some k.
As LR(x) is a monotone function of

n
i=1
x
i
, this is equivalent to the test:
Reject H
0
if

n
i=1
x
i
> c for some c.
(b) As each X
i
has a Poisson distribution, parameter , X
i
has mean and variance equal to the value of
. Therefore,T =

n
i=1
X
i
has mean and variance equal to n. If n = 50, then approximately T =

n
i=1
X
i
has a normal distribution, by the central limit theorem, so that approximately T is distributed
as N(n, n).
(i) We want to choose c so that, if n = 50 and = 2, then P(T > c) = 0.01. In this case, approximately
T is N(100, 100). Therefore we want
0.01 = P(T > c) = 1 P(
T 100
10

c 100
10
) 1 (
c 100
10
)
Therefore, from tables,
c100
10
= 2.33 so that c = 123.3.
(ii) The power of the test is the probability of rejecting H
0
if H
1
is true, i.e. we want to calculate
P(T > 123.3) if n = 50 and = 3. In this case, approximately, T is N(150, 150). Therefore, the power of
the test is
P(T > 123.3) = 1 P(
T 150

150

123.3 150

150
) 1 (
123.3 150

150
) = 1 (2.18) = 0.985
2. (a) As shown in the lectures, the test which minimises a
0
() +b
1
() is to accept H
0
if LR(y) < a/b, and to
accept H
1
if LR(y) > a/b, and to accept either if LR(y) = a/b, where LR(y) = f
1
(y)/f
0
(y). The likelihood
ratio values are as follows.
1 2 3 4 5
H
0
.4 .2 .2 .1 .1
H
1
.1 .2 .2 .2 .3
LR(y) .25 1 1 2 3
As a = 1.5, b = 1, the optimal test

accepts H
1
if Y is 4 or 5, and accepts H
0
if Y is 1, 2 or 3.
Therefore
0
(

), the probability that

accepts H
1
, if H
0
is true, equals the probability of observing Y
to be 4 or 5, given H
0
, which is 0.2. Similarly,
1
(

), equals the probability of observing Y to be 1,2 or


3, given H
1
,which is 0.5.
(b) The acceptance set for H
0
is empty if c < 0.25, with (
0
,
1
) = (1, 0). For 0.25 < c < 1, add the value
y = 1, with (
0
,
1
) = (0.6, 0.1). For 1 < c < 2, also add the values 2 and 3, with (
0
,
1
) = (0.2, 0.5).
For 2 < c < 3 also add 4, with (
0
,
1
) = (0.1, 0.7). For c > 3, add y = 5 with (
0
,
1
) = (0, 1)
3. (a) The likelihood, for general is
l() = f(x
1
, x
2
, x
3
|) =
n!
x
1
!x
2
!x
3
!
[(1 )
2
]
x1
[2(1 )]
x2
[
2
]
x3
Therefore, the likelihood ratio is
LR(x
1
, x
2
, x
3
) =
l(2/3)
l(1/3)
= 4
(x3x1)
The likelihood ratio is monotonic in x
3
x
1
. Therefore, the general form of the LR test is
Reject H
0
if [x
3
x
1
] > c.
(b) As the sample size is reasonably large, approximately, by the central limit theorem, X = X
3
X
1
has a
normal distribution.
To nd the mean and variance of this distribution, consider a sample of n = 1. For general , the possible
values of X if n = 1 are -1, 0 , +1, with probabilities (1 )
2
, 2(1 ),
2
respectively. Therefore, if
= 1/3, X takes values -1,0,1 with probabilities 4/9, 4/9, 1/9, so that E(X) = 1/3, V ar(X) = 4/9.
Therefore, approximately, the distribution of X, when n = 36, is approximately normal, with mean
= 36/3 = 12, and variance
2
= 36 (4/9) = 4
2
.
We want to choose a value for c so that P(X > c) = 0.01, when X N(12, 4
2
). From normal tables,
the upper 99% point of the standard normal is 2.33. Therefore c = 12 + 2.33 4 = 2.68 gives the
critical value for a test at signicance level 0.01.
From the symmetry of the specication, the distribution of X under H
1
is, approximately X N(12, 4
2
).
So, the power of the test, namely 1P(X < 2.68), given H
1
, is, approximately, 1((122.68)/4)) =
1 (3.67) = 0.9999.
Note that when a test has better power than signicance level, we may often be able to change the critical
value to reduce the signicance level at small cost to the power. For example, choosing c = 0 gives
signicance level 0.0013, and power 0.9987.
Statistical Concepts 2013/14 Sheet 9 LR Tests
1. [Hand in to be marked] We observe a series of n counts, x
1
, ..., x
n
. Our null hypothesis
H
0
is that each count x
i
is Poisson, with a common parameter , while our alternative
hypothesis, H
1
, is that each x
i
is Poisson, but with dierent parameters
1
, ...,
n
.
(a) Given H
0
, what is the maximum likelihood estimate for ? Given H
1
, what is the
maximum likelihood estimate for each
i
? Show that, if is the generalised likelihood
ratio, then the corresponding test statistic is
2 log() = 2
n

i=1
x
i
log(
x
i
x
)
where x is the sample mean. How many degrees of freedom does the null distribution
have?
(b) In a study done at the National Institute of Science and Technology, 1980, asbestos
bres on lters were counted as part of a project to develop measurement standards for
asbestos concentration. Assessment of the numbers of bres in each of 23 grid squares
gave the following counts:
31 34 26 28 29 27 27 24 19 34 27 21 18 30 18 17 31 16 24 24 28 18 22
Carry out the above test as to whether the counts have the same Poisson distribution and
report your conclusions.
2. Let y
1
, . . . , y
n
be a random sample from N(,
2
), where the value of
2
is known. Show
that the likelihood ratio test of the hypothesis =
0
for some specied value of
0
is
equivalent to rejecting the hypothesis when the ratio
| y
0
|
/

n
is large, where y is the sample average. What is the exact signicance level when

0
= 0, = 1, n = 9, y = 1?
Statistical Concepts 2013/14 Solutions 9 LR tests
1. Under H
0
, x
1
, ..., x
n
are an independent sample from a Poisson distribution, parameter . As shown in
lectures, the maximum likelihood estimate for is therefore

= x, the sample mean. Under H
1
, each x
i
individually is Poisson, parameter
i
, so the maximum likelihood estimator for each
i
is

i
= x
i
. The
likelihood ratio is therefore
=

n
i=1
f(x
i
|

n
i=1
f(x
i
|

i
)
=

n
i=1

x
i
e

/x
i
!

n
i=1

i
x
i
e

i
/x
i
!
=
n

i=1
(
x
x
i
)
x
i
e
x
i
x
The likelihood ratio test statistic is therefore
2 ln() = 2
n

i=1
[x
i
ln(
x
x
i
) + (x
i
x)] = 2
n

i=1
x
i
ln(
x
i
x
)
Under H
1
there are n independent parameters, while under H
0
there is only one parameter . Therefore,
asymptotically 2 ln() has a X
2
distribution with n 1 degrees of freedom.
(b) For the given data, 2

n
i=1
x
i
ln(
x
i
x
) = 27.11. With 23 observations, we have 22 degrees of freedom.
From the tables of the X
2
distribution, we see that the p-value (i.e. the probability of exceeding this value,
given the null distribution) is around 0.2. This would only provide very weak evidence against the null
hypothesis of a common value of . On the other hand, the sample size is fairly small, so the asymptotic
approximation is not fully reliable, and we might expect the test to have quite low power.
2. The likelihood function is
l() =
n

i=1
f(y
i
|) =
n

i=1
1

2
e
(y
i
)
2
/2
2
As
n

i=1
(y
i
)
2
=
n

i=1
(y
i
y)
2
+n( y )
2
the log-likelihood can be written:
L() = constant
n
2
2
( y )
2
The unrestricted m.l.e. of is = y and the restricted m.l.e is =
0
. Hence,
2[L( ) L(
0
)] =
n

2
( y
0
)
2
with dim() dim() = 1 0 = 1 degree-of-freedom. Hence, rejecting when 2[L( ) L(
0
)] is large is
equivalent to rejecting when
| y
0
|
/

n
is large. When the null hypothesis ( =
0
) is true
Z =
Y
0
/

n
has a N(0, 1) distribution, and its value is 3 when
0
= 0, = 1, n = 9, y = 1. Thus the p-value of
this test (which can also be called the exact signicance level) is P[|Z| 3] = 0.002699796 (computed as
2*(1-pnorm(3)) in R), which is strong evidence against the null hypothesis. Note that in this example,
2[L( ) L(
0
)] = Z
2
X
2
1
, exactly.
Statistical Concepts 2013/14 Sheet 10
Small sample statistics and distribution theory
1. [Hand in to be marked] Suppose that a pharmaceutical company must estimate
the average increase in blood pressure of patients who take a certain new drug. Sup-
pose that only six patients (randomly selected from the population of all patients)
can be used in the initial phase of human testing. Assume that the probability dis-
tribution of changes in blood pressure from which our sample was selected is normal,
with unknown mean, , unknown variance,
2
.
(a) Suppose that we use the sample variance s
2
to estimate the population variance

2
. Find the probability that s
2
will overestimate
2
by at least 50%.
(b) Suppose that the increase in blood pressure, in points, for each of the sample of
six patients is as follows:
1.7, 3.0, 0.8, 3.4, 2.7, 2.1
Evaluate a 95% condence interval for from these data. Compare the interval
that you would obtain using large sample approximations.
(c) Evaluate a 95% condence interval for
2
from these data.
2. If Z has a normal probability distribution, mean , variance
2
, and Y = e
Z
, then
nd the probability density function of Y .
[Y is said to have a lognormal density as log(Y ) is normally distributed.]
3. Let X and Y have the joint density
f(x, y) =
6
7
(x + y)
2
0 x 1, 0 y 1
(a) By integrating over appropriate regions, nd
(i) P(X > Y )
(ii) P(X + Y 1)
(iii) P(X
1
2
)
(b) Find the marginal density of X.
(c) Write down the marginal density of Y .
(d) Find the conditional density of X given Y .
(e) Write down the conditional density of Y given X.
(f) Are X and Y independent? Explain!
Statistical Concepts 2013/14 Solutions 10
Small sample statistics and distribution theory
1. (a) As each X
i
N(,
2
), the distribution of (n1)s
2
/
2
has a chi-square distribution,
with degrees of freedom n 1, where s
2
= (1/(n 1))

n
i=1
(X
i


X)
2
, the sample
variance, and, in this example n = 6. Therefore if X
2

is the upper % point of the


chi-square distribution with 5 df, then
= P(

n
i=1
(X
i


X)
2

2
X
2

) = P(s
2
(X
2

/5)
2
)
Therefore, P(s
2
1.5
2
) corresponds to the value of for which X
2

/5 = 1.5, which,
from detailed tables, or from using R, is 0.186.
[ The version of the tables distributed in class gives X
2
0.2
= 7.29, X
2
0.15
= 8.12, identi-
fying the probability as being a bit lower than 0.2.]
(b) From the given data x = 2.283, s = 0.950. The appropriate 95% interval is
x t
0.025
s

n
where t
0.025
is the upper 0.025 point of the t-distribution with (6-1)=5 degrees of
freedom, which is 2.571 from the tables. Therefore the interval is
2.283 (2.571)
0.950

6
= 2.283 0.997
The large sample approximation in this problem would be to suppose that s
2
was a
very accurate estimate for
2
(which we saw above is rather optimistic), and therefore
to use the interval
x z
0.025
s

n
where z
0.025
is the upper 0.025 point of the normal distribution, namely 1.96 replacing
the value 2.571 in the above interval (and so giving a narrower interval, 2.2830.760,
based on ignoring the substantial uncertainty arising from estimating the variance
from a small sample).
(c) The 95% condence interval for
2
based on a normal sample of size n is
(

n
i=1
(X
i


X)
2
X
2
(n1)(0.025)
,

n
i=1
(X
i


X)
2
X
2
(n1)(0.975)
)
From the given data, n = 6 and

6
i=1
(X
i


X)
2
= 4.51. The 0.025 and 0.975 values
for the chi-square distribution with 5 DF are 0.831 and 12.83, so the 95% interval for

2
is (0.35,5.43).
2.
f
Z
(z) =
1

2
e

1
2
2
(z)
2
Therefore, z = s(y) = log y ds(y)/dy = 1/y, so
f
Y
(y) = f
Z
(s(y))

ds(y)
dy

=
1

2y
e

1
2
2
(log y)
2
for y > 0 and zero otherwise.
3. (a) (i) P[X > Y ] = 1/2 by symmetry, or

1
0

x
0
6
7
(x + y)
2
dy

dx.
(ii) P[X + Y 1] =

1
0

1x
0
6
7
(x + y)
2
dy

dx = 3/14.
(iii) P[X
1
2
] =

1
0

1
2
0
6
7
(x + y)
2
dx

dy = 2/7.
(b) f
X
(x) =

1
0
6
7
(x + y)
2
dy =
2
7
(3x
2
+ 3x + 1) for x [0, 1] and zero o/w.
(c) Similarly, by symmetry, f
Y
(y) =
2
7
(3y
2
+ 3y + 1) for y [0, 1] and zero o/w.
(d)
f(x|y) =
f(x, y)
f
Y
(y)
=

3(x+y)
2
(3y
2
+3y+1)
for x [0, 1]
0, otherwise
(e) Similarly, by symmetry
f(y|x) =
f(x, y)
f
X
(x)
=

3(x+y)
2
(3x
2
+3x+1)
for y [0, 1]
0, otherwise
(f) X and Y are not independent because their joint pdf is not the product of the
two marginal densities for all x, y. Equivalently, the conditional densities are not
equivalent to their corresponding marginal densities.
Statistical Concepts 2013/14 Sheet 11 Distribution theory
1. [Hand in to be marked] Suppose that X and Y are independent random
quantities, each with exponential pdf
f(z) = e
z
for z > 0 and 0 otherwise.
Let
U = X +Y and V = X/Y
(a) Find the joint pdf of U and V .
(b) Find the marginal pdfs of U and V .
(c) Are U and V independent? Justify your answer.
2. Suppose that Y and Z are independent random quantities, where Y has a
chi-square distribution with n df, and Z has a standard normal distribution.
Let
X =
Z

Y
n
and W = Y
(i) Find the joint pdf of W and X.
(ii) Deduce that the pdf of X is
f
X
(x) =
[(n + 1)/2]

n(n/2)

1 +
x
2
n

(n+1)/2
[This is the pdf of the t-distribution with n df.]
Statistical Concepts 2013/14 Solutions 11 Distribution theory
1. (a) The inverse function to
u = r
1
(x, y) = x + y, v = r
2
(x, y) = x/y
is
x = s
1
(u, v) = uv/(1 + v), y = s
2
(u, v) = u/(1 + v)
over u > 0, v > 0.
The Jacobian, J, namely the determinant

s
1
u
s
1
v
s
2
u
s
2
v

has absolute value |J| = u/(1 + v)


2
. Hence, since X and Y are independent
f
U,V
(u, v) = f
X,Y
(x, y)|J| = f
X
(x)f
Y
(y)|J|
= e
x
e
y
|J| = e
uv/(1+v)
e
u/(1+v)
u
(1 + v)
2
=
2
e
u
u
(1 + v)
2
for positive u and v and zero otherwise.
(b) The marginal pdf of U is
f
U
(u) =

f
U,V
(u, v)dv =

2
e
u
u
(1 + v)
2
dv =
2
ue
u
, u > 0
and similarly,
f
V
(v) =
1
(1 + v)
2
(c) As
f
U,V
(u, v) = f
U
(u)f
V
(v)
U, V are independent.
2. (a) The inverse function to
w = r
1
(y, z) = y, x = r
2
(z, y) = z/

(y/n),
is
y = s
1
(x, w) = w, z = s
2
(x, w) = x

(w/n)
over w > 0, < x < +.
The Jacobian, J, namely the determinant

s
1
x
s
1
w
s
2
x
s
2
w

has absolute value |J| =

w/n. Hence, since Y and Z are independent with pdf


f
Y
(y) =
1
2
n/2
(n/2)
y
(n/2)1
e
y/2
f
Z
(z) =
1

2
e
z
2
/2
so that
f
W,X
(w, x) = f
Y,Z
(y, z)|J| = f
Y
(y)f
Z
(z)|J|
= f
Y
(w)f
Z
(x

w
n
)

w
n
= cw
(n+1)/21
e

1
2
(1+
x
2
n
)w
where
c =
1
2
(n+1)/2

n(
n
2
)
(b) The marginal pdf of X is therefore
f
X
(x) =

f
W,X
(w, x)dw = c


0
w
(n+1)/21
e
wh(x)
dw
where h(x) =
1
2
(1 + x
2
/n).
Recalling the gamma integral


0
x
a1
e
bx
dx =
(a)
b
a
(as the gamma pdf integrates to 1), we have
f
X
(x) = c
((n + 1)/2)
h(x)
(n+1)/2
=
[(n + 1)/2]

n(n/2)
(1 +
x
2
n
)
(n+1)/2
Statistical Concepts 2013/14 Sheet 12 Bayesian statistics
1. [Hand in to be marked] Joe is a trainee manager, and his boss decides he should prepare
a report on the durability of the light bulbs used in the corporations oces. His boss wants
to know what proportion last longer than the 900 hours claimed by the manufacturer as
the time that at least 90% should survive. What should Joe do?
Fred, who looks after light bulb replacement, tells Joe that he has been sceptical about
the manufacturers claims for years, and he reckons it is more like 80%. Joe is a careful
type and decides to pin Fred down a bit, oering him a choice between there being 75%,
80%, 85% and 90% which survive beyond 900 hours and getting him to say how relatively
likely he thinks those percentages are. Fred says he reckons that 80% is about 4 times
more likely than 75%, and about twice as likely as 85% and that 85% is about 4 times as
likely as 90%.
Joe knows that since his boss is an ex-engineer he is going to demand some facts to back
up the speculation. Joe decides to monitor the lifetimes of the next 30 bulbs installed in
oces. Fortunately, since the lights are left permanently on (to show passersby how well
the corporation is doing nancially), he simply has to record the time of installation and
wait for 900 hours.
At the end of the study, Joe is able to write up his report. Of his 30 bulbs, 4 have failed.
Assuming that Joe accepts Freds opinions as the honest opinions of an expert, what should
he conclude about the proportion of bulbs which last beyond 900 hours?
2. Suppose that you have a blood test for a rare disease. The proportion of people who
currently have this disease is .001. The blood test comes back with two possible results:
positive, which is some indication that you may have the disease, or negative. Suppose
that the test may give the wrong result: if you have the disease, it will give a negative
reading with probability .05; likewise, a false positive result will happen with probability
.05.
You have three blood tests and they are all positive. What is the probability of you having
the disease, assuming blood test results are conditionally independent given disease state?
3. An automatic machine in a small factory produces metal parts. Most of the time (90%
from long records) it produces 95% good parts and the remaining have to be scrapped.
Other times, the machine slips into a less productive mode and only produces 70% good
parts. The foreman observes the quality of parts that are produced by the machine and
wants to stop and adjust the machine when she believes that the machine is not working
well. Suppose that the rst dozen parts produced are given by the sequence
s, u, s, s, s, s, s, s, s, u, s, u
where s = satisfactory and u = unsatisfactory. After observing this sequence, what is the
probability that the machine is in its good state, assuming outcomes are conditionally
independent given the state of parts? If the foreman wishes to stop the machine when the
probability of good state is under .7, when should she stop it?
After observing the above sequence, what is the probability that the next two parts pro-
duced are unsatisfactory?
4. Suppose that a parameter can assume one of three possible values
1
= 1,
2
= 10
and
3
= 20. The distribution of a discrete random quantity Y , with possible values
y
1
, y
2
, y
3
, y
4
, depends on as follows:

1

2

3
y
1
.1 .2 .4
y
2
.1 .2 .3
y
3
.2 .3 .1
y
4
.6 .3 .2
Thus, each column gives the distribution of Y given the value of at the head of the
column.
Suppose that the parameter assumes its possible values 1, 10 and 20, with prior probabil-
ities 0.5, 0.25 and 0.25 respectively. In what follows, assume observations are conditionally
independent given .
(a) Suppose y
2
is observed. What is the posterior distribution of ? What is the mode
of this distribution, and compare it with the mle of based on y
2
.
(b) Suppose a second observation is made and y
1
is observed. What does the posterior
distribution for become?
(c) Suppose a third observation were contemplated. Find the conditional probability
distribution of this future observation given that y
2
and y
1
have been observed.
How might this conditional distribution help in predicting the outcome of the third
observation?
Statistical Concepts 2013/14 Solutions 12 Bayesian Statistics
1. Data D: 4 failures out of 30.
Model(M) % P[M] P[D | M] P[M | D]
M
1
75
2
15

30
4

(0.75)
26
(0.25)
4
0.055704
M
2
80
8
15

30
4

(0.80)
26
(0.20)
4
0.488715
M
3
85
4
15

30
4

(0.85)
26
(0.15)
4
0.373958
M
4
90
1
15

30
4

(0.90)
26
(0.10)
4
0.081623
Data essentially conrms Freds belief that the rate is 80% (or at least between 80% and
85%).
2. Let D mean has disease and

D mean does not have disease.
Prior: P[D] = 0.001, P

= 0.999.
Likelihood: P[+ + + | D] (0.95)
3
, P

+ + + |

D

(0.05)
3
.
Posterior: P[D | + + +] (0.95)
3
0.001, P


D | + + +

(0.05)
3
0.999.
Hence, P[D | + + +] = 0.872868, P


D | + + +

= 0.127132.
3.
P[G] = 0.90 P[B] = 0.10
P[S | G] = 0.95 P[U | G] = 0.05
P[S | B] = 0.70 P[U | B] = 0.30
P[G | sequence] P[G] P[sequence | G] 0.9 (0.95)
9
(0.05)
3
0.394217
P[B | sequence] P[B] P[sequence | B] 0.1 (0.70)
9
(0.30)
3
0.605783
P[G | SU] P[G] P[SU | G] 0.90 0.95 0.05 0.670588
P[B | SU] P[B] P[SU | B] 0.10 0.70 0.30 0.329412
As P[G | SU] < 0.70, she will stop after the second item, which is unsatisfactory.
P[UU | sequence] = P[UU | G] P[G | sequence] + P[UU | B] P[B | sequence] = (0.05)
2

0.394217 + (0.30)
2
0.605783 = 0.055506
4.
1
= 1,
2
= 10,
3
= 20 with prior probabilities 0.50, 0.25, 0.25
(a)
f(
1
|y
2
) f(
1
)f(y
2
|
1
) = 0.50 0.1 2
2
7
f(
2
|y
2
) f(
2
)f(y
2
|
2
) = 0.25 0.2 2
2
7
f(
3
|y
2
) f(
3
)f(y
2
|
3
) = 0.25 0.3 3
3
7
Mode is
3
= 20 and also

(y
2
) = 20
(b) f(|y
1
, y
2
) f(y
1
|)f(|y
2
)
f(
1
|y
2
, y
1
) 2 0.1 2
1
9
f(
2
|y
2
, y
1
) 2 0.2 4
2
9
f(
3
|y
2
, y
1
) 3 0.4 12
6
9
(c) f(y|y
1
, y
2
) =

3
i=1
f(y|
i
)f(
i
|y
1
, y
2
)
f(y
1
|y
1
, y
2
) = 0.1
1
9
+ 0.2
2
9
+ 0.4
6
9
=
29
90
f(y
2
|y
1
, y
2
) = 0.1
1
9
+ 0.2
2
9
+ 0.3
6
9
=
23
90
f(y
3
|y
1
, y
2
) = 0.2
1
9
+ 0.3
2
9
+ 0.1
6
9
=
14
90
f(y
4
|y
1
, y
2
) = 0.6
1
9
+ 0.3
2
9
+ 0.2
6
9
=
24
90
which apart from y
3
is a fairly at distribution.
Statistical Concepts 2013/14 Sheet 13 Bayesian Statistics
1. [Hand in to be marked] Show that if X Beta (a, b) then
E[X | a, b] =
a
a + b
Var [X | a, b] =
ab
(a + b)
2
(a + b + 1)
In question 1, problem sheet 12, Joe elicits Freds prior beliefs in the form of
a discrete distribution. Suppose instead that Joe had managed to elicit from
Fred that his mean and standard deviation for the percentage of lightbulbs
lasting more than 900 hours are 82% and 4%, respectively.
Use a Beta distribution to capture Freds prior beliefs and calculate the
posterior mean and posterior standard deviation for the percentage of light-
bulbs lasting more than 900 hours, given that 4 out of the 30 lightbulbs had
failed by 900 hours.
2. Independent observations y
1
, . . . , y
n
are such that y
i
(i = 1, . . . , n) is a
realisation from a Poisson distribution with mean t
i
where t
1
, . . . , t
n
are
known positive constants and is an unknown positive parameter. [It may
be helpful to regard y
i
as the number of events ocurring in an interval of
length t
i
in a Poisson process of constant rate , and where the n intervals
are non-overlapping.]
Prior beliefs about are represented by a Gamma (a, b) distribution, for
specied constants a and b. Show that the posterior distribution for is
Gamma (a + y, b + t), where y = y
1
+ . . . + y
n
and t = t
1
+ . . . + t
n
.
In all of what follows, put a = b = 0 in the posterior distribution for ,
corresponding to a limiting form of vague prior beliefs.
A new extrusion process for the manufacture of articial bre is under
investigation. It is assumed that incidence of aws along the length of the
bre follows a Poisson process with a constant mean number of aws per
metre. The numbers of aws in ve bres of lengths 10, 15, 25, 30, and 40
metres were found to be 3, 2, 7, 6 and 10, respectively.
Find the posterior distribution for the mean number of aws per metre of
bre; and compute the posterior mean and variance of the mean number of
aws per metre.
Show that the probability that a new bre of length 5 metres will not
contain any aws is exactly

24
25

28
. [Hint: average the probability of this
event for any with respect to the posterior distribution of .]
Statistical Concepts 2013/14 Solutions 13 Bayesian Statistics
1.
E[X] =
(a + b)
(a)(b)

1
0
x
(a+1)1
(1 x)
b1
dx =
(a + b)
(a)(b)
(a + 1)(b)
(a + b + 1)
=
a
a + b
Similarly
E

X
2

=
(a + b)
(a)(b)
(a + 2)(b)
(a + b + 2)
=
a(a + 1)
(a + b)(a + b + 1)
Therefore, Var [X] = E[X
2
] (E[X])
2
= ab/(a + b)
2
(a + b + 1).
Equating mean and variance, a/(a + b) = 0.82 and ab/(a + b)
2
(a + b + 1) = (0.04)
2
, gives
a = 74.825 and b = 16.425.
With s = 26, f = 4, posterior beliefs for p Beta(a + s, b + f) = Beta(100.825, 20.425).
Therefore, E[p | s = 26, f = 4] = 100.825/(100.825+20.425) = 100.825/121.25 = 0.831546,
and Var [p | s = 26, f = 4] = 100.825 20.425/121.25
2
122.25 = 0.001146 = (0.03385)
2
.
Hence, the posterior expectation for the percentage of lightbulbs lasting more than 900
hours is approximately 83.2% and the posterior standard deviation is approximately 3.4%.
2.
likelihood
n

i=1
(t
i
)
y
i
e
t
i

y
e
t
prior
a1
e
b
posterior
a1
e
b

y
e
t
=
a+y1
e
(b+t)
Hence, the posterior distribution is Gamma(a + y, b + t). When a = b = 0
f(|y, t) =
t
y
(y)

y1
e
t
In example, y = 3 + 2 + 7 + 6 + 10 = 28 and t = 10 + 15 + 25 + 30 + 40 = 120. Hence,
f(|y, t) =
120
28
(28)

27
e
120
E[ | y = 28, t = 120] = 28/120 = 0.233 and Var [ | y = 28, t = 120] = 28/120
2
=
0.00194.
Let Y be the number of aws in a new bre of length T = 5. We want
P[Y = 0 | y = 28, t = 120, T = 5] =


0
e
5
f(|y = 28, t = 120) d
=


0
120
28
(28)

27
e
(120+5)
d
=

120
125

28


0
125
28
(28)

27
e
125
d
=

24
25

28
= 0.318856
Statistical Concepts 2013/14 Sheet 14 Bayesian Statistics
1. [Hand in to be marked] Suppose that the heights of individuals in a certain population
have a normal distribution for which the value of the mean height is unknown but the
standard deviation is know to be 2 inches. Suppose also that prior beliefs about can be
adequately represented by a normal distribution with a mean of 68 inches and a standard
deviation of 1 inch. Suppose 10 people are selected at random from the population and
their average height is found to be 69.5 inches.
(a) What is the posterior distribution of ?
(b) (i) Which interval, 1 inch long, had the highest prior probability of containing ?
What is the value of this probability?
(ii) Which interval, 1 inch long, has the highest posterior probability of containing ?
What is the value of this probability?
(c) What is the posterior probability that the next person selected at random from the
population will have height greater than 70 inches?
(d) What happens to the posterior distribution in this problem when the number of people
n whose heights we measure becomes very large? Investigate this by (i) seeing what
happens when n becomes very large in the formulae you used for part (a); (ii) using
the general theoretical result on limiting posterior distributions. Check that (i) and
(ii) give the same answer in this case.
2. Albert, a geologist, is examining the amount of radiation being emitted by a geological
formation in order to assess the risk to health of people whose homes are built on it. He
would like to learn about the average amount of radiation being absorbed per minute by
individual residents. His mean and standard deviation for are 100 particles/minute and
10 particles/minute, and he is willing to use a gamma distribution to represent his beliefs
about . Albert would like to have more precise knowledge about . He has an instrument
which measures the exposure which would have been received by a human standing at the
same location as the instrument for one minute. Since he is dealing with radioactivity, he
believes his machine measurements follow a Poisson distribution with mean . However,
he does not know how many measurements he needs to make to suciently reduce his
uncertainty about . How many measurements would you advise him to make if he wishes
his expected posterior variance for to be 4 or less? [HINT: rst nd what his posterior
distribution for would be for n observations and then use the rst-year probability result
E[X] = E[E[X | Y ]] to help you compute the expectation of his posterior variance for .]
3. When gene frequencies are in equilibrium, the genotypes Hp1-1, Hp1-2 and Hp2-2 of
Haptoglobin occur with probabilities (1)
2
, 2(1) and
2
, respectively. In a study of
190 people the corresponding sample numbers were 10, 68 and 112. Assuming a uniform
prior distribution for over the interval (0, 1), compute the posterior distribution for .
Compute the posterior expectation and variance for .
Find a large sample 99% Bayesian condence interval for , based on these data.
4. Suppose that y
1
, . . . , y
n
is a random sample from a uniform distribution on (0, ) where
is an unknown positive parameter. Show that the likelihood function l() is given by
l() =

1/
n
m < <
0 otherwise
(1)
where m = max{y
1
, . . . , y
n
}. Suppose that the prior distribution for is a Pareto distri-
bution
f() =

ab
a
/
a+1
b < <
0 otherwise
(2)
where a and b are specied positive constants. Show that the posterior distribution for
is also Pareto with constants a + n and max{b, m}.
Now put a = b = 0 in the posterior distribution (corresponding to vague prior beliefs),
and with this specication show that (m, m
1/n
) is the 100 (1 )% highest posterior
density (HPD) credibility interval for ; that is, the posterior density at any point inside the
interval is greater than that of any point outside the interval. Is this interval a 100 (1)%
condence interval in the frequentist sense? If so, show this to be the case.
Statistical Concepts 2013/14 Solutions 14 Bayesian Statistics
1. (a) Height, Y N(, 2
2
), so
2
= 4. N(68, 1
2
), so
0
= 68,
2
0
= 1. n = 10 and
y = 69.5. Therefore, |data N(
n
,
2
n
) where

n
=
_

2
0
+
n y

2
_
_
1

2
0
+
n

2
_ =
_
68
1
2
+
1069.5
2
2
_
_
1
1
2
+
10
2
2
_ = 69.07 inches
and
1

2
n
=
_
1

2
0
+
n

2
_
=
_
1
1
2
+
10
2
2
_
= 3.5
so that
n
= 0.53 inches. Thus, |data N(69.07, 0.53
2
).
(b) (i) 68 0.5 (67.5, 68.5). Probability is P[|Z| 0.5] = 0.3829
(ii) 69.07 0.5 (68.57, 69.57) Probability is P[|Z| 0.5

3.5] = 0.6505
(c) If X N(w,
2
), w N(, v
2
) then X N(,
2
+ v
2
). In this problem, the
posterior predictive distribution for Y , the height of a further person selected from
the population, is therefore N(69.07, 4.281). Therefore
P(Y > 70) = 1 P(
Y 69.07

4.281

70 69.07

4.281
) = 1 (0.45) = 0.326
(d) (i)

n
=
_

2
0
+
n y

2
_
_
1

2
0
+
n

2
_ y, n
1

2
n
=
_
1

2
0
+
n

2
_

2
n


2
n
, n
(ii) The general large sample result is that the posterior distribution of tends to a
normal distribution N( , 1/L

( )), as n , where is the maximum likelihood


estimator for and L is the log likelihood, (or equivalently N( , 1/nI( )), where I is
Fishers information, i.e. minus the expected value of L

). In this case, we have posterior


normality for all sample sizes. The m.l.e. is the sample mean y as previously found, where
the log likelihood was shown to be
L() = constant
n( y )
2
2
2
so that
L

() =
n

2
so that the large sample limit for posterior variance is
2
/n, agreeing with the values found
directly in (i).
2. For a Gamma (a, b) prior E[] = a/b = 100 and Var [] = a/b
2
= 100 gives a = 100
and b = 1. For general a and b and with a Poisson likelihood the posterior for is
Gamma (a +n y, b +n). Thus
Var [ | y] =
a +n y
(b +n)
2
Therefore
E[Var [ | y]] =
a +nE
_

(b +n)
2
=
a
b(b +n)
because E
_

= E
_
E
_

Y |

= E[] = a/b. With a = 100 and b = 1, we require


100
(1 +n)
4
which gives n = 24.
3. For sample numbers A = 10, B = 68, C = 112, the likelihood is proportional to
[(1 )
2
]
A
[2(1 )]
B
[
2
]
C

2C+B
(1 )
2A+B
and with a uniform prior for on (0, 1) the posterior is proportional to the likelihood, and
we recognise that this is a Beta(2C+B+1, 2A+B+1) distribution with 2C+B+1 = 293
and 2A +B + 1 = 89.
Hence, E[ | data] = 293/(89+293) = 0.767 and Var [ | data] = 29389/382
2
(382+1) =
0.0004666; and the posterior SD is 0.0216.
A 99% condence can be based on these values, or, almost exactly the same, based on
the result that for large samples or vague prior information, the posterior distribution
for is approximately normal with mean

= (2C + B)/2n = 292/380 = 0.7684 and
variance 1/L

) =

(1

)/2n = 0.0004683. In either case the condence interval is
approximately 0.768 0.02164 2.575 (0.713, 0.824), where z
0.005
= 2.575.
4. The joint pdf is
f(y
1
, . . . , y
n
|) =
_
1/
n
y
i
< < for i = 1, . . . , y
n
0 otherwise
But y
i
< < for i = 1, . . . , n is equivalent to m = max{y
1
, . . . , y
n
} < < . When
f(y|) is considered as a function of (for given data y
1
, . . . , y
n
) the likelihood is
l() =
_
1/
n
m < <
0 otherwise
The prior for is
f() =
_
ab
a
/
a+1
b < <
0 otherwise
The posterior for is
f(|y)
_
1/
a+1+n
b < < and m < <
0 otherwise
But b < < and m < < is equivalent to max{b, m} < < . Hence, the posterior
distribution for is Pareto with parameters a +n and max{b, m}.
When a = b = 0, the posterior is
f(|y) =
_
nm
n
/
n+1
m < <
0 otherwise
Then
_
m
1/n
m
f(|y) d =
_
m
1/n
m
nm
n
/
n+1
d =
_

_
m

_
n
_
= 1
Hence, the interval is a 100 (1 )% credibility interval for . Inspection of the graph
of f(|y) and the placement of the interval shows it to be a highest posterior density
100 (1 )% credibility interval.
Let M denote the random quantity M = max{Y
1
, . . . , Y
n
}. Then because [M m] =
[Y
1
m, . . . , Y
n
m] for any m, it follows that the distribution function P[M m | ] of
M for any xed is given by
P[M m | ] =
_

_
0 m 0
(m/)
n
0 < m <
1 m
Therefore, as required
P
_
M < M
1/n
|

= P
_

1/n
M < |

=
_

_
n

1/n

_
n
= 1
Statistical Concepts 2013/14 Sheet 15 Regression
1. [Hand in to be marked] The following data give the diameter d in mm and the
height h in metres for 20 Norway spruce trees situated in a very small part of a
Swedish forest.
d 140 134 180 177 178 114 221 122 237 82
h 10.0 10.0 12.0 12.0 15.0 10.5 17.0 12.0 15.0 7.0
d 152 166 86 157 74 173 172 153 190 196
h 11.5 13.0 7.0 11.5 7.0 14.0 11.5 11.0 14.5 16.0
Diameter is very easily measured, but height is much more dicult to estimate in a
dense forest. It is therefore of interest to attempt to predict height from diameter.
Make a scatter plot of diameter versus height (by hand or using R), and comment
briey on the apparent strength of relationship between the two variables.
In what follows you should use the following information

d
i
= 3104,

h
i
= 237.5,

d
2
i
= 518202,

d
i
h
i
= 39046.5,

h
2
i
= 2977.25
Fit a straight line using the method of least squares, and draw it on your scatter plot.
Does it seem a sensible line to t to these data? Evaluate the estimate s of , where
s
2
is the usual unbiased estimate of the assumed common error variance
2
. Give a
95% condence interval for the slope of the line: is it statistically signicantly dierent
from zero?
2. Consider the simple linear regression model
y =
0
+
1
x +
with data (x
1
, y
1
), ..., (x
n
, y
n
).
(a) Derive the equations of the usual least squares estimators

0
,

1
.
(b) Show that if the errors
1
, ...,
n
are independent normal random quantities, mean
0, variance
2
, then the usual least squares estimators are also the maximum
likelihood estimators of
0
,
1
.
(c) Show that
Cov(

0
,

1
) =

2
x

i
(x
i
x)
2
[Hint: Evaluate V ar( y) both directly and also using the relation
y =

0
+ x

1
]
3. Zebedee is trying to measure the elasticity of a spring. The elasticity is the increase
in the length of the spring if a 1g mass is suspended from the end. The length of
the spring can only be measured when a mass is attached to the end, otherwise it
curls up in a tangled ball. Zebedee has available two masses, one of 10g and one of
20g. He is impatient to get home for dinner and so he only has time to perform 100
measurements. How many of the 100 measurements should be made with the 10g mass
if he wants to estimate the elasticity as accurately as possible? Prove your answer is
correct. You may assume that the expected increase in the length of the spring is
exactly proportional to the mass attached, that the measurements are independent
and that the variation in repeated measurements is the same for both masses.
[ Hint. Set the problem up as a simple linear regression problem, where you want to
choose the number of 10g readings and the number of 20g readings to minimise the
variance of the estimate for the slope coecient.]
Statistical Concepts 2013/14 Solutions 15 Regression
1. n = 20,

d = 155.2,

h = 11.875, S
dd
= 36461.2, S
dh
= 2186.5, S
hh
= 156.9375
Plot indicates fairly strong approximate linear positive relationship between height and
diameter.

N
o
r
w
a
y

S
p
r
u
c
e
d
i
a
m
e
t
e
r
h e i g h t
1
0
0
1
5
0
2
0
0
8 1 0 1 2 1 4 1 6
Figure 1: Scatter plot of Norway spruce data.
For h
i
=
0
+
1
d
i
+ z
i
we have

1
=
S
dh
S
dd
=
2186.5
36461.2
= 0.06 (2 dp)

0
=

h

1

d = 2.57 (2 dp)
Estimate of assumed common error variance
2
s
2
=
1
n 2

S
hh

S
2
dh
S
dd

= 1.4343
Hence, s = 1.1976.
Condence interval for slope
1
has limits

S
dd
t
18,0.025
where t
18,0.025
= 2.101. The condence interval [0.047, 0.073] does not contain 0, oering
further empirical evidence that variation in height can be explained in part by variation
in diameter.
[Checking whether zero is in this condence interval is exactly equivalent to testing the
null hypothesis that is zero at the 0.05 level.]
2. (a) Let the sum of squares be
Q =

i
(y
i
(
0
+
1
x
i
))
2
We want to choose
0
,
1
to minimise Q.
Q

0
= 2

i
(y
i
(
0
+
1
x
i
))
and
Q

1
= 2

i
(y
i
(
0
+
1
x
i
))x
i
Setting these equations to zero gives the following pair of equations for
0
,
1
.

0
n +
1

i
x
i
=

i
y
i

i
x
i
+
1

i
x
2
i
=

i
x
i
y
i
The rst equation gives

0
= y

1
x
which, substituting into the second gives

1
=

i
(x
i
x)(y
i
y)

i
(x
i
x)
2
(b) The likelihood of observation y
i
for given values x
i
,
0
,
1
,
2
is
f(y
i
|x
i
,
0
,
1
,
2
) =
1

2
e

1
2
2
(y
i
(
0
+
1
x
i
))
2
Therefore the likelihood of the sample is
f(y|x,
0
,
1
,
2
) =
1

n
(2)
n/2
e

1
2
2

i
(y
i
(
0
+
1
x
i
))
2
Therefore, we maximise the likelihood by minimising the exponent, namely

i
(y
i

(
0
+
1
x
i
))
2
, and this is exactly the least squares solution obtained above.
(c) Since y =

0
+

1
x, we have
V ar( y) = V ar(

0
) + x
2
V ar(

1
) + 2 xCov(

0
,

1
).
Therefore, for x = 0, as y
1
, ..., y
n
are independent, each with variance
2
, and using
the formulae for V ar(

0
), V ar(

1
) derived in lectures, we have
Cov(

0
,

1
) =
1
2 x
(V ar( y) V ar(

0
) x
2
V ar(

1
))
=
1
2 x
(

2
n


2

i
x
2
i
n

i
(x
i
x)
2


2
x
2

i
(x
i
x)
2
)
=

2
2 x
(
2n x
2
n

i
(x
i
x)
2
) =

2
x

i
(x
i
x)
2
[If x = 0, then

0
= y, and you can check directly that Cov(

0
,

1
) = 0.]
3. Take m observations at 10g and 100 m at 20g.
Model for length y is
y =
0
+
1
x + z
where, x = weight attached to spring,
0
= length of spring,
1
= elasticity of spring
and Var [Z] =
2
.
As Var

=
2
/

i
(x
i
x)
2
, it is straightforward to show that
2
/Var

= m(100
m), which is maximised at m = 50.
Statistical Concepts 2013/14 Sheet 16 Regression
1. [Hand in to be marked]
In a study of how wheat yield depends on fertilizer, suppose that funds are available
for only seven experimental observations. Therefore X (fertilizer in lb/acre) is set to
seven dierent levels, with one observation Y (yield in bushels/acre) for each value of
X. The data are as follows.
X 100 200 300 400 500 600 700
Y 40 50 50 70 65 65 80
(a) Fit the least squares line Y =

0
+

1
X.
(b) Find and interpret the multiple R
2
value for the above data.
(c) Suppose that we intend to use 550 pounds of fertiliser. Find a 95% condence
interval for the expected yield, assuming the usual simple linear regression model
with normal errors.
(d) Suppose that we intend to apply 550 pounds of fertiliser for a single plot, giving
yield Y

. Explain how you would modify the above condence interval to give a
95% prediction interval for Y

.
2. (a) In the usual simple linear regression model, where y
i
=

0
+

1
x
i
, and r
i
= y
i
y
i
,
show that
i.

r
i
= 0
ii.

x
i
r
i
= 0
iii.

y
i
r
i
= 0
Hence show that the sample correlation between the values of the independent
variable, x
i
, and the residuals, r
i
, is zero, and the sample correlation between the
tted values, y
i
, and the residuals is zero. What are the implications for graphical
diagnostics of the underlying assumptions of the simple linear regression model?
(b) If the error variance in the regression model is
2
, show that
V ar(r
i
) =
2

1
1
n

(x
i
x)
2
SS
xx

Comment on the implications of this result for the least squares t of a line to a
collection of points for which a single x value is very much larger than all of the
remaining x values.
Statistical Concepts 2013/14 Solutions 16 Regression
1. (a) For the given data

X = 400,

Y = 60, SS
xy
= 16, 500, SS
xx
= 280, 000, SS
yy
= 1150.001
Therefore,

1
=
SS
xy
SS
xx
= 0.059,

0
=

Y

1

X = 36.4
(b) R
2
is the square of the sample correlation coecient, so
R
2
=
SS
2
xy
SS
xx
SS
yy
= 0.845
R
2
may be interpreted as the proportion of the variance in y explained by the re-
gression i.e. the ratio

i
( y
i
y)
2
/

i
(y
i
y)
2
. (Here R
2
is reasonably large, as the
regression explains most of the variation.)
(c) The residual sum of squares is RSS =

i
(y
i
y
i
)
2
= 177.68. Therefore the estimate
for the error variance
2
is s
2
= RSS/(7 2) = 35.5.
The estimate for expected yield when x = 550 is
y =

0
+

1
550 = 69
The standard error of this estimate is
s

1
7
+
(550 x)
2
SS
xx
= 5.96

0.223
Therefore a 95% condence interval for expected yield, assuming normal errors is
69 t
.025,5
5.96

0.223 = 69 7
(d) For the prediction interval for a single yield, the estimate is the same but the standard
error becomes

1 +
1
7
+
(550 x)
2
SS
xx
so the prediction interval becomes 69 17.
2. (a) i.

i
r
i
=

i
(y
i

0
x
i

1
) = n( y x

0
) = 0
ii.

i
x
i
r
i
=

x
i
((y
i

0
x
i

1
)
=

i
x
i
(y
i
( y x

1
) x
i

1
) =

i
x
i
(y
i
y)

1
x
i
(x
i
x)
= SS
xy

1
SS
xx
= SS
xy

SS
xy
SS
xx
SS
xx
= 0
iii.

i
y
i
r
i
=

i
(

0
+ x
i

1
)r
i
=

i
r
i
+

i
x
i
r
i
= 0
The sample correlation between the residuals and the x values will be zero if the
sample covariance is zero, which, as the sum of the residuals is zero is equivalent to
the condition that

x
i
r
i
= 0, and similarly for the sample correlation between the
tted values, y, and the residuals.
What this implies is that in plots of residuals versus x values or against tted values
there cannot be a non-zero linear t. We would expect there to be a random scatter
and any pattern in the plot may suggest a problem with the regression model.
(b)
r
i
= y
i
( y x

1
)

1
x
i
= y
i
y +
( x x
i
)

j
(x
j
x)y
j
SS
xx
= y
i
[1
1
n

(x
i
x)
2
SS
xx
]

j=i
y
j
[
1
n
+
(x
i
x)
(
x
j
x)
SS
xx
]
As y
1
, ..., y
n
are independent, and each has variance
2
, letting Q
i
=
1
n
+
(x
i
x)
2
SS
xx
, we
have
V ar(r
i
) =
2
[1 Q
i
]
2
+
2

j=i
[
1
n
+
(x
i
x)(x
j
x)
SS
xx
]
2
=
2
[1 Q
i
]
2
+
2

j
[
1
n
+
(x
i
x)
(
x
j
x)
SS
xx
]
2

2
Q
2
i
=
2
[(1 Q
i
)
2
+ Q
i
Q
2
i
] =
2
[1 Q
i
]
Observe that V ar(r
i
) is a decreasing function of (x
i
x)
2
. Therefore, if a single
x value, x
j
say, is very much larger than all of the remaining x values, then the
variance of the jth residual will be much smaller than the variance of all the other
residuals. (You can check that moving x
j
to innity reduces the variance for z
j
to
zero.) This means that the least squares line will have a very small residual for this
point. Therefore, if a single value x
j
is very dierent from the rest of the values then
the least squares line will essentially be the line that goes through the sample mean
( x, y) of the remaining points and also goes through (x
j
, y
j
). In some circumstances
it may be of concern that a single unusual point has been allowed to determine the
slope of the line.
Statistical Concepts 2013/14 Sheet 17 Unpaired & Paired
Comparisons
1. [Hand in to be marked] Aerobic capacity, the peak oxygen intake per unit of body
weight of an individual performing a strenuous activity is a measure of work capacity. In
a comparative study, measurements of aerobic capacities were recorded for a group of 20
Peruvian highland natives and for a group of 10 USA lowlanders acclimatised as adults in
high altitudes. The following summary statistics were obtained from the data
Peruvians U.S. subjects
y 46.3 38.5
s 5.0 5.8
(i) Estimate the dierence in mean aerobic capacities between the two populations and
give the standard error of your estimate, stating any assumptions you make. (ii) Test
at the 2% signicance level the hypothesis that there is no dierence between the mean
aerobic capacities in the two populations. (iii) Construct a 98% condence interval for the
dierence in mean aerobic capacities between the two populations.
State any assumptions you have made and suggest ways in which you might informally
validate them if you had access to the original data.
2. A study was carried out to compare the starting salaries of male and female college grad-
uates who found jobs after leaving the same program from an (American) institution.
Matched pairs were formed of one male and one female graduate with the same subject
mix and grade point average. The starting salaries (in dollars) were as follows.
Pair 1 2 3 4 5 6 7 8 9 10
Male 29300 41500 40400 38500 43500 37800 69500 41200 38400 59200
Female 28800 41600 39800 38500 42600 38000 69200 40100 38200 58500
(a) Test the hypothesis that there is no dierence between average starting salaries be-
tween sexes, at signicance level 5%. State your assumptions.
(b) Construct a 95% condence interval for the dierence between the mean salaries for
male and female.
(c) Explain why the test procedure
Reject the hypothesis that the mean dierence is zero if the 95% condence interval
for the mean dierence does not contain the value zero
is a valid signicance test at level 5%.
(d) Suppose that we forget that the above individuals were selected as matched pairs, and
treat the 10 males as a random sample from male graduating students and the ten
females as a random sample of female graduating students. (The standard deviation
for the male values is 11665 and for females is 11617). Find a 95% condence interval
for the dierence between the means of the two groups.
Compare your answer with the matched pair analysis and explain why you have
reached dierent conclusions.
[This comparison is only to provide some order of magnitude comparison of the dier-
ences we might expect from a matched pairs versus a simple two sample experiment.
If the design really is a matched pairs design, we do not have independent samples
from the two populations, although if the ten individuals chosen are fairly representa-
tive of the general population then the order of magnitude comparison may be about
right.]
(e) Under what circumstances might it be better not to use matched pairs in this type
of study?
Statistical Concepts 2013/14 Solutions 17 Unpaired & Paired Comparisons
1. I will assume that the samples were drawn at random from both populations. As sample sizes are
small, I will assume normality for both populations which can be validated informally by looking
at histograms or normal quartile plots if the data were available. Also, I will assume that the
population variances are equalthe sample variances lend some support for this assumption.
(i) Pooled estimate of assumed common variance
s
2
p
=
19 5.0
2
+ 9 5.8
2
19 + 9
= 27.78
Estimated dierence in population means is D = 46.3 38.5 = 7.8 with estimated standard error
s
D
=

27.78

1
20
+
1
10

= 2.04
(ii) The test statistic is
t =
D
s
D
=
7.8
2.04
= 3.82
If there is no dierence between the means then the distribution of t will be a t-distribution with
10 + 20 2 = 28 degrees of freedom and t
28, 0.01
= 2.467. Our test is to reject the hypothesis
of equal means with signicance level 0.02 if |t| > 2.467, and so, in this case, we can reject the
hypothesis at this level.
(iii) Similarly, 98% condence limits for the dierence in population mean aerobic capacities are:
7.8 2.04 2.467 [2.8, 12.8]. (Note that zero is not in this condence interval).
2. (a) Assuming the pair dierences D
i
= X
i
Y
i
are iid N(,
2
) random quantities, we want to
test the hypothesis the = 0. The pairwise data dierences (male - female) are
500, 100, 600, 0, 900, 200, 300, 1100, 200, 700
The sample mean of these numbers is

D = 400 and the sample standard deviation is s
D
=
434.61. The test statistic is
t =

D
s
D
/

n
=
400
434.61/

10
= 2.91
If the mean dierence is zero, then t has a t-distribution with n 1 = 9 degrees of freedom.
As t
0.025,9
= 2.26, the observed value of |t| is larger than the critical value and the test rejects
the hypothesis of equality of means at 5%.
(b) Similarly, 95% condence interval for the mean dierence is

D t
0.025,9
s
d

1
n
= 400 2.26 434.6

1
10
= 400 310.6
(Note that this interval does not contain zero.)
(c) If the mean dierence is zero, then the probability that we will obtain a sample for which
the 95% condence interval for the mean dierence contains the value zero is 0.95. (This is
by the denition of the condence interval.)
Therefore, the probability that the corresponding test, which rejects the hypothesis that the
mean dierence is zero when the 95% condence interval does not contain the value zero,
will reject the hypothesis if it is true is 0.05, so that the test is a valid signicance test at
level 5%.
[Of course, there is nothing special about this example and the simple argument above shows
that, if we can construct the condence interval for a quantity, then we can always construct
the corresponding signicance test for the value of the quantity.]
(d) Ignoring the matching of pairs, our condence interval is

X

Y t
0.25,18
s
p

1
m
+
1
n
where

X

Y = 400, m = n = 10 and s
2
p
is the pooled variance estimate given by
s
p
=

9s
2
x
+ 9s
2
y
18
= 11641
so the condence interval is 400 10985. Notice that this interval is much wider than
the previous interval and does contain zero, near the centre. This is because we have not
controlled for the substantial variability in the sample means which is due to variability in
the values of individual male and female scores.
(e) We should not use matched pairs if the criterion that we use for matching has little to do
with the response we are measuring. In this case we will not eliminate variability due to
blocking, but we will halve the number of degrees of freedom in our t-statistic, corresponding
to the loss of information in our assessment of the underlying variability, as we will only learn
about population variability by considering dierences in the individual pairs.
[Of course, there are may be practical reasons why a matched pair experiment might be
more dicult or expensive.]
Statistical Concepts 2013/14 Sheet 18 Nonparametric methods
1. A study was done to compare the performances of engine bearings made of dierent compounds.
Ten bearings of each type were tested. The following table gives the times until failure (in units
of millions of cycles):
Type I 15.21 3.03 12.95 12.51 16.04 16.84 9.92 9.30 5.53 5.60
Type II 12.75 4.67 12.78 6.79 9.37 4.26 4.53 4.69 3.19 4.47
(a) Assuming normal distributions are good models for lifetimes of the two types of bearing,
test the hypothesis that there is no dierence in lifetime between the two types of bearing.
(b) Test the same hypothesis using the nonparametric Wilcoxon rank sum method. Try com-
puting it using the normal approximation and using tables of the statistic.
(c) Which of the two methods, (a) or (b), do you think gives a better test of the hypothesis?
(d) Suppose, instead, all that is recorded for each bearing above is whether the time to failure
was short, medium or long, where short is 6 or less, medium is 6 to 14, and long is greater
than 14. Use the Wilcoxon rank sum statistic to test the hypothesis of no dierence.
2. An experiment was performed to compare microbiological and hydroxylamine methods for analysis
of ampicillin dosages. Pairs of tablets were analysed by the two methods. The following data
give the percentages of ampicillin measured to be in each pair of tablets using these methods (so
relative to the real amount).
Pair Microbiological Hydroxylamine
1 97.2 97.2
2 105.8 97.8
3 99.5 96.2
4 100.0 101.8
5 93.8 88.0
6 79.2 74.0
7 72.0 75.0
8 72.0 67.5
9 69.5 65.8
10 20.5 21.2
11 95.2 94.8
12 90.8 95.8
13 96.2 98.0
14 96.2 99.0
15 91.0 100.2
Perform an appropriate nonparametric test of the hypothesis that there is no systematic dierence
between the two methods. Do your calculations two ways:
(a) Using a normal approximation to the distribution of the statistic.
(b) Using tables which give exact critical values.
Do your answers suggest that there is a systematic dierence between the methods?
Statistical Concepts 2013/14 Solutions 18 Nonparametric methods
1. (a) We do a pooled variance t-test. We have x = 10.69, y = 6.75, s
x
= 4.82 and s
y
= 3.60.
Hence the pooled variance is
s
2
=
9 s
2
x
+ 9 s
2
y
18
= 18.1
Hence the test statistic is
10.69 6.75

18.1(
1
10
+
1
10
)
= 2.065
which is just on the verge of being signicant at the 5% level since t
0.025
for 18 d.f. is 2.101.
(b) We combine the two samples together and nd the ranks. and get
Type I 18 1 17 14 19 20 13 11 8 9
Type II 15 6 16 10 12 3 5 7 2 4
from which we nd that the sum of the type I ranks is T = 130.
i. According to the normal approximation, if the hypothesis is true, this should be from a
normal distribution with mean
1
2
10 21 = 105 and variance 10 10 21/12 = 175.
But (130105)/

175 = 1.89 which again is on the verge of signicance since the critical
value for 5% signicance is 1.96.
ii. From the table which you were given, we reject at level = .05 if T is less than T
L
= 79
or greater than T
U
= 10(10 +10 +1) 79 = 131. So we just fail to reject the hypothesis
but we would reject at the 10% level.
(c) Neither of these methods suggests rejecting the null hypothesis very strongly, nor do they
disagree very much. Nevertheless, the Wilcoxon method is better for this data as there is no
reason to suppose normality of the population, and the data dont look very normal either.
(d) The reduced data table is
Short Medium Long
Type I 3 4 3
Type II 6 4 0
In the above table, 3 observations tie at the largest value, 8 observations tie at the next largest
value, and 9 observations tie at the smallest value. Therefore the midrank of the smallest 9 is
(1 +9)/2 = 5, the midrank of the next 8 is 9 +(8 +1)/2 = 13.5 and the midrank of the largest 3
is 17 +(3 +1)/2 = 19. Therefore, the Wilcoxon statistic is W
S
= 3 5 +4 13.5 +3 19 = 126.
The expectation of W
S
is
1
2
10 21 = 105, as before.
The variance of W
S
is 101021/12[1010((9
3
9)+(8
3
8)+(3
3
3))/(122019)] = 147.6
Therefore, under the hypothesis of no treatment dierence W
S
has approximately a normal dis-
tribution, with mean 105, variance 147.6, so
P(W
S
> 126) = P(
W
S
105
12.15
>
126 105
12.15
) = 1 (1.73) = 0.042
so that the two sided test of no dierence would reject only at around 10%, which again is fairly
weak evidence of dierence between the two types.
2. The paired structure means that the Wilcoxon signed rank test is appropriate. The dierences
of the data and the ranks of those which are non-zero are
Dierence rank of |Dierence|
1 0.0
2 8.0 13.0
3 3.3 7.0
4 -1.8 3.5
5 5.8 12.0
6 5.2 11.0
7 -3.0 6.0
8 4.5 9.0
9 3.7 8.0
10 -0.7 2.0
11 0.4 1.0
12 -5.0 10.0
13 -1.8 3.5
14 -2.8 5.0
15 -9.2 14.0
The sum of those ranks with positive associated dierences is 61. We calculate signicance levels
using n = 14 since one of the dierences is 0.
(a) The expectation of the signed rank sum is n(n+1)/4 = 52.5. The variance is n(n+1)(2n+
1)/24 = 253.75 and so the standard deviation is 15.9. The standardised observed value is
(61 52.5)/15.9 = 0.535 which is not signicant for any normal level of . We do not reject
the null hypothesis.
(b) The table which you were given is for the minimum of W
+
(the sum of the positive signed
ranks) and W

(the sum of the negative signed ranks). The sum of the ranks with negative
associated sign is 44. For n = 14 and = 0.05, the critical value is 22. Our value is larger
so we do not reject the null hypothesis. Note that we would not even reject the hypothesis
at the 10% level.
We do not nd sucient evidence to conclude that there is a systematic dierence.