Вы находитесь на странице: 1из 40

371 Past exams

page 1

2005
1. Consider two linear models for the same data set:
Model 1:
Model 2:

y = X 11 +
e

y = X 22 +
e

(a) What are the most general circumstances under which models 1 and 2 are
equivalent (X 1 X 2 is only one special case), and which of the following are
the same for equivalent models: , X , rank(X), Deviance(),
.

(b) Under what circumstances can models 1 and 2 be compared using an F test?
(c) Under what condition on the design matrix X 1 is 1 unique?

(d) What happens when the condition in (c) is not satisfied?


(e) Explain how an estimate of 1 can be found in practice when the condition in

(c) is not satisfied?


0
(f) Under what circumstances is the linear combination of the parameters P
1
said to be estimable?
[4 + 1 + 1 + 1 + 2 + 1 = 10 marks]

Questions 2 to 4 refer to the following experiment.


The effects of four drugs (A, B, C and D) in delaying atrophy of denervated muscles were investigated. A certain leg muscle in each of 48 rats was deprived of its nerve supply by surgical severing
of the appropriate nerves. The rats were then randomly allocated to four groups, and each group
was treated with one of the drugs. After 4, 8 and 12 days, 16 of the rats, 4 from each drug group,
were selected, at random, and the weight (in grams) of the denervated muscle was measured.
Theoretically, atrophy should be measured as the loss in weight of the muscle, but the initial
weight of the muscle could not be obtained without killing the rat. Consequently, the initial total
body weight (in grams) of the rat was measured. It was assumed that this figure is closely related
to the initial weight of the muscle. Drugs B and C were small and large doses, respectively, of
atrophine sulfate, drug D was quinidine sulfate while drug A acted as a control; it was simply a
saline solution. In the table below, x is (1000 times) the log of the initial body weight and y is
(1000 times) the log of the weight of the denervated muscle. Due to an accident in the lab, muscle
weights were not obtained for two of the rats (denoted by ** in the table).

Days
4

Drug
x
336
391
408
301

A
y
37
64
100
71

Drug B
x
y
297
76
394
61
255 66
338
83

Drug C
x
y
422
86
301
46
322
0
283
0

Drug D
x
y
258
4
425 179
438 190
255
9

423
394
255
377

41
137
187
**

270
342
299
380

60
17
56
17

250
274
398
290

174
143
33
125

288
438
346
438

13
29
64
17

12

297
243
299
350

469
367
387
319

398
461
407
367

60
41
60
**

310
369
324
330

244
97
161
76

270
456
389
332

92
4
13
60

371 Past exams

page 2

2. In this question we will consider just the responses (the yvalues) for Drug A, the
saline solution.
The following Splus output was obtained with the contrasts for factor set at contr.treatment.
Days.f is days treated as a factor whereas Days is days treated as a variable (taking
values 4, 8 and 12).
> rats.1 <- lm(y~Days.f,data=rats.dat[Drug=="A",])
> anova(rats.1)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
Days.f 2 328160.5 164080.3 30.9063 0.0001724347
Residuals 8
42471.7
5309.0
> summary(rats.1)$coef
Value Std. Error
t value
Pr(>|t|)
(Intercept)
14.0000
36.43130 0.3842849 0.71078042813
Days.f2 -135.6667
55.64973 -2.4378675 0.04069966398
Days.f3 -399.5000
51.52164 -7.7540237 0.00005463732
> disp.v(rats.1)
[1] "Estimated covariance matrix of parameter vector"
(Intercept)
Days.f2
Days.f3
(Intercept)
1327.24 -1327.240 -1327.240
Days.f2
-1327.24 3096.892 1327.240
Days.f3
-1327.24 1327.240 2654.479

> rats.2 <- lm(y~Days,data=rats.dat[Drug=="A",])


> anova(rats.2)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
Days 1 319200.5 319200.5 55.85671 0.00003796058
Residuals 9
51431.7
5714.6

(a) What would the parameter estimates be (under summary) for the (ANOVA)
model:
lm(y~Days.f - 1,data=rats.dat[Drug=="A",]) ?
(b) It has been suggested that it would have been just as good, if not better, to
treat days as a variable (with just the linear term) rather than as a factor.
i. Give one possible advantage for treating days as a variable rather than as
a factor.
ii. Using the Splus output provided, it is possible (here) to test the hypothesis
that it is OK to treat days as a variable using either an F test or a ttest.
Carry out ONE (only) of the tests and state your conclusion.
[3 + 5 = 8 marks]

371 Past exams

page 3

3. Here we consider the response (y) and all levels of both factors (Days and Drug),
but ignore the covariate (x).
> rats.3 <- lm(y~Days.f*Drug, data=rats.dat)
> anova(rats.3)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
Days.f 2 312180.1 156090.0 33.44240 0.000000009
Drug 3 235044.3 78348.1 16.78613 0.000000731
Days.f:Drug 6 126347.0 21057.8 4.51166 0.001830740
Residuals 34 158692.6
4667.4

Table of means
Days
4
8
12
All

A
14.00
121.67
385.50
168.27

Drug
B
C
38.50
10.00
29.00 102.25
53.67 144.50
11.18
78.92

D
89.00
24.25
40.25
24.33

All
37.88
52.87
162.80
57.15

(a) Produce a suitable display of the data (a rough sketch will suffice), and use it
to help describe the findings of the ANOVA given above.
(b) The main interest in this study is the comparison between drugs. Complete an
appropriate follow-up analysis for the factor Drug, and state your conclusions.
(c) The suggestion is made that the interaction between Days and Drug is due to
the results for Day 12 on Drug A. In order to test this suggestion the model:
lm(y ~ Days.f + Drug + Day12.DrugA, data = rats.dat)
was fitted, where Day12.DrugA is a factor with 2 levels, one for Day 12 on
Drug A, and another for all other combinations. The deviance for this model is
167832.2 with 39 degrees of freedom. Carry out a test of whether the suggestion
is a good one, and state your conclusion.
(d) Briefly describe an alternative way the suggestion in (c) could have been tested.
(e) Yet another approach is to use the numerical nature of Days. Doing this the
following output was obtained.
> anova(lm(y ~ Drug*Days, data=rats.dat))
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
Drug 3 244434.5 81478.2 17.99283 0.0000001985
Days 1 302778.0 302778.0 66.86247 0.0000000007
Drug:Days 3 112973.3 37657.8 8.31597 0.0002239914
Residuals 38 172078.1
4528.4

371 Past exams

page 4

> summary(lm(y ~ Drug*Days, data=rats.dat))$coef


Value Std. Error
t value
Pr(>|t|)
(Intercept) 231.22727 51.728667 4.470003 6.843536e-005
DrugB -152.27990 74.001989 -2.057781 4.651884e-002
DrugC -155.64394 72.920533 -2.134432 3.931887e-002
DrugD -77.64394 72.920533 -1.064775 2.936971e-001
Days -49.93750
5.947932 -8.395776 3.472980e-010
DrugBDays
38.13487
8.737377 4.364567 9.439271e-005
DrugCDays
30.62500
8.411646 3.640786 8.066059e-004
DrugDDays
33.78125
8.411646 4.016010 2.692819e-004

i. Show that this model is not significantly worse than the one with Days
treated as a factor.
ii. This model fits a different (straight) line for each drug. Find the equations
of the lines for drugs A and B.
iii. The equations of the lines for drugs B, C and D seem to be quite similar.
(There is no need to find the equations of the lines for drugs C and D.)
Give details of how you could test if there was no (significant) difference
between them.
[4 + 6 + 4 + 2 + 7 = 23 marks]

4. Here we consider the covariate (x) in addition to the response (y). To simplify
matters only the data from the three active drugs, B, C and D have been used here.
Numerous models were fitted as indicated in the table below.

1
2

Model
Days.f*Drug*x
Days.f*Drug + Days.f*x + Drug*x

Deviance
23891
35799

Residual df
17
21

3
4
5

Days.f*Drug + Days.f*x
Days.f*Drug + Drug*x
Days.f*x + Drug*x

41425
39025
36306

23
23
25

6
7
8

Days.f*Drug + x
Days.f + Drug*x
Drug + Days.f*x

44150
40139
42774

25
27
27

Days.f + Drug + x

46611

29

10
11
12

Days.f + Drug
Drug + x
Days.f + x

121449
176954
77501

30
31
31

(a) Not all of the models can be compared, formally, using F tests. Explain why.
(b) Models 3, 4 and 5 cannot be formally compared, nor can models 6, 7 and 8.
However, models 3, 4 and 5 can be compared with some, but not all, of the
models 6, 7 and 8. Of the nine possible pairings (3 with 6), (3 with 7) . . . (5
with 8), list those pairs that CANNOT be formally compared.
(c) Determine which of the 12 models seems to be most appropriate here. Give
details of any tests you use, and the outcome of each test.

371 Past exams

page 5

(d) In the light of your findings, comment on whether there was any value in using
the covariate.
(e) Suggest an alternative way the study could have been designed that would have
been similar to using a covariate.
[2 + 3 + 7 + 2 + 2 = 16 marks]
5. In a study of daily soil evaporation, the following predictor variables were identified:
maxat
minat
avat

=
=
=

maxst
minst
avst
maxh
minh
avh
wind

=
=
=
=
=
=
=

Maximum daily air temperature


Minimum daily air temperature
Integrated area under the daily air temperature curve
(ie, a measure of average air temperature)
Maximum daily soil temperature
Minimum daily soil temperature
Integrated area under the soil temperature curve
Maximum daily relative humidity
Minimum daily relative humidity
Integrated area under the daily humidity curve
Total wind, measured in miles per day

(a) After some preliminary analyses, it was decided to (log) transform the response
variable. The decision was made primarily on the basis of one particular plot.
What was the plot and what, most likely, was observed?
(b) The following diagnostic plots were obtained for the model:
> evap.2 <- lm(log(evap) ~ maxst + minst + avst + maxat + minat +
+ avat + maxh + minh + avh + wind, data = evap.dat)

-2

r.std(evap.2)

31

-4

41
1.5

2.0

2.5

3.0

3.5

4.0

4.5

fitted(evap.2)

1.0

31

0.6
0.4
0.2

0.0

Cooks Distance

0.8

41

10

20

30

40

371 Past exams

page 6

i. Explain, briefly, what is meant by the leverage of an observation (or Hii ).


ii. Which one of the following would change the leverage of observations?
There is no need to justify your answer.
A. change of scale of the response variable (eg inches to cm)
B. change of scale of one or more of the explanatory variables (eg F to
C)
C. transform the response variable
D. transform one or more of the explanatory variables
iii. Which one of the four observations, with ID numbers 2, 8, 31 and 41, has
the largest leverage? There is no need to justify your answer.
iv. A decision was made to remove observations 31 and 41. What information, in addition to that given here, was needed to justify this decision?
(c) Use output A to answer the following questions:
i. The computer output gives F = 22.95 on 10 and 33 degrees of freedom,
what is the hypothesis being tested?
ii. If you were to remove one predictor variable from the model, which would
it be?
iii. From the first table in Output A we see that only two variables reach
statistical significance, maxat and avh while avst was close to being significant (at the 5% level). It was therefore decided to fit a model with just
these three variables. Explain why this is not a reasonable thing to do, in
general.
iv. Output A gives 2 sets of P -values, (Pr(> |t|) and Pr(F) [top of page 8])
for each of the 10 explanatory variables. Only one of these variables wind
has the same P value, why?
Output

> evap.3 <- lm(log(evap) ~ maxst + minst + avst + maxat + minat +


+ avat + maxh + minh + avh + wind, data = evap.dat,subset=-c(31,41))
summary(evap.3)$coef
Value Std. Error
(Intercept) 2.5615 4.9902
maxst 0.0322 0.0371
minst -0.0025 0.0427
avst -0.0234 0.0130
maxat 0.1115 0.0378
minat -0.0007 0.0295
avat -0.0042 0.0107
maxh -0.0010 0.0430
minh 0.0274 0.0223
avh -0.0210 0.0073
wind 0.0004 0.0003

t value Pr(>|t|)
0.5133 0.6112
0.8682 0.3915
-0.0587 0.9535
-1.7956 0.0817
2.9525 0.0058
-0.0226 0.9821
-0.3911 0.6983
-0.0225 0.9822
1.2284 0.2280
-2.8764 0.0070
1.1110 0.2746

Residual standard error: 0.243 on 33 degrees of freedom


Multiple R-Squared: 0.8743
F-statistic: 22.95 on 10 and 33 degrees of freedom,
the p-value is 4.483e-012

371 Past exams

page 7

> anova(evap.3)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
maxst 1 9.727379 9.727379 164.7825 0.0000000
minst 1 0.634737 0.634737 10.7525 0.0024592
avst 1 0.008646 0.008646
0.1465 0.7043988
maxat 1 1.071469 1.071469 18.1508 0.0001598
minat 1 0.000033 0.000033
0.0006 0.9811877
avat 1 0.169509 0.169509
2.8715 0.0995801
maxh 1 0.532913 0.532913
9.0276 0.0050472
minh 1 0.789705 0.789705 13.3777 0.0008792
avh 1 0.542839 0.542839
9.1957 0.0046976
wind 1 0.072863 0.072863
1.2343 0.2746021
Residuals 33 1.948044 0.059032
(d) Use output A and B to answer the following questions.
i. Carry out a test of whether the model with just maxat, avst and avh is
significantly worse than the model with all 10 predictors, and state your
conclusion.
ii. Using the model with just maxat, avst and avh, predict, and find a 95%
prediction interval for the value of log(evap) when maxat = 85, avst =
150 and avh = 400; the se.fit here is 0.0674. Hence find the 95% prediction
interval for evap.
Output

B:

> evap.4 <- lm(log(evap) ~ maxat + avst + avh, subset=-c(31,41),


+ data = evap.dat)
> anova(evap.4)
Analysis of Variance Table
Response: log(evap)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
maxat 1 10.33466 10.33466 171.0319 0.0000000
avst 1
0.16905 0.16905
2.7976 0.1022108
avh 1
2.57742 2.57742 42.6546 0.0000001
Residuals 40
2.41701 0.06043
> summary(evap.4)$coef
Value
(Intercept) 1.42230216
maxat 0.10692312
avst -0.01617516
avh -0.01228371

Std. Error
t value
Pr(>|t|)
1.605370236 0.8859652 3.809318e-001
0.019805914 5.3985454 3.300966e-006
0.005034902 -3.2126065 2.598735e-003
0.001880818 -6.5310485 8.481575e-008
[2 + 6 + 5 + 7 = 20 marks]

371 Past exams

page 8

6. The data below resulted from a study of the effects of antibiotics and diet on the
milk yield of a particular breed of cow.

33.0

Antibiotic
B
32.9 30.8

34.4

35.8

33.4

35.9

35.5

26.2

31.8

32.4

30.7

35.7

38.4

34.6

35.5

33.3

32.4

24.3

32.4

31.4

31.2

34.1

37.4

34.3

37.9

32.8

34.1

27.4

34.0

35.4

32.3

38.0

37.9

35.8

37.7

36.7

Diet
1

A
28.6

30.5

30.9

24.1

29.8

31.6

32.2

27.7

31.1

30.4

31.4

The study was conducted using 12 cows four different cows for each of the three
antibiotics. The 12 columns in the table give the data for the different cows. Each
cow was inoculated with one of the antibiotics and then placed on each of the four
diets, in randomized order, for two weeks, with a suitable washout period between
each diet. [It is assumed that there are no (systematic) differences between the
periods during which the diets were used.]
Treating the design as a split-plot design, the following (incomplete) Splus output
was obtained.
> summary(aov(milk.y~anti.f*diet.f+Error(cow.f%in%anti.f),data=cows.dat))
Error: cow.f %in% anti.f
Df
Sum of Sq
Mean Sq
F Value
Pr(F)
anti.f
*
316.2217
Residuals
*
173.1306
Error: Within
diet.f
anti.f:diet.f
Residuals

(a)

Df
*
*
*

Sum of Sq
34.32396
4.41167
22.82188

Mean Sq

F Value

Pr(F)

i. Complete the analysis and state your conclusions. (A P value > or < 0.05
will suffice.)
ii. Find estimates of the sub-plot error variance and of the between cows (or
main-plot) variance.
iii. Find standard errors appropriate for comparing (2) levels of any significant
factors.

(b) Suppose now that six of the cows were three years and six were five years old.
i. Describe how the design of the study should have been modified to take
this into account (each cow to still receive one antibiotic and each of the
four diets).
ii. Give the form of the ANOVA table (Sources of variation and degrees of
freedom) that would have been appropriate for the modified design.
[12 + 6 = 18 marks]

371 Past exams

page 9

7. (a) For 2n factorial experiments there are three approaches that are used to determine which effects (main effects and/or interactions) are significant. Describe
the three methods and specify the circumstances under which each of them
might be used.
(b) A design is wanted for three replicates of a 24 experiment in 12 blocks of 4
plots.
i. List the effects that could be confounded in each of the three replicates
(possibly different effects in the different replicates) so that all main effects can be estimated with maximum precision, and each of the 2factor
interactions is confounded in at most one of the three replicates.
ii. Give the allocation of treatments to blocks for one (only) of the three
replicates described in (i).
iii. Give the form of the ANOVA table (sources of variation and degrees of
freedom) for your design.
[6 + 9 = 15 marks]

371 Past exams

page 10

2004
1. The Splus output given below was obtained when the model
yij

= + i + eij

(1)

was fitted to the following data.


Levels of
1
2
3.9 7.8
5.6 5.1
4.5 6.5
3.8 4.8
> lm.1 <- lm(y~A.f)
> summary(lm(y~A.f))$coef
Value Std. Error
(Intercept) 4.450 0.4494209
A.f2 1.600 0.6355772
A.f3 2.775 0.6355772
A.f4 1.825 0.6355772

factor A
3
4
6.9 6.1
7.8 5.9
6.5 6.1
7.7 7.0

t value
9.901631
2.517397
4.366110
2.871406

Pr(>|t|)
3.984615e-007
2.703814e-002
9.184445e-004
1.405307e-002

(a) Give the parameter estimates that Splus would have produced if the model
lm(y A.f 1) had been fitted.
(b) Having found that H0 : i s all equal is rejected, a follow up analysis using, for
example, Fishers or Tukeys method, is usually performed. Explain why this
type of additional analysis is (usually) needed and state, with reasons, which
of Fishers or Tukeys method is more likely to produce significant results.
(c) Consider a model of the form:
yij = 0 + 1 xi + 2 x2i + . . . + m xm
i

(2)

where xi = i, i = 1, . . . , 4.
What is the (smallest) value of m for which models (1) and (2) are equivalent.
For the value of m for which the two models are equivalent, state whether or
not the following are the same for models (1) and (2). There is no need to
justify your answers.
i. the design matrix X (i.e. is the X matrix for model (1) the same as the
X matrix for model (2)?)
ii. k, the number of columns of X
iii. r, the rank of X
iv. M(X)
v. , (i.e. is for model (1) the same as for model (2)?)

vi. y

vii. deviance()

371 Past exams

page 11

viii.

Explain the circumstances under which you would consider fitting model (2),
and explain what follow up analysis you would consider.
(2 + 3 + 10 = 15 marks)
2. This question is based on a subset of haemoglobin (Hb gL1 ) values recorded as part
of an international study of elite athletes conducted in 1999/2000. The data considered here come from 406 athletes (287 males, 119 females) from six countries. The
average Hb values for males and females from the six countries are given, together
with number of athletes in brackets. [For example, for country 1, the average Hb
was 164.3 from 12 males, and 139.0 from one female.] A test of interaction between
sex and country gave a P-value of 0.113, and interaction has been omitted. Splus
output is given for the strictly additive model
yijk = i + j + eijk
where 1 and 2 refer to males and females, respectively, while 1 to 6 refers to
the six countries.
Country
1
2
3
4
5
6
mean

males
164.3 (12)
148.5 (60)
167.9 (48)
147.2 (63)
161.9 (29)
149.3 (75)
153.7

females
139.0 (1)
134.5 (17)
146.7 (26)
131.4 (30)
142.6 (18)
128.2 (27)
136.2

mean
162.4
145.4
160.4
142.1
154.5
143.7

> Hb.2 <- lm(Hb~sex+country,data=Hb.dat)


> anova(Hb.2)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value Pr(F)
sex
1 25661.59 25661.59
311.96
0
country
5 23995.59 4799.12
58.34
0
Residuals 399 32821.02
82.26
> summary(Hb.2)
Coefficients:
(Intercept)
sex
country2
country3
country4
country5
country6

Value Std. Error


163.8053
2.5166
-18.4690
0.9996
-14.2992
2.7233
3.0892
2.7412
-15.7615
2.6967
-2.2214
2.8586
-15.2008
2.6775

t value
65.09
-18.48
-5.25
1.13
-5.84
-0.78
-5.68

Pr(>|t|)
0.0000
0.0000
0.0000
0.2604
0.0000
0.4376
0.0000

Residual standard error: 9.07 on 399 degrees of freedom


Multiple R-Squared: 0.6021
F-statistic: 100.6 on 6 and 399 degrees of freedom, the p-value is 0

371 Past exams

page 12

(a) Produce a (rough) graphical display of the data provided, and comment on
what it appears to show.
(b) Explain what is being tested, in terms of the s and s if possible and more
generally otherwise, by each of the following:
i.
ii.
iii.
iv.
v.

the
the
the
the
the

F-test for sex in the ANOVA table (F = 311.96);


F-test for country in the ANOVA table (F = 58.34);
t-test for sex in the summary table (t = 18.48);
t-test for country3 in the summary table (t = 1.13);
F-test at the bottom of the summary table (F = 100.6).

(c) Within each country, all of the athletes were located in the one city. Of the six
cities, three were at altitude (above 1700 metres), while the others were (close
to) sea-level. The cities at altitude were those in countries 1, 3 and 5. It is
suspected that any differences between countries (cities) could be due solely to
altitude.
i. Without any calculations, explain how the output given above supports
this suspicion.
ii. Explain how the suspicion could be formally tested.
iii. When the model needed to carry out the formal test (in (ii)) was fitted,
a deviance of 33734.10 was obtained. Complete the test and state your
conclusion.
(5 + 5 + 5 = 15 marks)

3. ONeill et al. (1983) American Review of Respiratory Disorders, 128:1051-1054,


reported the results of a study of 25 patients with cystic fibrosis. The response
variable of interest is PEmax, a measure of malnutrition in these patients, while the
explanatory variables relate largely to body size or lung function.
Sub
1
2
3

Age
7
7
8

Sex
0
1
0

Height
109
112
124

Weight
13.1
12.9
14.1

25

23

179

71.5

Sub
Sex
BMP
FEV
RV
FRC
TLC
PEmax

BMP
68
65
64
..
.
95

FEV
32
19
22

RV
258
449
441

FRC
183
245
268

TLC
137
134
147

PEmax
95
85
100

52

225

127

101

195

Subject number
1 = male, 2 = female
Body mass (Weight/Height2 ) as a percentage of the age-specific
median in normal individuals
Forced expiratory volume in 1 second
Residual volume
Functional residual capacity
Total lung capacity
Maximal statis expiratory pressure (cm H2 O)

371 Past exams

page 13

Splus output
Output A
> cf.1 <- lm(PEmax~age+sex+height+weight+BMP+FEV+RV+FRC+TLC,data=Cystic.dat)
> anova(cf.1)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
age 1 10098.48 10098.48 15.69696 0.0012527
sex 1
955.43
955.43 1.48511 0.2417972
height 1
154.97
154.97 0.24088 0.6306711
weight 1
647.32
647.32 1.00618 0.3317276
BMP 1
2857.85 2857.85 4.44222 0.0522960
FEV 1
1529.04 1529.04 2.37672 0.1439858
RV 1
600.61
600.61 0.93357 0.3492569
FRC 1
226.36
226.36 0.35186 0.5619016
TLC 1
112.50
112.50 0.17486 0.6817509
Residuals 15
9650.09
643.34
> summary(cf.1)$coef
Value Std. Error
t value Pr(>|t|)
(Intercept) 181.6249734 231.3086181 0.7852063 0.4445586
age
-2.5628964
4.7680886 -0.5375102 0.5987996
sex
-3.8711852 15.3993341 -0.2513865 0.8049267
height
-0.4778411
0.9075548 -0.5265149 0.6062292
weight
3.0278616
1.9996911 1.5141647 0.1507688
BMP
-1.7182736
1.1297550 -1.5209259 0.1490764
FEV
1.0553768
1.0807216 0.9765483 0.3442803
RV
0.2080534
0.1963097 1.0598224 0.3059936
FRC
-0.3350868
0.4936715 -0.6787647 0.5076251
TLC
0.2073738
0.4959126 0.4181659 0.6817509

Output B
> cf.0 <- lm(PEmax~1,data=Cystic.dat)
> step(cf.0,~+age+sex+height+weight+BMP+FEV+RV+FRC+TLC,trace=F)
Call:
lm(formula = PEmax ~ weight, data = Cystic.dat)
Coefficients:
(Intercept)
weight
63.44932 1.187979
Degrees of freedom: 25 total; 23 residual
Residual standard error (on weighted scale): 26.34972
> step(cf.1,~.,trace=F)
Call:
lm(formula = PEmax ~ weight + BMP + FEV, data = Cystic.dat)
Coefficients:
(Intercept)
weight
BMP
FEV
125.7083 1.531035 -1.45341 1.103848
Degrees of freedom: 25 total; 21 residual
Residual standard error (on weighted scale): 23.45087

371 Past exams

page 14

> lm.step.0 <- lm(PEmax~weight,data=Cystic.dat)


> disp.v(lm.step.0)
[1] "Estimated covariance matrix of parameter vector"
(Intercept)
weight
(Intercept) 161.079929 -3.46757915
weight
-3.467579 0.09019819
> lm.step.1 <- lm(PEmax~weight+BMP+FEV,data=Cystic.dat)
> disp.v(lm.step.1)
[1] "Estimated covariance matrix of parameter vector"
(Intercept)
weight
BMP
FEV
(Intercept) 1202.5842832 5.13497458 -17.7596405 0.35222208
weight
5.1349746 0.13213087 -0.1193368 -0.02514198
BMP
-17.7596405 -0.11933679
0.3341278 -0.10968035
FEV
0.3522221 -0.02514198 -0.1096804 0.26498025

(a)

(b)

(c)
(d)

(e)
(f)

i. The two components of output A, anova(cf.1) and summary(cf.1), give


quite different P-values for most of the (explanatory) variables. Explain
why the two sets of P-values are NOT inconsistent.
ii. Based on just the output for anova(cf.1), what can be said about the
possibility of finding at least one more reasonable model for these data
(i.e. a model all of whose terms are significant (at the 5% level))? Justify
your answer.
iii. Based on just the output for summary(cf.1)$coef, what can be said about
the possibility of finding at least one reasonable model for these data (i.e.
a model all of whose terms are significant (at the 5% level))? Justify your
answer.
iv. Consider applying the backward elimination method starting with the model
cf.1 using Fremove = 4. Would any of the variables be removed and, if so,
which variable would be removed first?
Output B gives the results obtained when the Splus stepwise procedure is applied starting with the (null) model, cf.0 and also when starting with cf.1. The
final model is different in the two cases.
i. Explain the step method used by Splus.
ii. There are, in general, two reasons why the outcome of the step procedure
starting with the null model might differ from the outcome when starting
with the full model. Briefly describe one of them.
Using the model lm.step.0, find a 95% confidence interval for the expected value
of PEmax for individuals of weight 50 kg.
Using the model lm.step.1, find a 95% prediction interval for the PEmax of an
individual of weight 50 kg, with BMP = 70 and FEV = 30. The se.fit for such
an individual is 7.678.
In all of the analyses sex has been treated as a regression variable rather than
as a factor. Explain why it is valid to do this.
Output C gives the output for plot(lm.step.1). Comment on whether there
appears to be any problems with assumptions for this model. There is no need
to comment on each of the six plots, just make an overall statement of the form:
everything seems OK because ......, or it would be appropriate to try .......
because .........

371 Past exams

page 15

200

24

25

160
100
80

2
100

120

24
140

60

-40

21
80

160

80

100

120

Fitted : weight + BMP + FEV

160

80

120

140

160

Residuals
40
20

25

-1

0.0

Quantiles of Standard Normal

0.4

0.8

0.0

0.15
0.10

Cooks Distance

24

0.05

0.0

-40

-20
-2

21

-40

24

-20

PEmax
0

20

0.20

40
20

25

-20

Residuals

100

Fitted : weight + BMP + FEV

0.25

40

140

fits

Fitted Values

-40

140

PEmax

5
4
3

sqrt(abs(Residuals))

0
-20

Residuals

20

21

180

25

120

40

Output C

0.4

0.8

10

15

20

25

f-value

(8 + 5 + 4 + 3 + 2 + 3 = 25 marks)
4. An experiment was conducted to evaluate the effects of three treatments on the
germination of seeds from five varieties of guayule (a rubber producing shrub). The
variable of interest is the number of plants that germinate in a plot. Each of the 15
combinations of treatment variety were used in three plots (giving a total of 45
observations). The analyses were carried out using the square-root of the number
of plants that germinated in a plot as the response variable, and the following table
gives means of the response variable.

treatment
1
2
3
all

1
3.41
3.51
2.85
3.26

2
4.22
3.58
3.25
3.68

variety
3
3.52
5.11
3.36
4.00

4
3.80
4.67
3.12
3.87

5
4.40
5.59
3.45
4.48

all
3.88
4.49
3.20
3.86

(a) Explain how and why a square-root transformation of the number of plants
(that germinated) may have been chosen.
(b) Assume here that the following analysis is appropriate (note that everything is
balanced).

371 Past exams

page 16

> guay.lm <- lm(sqrt(plants)~blocks+variety*treatment,data=guayule.dat)


> anova(guay.lm)
Analysis of Variance Table
Response: sqrt(plants)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
blocks 2
0.57869 0.289347 0.60565
variety 4
7.17266 1.793165 3.75341
treatment 2 12.45262 6.226311 13.03275
variety:treatment 8
5.92663 0.740828 1.55068
Residuals 28 13.37681 0.477743

Pr(F)
0.5527118
0.0144409
0.0000998
0.1849378

i. Describe how the study should have been designed (the nature of any grouping of the (45) study units and how the randomization should have been
done) in order to justify this form of analysis.
ii. A. Re-do the ANOVA table omitting the non-significant interaction.
B. What do you conclude about the three treatments? (Use Tukeys
method to decide which, if any, of the treatments differ significantly.)
C. Comment on the effectiveness of the blocking.
(c) Assume here that the following analysis is appropriate (and that everything is
balanced). The flats referred to in the analysis are shallow boxes for starting
seedlings.
> guay.aov <- aov(sqrt(plants)~variety*treatment+Error(flats),
+ data=guayule.dat)
> summary(guay.aov)
Error: flats
Df Sum of Sq Mean Sq F Value
Pr(F)
variety 4
7.17266 1.793165 1.779811 0.209426
Residuals 10 10.07503 1.007503
Error: Within
Df Sum of Sq Mean Sq F Value
Pr(F)
treatment 2 12.45262 6.226311 32.09045 0.000000573
variety:treatment 8
5.92663 0.740828 3.81823 0.007119376
Residuals 20
3.88048 0.194024

i. Describe how the study should have been designed (the nature of the grouping of the (45) study units into flats and how the randomization should
have been done) in order to justify this form of analysis.
ii. What conclusions can be reached from this analysis (without further calculations).
iii. The between flats variance is estimated to be 0.271. Show how this estimate
can be obtained from the output given above.
iv. Find the value of the standard error for comparing:
A. two treatments applied to the same variety [e.g. (from the table of
means of the response variable) the standard error for (3.41 3.51)];
B. two varieties with the same treatment [e.g. the standard error for
(3.41 4.22)].
(3 + 10 + 12 = 25 marks)

371 Past exams

page 17

5. This question refers to a study of young girls suffering from anorexia. A number of
such patients were randomly assigned to receive one of three treatments:
(Treatment 1): Cognitive behaviour treatment.
(Treatment 2): Standard treatment.
(Treatment 3): Family therapy.
The weight (in lbs) of each patient was recorded both before treatment began and
after a fixed period of time on the treatment. Seventy-two patients participated
in the study: 29 received treatment 1, 26 received treatment 2 and 17 received
treatment 3. The scatterplot uses numbers to indicate treatment.

After versus before weights for the anorexia study


1
100

3
1

3
1

95

3
1

after

90

3
3
3 3

3
3

85

2
2

2
80

2 2

75

1
1

70

75

3 22

1
1
2

11 1
1
2

31

1
21
1 2
1

11
2

3
2

3
2 21
1
80

2 2
2

85

90

95

before

(a) The following table gives the deviance and (residual) degrees of freedom for six
models that were fitted to the data.
1
2
3
4
5
6

Model
treatment*before
treatment:before
treatment + before
treatment
before
before 1

deviance
2845
3245
3311
3665
4078
4588

df
66
68
68
69
70
71

Use F-tests to determine which of the above models is most appropriate for
these data. Show full details of any tests that you use.

371 Past exams

page 18

(b) For the six models in the above table, it is not possible to compare all pairs of
models using an F-test. List three (3) such pairs; there is no need to justify
your answer.
(c) The output below was obtained for model (1) (which may or may not be the
best model), and for the equivalent model with Splus specification anorexia.1a.
i. Find a 95% confidence interval for the slope of the line for treatment 2, and
give an interpretation of this confidence interval in terms of the nature of
the relationship between the before and after weights of patients assigned
to the standard treatment.
ii. Find a 95% confidence interval for the difference in the expected after
weights of patients with a before weight of 80 lbs. given treatments 1
and 3.
iii. Describe the meaning of model (1) in non-technical terms. A rough sketch
may help here.
> anorexia.1 <- lm(after~treatment*before,data=anorexia.dat)
> summary(anorexia.1)
Value Std. Error
(Intercept) 15.5772 21.2083
treatment2 76.4742 28.3470
treatment3 -0.7575 34.5516
before
0.8480
0.2561
treatment2before -0.9822
0.3442
treatment3before
0.0612
0.4155

t value Pr(>|t|)
0.7345
0.4653
2.6978
0.0089
-0.0219
0.9826
3.3117
0.0015
-2.8532
0.0058
0.1474
0.8833

> disp.v(anorexia.1)
[1] "Estimated covariance matrix of parameter vector"
(Intercept) treatment2
treatment3
before
(Intercept) 449.791232 -449.791232 -449.791232 -5.42153583
treatment2 -449.791232 803.552512
449.791232 5.42153583
treatment3 -449.791232 449.791232 1193.814599 5.42153583
before
-5.421536
5.421536
5.421536 0.06556486
treatment2before
5.421536
-9.738768
-5.421536 -0.06556486
treatment3before
5.421536
-5.421536
-14.330501 -0.06556486

(Intercept)
treatment2
treatment3
before
treatment2before
treatment3before

treatment2before treatment3before
5.42153583
5.42153583
-9.73876785
-5.42153583
-5.42153583
-14.33050074
-0.06556486
-0.06556486
0.11849956
0.06556486
0.06556486
0.17260593

> anorexia.1a <- lm(after~treatment+treatment:before-1,data=anorexia.dat)


> summary(anorexia.1a)
Value Std. Error
treatment1 15.5772 21.2083
treatment2 92.0515 18.8085
treatment3 14.8198 27.2768
treatment1before
0.8480
0.2561

t value Pr(>|t|)
0.7345
0.4653
4.8941
0.0000
0.5433
0.5887
3.3117
0.0015

371 Past exams

treatment2before
treatment3before

page 19

-0.1342
0.9092

0.2301
0.3272

-0.5832
2.7791

0.5617
0.0071

> disp.v(anorexia.1a)
[1] "Estimated covariance matrix of parameter vector"
treatment1
treatment2
treatment3 treatment1before
treatment1 4.497912e+002 -5.236354e-014 -4.685106e-014
-5.421536e+000
treatment2 -5.236354e-014 3.537613e+002 5.649404e-030
6.332538e-016
treatment3 -4.685106e-014 5.649404e-030 7.440234e+002
5.665891e-016
treatment1before -5.421536e+000 6.332538e-016 5.665891e-016
6.556486e-002
treatment2before 6.420429e-016 -4.317232e+000 -6.926881e-032
-7.764489e-018
treatment3before 5.629147e-016 -6.787750e-032 -8.908965e+000
-6.807559e-018

treatment1
treatment2
treatment3
treatment1before
treatment2before
treatment3before

treatment2before treatment3before
6.420429e-016
5.629147e-016
-4.317232e+000
-6.787750e-032
-6.926881e-032
-8.908965e+000
-7.764489e-018
-6.807559e-018
5.293470e-002
8.322636e-034
8.322636e-034
1.070411e-001

(d) This part of the question refers to the design of the study.
i. There is one aspect of the study (as described in the introduction to the
question) which seems somewhat unusual for a designed experiment. What
is it and why is it unusual?
ii. The following output was obtained for a one-way ANOVA of the before
values.
> anova(lm(before~treatment,data=anorexia.dat))
Analysis of Variance Table
Response: before
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq
F Value
Pr(F)
treatment 2
32.569 16.28467 0.5994852 0.5519287
Residuals 69 1874.346 27.16443
What is likely to be the purpose of such an analysis, and what do you
conclude?
iii. Give details of an alternative way in which the before weights could have
been used in the design of the study (i.e. not as a covariate). What form
of analysis would be appropriate for this alternative design?
(4 + 3 + 7 + 6 = 20 marks)

6. (a) Use examples to help explain the difference between:


i. crossed and nested factors
ii. fixed and random effects
(b) Discuss briefly (no more than 2-3 sentences) the role of confounding in 2n
factorial experiments. (When and why is it used?)

371 Past exams

page 20

(c) Consider a 25 factorial experiment (with factors A, B, C D and E) to be carried


out in 4 blocks of 8 plots.
i. Find the allocation of treatments to the principal block such that ABC
and ADE are confounded with blocks.
ii. Give the form of the ANOVA table (the source and df columns) for the
design in (i) if the assumption is made that all 3factor interactions are
negligible.
iii. Describe (briefly) how you would analyse the data if you were unwilling to
assume that any (particular) interactions are negligible.
iv. Explain why it probably would not be desirable to include the 5-factor
interaction (ABCDE) among the effects confounded with blocks.
(4 + 3 + 8 = 15 marks)

371 Past exams

page 21

2003
1. The following measurements were obtained on the weights of various combinations
of three objects O1 , O2 and O3 .
Objects
O1 , O2
O1 , O3
O2 , O3
O1 , O2 , O3
O1
O2
O3

Weight
5.0
5.7
6.7
9.0
2.0
2.5
3.5

It is possible that the balance used may be biased in the sense of constantly giving
a higher (or lower) weight.
(a) Assuming a model of the form y = X +
e , write down the vector y and the

design matrix X.
(b) What is the rank of X?
(c) To test the hypothesis (H0 ) that the weights of the three objects are in the ratio
2 : 3 : 4, two models were fitted to the data; the model referred to in (a), for
which the deviance was 0.065, and the model under H0 , for which the deviance
was 0.295. Write down the design matrix for the model under H0 , carry out a
test of the hypothesis, and state your conclusions (accept or reject H0 at the
5% level).
[2 + 1 + 5 = 8 marks]
2. Small loaves of bread were prepared with flour that was fortified with a fixed amount
of vitamins. One day after baking, the vitamin C content of two loaves was measured.
Another two loaves baked at the same time were stored for three days and then the
vitamin C content was measured. In a similar manner two loaves were stored for five
days and another two for seven days before measurements were taken. The units are
milligrams per hundred grams of bread (mg/100 g).
Condition
One day after baking
Three days after baking
Five days after baking
Seven days after baking

Vitamin C (mg/100 g)
40.25
43.46
21.25
22.34
13.18
11.65
8.51
8.13

Using condit.f to denote the condition treated as a factor with four levels, the following Splus output was obtained when the simple one-way ANOVA model
yij = + i + eij
was fitted using contrasts = contr.treatment and contrasts = contr.sum

371 Past exams

page 22

> options(contrasts=c("contr.treatment", "contr.poly"))


> lm.1 <- lm(vitc ~ condit.f)
> summary(lm.1)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 41.955
0.892
47.029
0.000
condit.f2 -20.160
1.262
-15.979
0.000
condit.f3 -29.540
1.262
-23.414
0.000
condit.f4 -33.635
1.262
-26.660
0.000
> disp.v(lm.1)
[1] "Estimated covariance matrix of parameter vector"
(Intercept) condit.f2 condit.f3 condit.f4
(Intercept)
0.79584 -0.79584 -0.79584 -0.79584
condit.f2
-0.79584
1.59169
0.79584
0.79584
condit.f3
-0.79584
0.79584
1.59169
0.79584
condit.f4
-0.79584
0.79584
0.79584
1.59169

> options(contrasts=c("contr.sum", "contr.poly"))


> lm.2 <- lm(vitc ~ condit.f)
> summary(lm.2)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 21.121
0.446
47.352
0.000
condit.f1 20.834
0.773
26.966
0.000
condit.f2
0.674
0.773
0.872
0.432
condit.f3 -8.706
0.773
-11.269
0.000
> disp.v(lm.2)
[1] "Estimated covariance matrix of parameter vector"
(Intercept) condit.f1 condit.f2 condit.f3
(Intercept)
0.19896
0.00000
0.00000
0.00000
condit.f1
0.00000
0.59688 -0.19896 -0.19896
condit.f2
0.00000 -0.19896
0.59688 -0.19896
condit.f3
0.00000 -0.19896 -0.19896
0.59688
(a) Explain, in terms of the expected vitamin C content of the loaves, what is being
tested by the t-test for condit.f2 in the summary output for:
i. the model lm.1 (t = 15.979);
ii. the model lm.2 (t = 0.872).
(b) Find a 95% confidence interval for 1 22 + 3 , and give an interpretation of
your confidence interval in terms of how the vitamin C level changes over days
one to five.
(c) Specify a model that uses the actual number of days rather than condit.f,
which is equivalent to the ones fitted above. What might be the advantage
of considering such a model?
[4 + 6 + 2 = 12 marks]

371 Past exams

page 23

3. Haemoglobin [Hb] is an indicator of oxygen carrying capacity of blood, high values


of which are considered to be beneficial for athletes in endurance sports. Hb data
from 153 AIS (Australian Institute of Sport) athletes, 81 males and 72 females from
seven sports, as indicated, resulted in the Splus output given below. The contrasts
were set at contr.treatments with gender =1 for males and 2 for females.
Sample sizes
Sport
Males
Basket Ball [BBall]
12
Field Sports [Field]
11
Rowing [Row]
15
Swimming [Swim]
13
18
Track Sports 400 metres [T400m]
Tennis [Tennis]
3
Track Sports sprint [Tsprint]
9
Total
81

Females
13
7
22
9
10
7
4
72

Total
25
18
37
22
28
10
13
153

> anova(lm(Hb~gender*Sport, data = Hb.dat))


Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
gender
1
105.00 105.00 178.26 0.00000
Sport
6
14.93
2.49
4.23 0.00061
gender:Sport
6
3.10
0.52
0.88 0.51353
Residuals 139
81.87
0.59

> summary(lm(Hb~gender + Sport, data = Hb.dat))


Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 14.938
0.167
89.531
0.000
gender -1.618
0.127
-12.694
0.000
SportField
1.064
0.237
4.485
0.000
SportRow
0.606
0.198
3.053
0.003
SportSwim
0.438
0.224
1.954
0.053
SportT400m
0.323
0.212
1.524
0.130
SportTennis
0.155
0.287
0.540
0.590
SportTsprint
0.722
0.263
2.743
0.007
(a) Describe what you would have done by way of follow-up analysis had the
interaction between gender and sport been significant.
(b) Given that the interaction is not significant, find values of appropriate test
statistics for testing the (main) effects of gender and sport (on Hb).
(c) Find a 95% confidence interval for the (mean) difference in Hb between males
and females. Would you expect Tukeys multiple comparisons method to give
a 95% interval that was narrower, wider or of the same width as your interval?
(d) Determine whether or not Hb levels for Basket Ball [BBall] and Track (sprint)
[Tsprint] athletes differ significantly at the 5% level according to Tukeys mul-

371 Past exams

page 24

tiple comparisons procedure. Explain why it is not possible to obtain Tukeys


LSDQ values for all pairs of sports using just the Splus output provided.
[3 + 4 + 3 + 5 = 15 marks]
4. This question refers to a field study of factors affecting the weight of lambs carried
out by Agriculture students in 1991. The data are the lambs weight on April 30,
1991 (W), the lambs sex (S) and the age of the lamb, in days, on April 30, 1991
(A). A graph for the 26 lambs studied is shown below.

sex
F
M

11

10

Weight

3
0

10

15

20

25

30

Age

Various models can be fitted to these data to predict the lambs weight (W) on
April 30, 1991. A model can be written in parametric form where i is 1 for Males
and 2 for Females, Wij is the weight on April 30, Aij is the lambs age on that
day, and j represents the individual lambs. (For example, a model might look like
Wij = + i Aij + eij ). An Splus model fitting statement can also be given (for
example, W S + A or W S A, and so on).
(a) Give a parametric form of the model and the associated Splus specification for
each of the following situations.
i. Age is important in determining weight, but there is no difference between
males and females.
ii. Males and females have different weights, but the same non-zero growth
rate.
iii. Neither age nor sex has any relation to lamb weight.
iv. There is a different growth rate for males and females, but males and
females have the same birthweights.

371 Past exams

page 25

(b) The table below describes seven models that were fitted to the data.
i. Give the residual degrees of freedom for each model.
ii. It is not possible to formally compare (by F-test) all 21 pairs of models
given in the table below. List four of the pairs that cannot be formally
compared, no justification is required.
iii. Using F-tests derived from the deviances given below, decide on the most
appropriate model for the data. Give full details of the tests you use.
(Assume that the usual assumptions are satisfied.)
iv. Describe (in words, like in part (a)) what is implied by your chosen model.
Model
1
2
3
4
5
6
7

Parametric form
Wij = i + i Aij + eij
Wij = + i Aij + eij
Wij = i + Aij + eij
Wij = + Aij + eij
Wij = i + eij
Wij = Aij + eij
Wij = + eij

Deviance
48.72
52.86
50.24
54.18
63.51
224.21
83.78
[8 + 12 = 20 marks]

5. How well can house prices be predicted by variables such as age, size, number of
bedrooms etc? Data were available on the sale price (in $1,000) of 51 houses in a
particular location, together with values of the following variables:
age
age of house in years
bed
number of bedrooms
bath number of bathrooms
size size of the house in 1000 ft2
lot
size of the lot (or block of land) in 1000 ft2
The following table gives a partial listing of the data.
Obsn
1
2
3
4
..
.

age
21
21
7
6
..
.

bed
3
3
1
3
..
.

bath
3
2
1
2
..
.

size
0.951
1.036
0.676
1.456
..
.

lot
64.9
217.8
54.5
51.8
..
.

price
30.0
39.9
46.5
48.6
..
.

46
47
48
49
50
51

27
5
32
29
1
33

3
3
4
3
3
3

2
3
4
3
3
4

1.920
2.949
3.310
2.805
2.553
3.627

226.5
12.0
10.5
16.5
8.6
17.8

167.5
169.9
175.0
179.0
179.9
199.0

A multiple regression model was fitted using all five explanatory variables and, based
on the diagnostic plots in Figure 1, it was decided to omit observation 46.

371 Past exams

page 26

46

4
6

40

46

200

Figure 1: Diagnostics for the full model fitted to all 51 observations.

150
price

3
2

50

1
0
200

50

100

150

100

150

200

46

50

0.8

100

Residuals

0.2

-20

50

Fitted : age + bed + bath + size + lot

price

50

20

40

46

100

Fitted Values

Residuals

200

fits

0.6

150

0.4

100

Fitted : age + bed + bath + size + lot

Cooks Distance

-40

31
4
50

100

sqrt(abs(Residuals))

0
-20

Residuals

20

31

-2

-1

Quantiles of Standard Normal

0.0

0.4

0.8

0.0

0.0

-50

31

-50

-40

0.4

0.8

10

20

30

40

50

f-value

Refitting the full model resulted in the following ANOVA table.


> anova(lm(price~bath+lot+age+bed+size,subset=-46,data=Hprice.dat))
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
bath 1
31019
31019 125.71 0.00000
lot 1
8678
8678
35.17 0.00000
age 1
161
161
0.65 0.42290
bed 1
65
65
0.26 0.61054
size 1
25250
25250 102.33 0.00000
Residuals 44
10857
247
After further analysis it was decided to use a model with just age, bed and size.
When this model was fitted to the full data set (all 51 observations), the diagnostics
plot given in Figure 2 was obtained.

371 Past exams

page 27

40

46

150
price

sqrt(abs(Residuals))

31

50

20
0
-20
-40

31
4
50

100

150

200

50

100

150

Fitted : age + bed + size

200

50

200

Residuals
3
51

0.10
Cooks Distance

0.02

-20

price

50

50

40
20
0

Residuals

150

Fitted : age + bed + size

100

46

100

60

Fitted Values

-2

-1

Quantiles of Standard Normal

0.0

-50

0.0

31
-50

-40

100

fits

0.06

Residuals

200

46

100

60

Figure 2: Diagnostics for the reduced model fitted to all 51 observations.

0.4

0.8

0.0

0.4

0.8

10

20

30

40

50

f-value

It was again decided to omit observation 46. The following Splus output was obtained
for the reduced model.
> anova(lm(price~age+bed+size,subset=-46,data=Hprice.dat))
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
age 1
747
747
3.11 0.084393
bed 1
11181
11181
46.56 0.000000
size 1
53055
53055 220.93 0.000000
Residuals 46
11047
240
> summary(lm(price~age+bed+size,subset=-46,data=Hprice.dat))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 35.485 11.069
3.206
0.002
age -0.366
0.174
-2.098
0.041
bed -11.189
3.901
-2.868
0.006
size 61.296
4.124
14.864
0.000
(a) What can you say about observation 46 based on Figures 1 and 2? Explain
how you arrived at your answer.
(b) Explain why the first ANOVA table (the one for the full model) is of little use
in determining the final model.

371 Past exams

page 28

(c) Describe an approach that might have been used to arrive at the (final) reduced
model.
(d) Carry out a formal test to demonstrate that the reduced model is better
in some sense than the full model (both models without observation 46), and
explain how you can tell that an even simpler model is unlikely to be better.
(e) Comment on the coefficient of bed in the reduced model. Is the sign of the
coefficient what you would expect and, if not, how can you explain it?
(f) Using the reduced model, predict the sale price of a 5 year-old house with 4
bedrooms and size 2.1, and write down the formula, in terms of appropriate
variances and covariances, for a 95% prediction interval.
[3 + 2 + 2 + 5 + 3 + 5 = 20 marks]

6. Consider a 24 factorial experiment with factors A, B, C and D. The investigators


are interested in the main effects of the four factors and as many of the interactions
as can be estimated. The two-factor interactions are of particular interest though
two-factor interactions involving factor D are of less interest than the others.
For each of the situations specified below:
suggest a suitable design (list effects to be confounded with blocks);
for each replicate of the design give the allocation of treatments to blocks for
the principal block(s) only;
give the form of the ANOVA table (sources of variation and degrees of freedom);
explain how you would determine which effects are significant.
(a) two blocks of eight plots are available;
(b) four blocks of four plots are available;
(c) eight blocks of four plots are available.
[4 + 6 + 5 = 15 marks]

7. An experiment to measure the effects of different levels of moisture and fertilizer on


wheat plants was conducted using 48 peat pots on 12 plastic trays. Four pots were
placed on each tray and the moisture treatments consisted of adding 10, 20, 30 or 40
ml of water per pot per day to the tray; the water being absorbed by the peat pots.
The trays were assigned to the moisture levels at random; three trays per level. The
levels of fertilizer were 2, 4, 6 or 8 mg per pot. The four pots on each tray were
assigned to the four levels of fertilizer at random so that each level of fertilizer was
used once on each tray. Thirty days after planting, the weight (in grams) of the dry
matter in each pot was measured, giving a total of 48 observations. The data are
given in the table below, while the Splus output provided was obtained when the
data were analysed as a split plot design.

371 Past exams

page 29

Moisture
10

20

30

40

Tray
1
2
3
4
5
6
7
8
9
10
11
12

2
3.35
4.04
1.98
5.05
5.19
6.95
6.57
8.30
5.28
6.84
6.50
4.05

Fertilizer
4
6
4.32
4.56
4.14
6.52
3.84
4.47
7.94 10.77
8.51 10.39
7.02 10.93
10.73 12.26
8.91 13.44
8.67 11.14
9.08 10.37
6.07 10.75
3.84
9.44

8
5.88
7.38
5.12
13.52
13.52
15.28
15.71
14.96
15.63
12.51
12.50
10.28

> wheat.1 <- aov(drywt ~ moist.f * fert.f + Error(tray.f), data = wheat.dat)


> summary(wheat.1)
Error: tray.f
Df Sum of Sq Mean Sq F Value
Pr(F)
moist.f 3
269.19 89.730 26.341 0.00016916
Residuals 8
27.25
3.406
Error: Within
Df Sum of Sq Mean Sq F Value
Pr(F)
fert.f 3
297.05 99.018 131.65 0.00000000
moist.f:fert.f 9
38.06
4.228
5.62 0.00033534
Residuals 24
18.05
0.752
(a) Describe a situation, in terms of number of trays and number of pots per tray,
that would have led to a simple two-way ANOVA being appropriate.
(b) Give estimates of the wholeplot (between tray) and subplot (within tray) error
variances.
(c) Find the standard error for:
i. the difference between the means of two levels of fertilizer for a given level
of moisture;
ii. the difference between the means of two levels of moisture for a given level
of fertilizer.
Explain why these standard errors are of particular interest for these data.
[2 + 3 + 5 = 10 marks]

Вам также может понравиться