Академический Документы
Профессиональный Документы
Культура Документы
page 1
2005
1. Consider two linear models for the same data set:
Model 1:
Model 2:
y = X 11 +
e
y = X 22 +
e
(a) What are the most general circumstances under which models 1 and 2 are
equivalent (X 1 X 2 is only one special case), and which of the following are
the same for equivalent models: , X , rank(X), Deviance(),
.
(b) Under what circumstances can models 1 and 2 be compared using an F test?
(c) Under what condition on the design matrix X 1 is 1 unique?
Days
4
Drug
x
336
391
408
301
A
y
37
64
100
71
Drug B
x
y
297
76
394
61
255 66
338
83
Drug C
x
y
422
86
301
46
322
0
283
0
Drug D
x
y
258
4
425 179
438 190
255
9
423
394
255
377
41
137
187
**
270
342
299
380
60
17
56
17
250
274
398
290
174
143
33
125
288
438
346
438
13
29
64
17
12
297
243
299
350
469
367
387
319
398
461
407
367
60
41
60
**
310
369
324
330
244
97
161
76
270
456
389
332
92
4
13
60
page 2
2. In this question we will consider just the responses (the yvalues) for Drug A, the
saline solution.
The following Splus output was obtained with the contrasts for factor set at contr.treatment.
Days.f is days treated as a factor whereas Days is days treated as a variable (taking
values 4, 8 and 12).
> rats.1 <- lm(y~Days.f,data=rats.dat[Drug=="A",])
> anova(rats.1)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
Days.f 2 328160.5 164080.3 30.9063 0.0001724347
Residuals 8
42471.7
5309.0
> summary(rats.1)$coef
Value Std. Error
t value
Pr(>|t|)
(Intercept)
14.0000
36.43130 0.3842849 0.71078042813
Days.f2 -135.6667
55.64973 -2.4378675 0.04069966398
Days.f3 -399.5000
51.52164 -7.7540237 0.00005463732
> disp.v(rats.1)
[1] "Estimated covariance matrix of parameter vector"
(Intercept)
Days.f2
Days.f3
(Intercept)
1327.24 -1327.240 -1327.240
Days.f2
-1327.24 3096.892 1327.240
Days.f3
-1327.24 1327.240 2654.479
(a) What would the parameter estimates be (under summary) for the (ANOVA)
model:
lm(y~Days.f - 1,data=rats.dat[Drug=="A",]) ?
(b) It has been suggested that it would have been just as good, if not better, to
treat days as a variable (with just the linear term) rather than as a factor.
i. Give one possible advantage for treating days as a variable rather than as
a factor.
ii. Using the Splus output provided, it is possible (here) to test the hypothesis
that it is OK to treat days as a variable using either an F test or a ttest.
Carry out ONE (only) of the tests and state your conclusion.
[3 + 5 = 8 marks]
page 3
3. Here we consider the response (y) and all levels of both factors (Days and Drug),
but ignore the covariate (x).
> rats.3 <- lm(y~Days.f*Drug, data=rats.dat)
> anova(rats.3)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
Days.f 2 312180.1 156090.0 33.44240 0.000000009
Drug 3 235044.3 78348.1 16.78613 0.000000731
Days.f:Drug 6 126347.0 21057.8 4.51166 0.001830740
Residuals 34 158692.6
4667.4
Table of means
Days
4
8
12
All
A
14.00
121.67
385.50
168.27
Drug
B
C
38.50
10.00
29.00 102.25
53.67 144.50
11.18
78.92
D
89.00
24.25
40.25
24.33
All
37.88
52.87
162.80
57.15
(a) Produce a suitable display of the data (a rough sketch will suffice), and use it
to help describe the findings of the ANOVA given above.
(b) The main interest in this study is the comparison between drugs. Complete an
appropriate follow-up analysis for the factor Drug, and state your conclusions.
(c) The suggestion is made that the interaction between Days and Drug is due to
the results for Day 12 on Drug A. In order to test this suggestion the model:
lm(y ~ Days.f + Drug + Day12.DrugA, data = rats.dat)
was fitted, where Day12.DrugA is a factor with 2 levels, one for Day 12 on
Drug A, and another for all other combinations. The deviance for this model is
167832.2 with 39 degrees of freedom. Carry out a test of whether the suggestion
is a good one, and state your conclusion.
(d) Briefly describe an alternative way the suggestion in (c) could have been tested.
(e) Yet another approach is to use the numerical nature of Days. Doing this the
following output was obtained.
> anova(lm(y ~ Drug*Days, data=rats.dat))
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
Drug 3 244434.5 81478.2 17.99283 0.0000001985
Days 1 302778.0 302778.0 66.86247 0.0000000007
Drug:Days 3 112973.3 37657.8 8.31597 0.0002239914
Residuals 38 172078.1
4528.4
page 4
i. Show that this model is not significantly worse than the one with Days
treated as a factor.
ii. This model fits a different (straight) line for each drug. Find the equations
of the lines for drugs A and B.
iii. The equations of the lines for drugs B, C and D seem to be quite similar.
(There is no need to find the equations of the lines for drugs C and D.)
Give details of how you could test if there was no (significant) difference
between them.
[4 + 6 + 4 + 2 + 7 = 23 marks]
4. Here we consider the covariate (x) in addition to the response (y). To simplify
matters only the data from the three active drugs, B, C and D have been used here.
Numerous models were fitted as indicated in the table below.
1
2
Model
Days.f*Drug*x
Days.f*Drug + Days.f*x + Drug*x
Deviance
23891
35799
Residual df
17
21
3
4
5
Days.f*Drug + Days.f*x
Days.f*Drug + Drug*x
Days.f*x + Drug*x
41425
39025
36306
23
23
25
6
7
8
Days.f*Drug + x
Days.f + Drug*x
Drug + Days.f*x
44150
40139
42774
25
27
27
Days.f + Drug + x
46611
29
10
11
12
Days.f + Drug
Drug + x
Days.f + x
121449
176954
77501
30
31
31
(a) Not all of the models can be compared, formally, using F tests. Explain why.
(b) Models 3, 4 and 5 cannot be formally compared, nor can models 6, 7 and 8.
However, models 3, 4 and 5 can be compared with some, but not all, of the
models 6, 7 and 8. Of the nine possible pairings (3 with 6), (3 with 7) . . . (5
with 8), list those pairs that CANNOT be formally compared.
(c) Determine which of the 12 models seems to be most appropriate here. Give
details of any tests you use, and the outcome of each test.
page 5
(d) In the light of your findings, comment on whether there was any value in using
the covariate.
(e) Suggest an alternative way the study could have been designed that would have
been similar to using a covariate.
[2 + 3 + 7 + 2 + 2 = 16 marks]
5. In a study of daily soil evaporation, the following predictor variables were identified:
maxat
minat
avat
=
=
=
maxst
minst
avst
maxh
minh
avh
wind
=
=
=
=
=
=
=
(a) After some preliminary analyses, it was decided to (log) transform the response
variable. The decision was made primarily on the basis of one particular plot.
What was the plot and what, most likely, was observed?
(b) The following diagnostic plots were obtained for the model:
> evap.2 <- lm(log(evap) ~ maxst + minst + avst + maxat + minat +
+ avat + maxh + minh + avh + wind, data = evap.dat)
-2
r.std(evap.2)
31
-4
41
1.5
2.0
2.5
3.0
3.5
4.0
4.5
fitted(evap.2)
1.0
31
0.6
0.4
0.2
0.0
Cooks Distance
0.8
41
10
20
30
40
page 6
t value Pr(>|t|)
0.5133 0.6112
0.8682 0.3915
-0.0587 0.9535
-1.7956 0.0817
2.9525 0.0058
-0.0226 0.9821
-0.3911 0.6983
-0.0225 0.9822
1.2284 0.2280
-2.8764 0.0070
1.1110 0.2746
page 7
> anova(evap.3)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
maxst 1 9.727379 9.727379 164.7825 0.0000000
minst 1 0.634737 0.634737 10.7525 0.0024592
avst 1 0.008646 0.008646
0.1465 0.7043988
maxat 1 1.071469 1.071469 18.1508 0.0001598
minat 1 0.000033 0.000033
0.0006 0.9811877
avat 1 0.169509 0.169509
2.8715 0.0995801
maxh 1 0.532913 0.532913
9.0276 0.0050472
minh 1 0.789705 0.789705 13.3777 0.0008792
avh 1 0.542839 0.542839
9.1957 0.0046976
wind 1 0.072863 0.072863
1.2343 0.2746021
Residuals 33 1.948044 0.059032
(d) Use output A and B to answer the following questions.
i. Carry out a test of whether the model with just maxat, avst and avh is
significantly worse than the model with all 10 predictors, and state your
conclusion.
ii. Using the model with just maxat, avst and avh, predict, and find a 95%
prediction interval for the value of log(evap) when maxat = 85, avst =
150 and avh = 400; the se.fit here is 0.0674. Hence find the 95% prediction
interval for evap.
Output
B:
Std. Error
t value
Pr(>|t|)
1.605370236 0.8859652 3.809318e-001
0.019805914 5.3985454 3.300966e-006
0.005034902 -3.2126065 2.598735e-003
0.001880818 -6.5310485 8.481575e-008
[2 + 6 + 5 + 7 = 20 marks]
page 8
6. The data below resulted from a study of the effects of antibiotics and diet on the
milk yield of a particular breed of cow.
33.0
Antibiotic
B
32.9 30.8
34.4
35.8
33.4
35.9
35.5
26.2
31.8
32.4
30.7
35.7
38.4
34.6
35.5
33.3
32.4
24.3
32.4
31.4
31.2
34.1
37.4
34.3
37.9
32.8
34.1
27.4
34.0
35.4
32.3
38.0
37.9
35.8
37.7
36.7
Diet
1
A
28.6
30.5
30.9
24.1
29.8
31.6
32.2
27.7
31.1
30.4
31.4
The study was conducted using 12 cows four different cows for each of the three
antibiotics. The 12 columns in the table give the data for the different cows. Each
cow was inoculated with one of the antibiotics and then placed on each of the four
diets, in randomized order, for two weeks, with a suitable washout period between
each diet. [It is assumed that there are no (systematic) differences between the
periods during which the diets were used.]
Treating the design as a split-plot design, the following (incomplete) Splus output
was obtained.
> summary(aov(milk.y~anti.f*diet.f+Error(cow.f%in%anti.f),data=cows.dat))
Error: cow.f %in% anti.f
Df
Sum of Sq
Mean Sq
F Value
Pr(F)
anti.f
*
316.2217
Residuals
*
173.1306
Error: Within
diet.f
anti.f:diet.f
Residuals
(a)
Df
*
*
*
Sum of Sq
34.32396
4.41167
22.82188
Mean Sq
F Value
Pr(F)
i. Complete the analysis and state your conclusions. (A P value > or < 0.05
will suffice.)
ii. Find estimates of the sub-plot error variance and of the between cows (or
main-plot) variance.
iii. Find standard errors appropriate for comparing (2) levels of any significant
factors.
(b) Suppose now that six of the cows were three years and six were five years old.
i. Describe how the design of the study should have been modified to take
this into account (each cow to still receive one antibiotic and each of the
four diets).
ii. Give the form of the ANOVA table (Sources of variation and degrees of
freedom) that would have been appropriate for the modified design.
[12 + 6 = 18 marks]
page 9
7. (a) For 2n factorial experiments there are three approaches that are used to determine which effects (main effects and/or interactions) are significant. Describe
the three methods and specify the circumstances under which each of them
might be used.
(b) A design is wanted for three replicates of a 24 experiment in 12 blocks of 4
plots.
i. List the effects that could be confounded in each of the three replicates
(possibly different effects in the different replicates) so that all main effects can be estimated with maximum precision, and each of the 2factor
interactions is confounded in at most one of the three replicates.
ii. Give the allocation of treatments to blocks for one (only) of the three
replicates described in (i).
iii. Give the form of the ANOVA table (sources of variation and degrees of
freedom) for your design.
[6 + 9 = 15 marks]
page 10
2004
1. The Splus output given below was obtained when the model
yij
= + i + eij
(1)
factor A
3
4
6.9 6.1
7.8 5.9
6.5 6.1
7.7 7.0
t value
9.901631
2.517397
4.366110
2.871406
Pr(>|t|)
3.984615e-007
2.703814e-002
9.184445e-004
1.405307e-002
(a) Give the parameter estimates that Splus would have produced if the model
lm(y A.f 1) had been fitted.
(b) Having found that H0 : i s all equal is rejected, a follow up analysis using, for
example, Fishers or Tukeys method, is usually performed. Explain why this
type of additional analysis is (usually) needed and state, with reasons, which
of Fishers or Tukeys method is more likely to produce significant results.
(c) Consider a model of the form:
yij = 0 + 1 xi + 2 x2i + . . . + m xm
i
(2)
where xi = i, i = 1, . . . , 4.
What is the (smallest) value of m for which models (1) and (2) are equivalent.
For the value of m for which the two models are equivalent, state whether or
not the following are the same for models (1) and (2). There is no need to
justify your answers.
i. the design matrix X (i.e. is the X matrix for model (1) the same as the
X matrix for model (2)?)
ii. k, the number of columns of X
iii. r, the rank of X
iv. M(X)
v. , (i.e. is for model (1) the same as for model (2)?)
vi. y
vii. deviance()
page 11
viii.
Explain the circumstances under which you would consider fitting model (2),
and explain what follow up analysis you would consider.
(2 + 3 + 10 = 15 marks)
2. This question is based on a subset of haemoglobin (Hb gL1 ) values recorded as part
of an international study of elite athletes conducted in 1999/2000. The data considered here come from 406 athletes (287 males, 119 females) from six countries. The
average Hb values for males and females from the six countries are given, together
with number of athletes in brackets. [For example, for country 1, the average Hb
was 164.3 from 12 males, and 139.0 from one female.] A test of interaction between
sex and country gave a P-value of 0.113, and interaction has been omitted. Splus
output is given for the strictly additive model
yijk = i + j + eijk
where 1 and 2 refer to males and females, respectively, while 1 to 6 refers to
the six countries.
Country
1
2
3
4
5
6
mean
males
164.3 (12)
148.5 (60)
167.9 (48)
147.2 (63)
161.9 (29)
149.3 (75)
153.7
females
139.0 (1)
134.5 (17)
146.7 (26)
131.4 (30)
142.6 (18)
128.2 (27)
136.2
mean
162.4
145.4
160.4
142.1
154.5
143.7
t value
65.09
-18.48
-5.25
1.13
-5.84
-0.78
-5.68
Pr(>|t|)
0.0000
0.0000
0.0000
0.2604
0.0000
0.4376
0.0000
page 12
(a) Produce a (rough) graphical display of the data provided, and comment on
what it appears to show.
(b) Explain what is being tested, in terms of the s and s if possible and more
generally otherwise, by each of the following:
i.
ii.
iii.
iv.
v.
the
the
the
the
the
(c) Within each country, all of the athletes were located in the one city. Of the six
cities, three were at altitude (above 1700 metres), while the others were (close
to) sea-level. The cities at altitude were those in countries 1, 3 and 5. It is
suspected that any differences between countries (cities) could be due solely to
altitude.
i. Without any calculations, explain how the output given above supports
this suspicion.
ii. Explain how the suspicion could be formally tested.
iii. When the model needed to carry out the formal test (in (ii)) was fitted,
a deviance of 33734.10 was obtained. Complete the test and state your
conclusion.
(5 + 5 + 5 = 15 marks)
Age
7
7
8
Sex
0
1
0
Height
109
112
124
Weight
13.1
12.9
14.1
25
23
179
71.5
Sub
Sex
BMP
FEV
RV
FRC
TLC
PEmax
BMP
68
65
64
..
.
95
FEV
32
19
22
RV
258
449
441
FRC
183
245
268
TLC
137
134
147
PEmax
95
85
100
52
225
127
101
195
Subject number
1 = male, 2 = female
Body mass (Weight/Height2 ) as a percentage of the age-specific
median in normal individuals
Forced expiratory volume in 1 second
Residual volume
Functional residual capacity
Total lung capacity
Maximal statis expiratory pressure (cm H2 O)
page 13
Splus output
Output A
> cf.1 <- lm(PEmax~age+sex+height+weight+BMP+FEV+RV+FRC+TLC,data=Cystic.dat)
> anova(cf.1)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
age 1 10098.48 10098.48 15.69696 0.0012527
sex 1
955.43
955.43 1.48511 0.2417972
height 1
154.97
154.97 0.24088 0.6306711
weight 1
647.32
647.32 1.00618 0.3317276
BMP 1
2857.85 2857.85 4.44222 0.0522960
FEV 1
1529.04 1529.04 2.37672 0.1439858
RV 1
600.61
600.61 0.93357 0.3492569
FRC 1
226.36
226.36 0.35186 0.5619016
TLC 1
112.50
112.50 0.17486 0.6817509
Residuals 15
9650.09
643.34
> summary(cf.1)$coef
Value Std. Error
t value Pr(>|t|)
(Intercept) 181.6249734 231.3086181 0.7852063 0.4445586
age
-2.5628964
4.7680886 -0.5375102 0.5987996
sex
-3.8711852 15.3993341 -0.2513865 0.8049267
height
-0.4778411
0.9075548 -0.5265149 0.6062292
weight
3.0278616
1.9996911 1.5141647 0.1507688
BMP
-1.7182736
1.1297550 -1.5209259 0.1490764
FEV
1.0553768
1.0807216 0.9765483 0.3442803
RV
0.2080534
0.1963097 1.0598224 0.3059936
FRC
-0.3350868
0.4936715 -0.6787647 0.5076251
TLC
0.2073738
0.4959126 0.4181659 0.6817509
Output B
> cf.0 <- lm(PEmax~1,data=Cystic.dat)
> step(cf.0,~+age+sex+height+weight+BMP+FEV+RV+FRC+TLC,trace=F)
Call:
lm(formula = PEmax ~ weight, data = Cystic.dat)
Coefficients:
(Intercept)
weight
63.44932 1.187979
Degrees of freedom: 25 total; 23 residual
Residual standard error (on weighted scale): 26.34972
> step(cf.1,~.,trace=F)
Call:
lm(formula = PEmax ~ weight + BMP + FEV, data = Cystic.dat)
Coefficients:
(Intercept)
weight
BMP
FEV
125.7083 1.531035 -1.45341 1.103848
Degrees of freedom: 25 total; 21 residual
Residual standard error (on weighted scale): 23.45087
page 14
(a)
(b)
(c)
(d)
(e)
(f)
page 15
200
24
25
160
100
80
2
100
120
24
140
60
-40
21
80
160
80
100
120
160
80
120
140
160
Residuals
40
20
25
-1
0.0
0.4
0.8
0.0
0.15
0.10
Cooks Distance
24
0.05
0.0
-40
-20
-2
21
-40
24
-20
PEmax
0
20
0.20
40
20
25
-20
Residuals
100
0.25
40
140
fits
Fitted Values
-40
140
PEmax
5
4
3
sqrt(abs(Residuals))
0
-20
Residuals
20
21
180
25
120
40
Output C
0.4
0.8
10
15
20
25
f-value
(8 + 5 + 4 + 3 + 2 + 3 = 25 marks)
4. An experiment was conducted to evaluate the effects of three treatments on the
germination of seeds from five varieties of guayule (a rubber producing shrub). The
variable of interest is the number of plants that germinate in a plot. Each of the 15
combinations of treatment variety were used in three plots (giving a total of 45
observations). The analyses were carried out using the square-root of the number
of plants that germinated in a plot as the response variable, and the following table
gives means of the response variable.
treatment
1
2
3
all
1
3.41
3.51
2.85
3.26
2
4.22
3.58
3.25
3.68
variety
3
3.52
5.11
3.36
4.00
4
3.80
4.67
3.12
3.87
5
4.40
5.59
3.45
4.48
all
3.88
4.49
3.20
3.86
(a) Explain how and why a square-root transformation of the number of plants
(that germinated) may have been chosen.
(b) Assume here that the following analysis is appropriate (note that everything is
balanced).
page 16
Pr(F)
0.5527118
0.0144409
0.0000998
0.1849378
i. Describe how the study should have been designed (the nature of any grouping of the (45) study units and how the randomization should have been
done) in order to justify this form of analysis.
ii. A. Re-do the ANOVA table omitting the non-significant interaction.
B. What do you conclude about the three treatments? (Use Tukeys
method to decide which, if any, of the treatments differ significantly.)
C. Comment on the effectiveness of the blocking.
(c) Assume here that the following analysis is appropriate (and that everything is
balanced). The flats referred to in the analysis are shallow boxes for starting
seedlings.
> guay.aov <- aov(sqrt(plants)~variety*treatment+Error(flats),
+ data=guayule.dat)
> summary(guay.aov)
Error: flats
Df Sum of Sq Mean Sq F Value
Pr(F)
variety 4
7.17266 1.793165 1.779811 0.209426
Residuals 10 10.07503 1.007503
Error: Within
Df Sum of Sq Mean Sq F Value
Pr(F)
treatment 2 12.45262 6.226311 32.09045 0.000000573
variety:treatment 8
5.92663 0.740828 3.81823 0.007119376
Residuals 20
3.88048 0.194024
i. Describe how the study should have been designed (the nature of the grouping of the (45) study units into flats and how the randomization should
have been done) in order to justify this form of analysis.
ii. What conclusions can be reached from this analysis (without further calculations).
iii. The between flats variance is estimated to be 0.271. Show how this estimate
can be obtained from the output given above.
iv. Find the value of the standard error for comparing:
A. two treatments applied to the same variety [e.g. (from the table of
means of the response variable) the standard error for (3.41 3.51)];
B. two varieties with the same treatment [e.g. the standard error for
(3.41 4.22)].
(3 + 10 + 12 = 25 marks)
page 17
5. This question refers to a study of young girls suffering from anorexia. A number of
such patients were randomly assigned to receive one of three treatments:
(Treatment 1): Cognitive behaviour treatment.
(Treatment 2): Standard treatment.
(Treatment 3): Family therapy.
The weight (in lbs) of each patient was recorded both before treatment began and
after a fixed period of time on the treatment. Seventy-two patients participated
in the study: 29 received treatment 1, 26 received treatment 2 and 17 received
treatment 3. The scatterplot uses numbers to indicate treatment.
3
1
3
1
95
3
1
after
90
3
3
3 3
3
3
85
2
2
2
80
2 2
75
1
1
70
75
3 22
1
1
2
11 1
1
2
31
1
21
1 2
1
11
2
3
2
3
2 21
1
80
2 2
2
85
90
95
before
(a) The following table gives the deviance and (residual) degrees of freedom for six
models that were fitted to the data.
1
2
3
4
5
6
Model
treatment*before
treatment:before
treatment + before
treatment
before
before 1
deviance
2845
3245
3311
3665
4078
4588
df
66
68
68
69
70
71
Use F-tests to determine which of the above models is most appropriate for
these data. Show full details of any tests that you use.
page 18
(b) For the six models in the above table, it is not possible to compare all pairs of
models using an F-test. List three (3) such pairs; there is no need to justify
your answer.
(c) The output below was obtained for model (1) (which may or may not be the
best model), and for the equivalent model with Splus specification anorexia.1a.
i. Find a 95% confidence interval for the slope of the line for treatment 2, and
give an interpretation of this confidence interval in terms of the nature of
the relationship between the before and after weights of patients assigned
to the standard treatment.
ii. Find a 95% confidence interval for the difference in the expected after
weights of patients with a before weight of 80 lbs. given treatments 1
and 3.
iii. Describe the meaning of model (1) in non-technical terms. A rough sketch
may help here.
> anorexia.1 <- lm(after~treatment*before,data=anorexia.dat)
> summary(anorexia.1)
Value Std. Error
(Intercept) 15.5772 21.2083
treatment2 76.4742 28.3470
treatment3 -0.7575 34.5516
before
0.8480
0.2561
treatment2before -0.9822
0.3442
treatment3before
0.0612
0.4155
t value Pr(>|t|)
0.7345
0.4653
2.6978
0.0089
-0.0219
0.9826
3.3117
0.0015
-2.8532
0.0058
0.1474
0.8833
> disp.v(anorexia.1)
[1] "Estimated covariance matrix of parameter vector"
(Intercept) treatment2
treatment3
before
(Intercept) 449.791232 -449.791232 -449.791232 -5.42153583
treatment2 -449.791232 803.552512
449.791232 5.42153583
treatment3 -449.791232 449.791232 1193.814599 5.42153583
before
-5.421536
5.421536
5.421536 0.06556486
treatment2before
5.421536
-9.738768
-5.421536 -0.06556486
treatment3before
5.421536
-5.421536
-14.330501 -0.06556486
(Intercept)
treatment2
treatment3
before
treatment2before
treatment3before
treatment2before treatment3before
5.42153583
5.42153583
-9.73876785
-5.42153583
-5.42153583
-14.33050074
-0.06556486
-0.06556486
0.11849956
0.06556486
0.06556486
0.17260593
t value Pr(>|t|)
0.7345
0.4653
4.8941
0.0000
0.5433
0.5887
3.3117
0.0015
treatment2before
treatment3before
page 19
-0.1342
0.9092
0.2301
0.3272
-0.5832
2.7791
0.5617
0.0071
> disp.v(anorexia.1a)
[1] "Estimated covariance matrix of parameter vector"
treatment1
treatment2
treatment3 treatment1before
treatment1 4.497912e+002 -5.236354e-014 -4.685106e-014
-5.421536e+000
treatment2 -5.236354e-014 3.537613e+002 5.649404e-030
6.332538e-016
treatment3 -4.685106e-014 5.649404e-030 7.440234e+002
5.665891e-016
treatment1before -5.421536e+000 6.332538e-016 5.665891e-016
6.556486e-002
treatment2before 6.420429e-016 -4.317232e+000 -6.926881e-032
-7.764489e-018
treatment3before 5.629147e-016 -6.787750e-032 -8.908965e+000
-6.807559e-018
treatment1
treatment2
treatment3
treatment1before
treatment2before
treatment3before
treatment2before treatment3before
6.420429e-016
5.629147e-016
-4.317232e+000
-6.787750e-032
-6.926881e-032
-8.908965e+000
-7.764489e-018
-6.807559e-018
5.293470e-002
8.322636e-034
8.322636e-034
1.070411e-001
(d) This part of the question refers to the design of the study.
i. There is one aspect of the study (as described in the introduction to the
question) which seems somewhat unusual for a designed experiment. What
is it and why is it unusual?
ii. The following output was obtained for a one-way ANOVA of the before
values.
> anova(lm(before~treatment,data=anorexia.dat))
Analysis of Variance Table
Response: before
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq
F Value
Pr(F)
treatment 2
32.569 16.28467 0.5994852 0.5519287
Residuals 69 1874.346 27.16443
What is likely to be the purpose of such an analysis, and what do you
conclude?
iii. Give details of an alternative way in which the before weights could have
been used in the design of the study (i.e. not as a covariate). What form
of analysis would be appropriate for this alternative design?
(4 + 3 + 7 + 6 = 20 marks)
page 20
page 21
2003
1. The following measurements were obtained on the weights of various combinations
of three objects O1 , O2 and O3 .
Objects
O1 , O2
O1 , O3
O2 , O3
O1 , O2 , O3
O1
O2
O3
Weight
5.0
5.7
6.7
9.0
2.0
2.5
3.5
It is possible that the balance used may be biased in the sense of constantly giving
a higher (or lower) weight.
(a) Assuming a model of the form y = X +
e , write down the vector y and the
design matrix X.
(b) What is the rank of X?
(c) To test the hypothesis (H0 ) that the weights of the three objects are in the ratio
2 : 3 : 4, two models were fitted to the data; the model referred to in (a), for
which the deviance was 0.065, and the model under H0 , for which the deviance
was 0.295. Write down the design matrix for the model under H0 , carry out a
test of the hypothesis, and state your conclusions (accept or reject H0 at the
5% level).
[2 + 1 + 5 = 8 marks]
2. Small loaves of bread were prepared with flour that was fortified with a fixed amount
of vitamins. One day after baking, the vitamin C content of two loaves was measured.
Another two loaves baked at the same time were stored for three days and then the
vitamin C content was measured. In a similar manner two loaves were stored for five
days and another two for seven days before measurements were taken. The units are
milligrams per hundred grams of bread (mg/100 g).
Condition
One day after baking
Three days after baking
Five days after baking
Seven days after baking
Vitamin C (mg/100 g)
40.25
43.46
21.25
22.34
13.18
11.65
8.51
8.13
Using condit.f to denote the condition treated as a factor with four levels, the following Splus output was obtained when the simple one-way ANOVA model
yij = + i + eij
was fitted using contrasts = contr.treatment and contrasts = contr.sum
page 22
page 23
Females
13
7
22
9
10
7
4
72
Total
25
18
37
22
28
10
13
153
page 24
sex
F
M
11
10
Weight
3
0
10
15
20
25
30
Age
Various models can be fitted to these data to predict the lambs weight (W) on
April 30, 1991. A model can be written in parametric form where i is 1 for Males
and 2 for Females, Wij is the weight on April 30, Aij is the lambs age on that
day, and j represents the individual lambs. (For example, a model might look like
Wij = + i Aij + eij ). An Splus model fitting statement can also be given (for
example, W S + A or W S A, and so on).
(a) Give a parametric form of the model and the associated Splus specification for
each of the following situations.
i. Age is important in determining weight, but there is no difference between
males and females.
ii. Males and females have different weights, but the same non-zero growth
rate.
iii. Neither age nor sex has any relation to lamb weight.
iv. There is a different growth rate for males and females, but males and
females have the same birthweights.
page 25
(b) The table below describes seven models that were fitted to the data.
i. Give the residual degrees of freedom for each model.
ii. It is not possible to formally compare (by F-test) all 21 pairs of models
given in the table below. List four of the pairs that cannot be formally
compared, no justification is required.
iii. Using F-tests derived from the deviances given below, decide on the most
appropriate model for the data. Give full details of the tests you use.
(Assume that the usual assumptions are satisfied.)
iv. Describe (in words, like in part (a)) what is implied by your chosen model.
Model
1
2
3
4
5
6
7
Parametric form
Wij = i + i Aij + eij
Wij = + i Aij + eij
Wij = i + Aij + eij
Wij = + Aij + eij
Wij = i + eij
Wij = Aij + eij
Wij = + eij
Deviance
48.72
52.86
50.24
54.18
63.51
224.21
83.78
[8 + 12 = 20 marks]
5. How well can house prices be predicted by variables such as age, size, number of
bedrooms etc? Data were available on the sale price (in $1,000) of 51 houses in a
particular location, together with values of the following variables:
age
age of house in years
bed
number of bedrooms
bath number of bathrooms
size size of the house in 1000 ft2
lot
size of the lot (or block of land) in 1000 ft2
The following table gives a partial listing of the data.
Obsn
1
2
3
4
..
.
age
21
21
7
6
..
.
bed
3
3
1
3
..
.
bath
3
2
1
2
..
.
size
0.951
1.036
0.676
1.456
..
.
lot
64.9
217.8
54.5
51.8
..
.
price
30.0
39.9
46.5
48.6
..
.
46
47
48
49
50
51
27
5
32
29
1
33
3
3
4
3
3
3
2
3
4
3
3
4
1.920
2.949
3.310
2.805
2.553
3.627
226.5
12.0
10.5
16.5
8.6
17.8
167.5
169.9
175.0
179.0
179.9
199.0
A multiple regression model was fitted using all five explanatory variables and, based
on the diagnostic plots in Figure 1, it was decided to omit observation 46.
page 26
46
4
6
40
46
200
150
price
3
2
50
1
0
200
50
100
150
100
150
200
46
50
0.8
100
Residuals
0.2
-20
50
price
50
20
40
46
100
Fitted Values
Residuals
200
fits
0.6
150
0.4
100
Cooks Distance
-40
31
4
50
100
sqrt(abs(Residuals))
0
-20
Residuals
20
31
-2
-1
0.0
0.4
0.8
0.0
0.0
-50
31
-50
-40
0.4
0.8
10
20
30
40
50
f-value
page 27
40
46
150
price
sqrt(abs(Residuals))
31
50
20
0
-20
-40
31
4
50
100
150
200
50
100
150
200
50
200
Residuals
3
51
0.10
Cooks Distance
0.02
-20
price
50
50
40
20
0
Residuals
150
100
46
100
60
Fitted Values
-2
-1
0.0
-50
0.0
31
-50
-40
100
fits
0.06
Residuals
200
46
100
60
0.4
0.8
0.0
0.4
0.8
10
20
30
40
50
f-value
It was again decided to omit observation 46. The following Splus output was obtained
for the reduced model.
> anova(lm(price~age+bed+size,subset=-46,data=Hprice.dat))
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
age 1
747
747
3.11 0.084393
bed 1
11181
11181
46.56 0.000000
size 1
53055
53055 220.93 0.000000
Residuals 46
11047
240
> summary(lm(price~age+bed+size,subset=-46,data=Hprice.dat))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 35.485 11.069
3.206
0.002
age -0.366
0.174
-2.098
0.041
bed -11.189
3.901
-2.868
0.006
size 61.296
4.124
14.864
0.000
(a) What can you say about observation 46 based on Figures 1 and 2? Explain
how you arrived at your answer.
(b) Explain why the first ANOVA table (the one for the full model) is of little use
in determining the final model.
page 28
(c) Describe an approach that might have been used to arrive at the (final) reduced
model.
(d) Carry out a formal test to demonstrate that the reduced model is better
in some sense than the full model (both models without observation 46), and
explain how you can tell that an even simpler model is unlikely to be better.
(e) Comment on the coefficient of bed in the reduced model. Is the sign of the
coefficient what you would expect and, if not, how can you explain it?
(f) Using the reduced model, predict the sale price of a 5 year-old house with 4
bedrooms and size 2.1, and write down the formula, in terms of appropriate
variances and covariances, for a 95% prediction interval.
[3 + 2 + 2 + 5 + 3 + 5 = 20 marks]
page 29
Moisture
10
20
30
40
Tray
1
2
3
4
5
6
7
8
9
10
11
12
2
3.35
4.04
1.98
5.05
5.19
6.95
6.57
8.30
5.28
6.84
6.50
4.05
Fertilizer
4
6
4.32
4.56
4.14
6.52
3.84
4.47
7.94 10.77
8.51 10.39
7.02 10.93
10.73 12.26
8.91 13.44
8.67 11.14
9.08 10.37
6.07 10.75
3.84
9.44
8
5.88
7.38
5.12
13.52
13.52
15.28
15.71
14.96
15.63
12.51
12.50
10.28