Вы находитесь на странице: 1из 27

VERSION 1 STATS 330

THE UNIVERSITY OF AUCKLAND

SECOND SEMESTER, 2014


Campus: City

STATISTICS

Statistical Modeling

(Time allowed: THREE hours)

INSTRUCTIONS

SECTION A: Multiple Choice (60 marks)

Answer ALL 25 questions on the answer sheet provided.

All questions have a single correct answer and carry the same mark value.

If you give more than one answer to any question you will receive zero marks for that
question.

Incorrect answers are not penalized.

SECTION B (40 marks)

Answer 2 out of 3 questions. Each is worth 20 marks.

Total for both parts: 100 marks

CONTINUED
VERSION 1 2 STATS 330

SECTION A

1. Suppose we have a data set consisting of a continuous response variable and three
continuous explanatory variables. Which of the following plots would NOT be useful
in checking if some transformation of the explanatory variables is required?

(1) A set of gam plots, one for each explanatory variable.


(2) A trellis plot, conditioning on two of the continuous variables.
(3) A plot of residuals against the individual explanatory variables.
(4) A plot of residuals versus fitted values.
(5) A normal plot of residuals.

CONTINUED
VERSION 1 3 STATS 330

2. The data for this question consist of salaries for 397 Canadian professors in 2008-
2009. The other variables in the data set are sex, rank (either Assistant Professor,
Associate Professor, or Professor, and discipline, a factor with levels A (theoretical
departments) or B (applied departments). A Trellis graph is shown in Figure 1.

Male Male
A B


200000



















150000





























100000





















50000
salary

Female Female
A B

200000


150000







100000





50000
AsstProf AssocProf Prof AsstProf AssocProf Prof

Figure 1: Trellis plot for Question 2.

Which of the following is FALSE?

(1) Female associate professors tend to earn less than their male counterparts.
(2) For females, there is not much difference between assistant and associate profes-
sors.
(3) The applied disciplines seem better paid.
(4) The salaries of professors are more variable than assistant professors.
(5) There is no evidence of interaction between sex and rank.

CONTINUED
VERSION 1 4 STATS 330

3. The next few questions involve some data on earthquakes. There are 54 earthquakes
in the data set. The variables are

Displacement : the average displacement of the fault line (meters)


Length : the length of the fault (km)
SeisMmt : The seismic moment, a measure of earthquake size, related to Displace-
ment
Magnitude : the magnitude of the earthquake, on the Richter scale

A pairs plot of the data is shown in Figure 2. Which of the following is FALSE?

0 5 10 15 20 0.0e+00 1.0e+30 2.0e+30

8.5









7.5






Magnitude






















6.5







5.5


10 15 20

Displacement


5

600

Length

200




























0

1.5e+30

SeisMmt

0.0e+00

5.5 6.5 7.5 8.5 0 200 400 600 800

Figure 2: Pairs plot for Question 3.

(1) The variability of Magnitude seems to be increasing with Displacement.


(2) There seem to be outliers in the data in the variables Displacement and SeisMmt.
(3) There is a non-linear relationship between Length and Magnitude.
(4) There seems to be an outlier in the variable Magnitude.
(5) Length and Displacement seem correlated.

CONTINUED
VERSION 1 5 STATS 330

4. A linear model was fitted to the data with Magnitude as a response after logging the
other variables. The following output was obtained.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.728505 2.762443 -0.626 0.53435
log(SeisMmt) 0.141451 0.051666 2.738 0.00854 **
log(Displacement) 0.248094 0.071172 3.486 0.00103 **
log(Length) 0.004918 0.113168 0.043 0.96551
---
Residual standard error: 0.2427 on 50 degrees of freedom
Multiple R-squared: 0.8777, Adjusted R-squared: 0.8703
F-statistic: 119.6 on 3 and 50 DF, p-value: < 2.2e-16

Which of the following is FALSE?

(1) The residual sum of squares is about 2.945 to 3 decimal places.


(2) It seems that at least some of the variables are related to the response.
(3) On the basis of the R2 value, the fit appears good.
(4) Earthquakes with longer faults have higher magnitudes.
(5) Earthquakes with larger displacements have higher magnitudes.

CONTINUED
VERSION 1 6 STATS 330

5. A coplot of these data is shown in Figure 3. Which of the following is NOT a correct
interpretation of the plot?

Given : Displacement
0 5 10 15 20

56 60 64 68 56 60 64 68

800
8.5









7.5








6.5

600
5.5

8.5

Given : Length


Magnitude

7.5




400


6.5


5.5
8.5

200
7.5










6.5








5.5

56 60 64 68

log(SeisMmt)

Figure 3: Coplot for Question 5.

(1) The slope of the relationship between Magnitude and log(SeisMmt) decreases
slightly with increasing length.
(2) The coplot indicates a slight curvature in the regression surface.
(3) The relationship between Length and Displacement does not change with
Magnitude.
(4) The variable Magnitude seems to be increasing with Displacement.
(5) The variable SeisMmt seems to be increasing with Length.

6. Which of the following is the CORRECT interpretation of Figure 3?

(1) The plot sheds light on the relationship between Magnitude and log(Length),
conditional on the other variables.
(2) The plot sheds light on the relationship between Magnitude and log(SeisMmt),
conditional on the other variables.
(3) The plot explains the joint relationship of all four variables.
(4) The plot explains the joint relationship of log(Displacement) and log(Length).
(5) The plot sheds light on the relationship between log(Displacement) and
log(Length), conditional on the other variables.

CONTINUED
VERSION 1 7 STATS 330

7. Some graphical output from the earthquake analysis is shown in Figure 4. What is
NOT a correct interpretation of these plots? (Note: the label for point 18 is obscured
by the text Cooks distance in the bottom right plot.)

Residuals vs Fitted Normal QQ

2
14 20 14

0.4
20

Standardized residuals


1




Residuals

0.0


3 2 1




0.4

18
0.8

18

5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 2 1 0 1 2

Fitted values Theoretical Quantiles

ScaleLocation Residuals vs Leverage


18

2
14 0.5
Standardized residuals

Standardized residuals
1.5

1
20 14




0

1.0

1






0.5

43 0.5


1
3

Cook's distance
18
0.0

5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 0.00 0.10 0.20 0.30

Fitted values Leverage

Figure 4: Graphical output for Question 7.

(1) Point 18 has very high leverage.


(2) Point 18 is an outlier.
(3) The variance doesnt seem to be increasing with the mean.
(4) Point 18 would seem to be having an effect on some of the regression coefficients.
(5) Apart from an outlier, there is no evidence of departures from normality.

8. Some influence statistics are shown below:

dfb.1_ dfb.l.SM dfb.l.D. dfb.l.L. dffit cov.r cook.d hat inf


16 0.12377 -0.130879 0.082886 0.170159 0.21484 1.334 1.17e-02 0.1994 *
18 -0.22209 0.247894 -0.812605 -0.382511 -1.76844 0.526 6.31e-01 0.1944 *
37 -0.01421 0.017251 0.112266 -0.046653 -0.20371 1.618 1.06e-02 0.3345 *

Which of the following is TRUE?

(1) Point 18 is not having an effect on the standard errors.


(2) Point 37 has the most leverage of the three points shown.
(3) Point 37 is not having an effect on the standard errors.
q
(4) Point 18 does not have a large DFFITS value. (cuttoff is 3 (p/(n p))
(5) Point 16 is having an effect on the Length coefficient.

CONTINUED
VERSION 1 8 STATS 330

9. The following data were gathered in a survey on attitudes to debt. The variables are:

agegp : age group (1=youngest, 4=oldest)


ccarduse : how often did s/he use credit cards (1=never... 3=regularly)
locintrn : a continuous score on a locus of control scale (high values=internal, low
=external)
prodebt : a continuous score on a scale of attitudes to debt (high values=favourable
to debt)

A model was fitted with the following results

Call:
lm(formula = prodebt ~ factor(agegp) + factor(ccarduse) + locintrn,
data = debt)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95569 0.21854 18.101 < 2e-16 ***
factor(agegp)2 -0.20001 0.11879 -1.684 0.093288 .
factor(agegp)3 -0.45192 0.13179 -3.429 0.000691 ***
factor(agegp)4 -0.44842 0.13636 -3.288 0.001129 **
factor(ccarduse)2 0.27514 0.09613 2.862 0.004506 **
factor(ccarduse)3 0.50761 0.09920 5.117 5.58e-07 ***
locintrn -0.15023 0.04396 -3.417 0.000721 ***

Which of the following is NOT a correct interpretation?

(1) Older people tend to be less favourable to debt.


(2) Frequent credit card users tend to be more favourable to debt.
(3) The last two age groups seem similar in their attitude to debt.
(4) The first two age groups seem similar in their attitude to debt.
(5) People having a strong internal locus of control tend to be more favourable to
debt.

CONTINUED
VERSION 1 9 STATS 330

10. This question uses the salaries data introduced in Question 2. The model salary
rank * discipline was fitted, with the output below. Which of the following is
FALSE?

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 73936 4630 15.968 <2e-16 ***
rankAssocProf 9126 6421 1.421 0.1560
rankProf 46013 5036 9.136 <2e-16 ***
disciplineB 10658 5780 1.844 0.0659 .
rankAssocProf:disciplineB 7557 8169 0.925 0.3555
rankProf:disciplineB 2787 6414 0.435 0.6642

(1) The mean salary of assistant professors in Discipline A is estimated as $73,936.


(2) The mean salary of professors in Discipline B is estimated as $130,607.
(3) The mean salary of associate professors in Discipline B is estimated as $101,277.
(4) The mean salary of assistant professors in Discipline B is estimated as $84,594.
(5) The mean salary of professors in Discipline A is estimated as $119,949.

11. An analysis of variance table for the model fitted in Question 10 is shown below.

Analysis of Variance Table

Response: salary
Df Sum Sq Mean Sq F value Pr(>F)
rank 2 1.4323e+11 7.1616e+10 139.1894 < 2.2e-16 ***
discipline 1 1.8430e+10 1.8430e+10 35.8196 4.899e-09 ***
rank:discipline 2 4.6118e+08 2.3059e+08 0.4482 0.6391
Residuals 391 2.0118e+11 5.1452e+08

Which of the following is FALSE?

(1) The p-value 0.6391 is comparing the fit of the full model to that of the null model.
(2) Both variables rank and discipline are required in the model.
(3) There is no evidence that the variables rank and discipline interact.
(4) If we fitted the model salarydiscipline*rank the analysis of variance table
would have some different p-values.
(5) The estimate of error variance for this model is 5.1452e+08.

CONTINUED
VERSION 1 10 STATS 330

12. Using the output in Question 10, which of the following is TRUE?

(1) The difference in means between salaries for associate professors and assistant
professors in Discipline A is $9,126.
(2) The difference in means between salaries for associate professors and full professors
in Discipline A is $46,013.
(3) The difference in means between salaries for associate professors and assistant
professors in Discipline A is $55,139.
(4) The difference in means between salaries for associate professors and assistant
professors in Discipline B is $10,658.
(5) The difference in means between salaries for associate professors and full professors
in Discipline B is $9,126.

CONTINUED
VERSION 1 11 STATS 330

13. This question refers to the data in Question 3. Suppose we have an earthquake with
Length =10, Displacement = 0.3, and SeisMmt = 8.3e+24. On the basis of the
following output, which of the following is TRUE?

eq.lm = lm(Magnitude~log(SeisMmt) + log(Displacement) + log(Length),


data=earthquakes.df)
newdata=data.frame(Length =10, Displacement = 0.3, SeisMmt = 8.3e+24)
predict(eq.lm, newdata, se=TRUE)
$fit
1
6.100346

$se.fit
[1] 0.07524653

$df
[1] 50

$residual.scale
[1] 0.2426841

>
> qt(0.975,50)
[1] 2.008559
> qt(0.950,50)
[1] 1.675905

(1) A 95% confidence interval for the mean magnitude of earthquakes having these
values of the covariates is (5.974, 6.226) to 3 decimal places.
(2) A 95% prediction interval for the magnitude of this earthquake is (5.974, 6.226 )
to 3 decimal places.
(3) A 95% prediction interval for the magnitude of this earthquake is (5.590, 6.611)
to 3 decimal places.
(4) A 95% prediction interval for the magnitude of this earthquake is (5.949, 6.251)
to 3 decimal places.
(5) A 95% prediction interval for the magnitude of this earthquake is (5.613, 6.588)
to 3 decimal places.

CONTINUED
VERSION 1 12 STATS 330

14. Suppose we have a binary response Y (with values 0 and 1) and a continuous explana-
tory variable x. Which of the following is TRUE?

(1) The correct R code to fit the model is lm(x Y).


(2) The correct R code to fit the model is glm(Yx, family=poisson).
(3) The correct R code to fit the model is glm(Yx).
(4) The correct R code to fit the model is glm(Yx, family=binomial).
(5) The correct R code to fit the model is lm(Yx, family=binomial).

15. The data for the next few questions come from a STATS 20x summer school session
a few years ago. The binary outcome Pass with values No (baseline) and Yes records
whether or not the student passed the course. The covariates are

Stage1 : The grade in STATS 10x, one of A,B,C.


Repeat : Is the student repeating the course? (Yes/No), baseline No.
Attend : Was the student a regular attender (Yes/No), baseline No.
Years.Since : Number of years since STATS 10x was passed.
Colour : The colour of the answer sheet (Blue, Green, Pink, Yellow).

The summary of the fit of an additive model is shown below, as is an anova table.

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.53892 0.82406 0.654 0.513125
Stage1B -1.09575 0.66543 -1.647 0.099625 .
Stage1C -2.27889 0.67480 -3.377 0.000733 ***
RepeatYes 0.46683 0.50873 0.918 0.358816
AttendYes 1.67607 0.45458 3.687 0.000227 ***
Years.Since 0.02639 0.21815 0.121 0.903718
ColourGreen 0.16964 0.57905 0.293 0.769545
ColourPink 0.51877 0.61261 0.847 0.397102
ColourYellow 1.17461 0.66620 1.763 0.077877 .

Null deviance: 178.71 on 145 degrees of freedom


Residual deviance: 132.24 on 137 degrees of freedom
> anova(course.glm, test="Chisq")
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 145 178.71
Stage1 2 27.9868 143 150.72 8.37e-07 ***
Repeat 1 0.6901 142 150.03 0.4061323
Attend 1 13.9689 141 136.06 0.0001859 ***
Years.Since 1 0.0400 140 136.02 0.8415543
Colour 3 3.7806 137 132.24 0.2861503

CONTINUED
VERSION 1 13 STATS 330

Which of the following is FALSE?

(1) Attending class is associated with a higher probability of passing.


(2) The time interval since doing stage 1 doesnt seem to affect the probability of
passing.
(3) Students having a blue answer sheet tended to have a lower pass rate.
(4) Good Stage 1 performance is associated with a higher probability of passing.
(5) Repeating the course doesnt seem to affect the probability of passing.

16. Using the data in Question 15, which of the following is TRUE?

(1) The model estimates that having a B grade rather than a C increases the log-odds
of passing by by 1.18314, other things being equal.
(2) The model estimates that having a B grade rather than a C adds 3.264596 to the
odds of passing, other things being equal.
(3) The model estimates that attending class adds 1.67607 to the odds of passing,
other things being equal.
(4) The model estimates that having an A grade rather than a B adds 1.09575 to the
odds of passing, other things being equal.
(5) The model estimates that repeating the class increases the probability of passing
by 0.46683, other things being equal.

17. The data below (in a data frame foodstamps.df) relate to a food stamp program.
The response is either Yes (recieves foostamps) or No (doesnt receive foodstamps).
The covariates are TEN (tenancy, with values Yes or No) and SUP (supplemental income
also with values Yes or No). The response data are in the form r, n indicating the
number of respondents receiving foodstamps (r) and the total number of respondents
having that covariate pattern (n).

> foodstamps.df
TEN SUP r n
1 No No 9 31
2 Yes No 3 78
3 No Yes 9 20
4 Yes Yes 3 21
> foodstamps.glm = glm(cbind(r, n-r)~ TEN*SUP, family=binomial,
data=foodstamps.df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8938 0.3957 -2.259 0.02389 *
TENYes -2.3251 0.7094 -3.278 0.00105 **
SUPYes 0.6931 0.5988 1.158 0.24706
TENYes:SUPYes 0.7340 1.0460 0.702 0.48288

CONTINUED
VERSION 1 14 STATS 330

Null deviance: 2.4368e+01 on 3 degrees of freedom


Residual deviance: 1.1990e-14 on 0 degrees of freedom
AIC: 20.967

Which of the following is TRUE?

(1) The null model appears to fit well.


(2) There is no evidence that tenancy is related to receiving foodstamps.
(3) There is evidence that the factors have an additive effect on the odds.
(4) Something is wrong: the deviance should not be zero.
(5) Under the maximal (saturated) model, the estimated probability of receiving food-
stamps for a respondent with no tenancy and no supplemental income is 0.290 to
3 decimal places.
18. Use the output below to select the CORRECT answer.

> predict(foodstamps.glm, se=TRUE, type="response")


$fit
1 2 3 4
0.29032258 0.03846154 0.45000000 0.14285714
$se.fit
1 2 3 4
0.08152486 0.02177456 0.11124298 0.07636035

> predict(foodstamps.glm, se=TRUE)


$fit
1 2 3 4
-0.8938179 -3.2188758 -0.2006707 -1.7917595
$se.fit
1 2 3 4
0.3956838 0.5887841 0.4494666 0.6236096

(1) A 95% confidence interval for the odds of receiving foodstamps for a respondent
with a tenancy and no supplemental income is (1.139, 1.568).
(2) A 95% confidence interval for the log odds of receiving foodstamps for a respon-
dent with no tenancy but supplemental income is (-4.373, -2.065).
(3) A 95% confidence interval for the probability of receiving foodstamps for a re-
spondent with no tenancy and no supplemental income is (0.131, 0.450).
(4) A 95% confidence interval for the log odds of receiving foodstamps for a respon-
dent with a tenancy but no supplemental income is (0.232, 0.668).
(5) A 95% confidence interval for the probability of receiving foodstamps for a re-
spondent with a tenancy and supplemental income is (-1.082, 0.680).

CONTINUED
VERSION 1 15 STATS 330

19. In a logistic regression with ungrouped data, with two continuous explanatory vari-
ables, the residual deviance was 131.67 on 97 degrees of freedom, and the null deviance
was 135.37 on 99 degrees of freedom. The following output was obtained:

> 1-pchisq(135.37,99)
[1] 0.008929477
> 1-pchisq(131.67,97)
[1] 0.01104207
> 1-pchisq(3.70,2)
[1] 0.1572372

Which of the following is TRUE? (Note that for ungrouped data, we can use the
difference in deviances to compare models.)
(1) The null deviance indicates that the model fits well.
(2) The residual deviance indicates that the model fits well.
(3) At least one explanatory variable should be retained.
(4) There is no evidence that either of the variables helps explain the response.
(5) The residual deviance can be used to judge the goodness of fit.
20. The data in Table 1 are taken from a classic British study on smoking and mortality.

Table 1. Data for Question 20.


Person Years Coronary Deaths
Age Non-smokers Smokers Non-smokers Smokers
35-44 18793 52407 2 32
45-54 10673 43248 12 104
55-64 5710 28612 28 206
65-74 2585 12663 28 186
75-84 1462 5317 31 102

The study investigated the effect of smoking on coronary death rates (measured as
deaths per 100,000 person-years). Assume the data have been arranged in a data
frame with variables person.years, deaths, age.group and smokers.

person.years deaths smokers age.group


1 18793 2 Non-smokers 35-44
2 52407 32 Smokers 35-44
3 10673 12 Non-smokers 45-54
4 43248 104 Smokers 45-54
5 5710 28 Non-smokers 55-64
6 28612 206 Smokers 55-64
7 2585 28 Non-smokers 65-74
8 12663 186 Smokers 65-74
9 1462 31 Non-smokers 75-84
10 5317 102 Smokers 75-84

CONTINUED
VERSION 1 16 STATS 330

Which of the following is TRUE?

(1) To analyse the rates, we should take person.years as the response and deaths
as an explanatory variable.
(2) To analyse the rates, we should divide the person years by the deaths and use
this as the response.
(3) To analyse the rates, we should take deaths as the response and person.years
as an explanatory variable.
(4) To analyse the rates, we should include an offset of log(person.years).
(5) To analyse the rates, we should include an offset of log(person.years/100000).

21. The correct model was fitted and the following output obtained.

Estimate Std. Error z value Pr(>|z|)


(Intercept) 2.3648 0.7071 3.344 0.000825 ***
smokersSmokers 1.7470 0.7289 2.397 0.016534 *
age.group45-54 2.3575 0.7638 3.087 0.002024 **
age.group55-64 3.8303 0.7319 5.233 1.67e-07 ***
age.group65-74 4.6228 0.7319 6.316 2.68e-10 ***
age.group75-84 5.2945 0.7296 7.257 3.95e-13 ***
smokersSmokers:age.group45-54 -0.9868 0.7901 -1.249 0.211667
smokersSmokers:age.group55-64 -1.3630 0.7562 -1.802 0.071479 .
smokersSmokers:age.group65-74 -1.4424 0.7565 -1.907 0.056564 .
smokersSmokers:age.group75-84 -1.8472 0.7572 -2.440 0.014706 *

Which of the following is FALSE?

(1) For age group 75-84, the death rate for smokers is about 1918 per 100,000 person-
years.
(2) The death rate for non-smokers aged 35-45 is almost 11 per 100,000 person-years.
(3) For all age groups except for 75-84, the death rate for smokers is higher than for
non-smokers.
(4) The death rate for non-smokers increases with age more slowly than the death
rate for smokers.
(5) For age group 75-84, the death rate for non-smokers is about 2120 per 100,000
person-years.

CONTINUED
VERSION 1 17 STATS 330

22. The following table gives the number of traffic accidents in Denmark involving pedes-
trians, classified by day of the week.

Day of week Mon Tue Wed Thur Fri Sat Sun


Frequency 279 256 230 304 330 210 130

The following analysis was performed:

> y = c(279,256,230,304,330,210,130)
> n = sum(y)
> L1 = sum(y*log(y/n))
> L2 = sum(y*log(1/7))
> D = 2*(L1-L2)
> 1-pchisq(D,6)
[1] 0

Which of the following is FALSE?

(1) L2 is computing the log-likelihood of the null model.


(2) D is the null deviance.
(3) The residual deviance of the maximal model is zero.
(4) The small p-value shows that we cant reject the hypothesis that accidents are
equally likely to occur on any day of the week.
(5) L1 is computing the log-likelihood of the maximal model.

CONTINUED
VERSION 1 18 STATS 330

23. The following data come from the Danish Welfare Study. There were 4775 respondents,
classified according to the variables Social.group (One of I-II, III,IV, V), Ownership
(either Owner or Renter) and Freezer (whether or not they own a freezer.) In this
question we ignore Freezer, and analyse a 2-dimensional contingency table. Some
output is shown below:

> DWS.glm = glm(y~Social.group*Ownership, family=poisson, data=DWS.df)


> summary(DWS.glm)
........
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.14166 0.05407 95.086 < 2e-16 ***
Social.groupIII 0.78659 0.06523 12.058 < 2e-16 ***
Social.groupIV 1.05986 0.06275 16.891 < 2e-16 ***
Social.groupV 0.85479 0.06456 13.241 < 2e-16 ***
OwnershipRenter -0.78495 0.09661 -8.125 4.49e-16 ***
Social.groupIII:OwnershipRenter -0.17697 0.11895 -1.488 0.137
Social.groupIV:OwnershipRenter 0.46675 0.10835 4.308 1.65e-05 ***
Social.groupV:OwnershipRenter 0.68840 0.10931 6.298 3.02e-10 ***
.........
> anova(DWS.glm, test="Chisq")
Analysis of Deviance Table
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 15 3282.3
Social.group 3 823.95 12 2458.4 < 2.2e-16 ***
Ownership 1 208.02 11 2250.4 < 2.2e-16 ***
Social.group:Ownership 3 124.30 8 2126.1 < 2.2e-16 ***

Which of the following is TRUE?

(1) The output indicates that Social.group and Ownership are not independent.
(2) The odds ratio corresponding to social group III and Owners is 0.78659.
(3) The log odds ratio corresponding to social group V and Renters is 5.14166 +
0.85479 -0.78495 + 0.68840 = 5.8999.
(4) The odds ratio corresponding to social group III and Renters is -0.78495.
(5) The p-value of 0.137 indicates that the odds ratio corresponding to social group
III and Renters is not significantly different from zero.

CONTINUED
VERSION 1 19 STATS 330

24. This question uses the same data as Question 23 but now takes the third factor into
account. The anova table is

> anova(glm(y~Social.group*Freezer*Ownership, family=poisson,


data=DWS.df), test="Chisq")
Analysis of Deviance Table
NULL 15 3282.3
Social.group 3 823.95 12 2458.4 < 2e-16 ***
Freezer 1 1477.33 11 981.1 < 2e-16 ***
Ownership 1 208.02 10 773.0 < 2e-16 ***
Social.group:Freezer 3 15.68 7 757.4 0.00132 **
Social.group:Ownership 3 124.30 4 633.1 < 2e-16 ***
Freezer:Ownership 1 628.35 3 4.7 < 2e-16 ***
Social.group:Freezer:Ownership 3 4.72 0 0.0 0.19324

Which of the following is FALSE?

(1) The estimates of the cell probabilities for the fitted model are just the table
relative frequencies.
(2) The odds ratios in the two marginal Social.group by Ownership tables are not
significantly different.
(3) The odds ratios in the four marginal Freezer by Ownership tables are signifi-
cantly different.
(4) The odds ratios in the two marginal Social.group by Freezer tables are not
significantly different.
(5) The homogeneous association model seems to be appropriate for these data.

CONTINUED
VERSION 1 20 STATS 330

25. Suppose we have a contingency table with three factors A, B and C, and counts y. We
fit the model yA*B + B*C, using Poisson regression. The residual deviance of the
model has a p-value of 0.5523. Which of the following is TRUE?

(1) The p-value indicates that the maximal (saturated) model is plausible.
(2) The p-value indicates that the model where A is independent of B and C is plau-
sible.
(3) The p-value indicates that the model where A and B are independent given C is
plausible.
(4) The p-value indicates that the model where A, B and C are mutually independent
is plausible.
(5) The p-value indicates that the model where A and C are independent given B is
plausible.

CONTINUED
VERSION 1 21 STATS 330

SECTION B
26. (a) What is the optimism of a prediction? Describe two ways of estimating the
optimism. [5 marks]
(b) What do we mean by overfitting a statistical model? What is the consequence
of overfitting? [4 marks]
(c) Describe the measures we use to assess the effect individual observations are
having on a regression. You should make clear what aspect of the regression is
being captured by each measure. [6 marks]
(d) The output below uses physiological measurements for 202 athletes. For each
athlete the variables measured were
Bfat: % bodyfat of athlete (our response).
Hg: Haemoglobin.
Ferr: Plasma ferritin concentration.
BMI: Body mass index = weight/height2 .
LBM: Lean body mass.
SSF: Sum of skin folds.
Ht: Height in cm.
Wt: Weight in kg.
It is desired to build a model using Bfat as the response, and some of the other
variables as explanatory variables. Part of some APR output is shown below.
Which models are indicated by this output? Give a reason. [5 marks]
AIC BIC Hg Ferr BMI SSF LBM Ht Wt
1 891.516 898.132 0 0 0 1 0 0 0
2 243.406 253.331 0 0 0 0 1 0 1
3 218.512 231.745 0 0 0 1 1 0 1
4 209.952 226.493 1 0 0 1 1 0 1
5 207.995 227.845 1 0 0 1 1 1 1
6 208.875 232.033 1 1 0 1 1 1 1
7 210.000 236.466 1 1 1 1 1 1 1

CONTINUED
VERSION 1 22 STATS 330

27. (a) In the course we discussed two types of residual for logistic regression. Define
them. [4 marks]
(b) Suppose we have a data set in which no covariate patterns are repeated, and we
want to model a binary response y using logistic regression. Describe the form of
the residuals in this case and state what the residual versus fitted value plot will
look like. [4 marks]
(c) The data for this part came from a survey of 1246 employees of a Munich factory.
The study explored the relationship between the prevalence of bronchitis and
some environmental variables. The variables are
bronch : chronical bronchial reaction, no = 0, yes = 1,
dust : dust concentration (mg/cm3 ) at work place,
smoke : employee smoker? no = 1, yes = 2,
years : years of dust exposure.
Two models were fitted, bronchdust*smoke+ years*smoke and
bronchdust+smoke+years. Some output is shown:
> anova(dust1.glm,dust.glm, test="Chisq")
Analysis of Deviance Table

Model 1: bronch ~ dust + smoke + years


Model 2: bronch ~ dust * smoke + years * smoke
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1242 1278.3
2 1240 1274.3 2 3.9868 0.1362

Summary for model 1:


Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.047872 0.248570 -12.262 < 2e-16 ***
dust 0.091888 0.023243 3.953 7.71e-05 ***
smoke 0.676844 0.174380 3.881 0.000104 ***
years 0.040155 0.006206 6.470 9.78e-11 ***

Summary for model 2


Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.157042 0.441460 -7.151 8.59e-13 ***
dust 0.005321 0.056387 0.094 0.9248
smoke 0.822706 0.494313 1.664 0.0960 .
years 0.053162 0.013157 4.041 5.33e-05 ***
dust:smoke 0.105350 0.062047 1.698 0.0895 .
smoke:years -0.016539 0.014935 -1.107 0.2681

CONTINUED
VERSION 1 23 STATS 330

i. Does Model 1 give an adequate description of these data? Give reasons.


[4 marks]
ii. Interpret the coefficients of the better of these two models. Which of the
covariates are associated with the response? What is the direction of the
association? [4 marks]
iii. Some diagnostic plots are shown in Figure 5. Discuss the likely effect of the
high leverage points 704 and 1246 on the regression. [4 marks]

Index plot of deviance residuals Leverage plot

0.06
226 337 499542 646 730 1246
923 71 480
478
514
515 675 852
172 320 513 694
2
Deviance Residuals

0.04
1

Leverage

0.02
1245
0

704 1244
1243
1227
1215
1153 1241
1242
1238
99 1005 1240
1216
1

0.00
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200

Observation number Observation Number

Cook's Distance Plot Deviance Changes Plot

730
646
5

337 499
226 480542
0.020

478 852
923 514
515 675
Deviance changes

71 172 320 513 694


Cook's Distance

4
3
0.010

2
1
0.000

0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200

Observation number Observation number

Figure 5: Trellis plot for Question B2.3(c).

CONTINUED
VERSION 1 24 STATS 330

28. (a) Suppose in a three dimensional contingency table with factors A, B and C and
counts y, we first fit the saturated model yA*B*C and then some submodel.
Describe two ways of testing if the submodel is adequate. Do we need to fit the
saturated model in both ways? [4 marks]
(b) Define the odds ratios in connection with 2-way tables. Describe how we could use
the information in the model summary (i.e. the output of the summary function)
to construct a confidence interval for the odds ratios. [4 marks]
(c) The data for this part come from a survey of 1430 maths graduates in the US.
The data are in a data frame AMSsurvey with variables
type: A factor with levels I(Pu) for group I public universities, I(Pr) for group
I private universities, II and III for groups II and III, IV for statistics and
biostatistics programs, and Va for applied mathematics programs.
sex : A factor with levels Female, Male.
citizen : A factor with levels Non-US, US giving citizenship status.
count : The number of individuals in each category combination.
The following output was obtained:
> AMS.glm = glm(count ~ type*sex*citizen, family=poisson, data=AMSsurvey)
> AMS1.glm = glm(count ~ type*sex+ type*citizen, family=poisson, data=AMSsurvey)
> anova(AMS1.glm, AMS.glm, test="Chisq")
Analysis of Deviance Table
Model 1: count ~ type * sex + type * citizen
Model 2: count ~ type * sex * citizen
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 6 1.9568
2 0 0.0000 6 1.9568 0.9236
> AMS2.glm = glm(count ~ type*sex+ type*citizen+ sex*citizen,
family=poisson, data=AMSsurvey)
> summary(AMS2.glm)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.13300 0.16981 18.450 < 2e-16 ***
typeI(Pu) 0.34277 0.21311 1.608 0.107745
typeII 0.76254 0.20147 3.785 0.000154 ***
typeIII 0.53281 0.21407 2.489 0.012812 *
typeIV 1.51384 0.18725 8.084 6.24e-16 ***
typeVa -0.63074 0.27954 -2.256 0.024048 *
sexMale 1.26216 0.17787 7.096 1.28e-12 ***
citizenUS -0.03935 0.16589 -0.237 0.812501
typeI(Pu):sexMale 0.10372 0.21841 0.475 0.634859
typeII:sexMale -0.65992 0.20971 -3.147 0.001650 **
typeIII:sexMale -0.95932 0.22884 -4.192 2.76e-05 ***
typeIV:sexMale -1.09889 0.20002 -5.494 3.93e-08 ***
typeVa:sexMale -0.43975 0.28791 -1.527 0.126664
typeI(Pu):citizenUS 0.01920 0.17678 0.109 0.913529
typeII:citizenUS 0.01120 0.18275 0.061 0.951133
typeIII:citizenUS -0.16345 0.20751 -0.788 0.430875
typeIV:citizenUS -0.60480 0.17923 -3.374 0.000740 ***
typeVa:citizenUS 0.16103 0.25478 0.632 0.527376
sexMale:citizenUS 0.08617 0.11755 0.733 0.463551

CONTINUED
VERSION 1 25 STATS 330

i. What log-linear model is indicated by this output? Describe the model in


words in terms of independence. [3 marks]
ii. Use the output from the homogeneous association model to calculate a con-
fidence interval for the conditional odds ratio between citizen and sex, given
type. [4 marks]
iii. Does this contradict or confirm the anova output? [3 marks]

CONTINUED
VERSION 1 26 STATS 330

ANSWER SHEET FOLLOWS


ANSWER SHEET 27 STATS 330

Surname: First Names: ID No:

Вам также может понравиться