Вы находитесь на странице: 1из 8

Stat 452 Practice Final Exam Solutions

2018-12-01

Questions 1 - 4 use the Boston data set, comprised of 14 variables measured on 506 residential areas of
Boston. A description of the variables in this data set is as follows.
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
black 1000(Bk - 0.63)^2 where Bk is the proportion of black Americans by town
lstat % lower status of the population
medv Median value of owner-occupied homes in $1000's

In questions 2-4, medv is the response variable and all others are used as predictors.

Question 1 (4 marks)

a. (1 mark) Principal components analysis (PCA) is used to explore structure in the predictors. Before
applying PCA the variables are all centred and scaled by their SD. Briefly explain why one would
perform this centring and scaling.
Solution: One mark for mentioning that if we don’t scale, then variables with high variance have too much
influence on the first few PCs.
b. (1 mark) Let X1 , . . . , Xp denote the predictor variables and φ11 , . . . , φ1p denote the loadings for the
first PC. Write an expression for the first PC, Z1 .
Solution: Z1 = φ11 X1 + . . . + φ1p Xp .
c. (1 mark) What do we obtain if we replace the loadings in (b) with −φ11 , . . . , −φ1p ?
Solution: This would be another choice for the first PC (PCs are only determined up to a sign change.)
d. (1 mark) The following plot is of the proportion of the total variance explained by each of PC. What
is a reasonable number of PCs to use to represent these data?

1
proportion of variance

0.4
0.2
0.0

2 4 6 8 10 12

number of PCs

Solution: Any number between 2 and 4 (I would use 2).

Question 2 (11 marks)

The Boston data set was split into a test set of size 200 and a training set of size 306. All-subsets regression,
the lasso, and principal components regression were used to fit a model for mean medv.
a. (1 mark) The first method applied to the data was all-subsets regression. The full model is a linear
model in the p = 13 predictors. Write out the model for this linear predictor, f (). You may use the
notation Xj for predictor j without definition.
Solution: f (X) = β0 + β1 X1 + . . . + βp Xp , where p = 13.
b. (2 marks) BIC was used to select the number of model terms. Give the formula for the BIC in terms
of the sample size, n, and the number of predictors, p. What does BIC estimate?
Solution BIC equals nσ̂1 2 (RSS + loge (N )pσ̂ 2 ), where RSS is the residual sum of squares and σ̂ is the
estimated error SD (1 mark). BIC is an estimate of the optimism (1 mark), which is roughly the difference
between between test and training error.
c. (1 mark) Based on the following plot of BIC, what number of model terms would you choose to
predict medv?
−180
BIC

−240

2 4 6 8 10 12

Number of model terms

Solution: 5
d. (2 marks) Next, the lasso was applied. Recall that the lasso minimizes an objective function of the
form loss + penalty. Write out the loss function in terms of the estimated linear function, fˆ(), and
the penalty function in terms of the estimated coefficients β̂j .

2
Pp
− fˆ(Xi ))2 , (1 mark) and the penalty is λ
Pn
Solution: The loss is i=1 (yi j=1 |β̂j |.

e. (1 mark) Describe the following plot of the fitted lasso object. What do the individual lines represent?
13 11 4 0 0
0.4
Coefficients

−0.4 0.0

−6 −4 −2 0 2

Log Lambda

Solution: The plot is of the estimated coefficients for the different parameters versus the log of the tuning
λ.
f. (1 mark) Cross-validation was used to select the tuning parameter. According to the following
plot, what is the approximate value of the tuning parameter we should choose if we use the
“one-standard-error” rule?
13 13 11 11 9 7 4 3 2 0 0 0 0
Mean−Squared Error

0.8
0.4

−6 −4 −2 0 2

log(Lambda)
Solution: We should use log λ of about −2.2 or −2.3, or λ value of about 0.11 or 0.10.
g. (1 mark) The next method is principle component regression (PCR) with the number of components
selected by cross-validation. Describe the following plot. With the principle of parsimony in mind,
how many PCs would we choose for the PCR?

3
medv

1.0
RMSEP

0.8
0.6

0 2 4 6 8 10 12

number of components

Solution: The plot is of the square root of the MSE obtained by cross-validation vs the number of PCs
(1 mark). We might choose any number of PCs between 5 and 13 PCs based on the RMSEP, but with
parsimony in mind we would choose 5 PCs (1 mark).
h. (2 marks) The test MSE for the hold-out test data is 0.247 for all-subsets regression, 0.293 for the
lasso, and 0.277 for PCR. Which method gives the best test error? Will the order of test errors
necessarily be the same for another test data set?
Solution: All-subsets regression gives the best test MSE for this test data set (1 mark). However, the order
could be different on another test data set (1 mark).

Question 3 (4 marks)

We use the Boston data again. In this question we model the relationship between the response medv and
the predictor lstat using regression and smoothing splines.
a. (1 mark) If we specify a natural cubic spline in lstat to predict medv with 3 interior knots, how
many spline basis functions will be used?
Solution: 4
b. (1 mark) The following plot shows the results of cross-validation to choose the degrees of freedom
(df) of a natural cubic spline. What df minimizes the estimated test set error?

4
0.5
meanErr

0.4

0.3
2.5 5.0 7.5 10.0
df
Solution: 5
c. (1 mark) What is the analog of degrees of freedom for a smoothing spline? Define this quantity in
terms of the smoother matrix Sλ .
Solution: The effective degrees of freedom for a smoothing spline is the sum of the diagonal elements
(trace) of Sλ .
d. (1 mark) Let ĝλ (x) be the smoothing spline estimate of the curve in lstat that predicts medv. Write
out the formula for the leave-out-one CV estimate of test error in terms of ĝλ (x) and Sλ .
Solution:
n 2
1X yi − ĝλ (xi )


n i=1 1 − {Sλ }ii

Question 4 (11 marks)

In this question we will use generalized additive models (GAM) and boosting to predict medv using the 13
predictors in the Boston data set. The models are fit to the training set from Question 2 and evaluated on
the test set from Question 2.
a. (1 mark) The following is a summary of a model that includes additive terms for all predictors, with
smoothing splines for all quantitative predictors. Based on this output, what model reduction do
you recommend?
Anova for Nonparametric Effects
Npar Df Npar F Pr(F)
(Intercept)
s(crim) 3 5.3804 0.001317
s(zn) 3 0.4816 0.695356
s(indus) 3 0.6556 0.580072
s(nox) 3 3.6581 0.013058
s(rm) 3 19.8923 1.273e-11
s(age) 3 0.9285 0.427482
s(dis) 3 2.9111 0.035043
s(rad) 3 1.2746 0.283557

5
s(tax) 3 1.0097 0.388991
s(ptratio) 3 1.3618 0.254940
s(black) 3 0.8814 0.451277
s(lstat) 3 24.1854 8.216e-14
Solution: Remove the smooth in zn; or, replace the smooth in zn with a linear term.
b. (1 mark) What is the name for the model selection procedure where we start with the largest model
of interest and remove the least predictive term one at a time?
Solution: backward selection
c. (1 mark) After a series of model reductions, we end up with a model that includes terms for 12 of
the 13 predictors. The model term for lstat is a smoothing spline. The fitted lstat curve obtained
by plotting the fitted gam object is shown below. What is the interpretation of this curve?
s(lstat)

0.5
−0.5

−1 0 1 2 3

lstat
Solution: Something that mentions the effect of lstat on average medv adjusted for or holding fixed
the other variables.
d. (1 mark) Suggest a GAM model term if we wanted to model interaction between lstat and chas
nonparametrically. You may specify the term using words or R code.
Solution: A local regression function of lstat and chas; i.e., lo(lstat,chas)
e. (2 marks) We now fit a gradient boosted tree model to the training data with medv as a quantitative
response and all other variables as explanatory variables. We use (i) squared error loss, (ii) 500 trees,
(iii) the default shrinkage value of 0.001 and (iv) an interaction depth of 4. Is this model additive in
the sense that the basis functions each involve only one variable? If so, why? If not, why not?
Solution: No (1 mark), because with an interaction depth of four the trees that are the basis functions
have four splits and could involve up to four variables (1 mark).
f. (1 mark) What is the name for the algorithm used to find the mth basis function.
Solution: Forward stagewise selection (forward and stagewise are the key terms).
g. (1 mark) What parameters in the gradient boosting algorithm can be used to control the flexibility
of the model?
Solution: The number of basis functions (number of trees), the shrinkage parameter and the interaction
depth.
h. (1 mark) The following summary of the gradient boosted model has been truncated at the first four
variables, ordered by the value of the entry in the rel.inf column. Interpret this output.
## var rel.inf

6
## lstat lstat 38.017180
## rm rm 30.676493
## dis dis 8.559256
## crim crim 5.119594
Solution: Something that mentions that the variables lstat and rm are by far the most important for
reducing the loss function and that the others are not very important.
i. (1 mark) The test MSE is 0.15 for the fitted GAM model and 0.49 for the gradient boosted model.
Which method gives the best test error?
Solution: The GAM model.
j. (1 mark) If there were more residential areas and more variables measured on each variable, would
you expect the performance of the boosted tree model to improve relative to the GAM model or
not? (A simple improve, not improve will suffice.)
Solution: improve

Question 5 (11 marks)

In this question we use random forests and the support vector machine to classify patients’ atherosclerotic
heart disease (AHD) status using patient characteristics such as blood pressure and chest pain symptoms.
(This is the Heart data set we used several times in class.) We train the models on 2/3 of the data and
test on 1/3 of the data.
a. (4 marks) The first model is a random forest. Briefly,
(i) What is a bootstrap sample?
(ii) For a given bootstrap sample, what are the out-of-bag (OOB) observations?
(iii) When the outcome is binary, how do we obtain the OOB prediction for the ith observation?
(iv) What differentiates random forests from bagging decision trees?
Solutions: (i) A random sample with replacement from the sample, (ii) the observations not included in
the bootstrap sample, (iii) the “majority vote”, or consensus prediction from the bootstrap samples that
did not include the ith observation, (iv) in random forests we use a random subset of all predictors at
each split.
b. (2 marks) A random forest is fit to the Heart training data set. In the summary of the fitted model,
what does MeanDecreaseGini mean? Interpret the following output, which has been truncated after
the 6 variables with the top MeanDecreaseGini values.
## MeanDecreaseGini
## MaxHR 14.190996
## ChestPain 13.552763
## Thal 12.957842
## Ca 11.824852
## Oldpeak 9.498046
## Age 7.788180
Solution: The MeanDecreaseGini is the percent decrease in node homogeneity when the variable is excluded
(1 mark). Some flexibility here in acceptable interpretations. They should mention that ChestPain, Thal
and MaxHR are all quite important. Other than that, it is not clear what to make of this output.
c. (1 mark) Based on the following table of predictions versus true states of AHD, what is the misclassi-
fication rate of the random forest?

7
##
## pred.rf No Yes
## No 46 12
## Yes 4 37
Solution: (10+5)/(48+12+5+37) = 0.147
d. (1 mark) Next we fit a support vector machine to the Heart data to predict AHD using a radial kernel.
Briefly, what is the advantage of a radial kernel over a linear kernel?
Solution: The key point is that the radial kernel allows for non-linear decision boundaries.
e. (1 mark) Let p denote the number of predictors. Write out the formula for the radial kernel, K(xi , xi0 ),
for measuring the similarity between vectors xi and xi0 .
Solution: p
X
K(xi , xi0 ) = exp(−γ (xij − xi0 j )2 ).
j=1

f. (1 mark) What are the tuning parameters for the SVM that control the flexibility of the classifier?
Solution: The cost, or budget of margin violations, and the parameter γ in the radial kernel.
g. (1 mark) Based on the following table of predictions versus true states of AHD, what is the misclassi-
fication rate of the support vector machine?
##
## pp No Yes
## No 45 23
## Yes 5 26
Solution: (12+15)/(38+12+15+34) = 0.273.

Вам также может понравиться