Академический Документы
Профессиональный Документы
Культура Документы
a) Fit a regression model with only main effects. Which predictors have significant explanatory
power? Is there any evidence of multicollinearity among the predictors? (5 points)
Multiple regression analysis to predict Systol from 10 predictors
The prediction equation is:
Systol = 141.93 - 1.049 Age - 0.23 Calf - 0.84 Chin + 0.10 Diastol -1.09 Forearm - 108.37
Fraction - 0.04 Height + 0.12 Pulse +1.35 Weight + 2.28 Year
R2 : 0.67
Standard error of estimate: 8.74
F statistic: 5.75
p value: 0.000116332
Coeff
95%
LowerC
I
95%
UpperC
I
StdErr
141.93
39.55
244.31
49.98
2.84
0.01
Age
-1.05
-1.76
-0.34
0.35
-3.01
0.01
Calf
-0.23
-1.36
0.90
0.55
-0.42
0.68
Chin
-0.84
-2.41
0.72
0.76
-1.10
0.28
0.10
-0.21
0.41
0.15
0.67
0.51
-1.09
108.36
-3.57
1.39
1.21
-0.90
0.38
-174.27
-42.46
32.17
-3.37
0.00
Height
-0.04
-0.11
0.04
0.04
-0.97
0.34
Pulse
0.12
-0.24
0.47
0.17
0.67
0.51
Weight
1.35
0.44
2.26
0.44
3.05
0.00
Year
2.28
0.51
4.05
0.86
2.63
0.01
Constan
t
Diastol
Forear
m
Fraction
Significant
?
Yes
(p<0.01)
Yes
(p<0.01)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
Yes
(p<0.01)
No
(p>0.05)
No
(p>0.05)
Yes
(p<0.01)
Yes
(p<0.05)
The R-squared value, 67.3%, indicates the proportion of the variance of Systol that is explained
by the regression model. Thus the predictors together explain a very highly significant proportion
of the variation in Systol, based on the F test (p<0.001).
The standard error of estimate, 8.74, indicates the typical size of errors made in predicting Systol
using the regression model.
Holding the other X variables constant, we estimate that:
-1.05 is the increase in Systol associated with an increase in Age of 1 unit. This is highly
significant (p<0.01).
-0.23 is the increase in Systol associated with an increase in Calf of 1 unit. This is not
significant (p>0.05).
-0.84 is the increase in Systol associated with an increase in Chin of 1 unit. This is not significant
(p>0.05).
0.10 is the increase in Systol associated with an increase in Diastol of 1 unit. This is not
significant (p>0.05).
-1.09 is the increase in Systol associated with an increase in Forearm of 1 unit. This is not
significant (p>0.05).
-108.37 is the increase in Systol associated with an increase in Fraction of 1 unit. This is highly
significant (p<0.01).
-0.04 is the increase in Systol associated with an increase in Height of 1 unit. This is not
significant (p>0.05).
0.12 is the increase in Systol associated with an increase in Pulse of 1 unit. This is not significant
(p>0.05).
1.36 is the increase in Systol associated with an increase in Weight of 1 unit. This is highly
significant (p<0.01).
2.28 is the increase in Systol associated with an increase in Year of 1 unit. This is significant
(p<0.05).
From the test, Age, Fraction, Weight and Year have significant explanatory powers.
Standa
rd Error
t
Stat
P-value
vif
Intercept
141.93
49.98
2.84
0.01
Age
(1.05)
0.35
(3.01
)
0.01
3.56
Years
2.28
0.86
2.63
0.01
37.88
Weight
1.35
0.44
3.05
0.00
4.95
Height
(0.04)
0.04
(0.97
)
0.34
1.92
Chin
(0.84)
0.76
(1.10
)
0.28
2.15
Forearm
(1.09)
1.21
(0.90
)
0.38
3.84
Calf
(0.23)
0.55
(0.42
)
0.68
2.51
Pulse
0.12
0.17
0.67
0.51
1.33
Diastol
0.10
0.15
0.67
0.51
1.53
Fraction
(108.36)
32.17
(3.37
)
0.00
27.21
b) Obtain a reduced model that has sufficient explanatory power. Use both stepwise regression
procedure and best subset regression. Is the model selected via stepwise regression same as the
one selected by best subset regression. Use C(p) criterion for model selection (5 points)
Best Subset Regression
Response is Systol
Vars
1
1
2
2
3
3
4
4
R-Sq
27.2
22.6
47.3
42.1
51.8
50.3
59.7
53.7
R-Sq
(adj)
25.2
20.5
44.4
38.9
47.7
46.1
55.0
48.2
R-Sq
(pred)
20.7
17.1
37.6
30.3
39.8
38.6
44.8
39.8
Mallows
Cp
27.3
31.2
12.1
16.5
10.2
11.5
5.4
10.6
S
11.338
11.690
9.7772
10.251
9.4833
9.6273
8.7946
9.4348
Y
e
A a
g r
e s
W
e
i
g
h
t
X
H
e
i
g
h
t
C
h
i
n
F
o
r
e
a
r
m
C
a
l
f
P
u
l
s
e
D
i
a
s
t
o
l
F
r
a
c
t
i
o
n
X
X
X X
X
X
X X X
X
X
X
X
X X
X
X
X X
5
5
6
6
7
7
8
8
9
9
10
63.9
63.1
64.9
64.9
66.1
65.5
66.6
66.5
67.1
66.7
67.3
58.4
57.6
58.3
58.3
58.4
57.7
57.7
57.5
56.8
56.4
55.6
45.6
44.2
43.3
46.2
42.6
41.3
39.9
39.3
36.0
34.4
29.8
3.9
4.5
5.0
5.0
6.0
6.5
7.5
7.7
9.2
9.4
11.0
8.4571
8.5417
8.4663
8.4681
8.4556
8.5220
8.5228
8.5464
8.6139
8.6554
8.7391
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X X
X
X
X
X
X X
X
X X
X
X
X X
X
X
X X
X X
X X
X X X
X X X X
X
X X X X X X
Constant
Weight
Fraction
Diastol
----Step 1---Coef
P
66.6
0.963
0.001
-----Step 2---Coef
P
60.9
1.217
0.000
-26.77
0.001
11.3376
27.18%
25.21%
20.66%
27.27
9.77719
47.31%
44.38%
37.55%
12.06
S
R-sq
R-sq(adj)
R-sq(pred)
Mallows Cp
DF
3
1
1
1
35
38
Adj SS
3383.8
1493.4
293.7
1046.5
3147.7
6531.4
Adj MS
1127.93
1493.40
293.71
1046.52
89.93
F-Value
12.54
16.61
3.27
11.64
P-Value
0.000
0.000
0.079
0.002
-----Step
Coef
52.1
1.022
-24.32
0.264
3---P
0.000
0.002
0.079
9.48331
51.81%
47.68%
39.76%
10.21
Model Summary
S
9.48331
R-sq
51.81%
R-sq(adj)
47.68%
R-sq(pred)
39.76%
Coefficients
Term
Constant
Weight
Diastol
Fraction
Coef
52.1
1.022
0.264
-24.32
SE Coef
14.7
0.251
0.146
7.13
T-Value
3.55
4.08
1.81
-3.41
P-Value
0.001
0.000
0.079
0.002
VIF
1.34
1.23
1.13
Regression Equation
Systol = 52.1 + 1.022 Weight + 0.264 Diastol - 24.32 Fraction
Fits and Diagnostics for Unusual Observations
Obs
1
4
39
R
X
Systol
170.00
148.00
152.00
Fit
143.62
145.14
146.28
Resid
26.38
2.86
5.72
Std
Resid
3.05
0.43
0.74
X
X
Large residual
Unusual X
The model via the stepwise regression is not the same as the one selected by best subset
regression.
The stepwise regression is Systol = 52.1 + 1.022 Weight + 0.264 Diastol - 24.32 Fraction
Best subset regression is Systol = 109.36 - 1.01 Age -1.19 Chin - 110.81 Fraction +1.20 Weight
+2.41 Year
The stepwise has 3 predictors while the best subset has 5 predictors
c) Assess if the reduced model (obtained via best subset regression) satisfies all the assumptions.
Particularly test the hypothesis of homogeneous variance and normality of the random errors. (5
points)
Multiple R: 0.799
R Squared:
0.639
Adjusted R Square: 0.584
Standard Error:
8.457
Coefficients
Intercept
109.3588779
Stand
ard
Error
21.484
27
1.097639545
0.3059
36
0.7426
38
0.2980
13
Chin
-1.191791279
0.6139
83
Fraction
-110.8110438
27.279
49
Age
-1.012029491
Years
2.40674643
Weight
t Stat
5.0901
84
3.3079
8
3.2408
08
3.6831
92
1.9410
8
4.0620
6
Pvalue
1.41E05
0.0022
77
0.0027
23
0.0008
19
0.0608
29
0.0002
82
Lower
95%
65.648
8
1.6344
6
0.8958
39
0.4913
27
2.4409
5
166.31
2
Upper
95%
153.06
9
0.3896
3.9176
54
1.7039
52
0.0573
67
55.310
5
The R-squared value, 63.9%, indicates the proportion of the variance of Systol that is explained
by the regression model.
Thus Age, Chin, Fraction, Weight and Year together explain a very highly significant proportion
of the variation in Systol, based on the F test (p<0.001).
The standard error of estimate, 8.457, indicates the typical size of errors made in predicting
Systol using the regression model.
Holding the other X variables constant, we estimate that:
-1.012 is the increase in Systol associated with an increase in Age of 1 unit. This is highly
significant (p<0.01).
-1.192 is the increase in Systol associated with an increase in Chin of 1 unit. This is not
significant (p>0.05).
-110.811 is the increase in Systol associated with an increase in Fraction of 1 unit. This is very
highly significant (p<0.001).
1.098 is the increase in Systol associated with an increase in Weight of 1 unit. This is very highly
significant (p<0.001).
2.407 is the increase in Systol associated with an increase in Year of 1 unit. This is highly
significant (p<0.01).
Systol
200
200
150
150
100
Systol
50
100
50
0
15 20 25 30 35 40 45 50 55 60
Age
Systol
200
150
150
Systol
50
0
Fraction
200
150
Systol
100
50
0
0 5 10 15 20 25 30 35 40 45 50
Year
10
12
14
100
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Chin
200
100
50 55 60 65 70 75 80 85 90
Weight
DF
11
1
1
1
1
1
1
1
1
1
1
1
27
38
Adj SS
4401.81
702.30
521.50
675.19
80.17
98.18
42.93
19.70
38.35
35.04
850.99
8.79
2129.62
6531.44
Adj MS
400.165
702.298
521.503
675.186
80.166
98.176
42.932
19.695
38.353
35.036
850.985
8.792
78.875
F-Value
5.07
8.90
6.61
8.56
1.02
1.24
0.54
0.25
0.49
0.44
10.79
0.11
P-Value
0.000
0.006
0.016
0.007
0.322
0.274
0.467
0.621
0.492
0.511
0.003
0.741
Model Summary
S
8.88115
R-sq
67.39%
R-sq(adj)
54.11%
Coefficients
Term
Coef
Constant
146.0
Age
-1.060
Years
2.371
Weight
1.411
Height
-0.0409
Chin
-0.872
Forearm
-0.95
Calf
-0.298
Pulse
0.123
Diastol
0.102
Fraction
-111.4
Interaction -0.000000
R-sq(pred)
0.00%
SE Coef
52.2
0.355
0.922
0.482
0.0406
0.782
1.29
0.596
0.176
0.152
33.9
0.000000
T-Value
2.80
-2.98
2.57
2.93
-1.01
-1.12
-0.74
-0.50
0.70
0.67
-3.28
-0.33
P-Value
0.009
0.006
0.016
0.007
0.322
0.274
0.467
0.621
0.492
0.511
0.003
0.741
VIF
3.60
41.74
5.65
2.20
2.18
4.25
2.83
1.35
1.53
29.25
3.18
Regression Equation
Systol = 146.0 - 1.060 Age + 2.371 Years + 1.411 Weight - 0.0409 Height - 0.872 Chin
- 0.95 Forearm - 0.298 Calf + 0.123 Pulse + 0.102 Diastol - 111.4 Fraction
- 0.000000 Interaction
Fits and Diagnostics for Unusual Observations
Std
Obs Systol
Fit Resid Resid
1 170.00 155.37 14.63
2.38 R
8 108.00
95.53 12.47
2.13 R
39 152.00 151.24
0.76
0.86
X
R Large residual
X Unusual X
The interaction effect is not significant.
e) Split the dataset into 2 parts. The training set contains (approx.) 80% of samples. The test set
contains remaining 20% samples. Fit the full main effects model on the training set and predict
the values of Systol for test set. Obtain the mean square prediction error.
Now obtain a reduced model using stepwise regression on the training set. Use these parameter
estimates to predict the value of Systol in the test set. Obtain the mean square prediction error
for this reduced model. Compare the predictive performance of the full model and the reduced
model. Which one shows better predictive performance? Why? (5 points)
Multiple regression analysis to predict Systol from training set predictors (Age, Years, Weight,
Height, Chin, Forearm, Calf, Pulse)
The prediction equation is:
Systol = 127.405 - 0.275 Age + 0.245 Calf - 1.338 Chin - 1.060 Forearm - 0.067 Height + 0.034
Pulse + 2.114 Weight - 0.574 Year
R squared: 0.4998
Standard error of estimate: 10.4351
F statistic: 3.7477
p value: 0.0038
Height
0.2447
(1.338
4)
(1.060
4)
(0.067
3)
Pulse
0.0336
95%
Lower
CI
7.480
4
(0.868
6)
(1.052
1)
(3.145
0)
(3.997
6)
(0.155
6)
(0.382
8)
2.1145
(0.573
6)
1.154
1
(1.036
4)
Constan
t
Age
Calf
Chin
Forearm
Weight
Year
Coef
127.40
51
(0.274
9)
95%
Upper
CI
247.32
99
0.4499
StdErr
58.72
12
0.290
7
0.635
0
0.884
6
1.438
2
0.043
2
0.203
9
t
2.169
7
(0.945
8)
0.385
3
(1.513
0)
(0.737
3)
(1.556
0)
0.164
7
3.0748
(0.110
8)
0.470
2
0.226
6
4.496
6
(2.531
1)
0.3187
1.5415
0.4682
1.8768
0.0210
Significan
t?
Yes
(p<0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
Yes
0.000 (p<0.001
1 )
0.016 Yes
8 (p<0.05)
p
0.038
1
0.351
8
0.702
7
0.140
7
0.466
7
0.130
2
0.870
3
The R-squared value, 50.0%, indicates the proportion of the variance of Systol that is explained
by the regression model.
Thus training set predictors together explain a highly significant proportion of the variation in
Systol, based on the F test (p<0.01).
The standard error of estimate, 10.4351, indicates the typical size of errors made in predicting
Systol using the regression model.
Holding the other X variables constant, we estimate that:
-0.2749 is the increase in Systol associated with an increase in Age of 1 unit. This is not
significant (p>0.05).
0.2447 is the increase in Systol associated with an increase in Calf of 1 unit. This is not
significant (p>0.05).
-1.3384 is the increase in Systol associated with an increase in Chin of 1 unit. This is not
significant (p>0.05).
-1.0604 is the increase in Systol associated with an increase in Forearm of 1 unit. This is not
significant (p>0.05).
-0.0673 is the increase in Systol associated with an increase in Height of 1 unit. This is not
significant (p>0.05).
0.0336 is the increase in Systol associated with an increase in Pulse of 1 unit. This is not
significant (p>0.05).
2.1145 is the increase in Systol associated with an increase in Weight of 1 unit. This is very
highly significant (p<0.001).
-0.5736 is the increase in Systol associated with an increase in Year of 1 unit. This is significant
(p<0.05).
df
Regression
Residual
Total
8
30
38
SS
3264.72
3266.716
6531.436
MS
408.09
108.8905
F
3.747709
Stepwise Regression
Stepwise Selection of Terms
Candidate terms: Age, Years, Weight, Height, Chin, Forearm, Calf, Pulse
----Step 1----
-----Step 2----
Significance F
0.003783
Coef
66.6
0.963
Constant
Weight
Years
Coef
50.3
1.354
-0.572
0.001
S
R-sq
R-sq(adj)
R-sq(pred)
Mallows Cp
P
0.000
0.004
11.3376
27.18%
25.21%
20.66%
8.68
10.2512
42.08%
38.86%
30.35%
1.74
DF
2
1
1
36
38
Adj SS
2748.3
972.9
2698.3
3783.2
6531.4
Adj MS
1374.1
972.9
2698.3
105.1
F-Value
13.08
9.26
25.68
P-Value
0.000
0.004
0.000
Model Summary
S
10.2512
R-sq
42.08%
R-sq(adj)
38.86%
R-sq(pred)
30.35%
Coefficients
Term
Constant
Years
Weight
Coef
50.3
-0.572
1.354
SE Coef
15.8
0.188
0.267
T-Value
3.18
-3.04
5.07
P-Value
0.003
0.004
0.000
VIF
1.30
1.30
Regression Equation
Systol = 50.3 - 0.572 Years + 1.354 Weight
Fits and Diagnostics for Unusual Observations
Obs
1
38
39
R
X
Systol
170.00
132.00
152.00
Fit
145.89
120.52
145.25
Resid
24.11
11.48
6.75
Std
Resid
2.60
1.28
0.82
R
X
X
Large residual
Unusual X
Regression Equation:
Systol = 50.3 - 0.572 Years + 1.354 Weight
Using year 1 and weight 71
Systol = 50.3 -0.572*1 + 1.354 * 71 = -46.406
Mean square prediction error for reduced model is 10.2512