Вы находитесь на странице: 1из 12

The attached dataset contains one response (Systol) and 10 predictors.

a) Fit a regression model with only main effects. Which predictors have significant explanatory
power? Is there any evidence of multicollinearity among the predictors? (5 points)
Multiple regression analysis to predict Systol from 10 predictors
The prediction equation is:
Systol = 141.93 - 1.049 Age - 0.23 Calf - 0.84 Chin + 0.10 Diastol -1.09 Forearm - 108.37
Fraction - 0.04 Height + 0.12 Pulse +1.35 Weight + 2.28 Year
R2 : 0.67
Standard error of estimate: 8.74
F statistic: 5.75
p value: 0.000116332

Coeff

95%
LowerC
I

95%
UpperC
I

StdErr

141.93

39.55

244.31

49.98

2.84

0.01

Age

-1.05

-1.76

-0.34

0.35

-3.01

0.01

Calf

-0.23

-1.36

0.90

0.55

-0.42

0.68

Chin

-0.84

-2.41

0.72

0.76

-1.10

0.28

0.10

-0.21

0.41

0.15

0.67

0.51

-1.09
108.36

-3.57

1.39

1.21

-0.90

0.38

-174.27

-42.46

32.17

-3.37

0.00

Height

-0.04

-0.11

0.04

0.04

-0.97

0.34

Pulse

0.12

-0.24

0.47

0.17

0.67

0.51

Weight

1.35

0.44

2.26

0.44

3.05

0.00

Year

2.28

0.51

4.05

0.86

2.63

0.01

Constan
t

Diastol
Forear
m
Fraction

Significant
?
Yes
(p<0.01)
Yes
(p<0.01)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
Yes
(p<0.01)
No
(p>0.05)
No
(p>0.05)
Yes
(p<0.01)
Yes
(p<0.05)

The R-squared value, 67.3%, indicates the proportion of the variance of Systol that is explained
by the regression model. Thus the predictors together explain a very highly significant proportion
of the variation in Systol, based on the F test (p<0.001).
The standard error of estimate, 8.74, indicates the typical size of errors made in predicting Systol
using the regression model.
Holding the other X variables constant, we estimate that:
-1.05 is the increase in Systol associated with an increase in Age of 1 unit. This is highly
significant (p<0.01).
-0.23 is the increase in Systol associated with an increase in Calf of 1 unit. This is not
significant (p>0.05).
-0.84 is the increase in Systol associated with an increase in Chin of 1 unit. This is not significant
(p>0.05).
0.10 is the increase in Systol associated with an increase in Diastol of 1 unit. This is not
significant (p>0.05).
-1.09 is the increase in Systol associated with an increase in Forearm of 1 unit. This is not
significant (p>0.05).
-108.37 is the increase in Systol associated with an increase in Fraction of 1 unit. This is highly
significant (p<0.01).
-0.04 is the increase in Systol associated with an increase in Height of 1 unit. This is not
significant (p>0.05).
0.12 is the increase in Systol associated with an increase in Pulse of 1 unit. This is not significant
(p>0.05).
1.36 is the increase in Systol associated with an increase in Weight of 1 unit. This is highly
significant (p<0.01).
2.28 is the increase in Systol associated with an increase in Year of 1 unit. This is significant
(p<0.05).
From the test, Age, Fraction, Weight and Year have significant explanatory powers.

There is evidence of multicollienarity among predictors. VIF > 10.


Coefficient
s

Standa
rd Error

t
Stat

P-value

vif

Intercept

141.93

49.98

2.84

0.01

Age

(1.05)

0.35

(3.01
)

0.01

3.56

Years

2.28

0.86

2.63

0.01

37.88

Weight

1.35

0.44

3.05

0.00

4.95

Height

(0.04)

0.04

(0.97
)

0.34

1.92

Chin

(0.84)

0.76

(1.10
)

0.28

2.15

Forearm

(1.09)

1.21

(0.90
)

0.38

3.84

Calf

(0.23)

0.55

(0.42
)

0.68

2.51

Pulse

0.12

0.17

0.67

0.51

1.33

Diastol

0.10

0.15

0.67

0.51

1.53

Fraction

(108.36)

32.17

(3.37
)

0.00

27.21

b) Obtain a reduced model that has sufficient explanatory power. Use both stepwise regression
procedure and best subset regression. Is the model selected via stepwise regression same as the
one selected by best subset regression. Use C(p) criterion for model selection (5 points)
Best Subset Regression
Response is Systol

Vars
1
1
2
2
3
3
4
4

R-Sq
27.2
22.6
47.3
42.1
51.8
50.3
59.7
53.7

R-Sq
(adj)
25.2
20.5
44.4
38.9
47.7
46.1
55.0
48.2

R-Sq
(pred)
20.7
17.1
37.6
30.3
39.8
38.6
44.8
39.8

Mallows
Cp
27.3
31.2
12.1
16.5
10.2
11.5
5.4
10.6

S
11.338
11.690
9.7772
10.251
9.4833
9.6273
8.7946
9.4348

Y
e
A a
g r
e s

W
e
i
g
h
t
X

H
e
i
g
h
t

C
h
i
n

F
o
r
e
a
r
m

C
a
l
f

P
u
l
s
e

D
i
a
s
t
o
l

F
r
a
c
t
i
o
n

X
X
X X
X
X
X X X
X

X
X
X

X X
X
X
X X

5
5
6
6
7
7
8
8
9
9
10

63.9
63.1
64.9
64.9
66.1
65.5
66.6
66.5
67.1
66.7
67.3

58.4
57.6
58.3
58.3
58.4
57.7
57.7
57.5
56.8
56.4
55.6

45.6
44.2
43.3
46.2
42.6
41.3
39.9
39.3
36.0
34.4
29.8

3.9
4.5
5.0
5.0
6.0
6.5
7.5
7.7
9.2
9.4
11.0

8.4571
8.5417
8.4663
8.4681
8.4556
8.5220
8.5228
8.5464
8.6139
8.6554
8.7391

X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X

X
X
X
X X
X
X
X
X
X X
X
X X
X
X
X X
X
X
X X
X X
X X
X X X
X X X X
X
X X X X X X

I chose the first model variable 5 with good Cp and smallest


number of variables to get best R-Sq(adj) at 58.4%. Predictors
are Age, Years, Weight, Chin, Fraction.
The prediction equation is:
Systol = 109.36 - 1.01 Age -1.19 Chin -110.81Fraction +1.20 Weight +2.41 Year
Stepwise Regression
Stepwise Selection of Terms
Candidate terms: Age, Years, Weight, Height, Chin, Forearm, Calf,
Pulse, Diastol, Fraction

Constant
Weight
Fraction
Diastol

----Step 1---Coef
P
66.6
0.963
0.001

-----Step 2---Coef
P
60.9
1.217
0.000
-26.77
0.001

11.3376
27.18%
25.21%
20.66%
27.27

9.77719
47.31%
44.38%
37.55%
12.06

S
R-sq
R-sq(adj)
R-sq(pred)
Mallows Cp

to enter = 0.15, to remove = 0.15


Analysis of Variance
Source
Regression
Weight
Diastol
Fraction
Error
Total

DF
3
1
1
1
35
38

Adj SS
3383.8
1493.4
293.7
1046.5
3147.7
6531.4

Adj MS
1127.93
1493.40
293.71
1046.52
89.93

F-Value
12.54
16.61
3.27
11.64

P-Value
0.000
0.000
0.079
0.002

-----Step
Coef
52.1
1.022
-24.32
0.264

3---P
0.000
0.002
0.079

9.48331
51.81%
47.68%
39.76%
10.21

Model Summary
S
9.48331

R-sq
51.81%

R-sq(adj)
47.68%

R-sq(pred)
39.76%

Coefficients
Term
Constant
Weight
Diastol
Fraction

Coef
52.1
1.022
0.264
-24.32

SE Coef
14.7
0.251
0.146
7.13

T-Value
3.55
4.08
1.81
-3.41

P-Value
0.001
0.000
0.079
0.002

VIF
1.34
1.23
1.13

Regression Equation
Systol = 52.1 + 1.022 Weight + 0.264 Diastol - 24.32 Fraction
Fits and Diagnostics for Unusual Observations
Obs
1
4
39
R
X

Systol
170.00
148.00
152.00

Fit
143.62
145.14
146.28

Resid
26.38
2.86
5.72

Std
Resid
3.05
0.43
0.74

X
X

Large residual
Unusual X

The model via the stepwise regression is not the same as the one selected by best subset
regression.
The stepwise regression is Systol = 52.1 + 1.022 Weight + 0.264 Diastol - 24.32 Fraction
Best subset regression is Systol = 109.36 - 1.01 Age -1.19 Chin - 110.81 Fraction +1.20 Weight
+2.41 Year
The stepwise has 3 predictors while the best subset has 5 predictors
c) Assess if the reduced model (obtained via best subset regression) satisfies all the assumptions.
Particularly test the hypothesis of homogeneous variance and normality of the random errors. (5
points)
Multiple R: 0.799
R Squared:
0.639
Adjusted R Square: 0.584
Standard Error:
8.457

Coefficients

Intercept

109.3588779

Stand
ard
Error
21.484
27

1.097639545

0.3059
36
0.7426
38
0.2980
13

Chin

-1.191791279

0.6139
83

Fraction

-110.8110438

27.279
49

Age

-1.012029491

Years

2.40674643

Weight

t Stat
5.0901
84
3.3079
8
3.2408
08
3.6831
92
1.9410
8
4.0620
6

Pvalue
1.41E05
0.0022
77
0.0027
23
0.0008
19
0.0608
29
0.0002
82

Lower
95%
65.648
8
1.6344
6
0.8958
39
0.4913
27
2.4409
5
166.31
2

Upper
95%
153.06
9
0.3896
3.9176
54
1.7039
52
0.0573
67
55.310
5

The R-squared value, 63.9%, indicates the proportion of the variance of Systol that is explained
by the regression model.
Thus Age, Chin, Fraction, Weight and Year together explain a very highly significant proportion
of the variation in Systol, based on the F test (p<0.001).
The standard error of estimate, 8.457, indicates the typical size of errors made in predicting
Systol using the regression model.
Holding the other X variables constant, we estimate that:
-1.012 is the increase in Systol associated with an increase in Age of 1 unit. This is highly
significant (p<0.01).
-1.192 is the increase in Systol associated with an increase in Chin of 1 unit. This is not
significant (p>0.05).
-110.811 is the increase in Systol associated with an increase in Fraction of 1 unit. This is very
highly significant (p<0.001).
1.098 is the increase in Systol associated with an increase in Weight of 1 unit. This is very highly
significant (p<0.001).
2.407 is the increase in Systol associated with an increase in Year of 1 unit. This is highly
significant (p<0.01).

Systol

200

200

150

150

100

Systol

50

100
50

0
15 20 25 30 35 40 45 50 55 60

Age

Systol

200

150

150
Systol

50

0
Fraction

200
150
Systol

100
50
0
0 5 10 15 20 25 30 35 40 45 50
Year

10

12

14

100

50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Chin

200

100

50 55 60 65 70 75 80 85 90
Weight

The model does not satisfy all assumptions.


d) Fit a model with all first order interactions and comment on the significance of the interaction
effects. (5 points)
Analysis of Variance
Source
Regression
Age
Years
Weight
Height
Chin
Forearm
Calf
Pulse
Diastol
Fraction
Interaction
Error
Total

DF
11
1
1
1
1
1
1
1
1
1
1
1
27
38

Adj SS
4401.81
702.30
521.50
675.19
80.17
98.18
42.93
19.70
38.35
35.04
850.99
8.79
2129.62
6531.44

Adj MS
400.165
702.298
521.503
675.186
80.166
98.176
42.932
19.695
38.353
35.036
850.985
8.792
78.875

F-Value
5.07
8.90
6.61
8.56
1.02
1.24
0.54
0.25
0.49
0.44
10.79
0.11

P-Value
0.000
0.006
0.016
0.007
0.322
0.274
0.467
0.621
0.492
0.511
0.003
0.741

Model Summary
S
8.88115

R-sq
67.39%

R-sq(adj)
54.11%

Coefficients
Term
Coef
Constant
146.0
Age
-1.060
Years
2.371
Weight
1.411
Height
-0.0409
Chin
-0.872
Forearm
-0.95
Calf
-0.298
Pulse
0.123
Diastol
0.102
Fraction
-111.4
Interaction -0.000000

R-sq(pred)
0.00%

SE Coef
52.2
0.355
0.922
0.482
0.0406
0.782
1.29
0.596
0.176
0.152
33.9
0.000000

T-Value
2.80
-2.98
2.57
2.93
-1.01
-1.12
-0.74
-0.50
0.70
0.67
-3.28
-0.33

P-Value
0.009
0.006
0.016
0.007
0.322
0.274
0.467
0.621
0.492
0.511
0.003
0.741

VIF
3.60
41.74
5.65
2.20
2.18
4.25
2.83
1.35
1.53
29.25
3.18

Regression Equation
Systol = 146.0 - 1.060 Age + 2.371 Years + 1.411 Weight - 0.0409 Height - 0.872 Chin
- 0.95 Forearm - 0.298 Calf + 0.123 Pulse + 0.102 Diastol - 111.4 Fraction
- 0.000000 Interaction
Fits and Diagnostics for Unusual Observations
Std
Obs Systol
Fit Resid Resid
1 170.00 155.37 14.63
2.38 R
8 108.00
95.53 12.47
2.13 R
39 152.00 151.24
0.76
0.86
X

R Large residual
X Unusual X
The interaction effect is not significant.

e) Split the dataset into 2 parts. The training set contains (approx.) 80% of samples. The test set
contains remaining 20% samples. Fit the full main effects model on the training set and predict
the values of Systol for test set. Obtain the mean square prediction error.
Now obtain a reduced model using stepwise regression on the training set. Use these parameter
estimates to predict the value of Systol in the test set. Obtain the mean square prediction error
for this reduced model. Compare the predictive performance of the full model and the reduced
model. Which one shows better predictive performance? Why? (5 points)
Multiple regression analysis to predict Systol from training set predictors (Age, Years, Weight,
Height, Chin, Forearm, Calf, Pulse)
The prediction equation is:
Systol = 127.405 - 0.275 Age + 0.245 Calf - 1.338 Chin - 1.060 Forearm - 0.067 Height + 0.034
Pulse + 2.114 Weight - 0.574 Year
R squared: 0.4998
Standard error of estimate: 10.4351
F statistic: 3.7477
p value: 0.0038

Height

0.2447
(1.338
4)
(1.060
4)
(0.067
3)

Pulse

0.0336

95%
Lower
CI
7.480
4
(0.868
6)
(1.052
1)
(3.145
0)
(3.997
6)
(0.155
6)
(0.382
8)

2.1145
(0.573
6)

1.154
1
(1.036
4)

Constan
t
Age
Calf
Chin
Forearm

Weight
Year

Coef
127.40
51
(0.274
9)

95%
Upper
CI
247.32
99

0.4499

StdErr
58.72
12
0.290
7
0.635
0
0.884
6
1.438
2
0.043
2
0.203
9

t
2.169
7
(0.945
8)
0.385
3
(1.513
0)
(0.737
3)
(1.556
0)
0.164
7

3.0748
(0.110
8)

0.470
2
0.226
6

4.496
6
(2.531
1)

0.3187
1.5415
0.4682
1.8768
0.0210

Significan
t?
Yes
(p<0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
No
(p>0.05)
Yes
0.000 (p<0.001
1 )
0.016 Yes
8 (p<0.05)
p
0.038
1
0.351
8
0.702
7
0.140
7
0.466
7
0.130
2
0.870
3

The R-squared value, 50.0%, indicates the proportion of the variance of Systol that is explained
by the regression model.
Thus training set predictors together explain a highly significant proportion of the variation in
Systol, based on the F test (p<0.01).
The standard error of estimate, 10.4351, indicates the typical size of errors made in predicting
Systol using the regression model.
Holding the other X variables constant, we estimate that:
-0.2749 is the increase in Systol associated with an increase in Age of 1 unit. This is not
significant (p>0.05).
0.2447 is the increase in Systol associated with an increase in Calf of 1 unit. This is not
significant (p>0.05).
-1.3384 is the increase in Systol associated with an increase in Chin of 1 unit. This is not
significant (p>0.05).
-1.0604 is the increase in Systol associated with an increase in Forearm of 1 unit. This is not
significant (p>0.05).
-0.0673 is the increase in Systol associated with an increase in Height of 1 unit. This is not
significant (p>0.05).
0.0336 is the increase in Systol associated with an increase in Pulse of 1 unit. This is not
significant (p>0.05).
2.1145 is the increase in Systol associated with an increase in Weight of 1 unit. This is very
highly significant (p<0.001).
-0.5736 is the increase in Systol associated with an increase in Year of 1 unit. This is significant
(p<0.05).
df
Regression
Residual
Total

8
30
38

SS
3264.72
3266.716
6531.436

MS
408.09
108.8905

F
3.747709

Stepwise Regression
Stepwise Selection of Terms
Candidate terms: Age, Years, Weight, Height, Chin, Forearm, Calf, Pulse
----Step 1----

-----Step 2----

Significance F
0.003783

Coef
66.6
0.963

Constant
Weight
Years

Coef
50.3
1.354
-0.572

0.001

S
R-sq
R-sq(adj)
R-sq(pred)
Mallows Cp

P
0.000
0.004

11.3376
27.18%
25.21%
20.66%
8.68

10.2512
42.08%
38.86%
30.35%
1.74

to enter = 0.15, to remove = 0.15


Analysis of Variance
Source
Regression
Years
Weight
Error
Total

DF
2
1
1
36
38

Adj SS
2748.3
972.9
2698.3
3783.2
6531.4

Adj MS
1374.1
972.9
2698.3
105.1

F-Value
13.08
9.26
25.68

P-Value
0.000
0.004
0.000

Model Summary
S
10.2512

R-sq
42.08%

R-sq(adj)
38.86%

R-sq(pred)
30.35%

Coefficients
Term
Constant
Years
Weight

Coef
50.3
-0.572
1.354

SE Coef
15.8
0.188
0.267

T-Value
3.18
-3.04
5.07

P-Value
0.003
0.004
0.000

VIF
1.30
1.30

Regression Equation
Systol = 50.3 - 0.572 Years + 1.354 Weight
Fits and Diagnostics for Unusual Observations
Obs
1
38
39
R
X

Systol
170.00
132.00
152.00

Fit
145.89
120.52
145.25

Resid
24.11
11.48
6.75

Std
Resid
2.60
1.28
0.82

R
X
X

Large residual
Unusual X

Regression Equation:
Systol = 50.3 - 0.572 Years + 1.354 Weight
Using year 1 and weight 71
Systol = 50.3 -0.572*1 + 1.354 * 71 = -46.406
Mean square prediction error for reduced model is 10.2512

Mean square prediction error for reduced model is 10.4351


The full model shows better predictive performance because the predicted values are close to the
observed values.

Вам также может понравиться