Вы находитесь на странице: 1из 52

https://onlinecourses.science.psu.

edu/stat501/print/book/export/html/343

Published on STAT 501 (https://onlinecourses.science.psu.edu/stat501)


Home > Lesson 12: Multicollinearity & Other Regression Pitfalls

Lesson 12: Multicollinearity & Other


Regression Pitfalls
Introduction

So far, in our study of multiple regression models, we have ignored something that we
probably shouldn't have — and that's what is called multicollinearity. We're going to correct
our blissful ignorance in this lesson.

Multicollinearity exists when two or more of the predictors in a regression model are
moderately or highly correlated. Unfortunately, when it exists, it can wreak havoc on our
analysis and thereby limit the research conclusions we can draw. As we will soon learn, when
multicollinearity exists, any of the following pitfalls can be exacerbated:

the estimated regression coefficient of any one variable depends on which other
predictors are included in the model
the precision of the estimated regression coefficients decreases as more predictors are
added to the model
the marginal contribution of any one predictor variable in reducing the error sum of
squares depends on which other predictors are already in the model
hypothesis tests for βk = 0 may yield different conclusions depending on which
predictors are in the model

In this lesson, we'll take a look at an example or two that illustrates each of the above
outcomes. Then, we'll spend some time learning how not only to detect multicollinearity, but
also how to reduce it once we've found it.

We'll also consider other regression pitfalls, including extrapolation, nonconstant variance,
autocorrelation, overfitting, excluding important predictor variables, missing data, and power
and sample size.

Learning objectives and outcomes

Upon completing this lesson, you should be able to do the following:

Distinguish between structural multicollinearity and data-based multicollinearity.


Know what multicollinearity means.
Understand the effects of multicollinearity on various aspects of regression analyses.
Understand the effects of uncorrelated predictors on various aspects of regression
analyses.
Understand variance inflation factors, and how to use them to help detect
multicollinearity.
Know the two ways of reducing data-based multicollinearity.
Loading Understand how centering the predictors in a polynomial regression model helps to
[MathJax]/extensions/MathZoom.js

1 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

reduce structural multicollinearity.


Know the main issues surrounding other regression pitfalls, including extrapolation,
nonconstant variance, autocorrelation, overfitting, excluding important predictor
variables, missing data, and power and sample size.

12.1 - What is Multicollinearity?


As stated in the lesson overview, multicollinearity exists whenever two or more of the
predictors in a regression model are moderately or highly correlated. Now, you might be
wondering why can't a researcher just collect his data in such a way to ensure that the
predictors aren't highly correlated? Then, multicollinearity wouldn't be a problem, and we
wouldn't have to bother with this silly lesson.

Unfortunately, researchers often can't control the predictors. Obvious examples include a
person's gender, race, grade point average, math SAT score, IQ, and starting salary. For each
of these predictor examples, the researcher just observes the values as they occur for the
people in her random sample.

Multicollinearity happens more often than not in such observational studies. And,
unfortunately, regression analyses most often take place on data obtained from observational
studies. If you aren't convinced, consider the example data sets for this course. Most of the
data sets were obtained from observational studies, not experiments. It is for this reason that
we need to fully understand the impact of multicollinearity on our regression analyses.

Types of multicollinearity

There are two types of multicollinearity:

Structural multicollinearity is a mathematical artifact caused by creating new


predictors from other predictors — such as, creating the predictor x2 from the predictor
x.
Data-based multicollinearity, on the other hand, is a result of a poorly designed
experiment, reliance on purely observational data, or the inability to manipulate the
system on which the data are collected.

In the case of structural multicollinearity, the multicollinearity is induced by what you have
done. Data-based multicollinearity is the more troublesome of the two types of
multicollinearity. Unfortunately it is the type we encounter most often!

Example

Let's take a quick look at an example in which data-


based multicollinearity exists. Some researchers
observed — notice the choice of word! — the following
data (bloodpress.txt [1]) on 20 individuals with high
blood pressure:

blood pressure (y = BP, in mm Hg)


age (x1 = Age, in years)
weight (x2 = Weight, in kg)
body surface area (x3 = BSA, in sq m)
Loading [MathJax]/extensions/MathZoom.js

2 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

duration of hypertension (x4 = Dur, in years)


basal pulse (x5 = Pulse, in beats per minute)
stress index (x6 = Stress)

The researchers were interested in determining if a relationship exists between blood


pressure and age, weight, body surface area, duration, pulse rate and/or stress level.

The matrix plot of BP, Age, Weight, and BSA:

and the matrix plot of BP, Dur, Pulse, and Stress:

allow us to investigate the various marginal relationships between the response BP and the
predictors. Blood pressure appears to be related fairly strongly to Weight and BSA, and hardly
related at all to Stress level.
Loading [MathJax]/extensions/MathZoom.js

3 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

The matrix plots also allow us to investigate whether or not relationships exist among the
predictors. For example, Weight and BSA appear to be strongly related, while Stress and BSA
appear to be hardly related at all.

The following correlation matrix:

provides further evidence of the above claims. Blood pressure appears to be related fairly
strongly to Weight (r = 0.950) and BSA (r = 0.866), and hardly related at all to Stress level (r =
0.164). And, Weight and BSA appear to be strongly related (r = 0.875), while Stress and BSA
appear to be hardly related at all (r = 0.018). The high correlation among some of the
predictors suggests that data-based multicollinearity exists.

Now, what we need to learn is the impact of the multicollinearity on regression analysis. Let's
go do it!

12.2 - Uncorrelated Predictors


In order to get a handle on this multicollinearity thing, let's first investigate the effects that
uncorrelated predictors have on regression analyses. To do so, we'll investigate a "contrived"
data set, in which the predictors are perfectly uncorrelated. Then, we'll investigate a second
example of a "real" data set, in which the predictors are nearly uncorrelated. Our two
investigations will allow us to summarize the effects that uncorrelated predictors have on
regression analyses.

Then, on the next page, we'll investigate the effects that highly correlated predictors have on
regression analyses. In doing so, we'll learn — and therefore be able to summarize — the
various effects multicollinearity has on regression analyses.

What is the effect on regression analyses if the predictors are perfectly


uncorrelated?

Consider the following matrix plot of the response y and two predictors x1 and x2, of a
contrived data set (uncorrpreds.txt [2]), in which the predictors are perfectly uncorrelated:

Loading [MathJax]/extensions/MathZoom.js

4 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

As you can see there is no apparent relationship at all between the predictors x1 and x2. That
is, the correlation between x1 and x2 is zero:

suggesting the two predictors are perfectly uncorrelated.

Now, let's just proceed quickly through the output of a series of regression analyses collecting
various pieces of information along the way. When we're done, we'll review what we learned
by collating the various items in a summary table.

The regression of the response y on the predictor x1:

Loading [MathJax]/extensions/MathZoom.js

5 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

yields the estimated coefficient b1 = -1.00, the standard error se(b1) = 1.47, and the
regression sum of squares SSR(x1) = 8.000.

The regression of the response y on the predictor x2:

yields the estimated coefficient b2 = -1.75, the standard error se(b2) = 1.35, and the
regression sum of squares SSR(x ) = 24.50.
Loading [MathJax]/extensions/MathZoom.js2

6 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

The regression of the response y on the predictors x1 and x2 (in that order):

yields the estimated coefficients b1 = -1.00 and b2 = -1.75, the standard errors se(b1) = 1.41
and se(b2) = 1.41, and the sequential sum of squares SSR(x2|x1) = 24.500.

The regression of the response y on the predictors x2 and x1 (in that order):

Loading [MathJax]/extensions/MathZoom.js

7 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

yields the estimated coefficients b1 = -1.00 and b2 = -1.75, the standard errors se(b1) = 1.41
and se(b2) = 1.41, and the sequential sum of squares SSR(x1|x2) = 8.000.

Okay — as promised — compiling the results in a summary table, we obtain:

Model b1 se(b1) b2 se(b2) Seq SS

SSR(x1)
x1 only -1.00 1.47 --- ---
8.000
SSR(x2)
x2 only --- --- -1.75 1.35
24.50
x1, x2 SSR(x2|x1)
-1.00 1.41 -1.75 1.41
(in order) 24.500
x2, x1 SSR(x1|x2)
-1.00 1.41 -1.75 1.41
(in order) 8.000

What do we observe?

The estimated slope coefficients b1 and b2 are the same regardless of the model used.
The standard errors se(b1) and se(b2) don't change much at all from model to model.
The sum of squares SSR(x1) is the same as the sequential sum of squares SSR(x1|x2).
The sum of squares SSR(x2) is the same as the sequential sum of squares SSR(x2|x1).
Loading [MathJax]/extensions/MathZoom.js

8 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

These all seem to be good things! Because the slope estimates stay the same, the effect on
the response ascribed to a predictor doesn't depend on the other predictors in the model.
Because SSR(x1) = SSR(x1|x2), the marginal contribution that x1 has in reducing the
variability in the response y doesn't depend on the predictor x2. Similarly, because SSR(x2) =
SSR(x2|x1), the marginal contribution that x2 has in reducing the variability in the response y
doesn't depend on the predictor x1.

These are the things we can hope for in a regression analyis —but, then reality sets in! Recall
that we obtained the above results for a contrived data set, in which the predictors are
perfectly uncorrelated. Do we get similar results for real data with only nearly uncorrelated
predictors? Let's see!

What is the effect on regression analyses if the predictors are nearly


uncorrelated?

To investigate this question, let's go back and take a look at the blood pressure data set
(bloodpress.txt [1]). In particular, let's focus on the relationships among the response y = BP
and the predictors x3 = BSA and x6 = Stress:

As the above matrix plot and the following correlation matrix suggest:

there appears to be a strong relationship between y = BP and the predictor x3 = BSA (r =


0.866), a weak relationship between y = BP and x6 = Stress (r = 0.164), and an almost non-
Loading [MathJax]/extensions/MathZoom.js

9 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

existent relationship between x3 = BSA and x6 = Stress (r = 0.018). That is, the two predictors
are nearly perfectly uncorrelated.

What effect do these nearly perfectly uncorrelated predictors have on regression analyses?
Let's proceed similarly through the output of a series of regression analyses collecting various
pieces of information along the way. When we're done, we'll review what we learned by
collating the various items in a summary table.

The regression of the response y = BP on the predictor x6= Stress:

yields the estimated coefficient b6 = 0.0240, the standard error se(b6) = 0.0340, and the
regression sum of squares SSR(x6) = 15.04.

The regression of the response y = BP on the predictor x3 = BSA:

yields the estimated coefficient b3 = 34.44, the standard error se(b3) = 4.69, and the
regression sum of squares SSR(x3) = 419.858.
Loading [MathJax]/extensions/MathZoom.js

10 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

The regression of the response y = BP on the predictors x6= Stress and x3 = BSA (in that
order):

yields the estimated coefficients b6 = 0.0217 and b3 = 34.33, the standard errors se(b6) =
0.0170 and se(b2) = 4.61, and the sequential sum of squares SSR(x3|x6) = 417.07.

Finally, the regression of the response y = BP on the predictors x3 = BSA and x6= Stress
(in that order):

yields the estimated coefficients b6 = 0.0217 and b3 = 34.33, the standard errors se(b6) =
0.0170 and se(b2) = 4.61, and the sequential sum of squares SSR(x6|x3) = 12.26.

Again — as promised — compiling the results in a summary table, we obtain:

Model b6 se(b6) b3 se(b3) Seq SS

Loading [MathJax]/extensions/MathZoom.js

11 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

SSR(x6)
x6 only 0.0240 0.0340 --- ---
15.04
SSR(x3)
x3 only --- --- 34.44 4.69
419.858
x6, x3 SSR(x3|x6)
0.0217 0.0170 34.33 4.61
(in order) 417.07
x3, x6 SSR(x6|x3)
0.0217 0.0170 34.33 4.61
(in order) 12.26

What do we observe? If the predictors are nearly perfectly uncorrelated:

We don't get identical, but very similar slope estimates b3 and b6, regardless of the
predictors in the model.
The sum of squares SSR(x3) is not the same, but very similar to the sequential sum of
squares SSR(x3|x6).
The sum of squares SSR(x6) is not the same, but very similar to the sequential sum of
squares SSR(x6|x3).

Again, these are all good things! In short, the effect on the response ascribed to a predictor is
similar regardless of the other predictors in the model. And, the marginal contribution of a
predictor doesn't appear to depend much on the other predictors in the model.

PRACTICE PROBLEMS: Uncorrelated predictors

Effect of perfectly uncorrelated predictor variables.

This exercise reviews the benefits of having perfectly uncorrelated predictor variables.
The results of this exercise demonstrate a strong argument for conducting "designed
experiments" in which the researcher sets the levels of the predictor variables in
advance, as opposed to conducting an "observational study" in which the researcher
merely observes the levels of the predictor variables as they happen. Unfortunately,
many regression analyses are conducted on observational data rather than
experimental data, limiting the strength of the conclusions that can be drawn from the
data. As this exercise demonstrates, you should conduct an experiment, whenever
possible, not an observational study. Use the (contrived) data stored in uncorrelated.txt
[3] to complete this lab exercise.

1. Using the Stat >> Basic Statistics >> Correlation... command in Minitab, calculate the
correlation coefficient between X1 and X2. Are the two variables perfectly uncorrelated?

(CHECK YOUR ANSWER)

2. Fit the simple linear regression model with y as the response and x1 as the single
predictor:

What is the value of the estimated slope coefficient b1?


What is the regression sum of squares, SSR (X1), when x1 is the only predictor in
the model?
Loading [MathJax]/extensions/MathZoom.js

12 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

(CHECK YOUR ANSWER)

3. Now, fit the simple linear regression model with y as the response and x2 as the
single predictor:

What is the value of the estimated slope coefficient b2?


What is the regression sum of squares, SSR (X2), when x2 is the only predictor in
the model?

(CHECK YOUR ANSWER)

4. Now, fit the multiple linear regression model with y as the response and x1 as the first
predictor and x2 as the second predictor:

What is the value of the estimated slope coefficient b1? Is the estimate b1 different
than that obtained when x1 was the only predictor in the model?
What is the value of the estimated slope coefficient b2? Is the estimate b2 different
than that obtained when x2 was the only predictor in the model?
What is the sequential sum of squares, SSR (X2|X1)? Does the reduction in the
error sum of squares when x2 is added to the model depend on whether x1 is
already in the model?

(CHECK YOUR ANSWER)

5. Now, fit the multiple linear regression model with y as the response and x2 as the first
predictor and x1 as the second predictor:

What is the sequential sum of squares, SSR (X1|X2)? Does the reduction in the
error sum of squares when x1 is added to the model depend on whether x2 is
already in the model?

(CHECK YOUR ANSWER)

6. When the predictor variables are perfectly uncorrelated, is it possible to quantify the
effect a predictor has on the response without regard to the other predictors?

(CHECK YOUR ANSWER)

7. In what way does this exercise demonstrate the benefits of conducting a designed
experiment rather than an observational study?

(CHECK YOUR ANSWER)

12.3 - Highly Correlated Predictors


Okay, so we've learned about all of the good things that can happen when predictors are
perfectly or nearly perfectly uncorrelated. Now, let's discover the bad things that can happen
when predictors are highly correlated.

What happens if the predictor variables are highly correlated?


Loading [MathJax]/extensions/MathZoom.js

13 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

Let's return again to the blood pressure data set (bloodpress.txt [1]). This time, let's focus,
however, on the relationships among the response y = BP and the predictors x2 = Weight and
x3 = BSA:

As the matrix plot and the following correlation matrix suggest:

there appears to be not only a strong relationship between y = BP and x2 = Weight (r = 0.950)
and a strong relationship between y = BP and the predictor x3 = BSA (r = 0.866), but also a
strong relationship between the two predictors x2 = Weight and x3 = BSA (r = 0.875).
Incidentally, it shouldn't be too surprising that a person's weight and body surface area are
highly correlated.

What impact does the strong correlation betwen the two predictors have on the regression
analysis and the subsequent conclusions we can draw? Let's proceed as before by reviewing
the output of a series of regression analyses and collecting various pieces of information
along the way. When we're done, we'll review what we learned by collating the various items
in a summary table.

The regression of the response y = BP on the predictor x2= Weight:

Loading [MathJax]/extensions/MathZoom.js

14 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

yields the estimated coefficient b2 = 1.2009, the standard error se(b2) = 0.0930, and the
regression sum of squares SSR(x2) = 505.472.

The regression of the response y = BP on the predictor x3= BSA:

yields the estimated coefficient b3 = 34.44, the standard error se(b3) = 4.69, and the
regression sum of squares SSR(x3) = 419.858.
Loading [MathJax]/extensions/MathZoom.js

15 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

The regression of the response y = BP on the predictors x2= Weight and x3 = BSA (in that
order):

yields the estimated coefficients b2 = 1.039 and b3 = 5.83, the standard errors se(b2) = 0.193
and se(b3) = 6.06, and the sequential sum of squares SSR(x3|x2) = 2.814.

And finally, the regression of the response y = BP on the predictors x3= BSA and x2= Weight
(in that order):

Loading [MathJax]/extensions/MathZoom.js

16 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

yields the estimated coefficients b2 = 1.039 and b3 = 5.83, the standard errors se(b2) = 0.193
and se(b3) = 6.06, and the sequential sum of squares SSR(x2|x3) = 88.43.

Compiling the results in a summary table, we obtain:

Model b2 se(b2) b3 se(b3) Seq SS

SSR(x2)
x2 only 1.2009 0.0930 --- ---
505.472
SSR(x3)
x3 only --- --- 34.44 4.69
419.858
x2, x3 SSR(x3|x2)
1.039 0.193 5.83 6.06
(in order) 2.814
x3, x2 SSR(x2|x3)
1.039 0.193 5.83 6.06
(in order) 88.43

Geez — things look a little different than before. It appears as if, when predictors are highly
correlated, the answers you get depend on the predictors in the model. That's not good! Let's
proceed through the table and in so doing carefully summarize the effects of multicollinearity
on the regression analyses.

Effect #1

When predictor variables are correlated, the estimated regression coefficient of any one
variable depends on which other predictor variables are included in the model.

Here's the relevant portion of the table:

Variables
b2 b3
in model

x2 1.20 ---
x3 --- 34.4
x2, x3 1.04 5.83

Note that, depending on which predictors we include in the model, we obtain wildly different
estimates of the slope parameter for x3 = BSA!

If x3 = BSA is the only predictor included in our model, we claim that for every additional
one square meter increase in body surface area (BSA), blood pressure (BP) increases
by 34.4 mm Hg.
On the other hand, if x2 = Weight and x3 = BSA are both included in our model, we
claim that for every additional one square meter increase in body surface area (BSA),
holding weight constant, blood pressure (BP) increases by only 5.83 mm Hg.

This is[MathJax]/extensions/MathZoom.js
Loading a huge difference! Our hope would be, of course, that two regression analyses

17 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

wouldn't lead us to such seemingly different scientific conclusions. The high correlation
among the two predictors is what causes the large discrepancy. When interpreting b3 = 34.4
in the model that excludes x2 = Weight, keep in mind that when we increase x3 = BSA then x2
= Weight also increases and both factors are associated with increased blood pressure.
However, when interpreting b3 = 5.83 in the model that includes x2 = Weight, we keep x2 =
Weight fixed, so the resulting increase in blood pressure is much smaller.

The amazing thing is that even predictors that are not included in the model, but are highly
correlated with the predictors in our model, can have an impact! For example, consider a
pharmaceutical company's regression of territory sales on territory population and per capita
income. One would, of course, expect that as the population of the territory increases, so
would the sales in the territory. But, contrary to this expectation, the pharmaceutical
company's regression analysis deemed the estimated coefficient of territory population to be
negative. That is, as the population of territory increases, the territory sales were predicted to
decrease. After further investigation, the pharmaceutical company determined that the larger
the territory, the larger too the competitor's market penetration. That is, the competitor kept
the sales down in territories with large populations.

In summary, the competitor's market penetration was not included in the original model. Yet, it
was later deemed to be strongly positively correlated with territory population. Even though
the competitor's market penetration was not included in the original model, its strong
correlation with one of the predictors in the model, greatly affected the conclusions arising
from the regression analysis.

The moral of the story is that if you get estimated coefficients that just don't make sense,
there is probably a very good explanation. Rather than stopping your research and running off
to report your unusual results, think long and hard about what might have caused the results.
That is, think about the system you are studying and all of the extraneous variables that could
influence the system.

Effect #2

When predictor variables are correlated, the precision of the estimated regression coefficients
decreases as more predictor variables are added to the model.

Here's the relevant portion of the table:

Variables
se(b2) se(b3)
in model

x2 0.093 ---
x3 --- 4.69
x2, x3 0.193 6.06

The standard error for the estimated slope b2 obtained from the model including both x2 =
Weight and x3 = BSA is about double the standard error for the estimated slope b2 obtained
from the model including only x2 = Weight. And, the standard error for the estimated slope b3
obtained from the model including both x2 = Weight and x3 = BSA is about 30% larger than
the standard error for the estimated slope b3 obtained from the model including only x3 =
Loading [MathJax]/extensions/MathZoom.js

18 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

BSA.

What is the major implication of these increased standard errors? Recall that the standard
errors are used in the calculation of the confidence intervals for the slope parameters. That is,
increased standard errors of the estimated slopes lead to wider confidence intervals, and
hence less precise estimates of the slope parameters.

Three plots to help clarify the second effect. Recall that the first data set (uncorrpreds.txt
[2]) that we investigated in this lesson contained perfectly uncorrelated predictor variables (r =

0). Upon regressing the response y on the uncorrelated predictors x1 and x2, Minitab (or any
other statistical software for that matter) will find the "best fitting" plane through the data
points:

Click on the Best Fitting Plane button in order to see the best fitting plane for this particular
set of responses. Now, here's where you have to turn on your imagination. The primary
characteristic of the data — because the predictors are perfectly uncorrelated — is that the
predictor values are spread out and anchored in each of four corners, providing a solid base
over which to draw the response plane. Now, even if the responses (y) varied somewhat from
sample to sample, the plane couldn't change all that much because of the solid base. That is,
the estimated coefficients, b1 and b2, couldn't change that much, and hence the standard
errors of the estimated coefficients, se(b1) and se(b2), will necessarily be small.

Now, let's take a look at the second example (bloodpress.txt [1]) that we investigated in this
lesson, in which the predictors x3 = BSA and x6 = Stress were nearly perfectly uncorrelated (r
= 0.018). Upon regressing the response y = BP on the nearly uncorrelated predictors x3 =
BSA and x6 = Stress, Minitab will again find the "best fitting" plane through the data points:

Click on the Best Fitting Plane button in order to see the best fitting plane for this particular
set of responses. Again, the primary characteristic of the data — because the predictors are
nearly perfectly uncorrelated — is that the predictor values are spread out and just about
anchored in each of four corners, providing a solid base over which to draw the response
plane. Again, even if the responses (y) varied somewhat from sample to sample, the plane
couldn't change all that much because of the solid base. That is, the estimated coefficients, b3
and b6, couldn't change all that much. The standard errors of the estimated coefficients,
se(b3) and se(b6), again will necessarily be small.

Now, let's see what happens when the predictors are highly correlated. Let's return to our
most recent example (bloodpress.txt [1]), in which the predictors x2 = Weight and x3 = BSA are
very highly correlated (r = 0.875). Upon regressing the response y = BP on the predictors x2 =
Weight and x3 = BSA, Minitab will again find the "best fitting" plane through the data points.

Loading [MathJax]/extensions/MathZoom.js

19 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

Do you see the difficulty in finding the best fitting plane in this situation? The primary
characteristic of the data — because the predictors are so highly correlated — is that the
predictor values tend to fall in a straight line. That is, there are no anchors in two of the four
corners. Therefore, the base over which the response plane is drawn is not very solid.

Let's put it this way — would you rather sit on a chair with four legs or one with just two legs?
If the responses (y) varied somewhat from sample to sample, the position of the plane could
change significantly. That is, the estimated coefficients, b2 and b3, could change substantially.
The standard errors of the estimated coefficients, se(b2) and se(b3), will then be necessarily
larger. Below is an animated view (no sound) of the problem that highly correlated predictors
can cause with finding the best fitting plane.

Effect #3
Loading [MathJax]/extensions/MathZoom.js

20 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

When predictor variables are correlated, the marginal contribution of any one predictor
variable in reducing the error sum of squares varies depending on which other variables are
already in the model.

For example, regressing the response y = BP on the predictor x2 = Weight, we obtain


SSR(x2) = 505.472. But, regressing the response y = BP on the two predictors x3 = BSA and
x2 = Weight (in that order), we obtain SSR(x2|x3) = 88.43. The first model suggests that
weight reduces the error sum of squares substantially (by 505.472), but the second model
suggests that weight doesn't reduce the error sum of squares all that much (by 88.43) once a
person's body surface area is taken into account.

This should make intuitive sense. In essence, weight appears to explain some of the variation
in blood pressure. However, because weight and body surface area are highly correlated,
most of the variation in blood pressure explained by weight could just have easily been
explained by body surface area. Therefore, once you take into account a person's body
surface area, there's not much variation left in the blood pressure for weight to explain.

Incidentally, we see a similar phenomenon when we enter the predictors into the model in the
reverse order. That is, regressing the response y = BP on the predictor x3 = BSA, we obtain
SSR(x3) = 419.858. But, regressing the response y = BP on the two predictors x2 = Weight
and x3 = BSA (in that order), we obtain SSR(x3|x2) = 2.814. The first model suggests that
body surface area reduces the error sum of squares substantially (by 419.858), and the
second model suggests that body surface area doesn't reduce the error sum of squares all
that much (by only 2.814) once a person's weight is taken into account.

Effect #4

When predictor variables are correlated, hypothesis tests for βk = 0 may yield different
conclusions depending on which predictor variables are in the model. (This effect is a direct
consequence of the three previous effects.)

To illustrate this effect, let's once again quickly proceed through the output of a series of
regression analyses, focusing primarily on the outcome of the t-tests for testing H0 : βBSA = 0
and H0 : βWeight = 0.

The regression of the response y = BP on the predictor x3 = BSA:

indicates that the P-value associated with the t-test for testing H0 : βBSA = 0 is 0.000... < 0.01.
There is sufficient evidence at the 0.05 level to conclude that blood pressure is significantly
related to body surface area.

The regression of the response y = BP on the predictor x2 = Weight:


Loading [MathJax]/extensions/MathZoom.js

21 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

indicates that the P-value associated with the t-test for testing H0 : βWeight = 0 is 0.000... <
0.01. There is sufficient evidence at the 0.05 level to conclude that blood pressure is
significantly related to weight.

And, the regression of the response y = BP on the predictors x2 = Weight and x3 = BSA:

indicates that the P-value associated with the t-test for testing H0 : βWeight = 0 is 0.000... <
0.01. There is sufficient evidence at the 0.05 level to conclude that, after taking into account
body surface area, blood pressure is significantly related to weight.

The regression also indicates that the P-value associated with the t-test for testing H0 : βBSA
= 0 is 0.350. There is insufficient evidence at the 0.05 level to conclude that blood pressure is
significantly related to body surface area after taking into account weight. This might sound
contradictory to what we claimed earlier, namely that blood pressure is indeed significantly
related to body surface area. Again, what is going on here, once you take into account a
person's weight, body surface area doesn't explain much of the remaining variability in blood
pressure readings.

Effect #5

High multicollinearity among predictor variables does not prevent good, precise predictions of
the response within the scope of the model.

Well, okay, it's not an effect, and it's not bad news either! It is good news! If the primary
purpose of your regression analysis is to estimate a mean response μY or to predict a new
response y, you don't have to worry much about multicollinearity.

For example, suppose you are interested in predicting the blood pressure (y = BP) of an
individual whose weight is 92 kg and whose body surface area is 2 square meters:

Loading [MathJax]/extensions/MathZoom.js

22 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

Because the point (2, 92) falls within the scope of the model, you'll still get good, reliable
predictions of the response y, regardless of the correlation that exists among the two
predictors BSA and Weight. Geometrically, what is happening here is that the best fitting
plane through the responses may tilt from side to side from sample to sample (because of the
correlation), but the center of the plane (in the scope of the model) won't change all that
much.

The following output illustrates how the predictions don't change all that much from model to
model:

The first output yields a predicted blood pressure of 112.7 mm Hg for a person whose weight
is 92 kg based on the regression of blood pressure on weight. The second output yields a
predicted blood pressure of 114.1 mm Hg for a person whose body surface area is 2 square
meters based on the regression of blood pressure on body surface area. And the last output
yields a predicted blood pressure of 112.8 mm Hg for a person whose body surface area is 2
square meters and whose weight is 92 kg based on the regression of blood pressure on body
surface area and weight. Reviewing the confidence intervals and prediction intervals, you can
see that they too yield similar results regardless of the model.

The bottom line

Now, in short, what are the major effects that multicollinearity has on our use of a regression
model to answer our research questions? In the presence of multicollinearity:
Loading [MathJax]/extensions/MathZoom.js

23 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

It is okay to use an estimated regression model to predict y or estimate μY as long as


you do so within the scope of the model.
We can no longer make much sense of the usual interpretation of a slope coefficient as
the change in the mean response for each additional unit increase in the predictor xk,
when all the other predictors are held constant.

The first point is, of course, adressed above. The second point is a direct consequence of the
correlation among the predictors. It wouldn't make sense to talk about holding the values of
correlated predictors constant, since changing one predictor necessarily would change the
values of the others.

PRACTICE PROBLEMS: Correlated predictors

Effects of correlated predictor variables. This exercise reviews the effects of


multicollinearity on various aspects of regression analyses. The Allen Cognitive Level
(ACL) test is designed to quantify one's cognitive abilities. David and Riley (1990)
investigated the relationship of the ACL test to level of psychopathology in a set of 69
patients from a general hospital psychiatry unit. The data set allentest.txt [4]contains the
response y = ACL and three potential predictors:

x1 = Vocab, scores on the vocabulary component of the Shipley Institute of Living


Scale
x2 = Abstract, scores on the abstraction component of the Shipley Institute of
Living Scale
x3 = SDMT, scores on the Symbol-Digit Modalities Test

1. Determine the pairwise correlations among the predictor variables in order to get an
idea of the extent to which the predictor variables are (pairwise) correlated. (See
Minitab Help: Creating a correlation matrix [5]). Also, create a matrix plot of the data in
order to get a visual portrayal of the relationship among the response and predictor
variables. (See Minitab Help: Creating a simple matrix of scatter plots [6]).

(CHECK YOUR ANSWER)

2. Fit the simple linear regression model with y = ACL as the response and x1 = Vocab
as the predictor. After fitting your model, request that Minitab predict the response y =
ACL when x1 = 25. (See Minitab Help: Performing a multiple regression analysis — with
options [7]).

What is the value of the estimated slope coefficient b1?


What is the value of the standard error of b1?
What is the regression sum of squares, SSR (x1), when x1 is the only predictor in
the model?
What is the predicted response of y = ACL when x1 = 25?

(CHECK YOUR ANSWER)

3. Now, fit the simple linear regression model with y = ACL as the response and x3 =
SDMT as the predictor. After fitting your model, request that Minitab predict the
response y = ACL when x3 = 40. (See Minitab Help: Performing a multiple regression
analysis
Loading — with options [7]).
[MathJax]/extensions/MathZoom.js

24 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

What is the value of the estimated slope coefficient b3?


What is the value of the standard error of b3?
What is the regression sum of squares, SSR (x3), when x3 is the only predictor in
the model?
What is the predicted response of y = ACL when x3 = 40?

(CHECK YOUR ANSWER)

4. Fit the multiple linear regression model with y = ACL as the response and x3 = SDMT
as the first predictor and x1 = Vocab as the second predictor. After fitting your model,
request that Minitab predict the response y = ACL when x1 = 25 and x3 = 40. (See
Minitab Help: Performing a multiple regression analysis — with options [7]).

Now, what is the value of the estimated slope coefficient b1? and b3?
Now, what is the value of the standard error of b1? and b3?
What is the sequential sum of squares, SSR (X1|X3)?
What is the predicted response of y = ACL when x1 = 25 and x3 = 40?

(CHECK YOUR ANSWER)

5. Summarize the effects of multicollinearity on various aspects of the regression


analyses.

(CHECK YOUR ANSWER)

12.4 - Detecting Multicollinearity Using


Variance Inflation Factors
Okay, now that we know the effects that multicollinearity can have on our regression analyses
and subsequent conclusions, how do we tell when it exists? That is, how can we tell if
multicollinearity is present in our data?

Some of the common methods used for detecting multicollinearity include:

The analysis exhibits the signs of multicollinearity — such as, estimates of the
coefficients vary from model to model.
The t-tests for each of the individual slopes are non-significant (P > 0.05), but the overall
F-test for testing all of the slopes are simultaneously 0 is significant (P < 0.05).
The correlations among pairs of predictor variables are large.

Looking at correlations only among pairs of predictors, however, is limiting. It is possible that
the pairwise correlations are small, and yet a linear dependence exists among three or even
more variables, for example, if X3 = 2X1 + 5X2 + error, say. That's why many regression
analysts often rely on what are called variance inflation factors (VIF) to help detect
multicollinearity.

What is a Variation Inflation Factor?

As the[MathJax]/extensions/MathZoom.js
Loading name suggests, a variance inflation factor (VIF) quantifies how much the variance is

25 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

inflated. But what variance? Recall that we learned previously that the standard errors — and
hence the variances — of the estimated coefficients are inflated when multicollinearity exists.
So, the variance inflation factor for the estimated coefficient bk —denoted VIFk —is just the
factor by which the variance is inflated.

Let's be a little more concrete. For the model in which xk is the only predictor:

it can be shown that the variance of the estimated coefficient bk is:

Note that we add the subscript "min" in order to denote that it is the smallest the variance can
be. Don't worry about how this variance is derived — we just need to keep track of this
baseline variance, so we can see how much the variance of bk is inflated when we add
correlated predictors to our regression model.

Let's consider such a model with correlated predictors:

Now, again, if some of the predictors are correlated with the predictor xk, then the variance of
bk is inflated. It can be shown that the variance of bk is:

where is the R2-value obtained by regressing the kth predictor on the remaining
predictors. Of course, the greater the linear dependence among the predictor xk and the other
predictors, the larger the value. And, as the above formula suggests, the larger the
value, the larger the variance of bk.

How much larger? To answer this question, all we need to do is take the ratio of the two
variances. Doing so, we obtain:

The above quantity is what is deemed the variance inflation factor for the kth predictor. That is:

where is the R2-value obtained by regressing the kth predictor on the remaining
predictors. Note that a variance inflation factor exists for each of the k predictors in a multiple
regression
Loading model.
[MathJax]/extensions/MathZoom.js

26 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

How do we interpret the variance inflation factors for a regression model? Again, it is a
measure of how much the variance of the estimated regression coefficient bk is "inflated" by
the existence of correlation among the predictor variables in the model. A VIF of 1 means that
there is no correlation among the kth predictor and the remaining predictor variables, and
hence the variance of bk is not inflated at all. The general rule of thumb is that VIFs exceeding
4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity
requiring correction.

An Example

Let's return to the blood pressure data (bloodpress.txt [1]) in which researchers observed the
following data on 20 individuals with high blood pressure:

blood pressure (y = BP, in mm Hg)


age (x1 = Age, in years)
weight (x2 = Weight, in kg)
body surface area (x3 = BSA, in sq m)
duration of hypertension (x4 = Dur, in years)
basal pulse (x5 = Pulse, in beats per minute)
stress index (x6 = Stress)

As you may recall, the matrix plot of BP, Age, Weight, and BSA:

the matrix plot of BP, Dur, Pulse, and Stress:

Loading [MathJax]/extensions/MathZoom.js

27 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

and the correlation matrix:

suggest that some of the predictors are at least moderately marginally correlated. For
example, body surface area (BSA) and weight are strongly correlated (r = 0.875), and weight
and pulse are fairly strongly correlated (r = 0.659). On the other hand, none of the pairwise
correlations among age, weight, duration and stress are particularly strong (r < 0.40 in each
case).

Regressing y = BP on all six of the predictors, we obtain:

Loading [MathJax]/extensions/MathZoom.js

28 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

[Minitab v17 reports the variance inflation factors by default; for v16 you have to select this
under Options.] As you can see, three of the variance inflation factors —8.42, 5.33, and 4.41
—are fairly large. The VIF for the predictor Weight, for example, tells us that the variance of
the estimated coefficient of Weight is inflated by a factor of 8.42 because Weight is highly
correlated with at least one of the other predictors in the model.

For the sake of understanding, let's verify the calculation of the VIF for the predictor Weight.
Regressing the predictor x2 = Weight on the remaining five predictors:

Loading [MathJax]/extensions/MathZoom.js

29 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

Minitab reports that is 88.1% or, in decimal form, 0.881. Therefore, the variance
inflation factor for the estimated coefficient Weight is by definition:

Again, this variance inflation factor tells us that the variance of the weight coefficient is inflated
by a factor of 8.4 because Weight is highly correlated with at least one of the other predictors
in the model.

So, what to do? One solution to dealing with multicollinearity is to remove some of the
violating predictors from the model. If we review the pairwise correlations again:

we see that the predictors Weight and BSA are highly correlated (r = 0.875). We can choose
to remove either predictor from the model. The decision of which one to remove is often a
scientific or practical one. For example, if the researchers here are interested in using their
final model to predict the blood pressure of future individuals, their choice should be clear.
Which of the two measurements — body surface area or weight — do you think would be
easier to obtain?! If indeed weight is an easier measurement to obtain than body surface
area, then the researchers would be well-advised to remove BSA from the model and leave
Loading [MathJax]/extensions/MathZoom.js

30 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

Weight in the model.

Reviewing again the above pairwise correlations, we see that the predictor Pulse also
appears to exhibit fairly strong marginal correlations with several of the predictors, including
Age (r = 0.619), Weight (r = 0.659) and Stress (r = 0.506). Therefore, the researchers could
also consider removing the predictor Pulse from the model.

Let's see how the researchers would do. Regressing the response y = BP on the four
remaining predictors age, weight, duration and stress, we obtain:

Aha — the remaining variance inflation factors are quite satisfactory! That is, it appears as if
hardly any variance inflation remains. Incidentally, in terms of the adjusted R2-value, we did
not seem to lose much by dropping the two predictors BSA and Pulse from our model. The
adjusted R2-value decreased to only 98.97% from the original adjusted R2-value of 99.44%.

PRACTICE PROBLEMS: Variance inflation factors

Detecting multicollinearity using VIFk.

We’ll use the cement.txt [8] data set to explore variance inflation factors. The response y
measures the heat evolved in calories during the hardening of cement on a per gram
basis. The four predictors are the percentages of four ingredients: tricalcium aluminate
(x1), tricalcium silicate (x2), tetracalcium alumino ferrite (x3), and dicalcium silicate (x4).
It’s not hard to imagine that such predictors would be correlated in some way.

1. Use the Stat >> Basic Statistics >> Correlation ... command in Minitab to get an idea
of the extent to which the predictor variables are (pairwise) correlated. Also, use the
Graph >> Matrix Plot ... command in Minitab to get a visual portrayal of the (pairwise)
relationships among the response and predictor variables.
Loading [MathJax]/extensions/MathZoom.js

31 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

(CHECK YOUR ANSWER)

2. Regress the fourth predictor, x4, on the remaining three predictors, x1, x2, and x3.
That is, fit the linear regression model treating x4 as the response and x1, x2, and x3 as
the predictors. What is the R24 value? (Note that Minitab rounds the R2 value it reports
to three decimal places. For the purposes of the next question, you’ll want a more
accurate R2 value. Calculate the R2 value SSR using its definition, . Use your
calculated value, carried out to 5 decimal places, in answering the next question.)

(CHECK YOUR ANSWER)

3. Using your calculated R2 value carried out to 5 decimal places, determine by what
factor the variance of b4 is inflated. That is, what is VIF4?

(CHECK YOUR ANSWER)

4. Minitab will actually calculate the variance inflation factors for you. Fit the multiple
linear regression model with y as the response and x1,x2,x3 and x4 as the predictors.
The VIFk will be reported as a column of the estimated coefficients table. Is the VIF4
that you calculated consistent with what Minitab reports?

(CHECK YOUR ANSWER)

5. Note that all of the VIFk are larger than 10, suggesting that a high degree of
multicollinearity is present. (It should seem logical that multicollinearity is present here,
given that the predictors are measuring the percentage of ingredients in the cement.)
Do you notice anything odd about the results of the t-tests for testing the individual H0 :
βi = 0 and the result of the overall F-test for testing H0 : β1 = β2 = β3 = β4 = 0? Why
does this happen?

(CHECK YOUR ANSWER)

6. We learned that one way of reducing data-based multicollinearity is to remove some


of the violating predictors from the model. Fit the linear regression model with y as the
response and X1 and X2 as the only predictors. Are the variance inflation factors for this
model acceptable?

(CHECK YOUR ANSWER)

12.5 - Reducing Data-based


Multicollinearity
Recall that data-based multicollinearity is multicollinearity that results from a poorly designed
experiment, reliance on purely observational data, or the inability to manipulate the system on
which the data are collected. We now know all the bad things that can happen in the presence
of multicollinearity. And, we've learned how to detect multicollinearity. Now, let's learn how to
reduce multicollinearity once we've discovered that it exists.

As the[MathJax]/extensions/MathZoom.js
Loading example in the previous section illustrated, one way of reducing data-based

32 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

multicollinearity is to remove one or more of the violating predictors from the regression
model. Another way is to collect additional data under different experimental or observational
conditions. We'll investigate this alternative method in this section.

Before we do, let's quickly remind ourselves why we should care about reducing
multicollinearity. It all comes down to drawing conclusions about the population slope
parameters. If the variances of the estimated coefficients are inflated by multicollinearity, then
our confidence intervals for the slope parameters are wider and therefore less useful.
Eliminating or even reducing the multicollinearity therefore yields narrower, more useful
confidence intervals for the slopes.

An Example

Researchers running the Allen Cognitive Level (ACL) Study were interested in the relationship
of ACL test scores to the level of psychopathology. They therefore collected the following data
on a set of 69 patients in a hospital psychiatry unit:

Response y = ACL test score


Predictor x1 = vocabulary (Vocab) score on the Shipley Institute of Living Scale
Predictor x2 = abstraction (Abstract) score on the Shipley Institute of Living Scale
Predictor x3 = score on the Symbol-Digit Modalities Test (SDMT)

For the sake of this example, I sampled 23 patients from the original data set in such a way to
ensure that a very high correlation exists between the two predictors Vocab and Abstract. A
matrix plot of the resulting data set (allentestn23.txt [9]):

suggests that, indeed, a strong correlation exists between Vocab and Abstract. The
correlations among the remaining pairs of predictors do not appear to be particularly strong.

Focusing only on the relationship between the two predictors Vocab and Abstract:

Loading [MathJax]/extensions/MathZoom.js

33 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

we do indeed see that a very strong relationship (r = 0.99) exists among the two predictors.

Let's see what havoc this high correlation wreaks on our regression analysis! Regressing the
response y = ACL on the predictors SDMT, Vocab, and Abstract, we obtain:

Yikes — the variance inflation factors for Vocab and Abstract are very large — 49.3 and 50.6,
respectively!

What should we do about this? We could opt to remove one of the two predictors from the
model. Alternatively, if we have a good scientific reason for needing both of the predictors to
remain in the model, we could go out and collect more data. Let's try this second approach
here. [MathJax]/extensions/MathZoom.js
Loading

34 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

For the sake of this example, let's imagine that we went out and collected more data, and in
so doing, obtained the actual data collected on all 69 patients enrolled in the Allen Cognitive
Level (ACL) Study. A matrix plot of the resulting data set (allentest.txt [4]):

suggests that a correlation still exists between Vocab and Abstract — it is just a weaker
correlation now.

Again, focusing only on the relationship between the two predictors Vocab and Abstract:

we do indeed see that the relationship between Abstract and Vocab is now much weaker (r =
0.698) than before. The round data points in blue represent the 23 data points in the original
data set, while the square red data points represent the 46 newly collected data points. As
you can see from the plot, collecting the additional data has expanded the "base" over which
Loading [MathJax]/extensions/MathZoom.js

35 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

the "best fitting plane" will sit. The existence of this larger base allows less room for the plane
to tilt from sample to sample, and thereby reduces the variance of the estimated slope
coefficients.

Let's see if the addition of the new data helps to reduce the multicollinearity here. Regressing
the response y = ACL on the predictors SDMT, Vocab, and Abstract:

we find that the variance inflation factors are reduced significantly and satisfactorily. The
researchers could now feel comfortable proceeding with drawing conclusions about the
effects of the vocabulary and abstraction scores on the level of psychopathology.

One thing to keep in mind. In order to reduce the multicollinearity that exists, it is not
sufficient to go out and just collect any ol' data. The data have to be collected in such a way to
ensure that the correlations among the violating predictors is actually reduced. That is,
collecting more of the same kind of data won't help to reduce the multicollinearity. The data
have to be collected to ensure that the "base" is sufficiently enlarged. Doing so, of course,
changes the characteristics of the studied population, and therefore should be reported
accordingly.

12.6 - Reducing Structural Multicollinearity


Recall that structural multicollinearity is multicollinearity that is a mathematical artifact caused
by creating new predictors from other predictors, such as, creating the predictor x2 from the
predictor x. Because of this, at the same time that we learn here about reducing structural
multicollinearity, we learn more about polynomial regression models.

An Example

What is the impact of exercise on the human immune system? In order to answer this very
global and general research question, one has to first quantify what "exercise" means and
what "immunity" means. Of course, there are several ways of doing so. For example, we
Loading [MathJax]/extensions/MathZoom.js

36 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

might quantify one's level of exercise by measuring his or her "maximal oxygen uptake." And,
we might quantify the quality of one's immune system by measuring the amount of
"immunoglobin in his or her blood." In doing so, the general research question is translated
into the much more specific research question: "How is the amount of immunoglobin in blood
(y) related to maximal oxygen uptake (x)?"

Because some researchers were interested in answering the above research question, they
collected the following data (exerimmun.txt [10]) on a sample of 30 individuals:

yi = amount of immunoglobin in blood (mg) of individual i


xi = maximal oxygen uptake (ml/kg) of individual i

The scatter plot of the resulting data:

suggests that there might be some curvature to the trend in the data. In order to allow for the
apparent curvature —rather than formulating a linear regression function —the researchers
formulated the following quadratic polynomial regression function:

where:

yi = amount of immunoglobin in blood (mg) of individual i


xi = maximal oxygen uptake (ml/kg) of individual i

As usual, the error terms εi are assumed to be independent, normally distributed and have
equal variance σ2.

As the following plot of the estimated quadratic function suggests:

Loading [MathJax]/extensions/MathZoom.js

37 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

the formulated regression function appears to describe the trend in the data well. The
adjusted R2-value is 93.3%.

But, now what do the estimated coefficients tell us? The interpretation of the regression
coefficients is mostly geometric in nature. That is, the coefficients tell us a little bit about what
the picture looks like:

If 0 is a possible x value, then b0 is the predicted response when x = 0. Otherwise, the


interpretation of b0 is meaningless.
The estimated coefficient b1 is the estimated slope of the tangent line at x = 0.
The estimated coefficient b2 indicates the up/down direction of the curve. That is:
if b2 < 0, then the curve is concave down
if b2 > 0, then the curve is concave up

So far, we have kept our head a little bit in the sand! If we look at the output we obtain upon
regressing the response y = igg on the predictors oxygen and oxygen2:

Loading [MathJax]/extensions/MathZoom.js

38 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

we quickly see that the variance inflation factors for both predictors —oxygen and oxygensq
—are very large (99.9 in each case). Is this surprising to you? If you think about it, we've
created a correlation by taking the predictor oxygen and squaring it to obtain oxygensq. That
is, just by the nature of our model, we have created a "structural multicollinearity."

The scatter plot of oxygensq versus oxygen:

illustrates the intense strength of the correlation that we induced. After all, we just can't get
much more correlated than a correlation of r = 0.995!

The neat thing here is that we can reduce the multicollinearity in our data by doing what is
known as "centering the predictors." Centering a predictor merely entails subtracting the
mean of the predictor values in the data set from each predictor value. For example, Minitab
reports[MathJax]/extensions/MathZoom.js
Loading that the mean of the oxygen values in our data set is 50.64:

39 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

Therefore, in order to center the predictor oxygen, we merely subtract 50.64 from each
oxygen value in our data set. Doing so, we obtain the centered predictor, oxcent, say:

oxygen oxcent oxcentsq


34.6 -16.04 257.282
45.0 -5.64 31.810
62.3 11.66 135.956
58.9 8.26 68.228
42.5 -8.14 66.260
44.3 -6.34 40.196
67.9 17.26 297.908
58.5 7.86 61.780
35.6 -15.04 226.202
49.6 -1.04 1.082
33.0 -17.64 311.170

For example, 34.6 minus 50.64 is -16.04, and 45.0 minus 50.64 is -5.64, and so on. Now, in
order to include the squared oxygen term in our regression model—to allow for curvature in
the trend—we square the centered predictor oxcent to obtain oxcentsq. That is, (-16.04)2 =
257.282 and (-5.64)2 = 31.810, and so on.

Wow! It really works! The scatter plot of oxcentsq versus oxcent:

Loading [MathJax]/extensions/MathZoom.js

40 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

illustrates—by centering the predictors—just how much we've reduced the correlation
between the predictor and its square. The correlation has gone from a whopping r = 0.995 to
a rather low r = 0.219!

Having centered the predictor oxygen, we must reformulate our quadratic polynomial
regression model accordingly. That is, we now formulate our model as:

or alternatively:

where:

yi = amount of immunoglobin in blood (mg), and


denotes the centered predictor

and the error terms εi are independent, normally distributed and have equal variance σ2. Note
that we add asterisks to each of the parameters in order to make it clear that the parameters
differ from the parameters in the original model we formulated.

Let's see how we did by centering the predictors and reformulating our model. Recall that
—based on our original model —the variance inflation factors for oxygen and oxygensq were
99.9. Now, regressing y = igg on the centered predictors oxcent and oxcentsq:

we see that the variance inflation factors have dropped significantly—now they are 1.05 in
each case.

Because we reformulated our model based on the centered predictors, the meaning of the
parameters must be changed accordingly. Now, the estimated coefficients tell us:
Loading [MathJax]/extensions/MathZoom.js

41 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

The estimated coefficient b0 is the predicted response y when the predictor x equals the
sample mean of the predictor values.
The estimated coefficient b1 is the estimated slope of the tangent line at the predictor
mean — and, often, it is similar to the estimated slope in the simple linear regression
model.
The estimated coefficient b2 indicates the up/down direction of curve. That is:
if b2 < 0, then the curve is concave down
if b2 > 0, then the curve is concave up

So, here, in this example, the estimated coefficient b0 = 1632.3 tells us that a male whose
maximal oxygen uptake is 50.64 ml/kg is predicted to have 1632.3 mg of immunoglobin in his
blood. And, the estimated coefficient b1 = 34.00 tells us that the when an individual's maximal
oxygen uptake is near 50.64 ml/kg, we can expect the individual's immunoglobin to increase
by 34.00 mg for every 1 ml/kg increase in maximal oxygen uptake.

As the following plot of the estimated quadratic function suggests:

the reformulated regression function appears to describe the trend in the data well. The
adjusted R2-value is still 93.3%.

We shouldn't be surprised to see that the estimates of the coefficients in our reformulated
polynomial regression model are quite similar to the estimates of the coefficients for the
simple linear regression model:

Loading [MathJax]/extensions/MathZoom.js

42 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

As you can see, the estimated coefficient b1 = 34.00 for the polynomial regression model and
b1 = 32.74 for the simple linear regression model. And, the estimated coefficient b0 = 1632 for
the polynomial regression model and b0 = 1558 for the simple linear regression model. The
similarities in the estimates, of course, arise from the fact that the predictors are nearly
uncorrelated and therefore the estimates of the coefficients don't change all that much from
model to model.

Now, you might be getting this sense that we're "mucking around with the data" in order to get
an answer to our research questions. One way to convince you that we're not is to show you
that the two estimated models are algebraically equivalent. That is, if given one form of the
estimated model, say the estimated model with the centered predictors:

then, the other form of the estimated model, say the estimated model with the original
predictors:

can be easily obtained. In fact, it can be shown algebraically that the estimated coefficients of
the original model equal:

For example, the estimated regression function for our reformulated model with centered
predictors is:

Then, since the mean of the oxygen values in the data set is 50.64:

Loading [MathJax]/extensions/MathZoom.js

43 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

it can be shown algebraically that the estimated coefficients for the model with the original
(uncentered) predictors are:

b0 = 1632.3 - 34.00(50.64) - 0.536(50.64)2 = -1464

b1 = 34.00 - 2(- 0.536)(50.64) = 88.3

b11 = - 0.536

That is, the estimated regression function for our quadratic polynomial model with the original
(uncentered) predictors is:

Given the equivalence of the two estimated models, you might ask why we bother to center
the predictors. The main reason for centering to correct structural multicollinearity is that low
levels of multicollinearity can be helpful in avoiding computational inaccuracies. Specifically, a
near-zero determinant of XTX is a potential source of serious roundoff errors in the
calculations of the normal equations. Severe multicollinearity has the effect of making this
determinant come close to zero. Thus, under severe multicollinearity, the regression
coefficients may be subject to large roundoff errors.

Let's use our model to predict the immunoglobin level in the blood of a person whose maximal
oxygen uptake is 90 ml/kg. Of course, before we use our model to answer a research
question, we should always evaluate it first to make sure it means all of the necessary
conditions. The residuals versus fits plot:

shows a nice horizontal band around the residual = 0 line, suggesting the model fits the data
well. It also suggests that the variances of the error terms are equal. And, the normal
probability plot:
Loading [MathJax]/extensions/MathZoom.js

44 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

suggests that the error terms are normally distributed. Okay, we're good to go —let's use the
model to answer our research question: "What is one's predicted IgG if the maximal oxygen
uptake is 90 ml/kg?"

When asking Minitab to make this prediction, you have to remember that we have centered
the predictors. That is, if oxygen = 90, then oxcent = 90-50.64 = 39.36. And, if oxcent = 39.36,
then oxcentsq = 1549.210. Asking Minitab to predict the igg of an individual whose oxcent =
39.4 and oxcentsq = 1549, we obtain the following output:

Why does Minitab report that "XX denotes a row with very extreme X values?"
Recall that the levels of maximal oxygen uptake in the data set ranged from 30 to 70 ml/kg.
Therefore, a maximal oxygen uptake of 90 is way outside the scope of the model, and Minitab
provides such a warning.

A word of warning. Be careful —because of changes in direction of the curve, there is an


even greater danger in extrapolation when modeling data with a polynomial function.

The hierarchical approach to model fitting

Just one closing comment since we've been discussing polynomial regression to remind you
Loading [MathJax]/extensions/MathZoom.js

45 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

about the "hierarchical approach to model fitting." The widely accepted approach to fitting
polynomial regression functions is to fit a higher-order model and then explore whether a
lower-order (simpler) model is adequate. For example, suppose we formulate the following
cubic polynomial regression function:

Then, to see if the simpler first order model (a "line") is adequate in describing the trend in the
data, we could test the null hypothesis:

But then … if a polynomial term of a given order is retained, then all related lower-order terms
are also retained. That is, if a quadratic term is deemed significant, then it is standard practice
to use this regression function:

and not this one:

That is, we always fit the terms of a polynomial model in a hierarchical manner.

12.7 - Further Example


Example - Poverty and Teen Birth Rate Data

(Data source: The U.S. Census Bureau and Mind On Statistics, (3rd edition), Utts and
Heckard). In this example, the observations are the 50 states of the United States (poverty.txt
[12] - Note: remove data from the District of Columbia). The variables are y = percentage of

each state’s population living in households with income below the federally defined poverty
level in the year 2002, x1 = birth rate for females 15 to 17 years old in 2002, calculated as
births per 1000 persons in the age group, and x2 = birth rate for females 18 to 19 years old in
2002, calculated as births per 1000 persons in the age group.

The two x-variables are correlated (so we have multicollinearity). The correlation is about
0.95. A plot of the two x-variables is given below.

Loading [MathJax]/extensions/MathZoom.js

46 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

The figure below shows plots of y = poverty percentage versus each x-variable separately.
Both x-variables are linear predictors of the poverty percentage.

Minitab results for the two possible simple regressions and the multiple regression are given
below.

Loading [MathJax]/extensions/MathZoom.js

47 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

We note the following:

1. The value of the sample coefficient that multiplies a particular x-variable is not same in
the multiple regression as it is in the relevant simple regression.
2. The R2 for the multiple regression is not the sum of the R2 values for the simple
regressions. An x-variable (either one) is not making an independent “add-on” in the
multiple regression.
3. The 18 to 19 year-old birth rate variable is significant in the simple regression, but is not
in the multiple regression. This discrepancy is caused by the correlation between the
two x-variables. The 15 to 17 year-old birth rate is the stronger of the two x-variables,
and given it’s presence in the equation, the 18 to 19 year-old rate does not improve R2
enough to be significant. More specifically, the correlation between the two x-variables
has increased the standard errors of the coefficients, so we have less precise estimates
of the individual slopes.

12.8 - Extrapolation
"Extrapolation" beyond the "scope of the model" occurs when one uses an estimated
regression equation to estimate a mean µY or to predict a new response ynew for x values not
in the range of the sample data used to determine the estimated regression equation. In
general, it is dangerous to extrapolate beyond the scope of model. The following example
illustrates why this is not a good thing to do.

Researchers measured the number of colonies of grown bacteria for various concentrations
of urine (ml/plate). The scope of the model — that is, the range of the x values — was 0 to
5.80 ml/plate. The researchers obtained the following estimated regression equation:

Loading [MathJax]/extensions/MathZoom.js

48 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

Using the estimated regression equation, the researchers predicted the number of colonies at
11.60 ml/plate to be 16.0667 + 1.61576(11.60) or 34.8 colonies. But when the researchers
conducted the experiment at 11.60 ml/plate, they observed that the number of colonies
decreased dramatically to about 15.1 ml/plate:

The moral of the story is that the trend in the data as summarized by the estimated regression
equation does not necessarily hold outside the scope of the model.

12.9 - Other Regression Pitfalls


Nonconstant Variance
Excessive nonconstant variance can create technical difficulties with a multiple linear
regression model. For example, if the residual variance increases with the fitted values, then
Loading [MathJax]/extensions/MathZoom.js

49 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

prediction intervals will tend to be wider than they should be at low fitted values and narrower
than they should be at high fitted values. Some remedies for refining a model exhibiting
excessive nonconstant variance includes the following:

Apply a variance-stabilizing transformation to the response variable, for example a


logarithmic transformation (or a square root transformation if a logarithmic
transformation is "too strong" or a reciprocal transformation if a logarithmic
transformation is "too weak"). We explored this in more detail in Lesson 9.
Weight the variances so that they can be different for each set of predictor values. This
leads to weighted least squares, in which the data observations are given different
weights when estimating the model. We'll cover this in Lesson 13.
A generalization of weighted least squares is to allow the regression errors to be
correlated with one another in addition to having different variances. This leads to
generalized least squares, in which various forms of nonconstant variance can be
modeled.
For some applications we can explicitly model the variance as a function of the mean,
E(Y). This approach uses the framework of generalized linear models, which we
discuss in Lesson 15.

Autocorrelation
One common way for the "independence" condition in a multiple linear regression model to
fail is when the sample data have been collected over time and the regression model fails to
effectively capture any time trends. In such a circumstance, the random errors in the model
are often positively correlated over time, so that each random error is more likely to be similar
to the previous random error that it would be if the random errors were independent of one
another. This phenomenon is known as autocorrelation (or serial correlation) and can
sometimes be detected by plotting the model residuals versus time. We'll explore this further
in Lesson 14.

Overfitting
When building a regression model, we don't want to include unimportant or irrelevant
predictors whose presence can overcomplicate the model and increase our uncertainty about
the magnitudes of the effects for the important predictors (particularly if some of those
predictors are highly collinear). Such "overfitting" can occur the more complicated a model
becomes and the more predictor variables, transformations, and interactions are added to a
model. It is always prudent to apply a sanity check to any model being used to make
decisions. Models should always make sense, preferably grounded in some kind of
background theory or sensible expectation about the types of associations allowed between
variables. Predictions from the model should also be reaosnable (over-complicated models
can give quirky results that may not reflect reality).

Excluding Important Predictor Variables


However, there is potentially greater risk from excluding important predictors than from
including unimportant ones. The linear association between two variables ignoring other
relevant variables can differ both in magnitude and direction from the association that controls
for other relevant variables. Whereas the potential cost of including unimportant predictors
might be increased difficulty with interpretation and reduced prediction accuracy, the potential
cost of excluding important predictors can be a completely meaningless model containing
misleading associations. Results can vary considerably depending on whether such
predictors are (inappropriately) excluded or (appropriately) included. These predictors are
Loading [MathJax]/extensions/MathZoom.js

50 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

sometimes called confounding or lurking variables, and their absence from a model can lead
to incorrect decisions and poor decision-making.

Missing Data
Real-world datasets frequently contain missing values, so that we do not know the values of
particular variables for some of the sample observations. For example, such values may be
missing because they were impossible to obtain during data collection. Dealing with missing
data is a challenging task. Missing data has the potential to adversely affect a regression
analysis by reducing the total usable sample size. The best solution to this problem is to try
extremely hard to avoid having missing data in the first place. When there are missing values
that are impossible or too costly to avoid, one approach is to replace the missing values with
plausible estimates, known as imputation. Another (easier) approach is to consider only
models that contain predictors with no (or few) missing values. This may be unsatisfactory,
however, because even a predictor variable with a large number of missing values can
contain useful information.

Power and Sample Size


In small datasets, a lack of observations can lead to poorly estimated models with large
standard errors. Such models are said to lack statistical power because there is insufficient
data to be able to detect significant associations between the response and predictors. So,
how much data do we need to conduct a successful regression analysis? A common rule of
thumb is that 10 data observations per predictor variable is a pragmatic lower bound for
sample size. However, it is not so much the number of data observations that determines
whether a regression model is going to be useful, but rather whether the resulting model
satisfies the LINE conditions. In some circumstances, a model applied to fewer than 10 data
observations per predictor variable might be perfectly fine (if, say, the model fits the data
really well and the LINE conditions seem fine), while in other circumstances a model applied
to a few hundred data points per predictor variable might be pretty poor (if, say, the model fits
the data badly and one or more conditions are seriously violated). For another example, in
general we’d need more data to model interaction compared to a similar model without the
interaction. However, it is difficult to say exactly how much data would be needed. It is
possible that we could adequately model interaction with a relatively small number of
observations if the interaction effect was pronounced and there was little statistical error.
Conversely, in datasets with only weak interaction effects and relatively large statistical error,
it might take a much larger number of observations to have a satisfactory model. In practice,
we have methods for assessing the LINE conditions, so it is possible to consider whether an
interaction model approximately satisfies the assumptions on a case-by-case basis. In
conclusion, there is not really a good standard for determining sample size given the number
of predictors, since the only truthful answer is, “It depends.” In many cases, it soon becomes
pretty clear when working on a particular dataset if we are trying to fit a model with too many
predictor terms for the number of sample observations (results can start to get a little odd and
standard errors greatly increase). From a different perspective, if we are designing a study
and need to know how much data to collect, then we need to get into sample size and power
calculations, which rapidly become quite complex. Some statistical software packages will do
sample size and power calculations, and there is even some software specifically designed to
do just that. When designing a large, expensive study, it is recommended that such software
be used or to get advice from a statistician with sample size expertise.

Source URL: https://onlinecourses.science.psu.edu/stat501/node/343

Links:
Loading [MathJax]/extensions/MathZoom.js
[1] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files

51 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343

/data/bloodpress.txt
[2] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/uncorrpreds.txt
[3] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/uncorrelated.txt
[4] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/allentest.txt
[5] https://onlinecourses.science.psu.edu/stat501/node/240
[6] https://onlinecourses.science.psu.edu/stat501/node/247
[7] https://onlinecourses.science.psu.edu/stat501/node/244
[8] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/cement.txt
[9] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/allentestn23.txt
[10] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/exerimmun.txt
[11] https://onlinecourses.science.psu.edu/stat501/../data/lifeexpect.txt
[12] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/examples
/poverty.txt

Loading [MathJax]/extensions/MathZoom.js

52 of 52 11-02-2018, 02:53

Вам также может понравиться