Академический Документы
Профессиональный Документы
Культура Документы
edu/stat501/print/book/export/html/343
So far, in our study of multiple regression models, we have ignored something that we
probably shouldn't have — and that's what is called multicollinearity. We're going to correct
our blissful ignorance in this lesson.
Multicollinearity exists when two or more of the predictors in a regression model are
moderately or highly correlated. Unfortunately, when it exists, it can wreak havoc on our
analysis and thereby limit the research conclusions we can draw. As we will soon learn, when
multicollinearity exists, any of the following pitfalls can be exacerbated:
the estimated regression coefficient of any one variable depends on which other
predictors are included in the model
the precision of the estimated regression coefficients decreases as more predictors are
added to the model
the marginal contribution of any one predictor variable in reducing the error sum of
squares depends on which other predictors are already in the model
hypothesis tests for βk = 0 may yield different conclusions depending on which
predictors are in the model
In this lesson, we'll take a look at an example or two that illustrates each of the above
outcomes. Then, we'll spend some time learning how not only to detect multicollinearity, but
also how to reduce it once we've found it.
We'll also consider other regression pitfalls, including extrapolation, nonconstant variance,
autocorrelation, overfitting, excluding important predictor variables, missing data, and power
and sample size.
1 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
Unfortunately, researchers often can't control the predictors. Obvious examples include a
person's gender, race, grade point average, math SAT score, IQ, and starting salary. For each
of these predictor examples, the researcher just observes the values as they occur for the
people in her random sample.
Multicollinearity happens more often than not in such observational studies. And,
unfortunately, regression analyses most often take place on data obtained from observational
studies. If you aren't convinced, consider the example data sets for this course. Most of the
data sets were obtained from observational studies, not experiments. It is for this reason that
we need to fully understand the impact of multicollinearity on our regression analyses.
Types of multicollinearity
In the case of structural multicollinearity, the multicollinearity is induced by what you have
done. Data-based multicollinearity is the more troublesome of the two types of
multicollinearity. Unfortunately it is the type we encounter most often!
Example
2 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
allow us to investigate the various marginal relationships between the response BP and the
predictors. Blood pressure appears to be related fairly strongly to Weight and BSA, and hardly
related at all to Stress level.
Loading [MathJax]/extensions/MathZoom.js
3 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
The matrix plots also allow us to investigate whether or not relationships exist among the
predictors. For example, Weight and BSA appear to be strongly related, while Stress and BSA
appear to be hardly related at all.
provides further evidence of the above claims. Blood pressure appears to be related fairly
strongly to Weight (r = 0.950) and BSA (r = 0.866), and hardly related at all to Stress level (r =
0.164). And, Weight and BSA appear to be strongly related (r = 0.875), while Stress and BSA
appear to be hardly related at all (r = 0.018). The high correlation among some of the
predictors suggests that data-based multicollinearity exists.
Now, what we need to learn is the impact of the multicollinearity on regression analysis. Let's
go do it!
Then, on the next page, we'll investigate the effects that highly correlated predictors have on
regression analyses. In doing so, we'll learn — and therefore be able to summarize — the
various effects multicollinearity has on regression analyses.
Consider the following matrix plot of the response y and two predictors x1 and x2, of a
contrived data set (uncorrpreds.txt [2]), in which the predictors are perfectly uncorrelated:
Loading [MathJax]/extensions/MathZoom.js
4 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
As you can see there is no apparent relationship at all between the predictors x1 and x2. That
is, the correlation between x1 and x2 is zero:
Now, let's just proceed quickly through the output of a series of regression analyses collecting
various pieces of information along the way. When we're done, we'll review what we learned
by collating the various items in a summary table.
Loading [MathJax]/extensions/MathZoom.js
5 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
yields the estimated coefficient b1 = -1.00, the standard error se(b1) = 1.47, and the
regression sum of squares SSR(x1) = 8.000.
yields the estimated coefficient b2 = -1.75, the standard error se(b2) = 1.35, and the
regression sum of squares SSR(x ) = 24.50.
Loading [MathJax]/extensions/MathZoom.js2
6 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
The regression of the response y on the predictors x1 and x2 (in that order):
yields the estimated coefficients b1 = -1.00 and b2 = -1.75, the standard errors se(b1) = 1.41
and se(b2) = 1.41, and the sequential sum of squares SSR(x2|x1) = 24.500.
The regression of the response y on the predictors x2 and x1 (in that order):
Loading [MathJax]/extensions/MathZoom.js
7 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
yields the estimated coefficients b1 = -1.00 and b2 = -1.75, the standard errors se(b1) = 1.41
and se(b2) = 1.41, and the sequential sum of squares SSR(x1|x2) = 8.000.
SSR(x1)
x1 only -1.00 1.47 --- ---
8.000
SSR(x2)
x2 only --- --- -1.75 1.35
24.50
x1, x2 SSR(x2|x1)
-1.00 1.41 -1.75 1.41
(in order) 24.500
x2, x1 SSR(x1|x2)
-1.00 1.41 -1.75 1.41
(in order) 8.000
What do we observe?
The estimated slope coefficients b1 and b2 are the same regardless of the model used.
The standard errors se(b1) and se(b2) don't change much at all from model to model.
The sum of squares SSR(x1) is the same as the sequential sum of squares SSR(x1|x2).
The sum of squares SSR(x2) is the same as the sequential sum of squares SSR(x2|x1).
Loading [MathJax]/extensions/MathZoom.js
8 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
These all seem to be good things! Because the slope estimates stay the same, the effect on
the response ascribed to a predictor doesn't depend on the other predictors in the model.
Because SSR(x1) = SSR(x1|x2), the marginal contribution that x1 has in reducing the
variability in the response y doesn't depend on the predictor x2. Similarly, because SSR(x2) =
SSR(x2|x1), the marginal contribution that x2 has in reducing the variability in the response y
doesn't depend on the predictor x1.
These are the things we can hope for in a regression analyis —but, then reality sets in! Recall
that we obtained the above results for a contrived data set, in which the predictors are
perfectly uncorrelated. Do we get similar results for real data with only nearly uncorrelated
predictors? Let's see!
To investigate this question, let's go back and take a look at the blood pressure data set
(bloodpress.txt [1]). In particular, let's focus on the relationships among the response y = BP
and the predictors x3 = BSA and x6 = Stress:
As the above matrix plot and the following correlation matrix suggest:
9 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
existent relationship between x3 = BSA and x6 = Stress (r = 0.018). That is, the two predictors
are nearly perfectly uncorrelated.
What effect do these nearly perfectly uncorrelated predictors have on regression analyses?
Let's proceed similarly through the output of a series of regression analyses collecting various
pieces of information along the way. When we're done, we'll review what we learned by
collating the various items in a summary table.
yields the estimated coefficient b6 = 0.0240, the standard error se(b6) = 0.0340, and the
regression sum of squares SSR(x6) = 15.04.
yields the estimated coefficient b3 = 34.44, the standard error se(b3) = 4.69, and the
regression sum of squares SSR(x3) = 419.858.
Loading [MathJax]/extensions/MathZoom.js
10 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
The regression of the response y = BP on the predictors x6= Stress and x3 = BSA (in that
order):
yields the estimated coefficients b6 = 0.0217 and b3 = 34.33, the standard errors se(b6) =
0.0170 and se(b2) = 4.61, and the sequential sum of squares SSR(x3|x6) = 417.07.
Finally, the regression of the response y = BP on the predictors x3 = BSA and x6= Stress
(in that order):
yields the estimated coefficients b6 = 0.0217 and b3 = 34.33, the standard errors se(b6) =
0.0170 and se(b2) = 4.61, and the sequential sum of squares SSR(x6|x3) = 12.26.
Loading [MathJax]/extensions/MathZoom.js
11 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
SSR(x6)
x6 only 0.0240 0.0340 --- ---
15.04
SSR(x3)
x3 only --- --- 34.44 4.69
419.858
x6, x3 SSR(x3|x6)
0.0217 0.0170 34.33 4.61
(in order) 417.07
x3, x6 SSR(x6|x3)
0.0217 0.0170 34.33 4.61
(in order) 12.26
We don't get identical, but very similar slope estimates b3 and b6, regardless of the
predictors in the model.
The sum of squares SSR(x3) is not the same, but very similar to the sequential sum of
squares SSR(x3|x6).
The sum of squares SSR(x6) is not the same, but very similar to the sequential sum of
squares SSR(x6|x3).
Again, these are all good things! In short, the effect on the response ascribed to a predictor is
similar regardless of the other predictors in the model. And, the marginal contribution of a
predictor doesn't appear to depend much on the other predictors in the model.
This exercise reviews the benefits of having perfectly uncorrelated predictor variables.
The results of this exercise demonstrate a strong argument for conducting "designed
experiments" in which the researcher sets the levels of the predictor variables in
advance, as opposed to conducting an "observational study" in which the researcher
merely observes the levels of the predictor variables as they happen. Unfortunately,
many regression analyses are conducted on observational data rather than
experimental data, limiting the strength of the conclusions that can be drawn from the
data. As this exercise demonstrates, you should conduct an experiment, whenever
possible, not an observational study. Use the (contrived) data stored in uncorrelated.txt
[3] to complete this lab exercise.
1. Using the Stat >> Basic Statistics >> Correlation... command in Minitab, calculate the
correlation coefficient between X1 and X2. Are the two variables perfectly uncorrelated?
2. Fit the simple linear regression model with y as the response and x1 as the single
predictor:
12 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
3. Now, fit the simple linear regression model with y as the response and x2 as the
single predictor:
4. Now, fit the multiple linear regression model with y as the response and x1 as the first
predictor and x2 as the second predictor:
What is the value of the estimated slope coefficient b1? Is the estimate b1 different
than that obtained when x1 was the only predictor in the model?
What is the value of the estimated slope coefficient b2? Is the estimate b2 different
than that obtained when x2 was the only predictor in the model?
What is the sequential sum of squares, SSR (X2|X1)? Does the reduction in the
error sum of squares when x2 is added to the model depend on whether x1 is
already in the model?
5. Now, fit the multiple linear regression model with y as the response and x2 as the first
predictor and x1 as the second predictor:
What is the sequential sum of squares, SSR (X1|X2)? Does the reduction in the
error sum of squares when x1 is added to the model depend on whether x2 is
already in the model?
6. When the predictor variables are perfectly uncorrelated, is it possible to quantify the
effect a predictor has on the response without regard to the other predictors?
7. In what way does this exercise demonstrate the benefits of conducting a designed
experiment rather than an observational study?
13 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
Let's return again to the blood pressure data set (bloodpress.txt [1]). This time, let's focus,
however, on the relationships among the response y = BP and the predictors x2 = Weight and
x3 = BSA:
there appears to be not only a strong relationship between y = BP and x2 = Weight (r = 0.950)
and a strong relationship between y = BP and the predictor x3 = BSA (r = 0.866), but also a
strong relationship between the two predictors x2 = Weight and x3 = BSA (r = 0.875).
Incidentally, it shouldn't be too surprising that a person's weight and body surface area are
highly correlated.
What impact does the strong correlation betwen the two predictors have on the regression
analysis and the subsequent conclusions we can draw? Let's proceed as before by reviewing
the output of a series of regression analyses and collecting various pieces of information
along the way. When we're done, we'll review what we learned by collating the various items
in a summary table.
Loading [MathJax]/extensions/MathZoom.js
14 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
yields the estimated coefficient b2 = 1.2009, the standard error se(b2) = 0.0930, and the
regression sum of squares SSR(x2) = 505.472.
yields the estimated coefficient b3 = 34.44, the standard error se(b3) = 4.69, and the
regression sum of squares SSR(x3) = 419.858.
Loading [MathJax]/extensions/MathZoom.js
15 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
The regression of the response y = BP on the predictors x2= Weight and x3 = BSA (in that
order):
yields the estimated coefficients b2 = 1.039 and b3 = 5.83, the standard errors se(b2) = 0.193
and se(b3) = 6.06, and the sequential sum of squares SSR(x3|x2) = 2.814.
And finally, the regression of the response y = BP on the predictors x3= BSA and x2= Weight
(in that order):
Loading [MathJax]/extensions/MathZoom.js
16 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
yields the estimated coefficients b2 = 1.039 and b3 = 5.83, the standard errors se(b2) = 0.193
and se(b3) = 6.06, and the sequential sum of squares SSR(x2|x3) = 88.43.
SSR(x2)
x2 only 1.2009 0.0930 --- ---
505.472
SSR(x3)
x3 only --- --- 34.44 4.69
419.858
x2, x3 SSR(x3|x2)
1.039 0.193 5.83 6.06
(in order) 2.814
x3, x2 SSR(x2|x3)
1.039 0.193 5.83 6.06
(in order) 88.43
Geez — things look a little different than before. It appears as if, when predictors are highly
correlated, the answers you get depend on the predictors in the model. That's not good! Let's
proceed through the table and in so doing carefully summarize the effects of multicollinearity
on the regression analyses.
Effect #1
When predictor variables are correlated, the estimated regression coefficient of any one
variable depends on which other predictor variables are included in the model.
Variables
b2 b3
in model
x2 1.20 ---
x3 --- 34.4
x2, x3 1.04 5.83
Note that, depending on which predictors we include in the model, we obtain wildly different
estimates of the slope parameter for x3 = BSA!
If x3 = BSA is the only predictor included in our model, we claim that for every additional
one square meter increase in body surface area (BSA), blood pressure (BP) increases
by 34.4 mm Hg.
On the other hand, if x2 = Weight and x3 = BSA are both included in our model, we
claim that for every additional one square meter increase in body surface area (BSA),
holding weight constant, blood pressure (BP) increases by only 5.83 mm Hg.
This is[MathJax]/extensions/MathZoom.js
Loading a huge difference! Our hope would be, of course, that two regression analyses
17 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
wouldn't lead us to such seemingly different scientific conclusions. The high correlation
among the two predictors is what causes the large discrepancy. When interpreting b3 = 34.4
in the model that excludes x2 = Weight, keep in mind that when we increase x3 = BSA then x2
= Weight also increases and both factors are associated with increased blood pressure.
However, when interpreting b3 = 5.83 in the model that includes x2 = Weight, we keep x2 =
Weight fixed, so the resulting increase in blood pressure is much smaller.
The amazing thing is that even predictors that are not included in the model, but are highly
correlated with the predictors in our model, can have an impact! For example, consider a
pharmaceutical company's regression of territory sales on territory population and per capita
income. One would, of course, expect that as the population of the territory increases, so
would the sales in the territory. But, contrary to this expectation, the pharmaceutical
company's regression analysis deemed the estimated coefficient of territory population to be
negative. That is, as the population of territory increases, the territory sales were predicted to
decrease. After further investigation, the pharmaceutical company determined that the larger
the territory, the larger too the competitor's market penetration. That is, the competitor kept
the sales down in territories with large populations.
In summary, the competitor's market penetration was not included in the original model. Yet, it
was later deemed to be strongly positively correlated with territory population. Even though
the competitor's market penetration was not included in the original model, its strong
correlation with one of the predictors in the model, greatly affected the conclusions arising
from the regression analysis.
The moral of the story is that if you get estimated coefficients that just don't make sense,
there is probably a very good explanation. Rather than stopping your research and running off
to report your unusual results, think long and hard about what might have caused the results.
That is, think about the system you are studying and all of the extraneous variables that could
influence the system.
Effect #2
When predictor variables are correlated, the precision of the estimated regression coefficients
decreases as more predictor variables are added to the model.
Variables
se(b2) se(b3)
in model
x2 0.093 ---
x3 --- 4.69
x2, x3 0.193 6.06
The standard error for the estimated slope b2 obtained from the model including both x2 =
Weight and x3 = BSA is about double the standard error for the estimated slope b2 obtained
from the model including only x2 = Weight. And, the standard error for the estimated slope b3
obtained from the model including both x2 = Weight and x3 = BSA is about 30% larger than
the standard error for the estimated slope b3 obtained from the model including only x3 =
Loading [MathJax]/extensions/MathZoom.js
18 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
BSA.
What is the major implication of these increased standard errors? Recall that the standard
errors are used in the calculation of the confidence intervals for the slope parameters. That is,
increased standard errors of the estimated slopes lead to wider confidence intervals, and
hence less precise estimates of the slope parameters.
Three plots to help clarify the second effect. Recall that the first data set (uncorrpreds.txt
[2]) that we investigated in this lesson contained perfectly uncorrelated predictor variables (r =
0). Upon regressing the response y on the uncorrelated predictors x1 and x2, Minitab (or any
other statistical software for that matter) will find the "best fitting" plane through the data
points:
Click on the Best Fitting Plane button in order to see the best fitting plane for this particular
set of responses. Now, here's where you have to turn on your imagination. The primary
characteristic of the data — because the predictors are perfectly uncorrelated — is that the
predictor values are spread out and anchored in each of four corners, providing a solid base
over which to draw the response plane. Now, even if the responses (y) varied somewhat from
sample to sample, the plane couldn't change all that much because of the solid base. That is,
the estimated coefficients, b1 and b2, couldn't change that much, and hence the standard
errors of the estimated coefficients, se(b1) and se(b2), will necessarily be small.
Now, let's take a look at the second example (bloodpress.txt [1]) that we investigated in this
lesson, in which the predictors x3 = BSA and x6 = Stress were nearly perfectly uncorrelated (r
= 0.018). Upon regressing the response y = BP on the nearly uncorrelated predictors x3 =
BSA and x6 = Stress, Minitab will again find the "best fitting" plane through the data points:
Click on the Best Fitting Plane button in order to see the best fitting plane for this particular
set of responses. Again, the primary characteristic of the data — because the predictors are
nearly perfectly uncorrelated — is that the predictor values are spread out and just about
anchored in each of four corners, providing a solid base over which to draw the response
plane. Again, even if the responses (y) varied somewhat from sample to sample, the plane
couldn't change all that much because of the solid base. That is, the estimated coefficients, b3
and b6, couldn't change all that much. The standard errors of the estimated coefficients,
se(b3) and se(b6), again will necessarily be small.
Now, let's see what happens when the predictors are highly correlated. Let's return to our
most recent example (bloodpress.txt [1]), in which the predictors x2 = Weight and x3 = BSA are
very highly correlated (r = 0.875). Upon regressing the response y = BP on the predictors x2 =
Weight and x3 = BSA, Minitab will again find the "best fitting" plane through the data points.
Loading [MathJax]/extensions/MathZoom.js
19 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
Do you see the difficulty in finding the best fitting plane in this situation? The primary
characteristic of the data — because the predictors are so highly correlated — is that the
predictor values tend to fall in a straight line. That is, there are no anchors in two of the four
corners. Therefore, the base over which the response plane is drawn is not very solid.
Let's put it this way — would you rather sit on a chair with four legs or one with just two legs?
If the responses (y) varied somewhat from sample to sample, the position of the plane could
change significantly. That is, the estimated coefficients, b2 and b3, could change substantially.
The standard errors of the estimated coefficients, se(b2) and se(b3), will then be necessarily
larger. Below is an animated view (no sound) of the problem that highly correlated predictors
can cause with finding the best fitting plane.
Effect #3
Loading [MathJax]/extensions/MathZoom.js
20 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
When predictor variables are correlated, the marginal contribution of any one predictor
variable in reducing the error sum of squares varies depending on which other variables are
already in the model.
This should make intuitive sense. In essence, weight appears to explain some of the variation
in blood pressure. However, because weight and body surface area are highly correlated,
most of the variation in blood pressure explained by weight could just have easily been
explained by body surface area. Therefore, once you take into account a person's body
surface area, there's not much variation left in the blood pressure for weight to explain.
Incidentally, we see a similar phenomenon when we enter the predictors into the model in the
reverse order. That is, regressing the response y = BP on the predictor x3 = BSA, we obtain
SSR(x3) = 419.858. But, regressing the response y = BP on the two predictors x2 = Weight
and x3 = BSA (in that order), we obtain SSR(x3|x2) = 2.814. The first model suggests that
body surface area reduces the error sum of squares substantially (by 419.858), and the
second model suggests that body surface area doesn't reduce the error sum of squares all
that much (by only 2.814) once a person's weight is taken into account.
Effect #4
When predictor variables are correlated, hypothesis tests for βk = 0 may yield different
conclusions depending on which predictor variables are in the model. (This effect is a direct
consequence of the three previous effects.)
To illustrate this effect, let's once again quickly proceed through the output of a series of
regression analyses, focusing primarily on the outcome of the t-tests for testing H0 : βBSA = 0
and H0 : βWeight = 0.
indicates that the P-value associated with the t-test for testing H0 : βBSA = 0 is 0.000... < 0.01.
There is sufficient evidence at the 0.05 level to conclude that blood pressure is significantly
related to body surface area.
21 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
indicates that the P-value associated with the t-test for testing H0 : βWeight = 0 is 0.000... <
0.01. There is sufficient evidence at the 0.05 level to conclude that blood pressure is
significantly related to weight.
And, the regression of the response y = BP on the predictors x2 = Weight and x3 = BSA:
indicates that the P-value associated with the t-test for testing H0 : βWeight = 0 is 0.000... <
0.01. There is sufficient evidence at the 0.05 level to conclude that, after taking into account
body surface area, blood pressure is significantly related to weight.
The regression also indicates that the P-value associated with the t-test for testing H0 : βBSA
= 0 is 0.350. There is insufficient evidence at the 0.05 level to conclude that blood pressure is
significantly related to body surface area after taking into account weight. This might sound
contradictory to what we claimed earlier, namely that blood pressure is indeed significantly
related to body surface area. Again, what is going on here, once you take into account a
person's weight, body surface area doesn't explain much of the remaining variability in blood
pressure readings.
Effect #5
High multicollinearity among predictor variables does not prevent good, precise predictions of
the response within the scope of the model.
Well, okay, it's not an effect, and it's not bad news either! It is good news! If the primary
purpose of your regression analysis is to estimate a mean response μY or to predict a new
response y, you don't have to worry much about multicollinearity.
For example, suppose you are interested in predicting the blood pressure (y = BP) of an
individual whose weight is 92 kg and whose body surface area is 2 square meters:
Loading [MathJax]/extensions/MathZoom.js
22 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
Because the point (2, 92) falls within the scope of the model, you'll still get good, reliable
predictions of the response y, regardless of the correlation that exists among the two
predictors BSA and Weight. Geometrically, what is happening here is that the best fitting
plane through the responses may tilt from side to side from sample to sample (because of the
correlation), but the center of the plane (in the scope of the model) won't change all that
much.
The following output illustrates how the predictions don't change all that much from model to
model:
The first output yields a predicted blood pressure of 112.7 mm Hg for a person whose weight
is 92 kg based on the regression of blood pressure on weight. The second output yields a
predicted blood pressure of 114.1 mm Hg for a person whose body surface area is 2 square
meters based on the regression of blood pressure on body surface area. And the last output
yields a predicted blood pressure of 112.8 mm Hg for a person whose body surface area is 2
square meters and whose weight is 92 kg based on the regression of blood pressure on body
surface area and weight. Reviewing the confidence intervals and prediction intervals, you can
see that they too yield similar results regardless of the model.
Now, in short, what are the major effects that multicollinearity has on our use of a regression
model to answer our research questions? In the presence of multicollinearity:
Loading [MathJax]/extensions/MathZoom.js
23 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
The first point is, of course, adressed above. The second point is a direct consequence of the
correlation among the predictors. It wouldn't make sense to talk about holding the values of
correlated predictors constant, since changing one predictor necessarily would change the
values of the others.
1. Determine the pairwise correlations among the predictor variables in order to get an
idea of the extent to which the predictor variables are (pairwise) correlated. (See
Minitab Help: Creating a correlation matrix [5]). Also, create a matrix plot of the data in
order to get a visual portrayal of the relationship among the response and predictor
variables. (See Minitab Help: Creating a simple matrix of scatter plots [6]).
2. Fit the simple linear regression model with y = ACL as the response and x1 = Vocab
as the predictor. After fitting your model, request that Minitab predict the response y =
ACL when x1 = 25. (See Minitab Help: Performing a multiple regression analysis — with
options [7]).
3. Now, fit the simple linear regression model with y = ACL as the response and x3 =
SDMT as the predictor. After fitting your model, request that Minitab predict the
response y = ACL when x3 = 40. (See Minitab Help: Performing a multiple regression
analysis
Loading — with options [7]).
[MathJax]/extensions/MathZoom.js
24 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
4. Fit the multiple linear regression model with y = ACL as the response and x3 = SDMT
as the first predictor and x1 = Vocab as the second predictor. After fitting your model,
request that Minitab predict the response y = ACL when x1 = 25 and x3 = 40. (See
Minitab Help: Performing a multiple regression analysis — with options [7]).
Now, what is the value of the estimated slope coefficient b1? and b3?
Now, what is the value of the standard error of b1? and b3?
What is the sequential sum of squares, SSR (X1|X3)?
What is the predicted response of y = ACL when x1 = 25 and x3 = 40?
The analysis exhibits the signs of multicollinearity — such as, estimates of the
coefficients vary from model to model.
The t-tests for each of the individual slopes are non-significant (P > 0.05), but the overall
F-test for testing all of the slopes are simultaneously 0 is significant (P < 0.05).
The correlations among pairs of predictor variables are large.
Looking at correlations only among pairs of predictors, however, is limiting. It is possible that
the pairwise correlations are small, and yet a linear dependence exists among three or even
more variables, for example, if X3 = 2X1 + 5X2 + error, say. That's why many regression
analysts often rely on what are called variance inflation factors (VIF) to help detect
multicollinearity.
As the[MathJax]/extensions/MathZoom.js
Loading name suggests, a variance inflation factor (VIF) quantifies how much the variance is
25 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
inflated. But what variance? Recall that we learned previously that the standard errors — and
hence the variances — of the estimated coefficients are inflated when multicollinearity exists.
So, the variance inflation factor for the estimated coefficient bk —denoted VIFk —is just the
factor by which the variance is inflated.
Let's be a little more concrete. For the model in which xk is the only predictor:
Note that we add the subscript "min" in order to denote that it is the smallest the variance can
be. Don't worry about how this variance is derived — we just need to keep track of this
baseline variance, so we can see how much the variance of bk is inflated when we add
correlated predictors to our regression model.
Now, again, if some of the predictors are correlated with the predictor xk, then the variance of
bk is inflated. It can be shown that the variance of bk is:
where is the R2-value obtained by regressing the kth predictor on the remaining
predictors. Of course, the greater the linear dependence among the predictor xk and the other
predictors, the larger the value. And, as the above formula suggests, the larger the
value, the larger the variance of bk.
How much larger? To answer this question, all we need to do is take the ratio of the two
variances. Doing so, we obtain:
The above quantity is what is deemed the variance inflation factor for the kth predictor. That is:
where is the R2-value obtained by regressing the kth predictor on the remaining
predictors. Note that a variance inflation factor exists for each of the k predictors in a multiple
regression
Loading model.
[MathJax]/extensions/MathZoom.js
26 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
How do we interpret the variance inflation factors for a regression model? Again, it is a
measure of how much the variance of the estimated regression coefficient bk is "inflated" by
the existence of correlation among the predictor variables in the model. A VIF of 1 means that
there is no correlation among the kth predictor and the remaining predictor variables, and
hence the variance of bk is not inflated at all. The general rule of thumb is that VIFs exceeding
4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity
requiring correction.
An Example
Let's return to the blood pressure data (bloodpress.txt [1]) in which researchers observed the
following data on 20 individuals with high blood pressure:
As you may recall, the matrix plot of BP, Age, Weight, and BSA:
Loading [MathJax]/extensions/MathZoom.js
27 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
suggest that some of the predictors are at least moderately marginally correlated. For
example, body surface area (BSA) and weight are strongly correlated (r = 0.875), and weight
and pulse are fairly strongly correlated (r = 0.659). On the other hand, none of the pairwise
correlations among age, weight, duration and stress are particularly strong (r < 0.40 in each
case).
Loading [MathJax]/extensions/MathZoom.js
28 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
[Minitab v17 reports the variance inflation factors by default; for v16 you have to select this
under Options.] As you can see, three of the variance inflation factors —8.42, 5.33, and 4.41
—are fairly large. The VIF for the predictor Weight, for example, tells us that the variance of
the estimated coefficient of Weight is inflated by a factor of 8.42 because Weight is highly
correlated with at least one of the other predictors in the model.
For the sake of understanding, let's verify the calculation of the VIF for the predictor Weight.
Regressing the predictor x2 = Weight on the remaining five predictors:
Loading [MathJax]/extensions/MathZoom.js
29 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
Minitab reports that is 88.1% or, in decimal form, 0.881. Therefore, the variance
inflation factor for the estimated coefficient Weight is by definition:
Again, this variance inflation factor tells us that the variance of the weight coefficient is inflated
by a factor of 8.4 because Weight is highly correlated with at least one of the other predictors
in the model.
So, what to do? One solution to dealing with multicollinearity is to remove some of the
violating predictors from the model. If we review the pairwise correlations again:
we see that the predictors Weight and BSA are highly correlated (r = 0.875). We can choose
to remove either predictor from the model. The decision of which one to remove is often a
scientific or practical one. For example, if the researchers here are interested in using their
final model to predict the blood pressure of future individuals, their choice should be clear.
Which of the two measurements — body surface area or weight — do you think would be
easier to obtain?! If indeed weight is an easier measurement to obtain than body surface
area, then the researchers would be well-advised to remove BSA from the model and leave
Loading [MathJax]/extensions/MathZoom.js
30 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
Reviewing again the above pairwise correlations, we see that the predictor Pulse also
appears to exhibit fairly strong marginal correlations with several of the predictors, including
Age (r = 0.619), Weight (r = 0.659) and Stress (r = 0.506). Therefore, the researchers could
also consider removing the predictor Pulse from the model.
Let's see how the researchers would do. Regressing the response y = BP on the four
remaining predictors age, weight, duration and stress, we obtain:
Aha — the remaining variance inflation factors are quite satisfactory! That is, it appears as if
hardly any variance inflation remains. Incidentally, in terms of the adjusted R2-value, we did
not seem to lose much by dropping the two predictors BSA and Pulse from our model. The
adjusted R2-value decreased to only 98.97% from the original adjusted R2-value of 99.44%.
We’ll use the cement.txt [8] data set to explore variance inflation factors. The response y
measures the heat evolved in calories during the hardening of cement on a per gram
basis. The four predictors are the percentages of four ingredients: tricalcium aluminate
(x1), tricalcium silicate (x2), tetracalcium alumino ferrite (x3), and dicalcium silicate (x4).
It’s not hard to imagine that such predictors would be correlated in some way.
1. Use the Stat >> Basic Statistics >> Correlation ... command in Minitab to get an idea
of the extent to which the predictor variables are (pairwise) correlated. Also, use the
Graph >> Matrix Plot ... command in Minitab to get a visual portrayal of the (pairwise)
relationships among the response and predictor variables.
Loading [MathJax]/extensions/MathZoom.js
31 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
2. Regress the fourth predictor, x4, on the remaining three predictors, x1, x2, and x3.
That is, fit the linear regression model treating x4 as the response and x1, x2, and x3 as
the predictors. What is the R24 value? (Note that Minitab rounds the R2 value it reports
to three decimal places. For the purposes of the next question, you’ll want a more
accurate R2 value. Calculate the R2 value SSR using its definition, . Use your
calculated value, carried out to 5 decimal places, in answering the next question.)
3. Using your calculated R2 value carried out to 5 decimal places, determine by what
factor the variance of b4 is inflated. That is, what is VIF4?
4. Minitab will actually calculate the variance inflation factors for you. Fit the multiple
linear regression model with y as the response and x1,x2,x3 and x4 as the predictors.
The VIFk will be reported as a column of the estimated coefficients table. Is the VIF4
that you calculated consistent with what Minitab reports?
5. Note that all of the VIFk are larger than 10, suggesting that a high degree of
multicollinearity is present. (It should seem logical that multicollinearity is present here,
given that the predictors are measuring the percentage of ingredients in the cement.)
Do you notice anything odd about the results of the t-tests for testing the individual H0 :
βi = 0 and the result of the overall F-test for testing H0 : β1 = β2 = β3 = β4 = 0? Why
does this happen?
As the[MathJax]/extensions/MathZoom.js
Loading example in the previous section illustrated, one way of reducing data-based
32 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
multicollinearity is to remove one or more of the violating predictors from the regression
model. Another way is to collect additional data under different experimental or observational
conditions. We'll investigate this alternative method in this section.
Before we do, let's quickly remind ourselves why we should care about reducing
multicollinearity. It all comes down to drawing conclusions about the population slope
parameters. If the variances of the estimated coefficients are inflated by multicollinearity, then
our confidence intervals for the slope parameters are wider and therefore less useful.
Eliminating or even reducing the multicollinearity therefore yields narrower, more useful
confidence intervals for the slopes.
An Example
Researchers running the Allen Cognitive Level (ACL) Study were interested in the relationship
of ACL test scores to the level of psychopathology. They therefore collected the following data
on a set of 69 patients in a hospital psychiatry unit:
For the sake of this example, I sampled 23 patients from the original data set in such a way to
ensure that a very high correlation exists between the two predictors Vocab and Abstract. A
matrix plot of the resulting data set (allentestn23.txt [9]):
suggests that, indeed, a strong correlation exists between Vocab and Abstract. The
correlations among the remaining pairs of predictors do not appear to be particularly strong.
Focusing only on the relationship between the two predictors Vocab and Abstract:
Loading [MathJax]/extensions/MathZoom.js
33 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
we do indeed see that a very strong relationship (r = 0.99) exists among the two predictors.
Let's see what havoc this high correlation wreaks on our regression analysis! Regressing the
response y = ACL on the predictors SDMT, Vocab, and Abstract, we obtain:
Yikes — the variance inflation factors for Vocab and Abstract are very large — 49.3 and 50.6,
respectively!
What should we do about this? We could opt to remove one of the two predictors from the
model. Alternatively, if we have a good scientific reason for needing both of the predictors to
remain in the model, we could go out and collect more data. Let's try this second approach
here. [MathJax]/extensions/MathZoom.js
Loading
34 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
For the sake of this example, let's imagine that we went out and collected more data, and in
so doing, obtained the actual data collected on all 69 patients enrolled in the Allen Cognitive
Level (ACL) Study. A matrix plot of the resulting data set (allentest.txt [4]):
suggests that a correlation still exists between Vocab and Abstract — it is just a weaker
correlation now.
Again, focusing only on the relationship between the two predictors Vocab and Abstract:
we do indeed see that the relationship between Abstract and Vocab is now much weaker (r =
0.698) than before. The round data points in blue represent the 23 data points in the original
data set, while the square red data points represent the 46 newly collected data points. As
you can see from the plot, collecting the additional data has expanded the "base" over which
Loading [MathJax]/extensions/MathZoom.js
35 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
the "best fitting plane" will sit. The existence of this larger base allows less room for the plane
to tilt from sample to sample, and thereby reduces the variance of the estimated slope
coefficients.
Let's see if the addition of the new data helps to reduce the multicollinearity here. Regressing
the response y = ACL on the predictors SDMT, Vocab, and Abstract:
we find that the variance inflation factors are reduced significantly and satisfactorily. The
researchers could now feel comfortable proceeding with drawing conclusions about the
effects of the vocabulary and abstraction scores on the level of psychopathology.
One thing to keep in mind. In order to reduce the multicollinearity that exists, it is not
sufficient to go out and just collect any ol' data. The data have to be collected in such a way to
ensure that the correlations among the violating predictors is actually reduced. That is,
collecting more of the same kind of data won't help to reduce the multicollinearity. The data
have to be collected to ensure that the "base" is sufficiently enlarged. Doing so, of course,
changes the characteristics of the studied population, and therefore should be reported
accordingly.
An Example
What is the impact of exercise on the human immune system? In order to answer this very
global and general research question, one has to first quantify what "exercise" means and
what "immunity" means. Of course, there are several ways of doing so. For example, we
Loading [MathJax]/extensions/MathZoom.js
36 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
might quantify one's level of exercise by measuring his or her "maximal oxygen uptake." And,
we might quantify the quality of one's immune system by measuring the amount of
"immunoglobin in his or her blood." In doing so, the general research question is translated
into the much more specific research question: "How is the amount of immunoglobin in blood
(y) related to maximal oxygen uptake (x)?"
Because some researchers were interested in answering the above research question, they
collected the following data (exerimmun.txt [10]) on a sample of 30 individuals:
suggests that there might be some curvature to the trend in the data. In order to allow for the
apparent curvature —rather than formulating a linear regression function —the researchers
formulated the following quadratic polynomial regression function:
where:
As usual, the error terms εi are assumed to be independent, normally distributed and have
equal variance σ2.
Loading [MathJax]/extensions/MathZoom.js
37 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
the formulated regression function appears to describe the trend in the data well. The
adjusted R2-value is 93.3%.
But, now what do the estimated coefficients tell us? The interpretation of the regression
coefficients is mostly geometric in nature. That is, the coefficients tell us a little bit about what
the picture looks like:
So far, we have kept our head a little bit in the sand! If we look at the output we obtain upon
regressing the response y = igg on the predictors oxygen and oxygen2:
Loading [MathJax]/extensions/MathZoom.js
38 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
we quickly see that the variance inflation factors for both predictors —oxygen and oxygensq
—are very large (99.9 in each case). Is this surprising to you? If you think about it, we've
created a correlation by taking the predictor oxygen and squaring it to obtain oxygensq. That
is, just by the nature of our model, we have created a "structural multicollinearity."
illustrates the intense strength of the correlation that we induced. After all, we just can't get
much more correlated than a correlation of r = 0.995!
The neat thing here is that we can reduce the multicollinearity in our data by doing what is
known as "centering the predictors." Centering a predictor merely entails subtracting the
mean of the predictor values in the data set from each predictor value. For example, Minitab
reports[MathJax]/extensions/MathZoom.js
Loading that the mean of the oxygen values in our data set is 50.64:
39 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
Therefore, in order to center the predictor oxygen, we merely subtract 50.64 from each
oxygen value in our data set. Doing so, we obtain the centered predictor, oxcent, say:
For example, 34.6 minus 50.64 is -16.04, and 45.0 minus 50.64 is -5.64, and so on. Now, in
order to include the squared oxygen term in our regression model—to allow for curvature in
the trend—we square the centered predictor oxcent to obtain oxcentsq. That is, (-16.04)2 =
257.282 and (-5.64)2 = 31.810, and so on.
Loading [MathJax]/extensions/MathZoom.js
40 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
illustrates—by centering the predictors—just how much we've reduced the correlation
between the predictor and its square. The correlation has gone from a whopping r = 0.995 to
a rather low r = 0.219!
Having centered the predictor oxygen, we must reformulate our quadratic polynomial
regression model accordingly. That is, we now formulate our model as:
or alternatively:
where:
and the error terms εi are independent, normally distributed and have equal variance σ2. Note
that we add asterisks to each of the parameters in order to make it clear that the parameters
differ from the parameters in the original model we formulated.
Let's see how we did by centering the predictors and reformulating our model. Recall that
—based on our original model —the variance inflation factors for oxygen and oxygensq were
99.9. Now, regressing y = igg on the centered predictors oxcent and oxcentsq:
we see that the variance inflation factors have dropped significantly—now they are 1.05 in
each case.
Because we reformulated our model based on the centered predictors, the meaning of the
parameters must be changed accordingly. Now, the estimated coefficients tell us:
Loading [MathJax]/extensions/MathZoom.js
41 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
The estimated coefficient b0 is the predicted response y when the predictor x equals the
sample mean of the predictor values.
The estimated coefficient b1 is the estimated slope of the tangent line at the predictor
mean — and, often, it is similar to the estimated slope in the simple linear regression
model.
The estimated coefficient b2 indicates the up/down direction of curve. That is:
if b2 < 0, then the curve is concave down
if b2 > 0, then the curve is concave up
So, here, in this example, the estimated coefficient b0 = 1632.3 tells us that a male whose
maximal oxygen uptake is 50.64 ml/kg is predicted to have 1632.3 mg of immunoglobin in his
blood. And, the estimated coefficient b1 = 34.00 tells us that the when an individual's maximal
oxygen uptake is near 50.64 ml/kg, we can expect the individual's immunoglobin to increase
by 34.00 mg for every 1 ml/kg increase in maximal oxygen uptake.
the reformulated regression function appears to describe the trend in the data well. The
adjusted R2-value is still 93.3%.
We shouldn't be surprised to see that the estimates of the coefficients in our reformulated
polynomial regression model are quite similar to the estimates of the coefficients for the
simple linear regression model:
Loading [MathJax]/extensions/MathZoom.js
42 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
As you can see, the estimated coefficient b1 = 34.00 for the polynomial regression model and
b1 = 32.74 for the simple linear regression model. And, the estimated coefficient b0 = 1632 for
the polynomial regression model and b0 = 1558 for the simple linear regression model. The
similarities in the estimates, of course, arise from the fact that the predictors are nearly
uncorrelated and therefore the estimates of the coefficients don't change all that much from
model to model.
Now, you might be getting this sense that we're "mucking around with the data" in order to get
an answer to our research questions. One way to convince you that we're not is to show you
that the two estimated models are algebraically equivalent. That is, if given one form of the
estimated model, say the estimated model with the centered predictors:
then, the other form of the estimated model, say the estimated model with the original
predictors:
can be easily obtained. In fact, it can be shown algebraically that the estimated coefficients of
the original model equal:
For example, the estimated regression function for our reformulated model with centered
predictors is:
Then, since the mean of the oxygen values in the data set is 50.64:
Loading [MathJax]/extensions/MathZoom.js
43 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
it can be shown algebraically that the estimated coefficients for the model with the original
(uncentered) predictors are:
b11 = - 0.536
That is, the estimated regression function for our quadratic polynomial model with the original
(uncentered) predictors is:
Given the equivalence of the two estimated models, you might ask why we bother to center
the predictors. The main reason for centering to correct structural multicollinearity is that low
levels of multicollinearity can be helpful in avoiding computational inaccuracies. Specifically, a
near-zero determinant of XTX is a potential source of serious roundoff errors in the
calculations of the normal equations. Severe multicollinearity has the effect of making this
determinant come close to zero. Thus, under severe multicollinearity, the regression
coefficients may be subject to large roundoff errors.
Let's use our model to predict the immunoglobin level in the blood of a person whose maximal
oxygen uptake is 90 ml/kg. Of course, before we use our model to answer a research
question, we should always evaluate it first to make sure it means all of the necessary
conditions. The residuals versus fits plot:
shows a nice horizontal band around the residual = 0 line, suggesting the model fits the data
well. It also suggests that the variances of the error terms are equal. And, the normal
probability plot:
Loading [MathJax]/extensions/MathZoom.js
44 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
suggests that the error terms are normally distributed. Okay, we're good to go —let's use the
model to answer our research question: "What is one's predicted IgG if the maximal oxygen
uptake is 90 ml/kg?"
When asking Minitab to make this prediction, you have to remember that we have centered
the predictors. That is, if oxygen = 90, then oxcent = 90-50.64 = 39.36. And, if oxcent = 39.36,
then oxcentsq = 1549.210. Asking Minitab to predict the igg of an individual whose oxcent =
39.4 and oxcentsq = 1549, we obtain the following output:
Why does Minitab report that "XX denotes a row with very extreme X values?"
Recall that the levels of maximal oxygen uptake in the data set ranged from 30 to 70 ml/kg.
Therefore, a maximal oxygen uptake of 90 is way outside the scope of the model, and Minitab
provides such a warning.
Just one closing comment since we've been discussing polynomial regression to remind you
Loading [MathJax]/extensions/MathZoom.js
45 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
about the "hierarchical approach to model fitting." The widely accepted approach to fitting
polynomial regression functions is to fit a higher-order model and then explore whether a
lower-order (simpler) model is adequate. For example, suppose we formulate the following
cubic polynomial regression function:
Then, to see if the simpler first order model (a "line") is adequate in describing the trend in the
data, we could test the null hypothesis:
But then … if a polynomial term of a given order is retained, then all related lower-order terms
are also retained. That is, if a quadratic term is deemed significant, then it is standard practice
to use this regression function:
That is, we always fit the terms of a polynomial model in a hierarchical manner.
(Data source: The U.S. Census Bureau and Mind On Statistics, (3rd edition), Utts and
Heckard). In this example, the observations are the 50 states of the United States (poverty.txt
[12] - Note: remove data from the District of Columbia). The variables are y = percentage of
each state’s population living in households with income below the federally defined poverty
level in the year 2002, x1 = birth rate for females 15 to 17 years old in 2002, calculated as
births per 1000 persons in the age group, and x2 = birth rate for females 18 to 19 years old in
2002, calculated as births per 1000 persons in the age group.
The two x-variables are correlated (so we have multicollinearity). The correlation is about
0.95. A plot of the two x-variables is given below.
Loading [MathJax]/extensions/MathZoom.js
46 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
The figure below shows plots of y = poverty percentage versus each x-variable separately.
Both x-variables are linear predictors of the poverty percentage.
Minitab results for the two possible simple regressions and the multiple regression are given
below.
Loading [MathJax]/extensions/MathZoom.js
47 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
1. The value of the sample coefficient that multiplies a particular x-variable is not same in
the multiple regression as it is in the relevant simple regression.
2. The R2 for the multiple regression is not the sum of the R2 values for the simple
regressions. An x-variable (either one) is not making an independent “add-on” in the
multiple regression.
3. The 18 to 19 year-old birth rate variable is significant in the simple regression, but is not
in the multiple regression. This discrepancy is caused by the correlation between the
two x-variables. The 15 to 17 year-old birth rate is the stronger of the two x-variables,
and given it’s presence in the equation, the 18 to 19 year-old rate does not improve R2
enough to be significant. More specifically, the correlation between the two x-variables
has increased the standard errors of the coefficients, so we have less precise estimates
of the individual slopes.
12.8 - Extrapolation
"Extrapolation" beyond the "scope of the model" occurs when one uses an estimated
regression equation to estimate a mean µY or to predict a new response ynew for x values not
in the range of the sample data used to determine the estimated regression equation. In
general, it is dangerous to extrapolate beyond the scope of model. The following example
illustrates why this is not a good thing to do.
Researchers measured the number of colonies of grown bacteria for various concentrations
of urine (ml/plate). The scope of the model — that is, the range of the x values — was 0 to
5.80 ml/plate. The researchers obtained the following estimated regression equation:
Loading [MathJax]/extensions/MathZoom.js
48 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
Using the estimated regression equation, the researchers predicted the number of colonies at
11.60 ml/plate to be 16.0667 + 1.61576(11.60) or 34.8 colonies. But when the researchers
conducted the experiment at 11.60 ml/plate, they observed that the number of colonies
decreased dramatically to about 15.1 ml/plate:
The moral of the story is that the trend in the data as summarized by the estimated regression
equation does not necessarily hold outside the scope of the model.
49 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
prediction intervals will tend to be wider than they should be at low fitted values and narrower
than they should be at high fitted values. Some remedies for refining a model exhibiting
excessive nonconstant variance includes the following:
Autocorrelation
One common way for the "independence" condition in a multiple linear regression model to
fail is when the sample data have been collected over time and the regression model fails to
effectively capture any time trends. In such a circumstance, the random errors in the model
are often positively correlated over time, so that each random error is more likely to be similar
to the previous random error that it would be if the random errors were independent of one
another. This phenomenon is known as autocorrelation (or serial correlation) and can
sometimes be detected by plotting the model residuals versus time. We'll explore this further
in Lesson 14.
Overfitting
When building a regression model, we don't want to include unimportant or irrelevant
predictors whose presence can overcomplicate the model and increase our uncertainty about
the magnitudes of the effects for the important predictors (particularly if some of those
predictors are highly collinear). Such "overfitting" can occur the more complicated a model
becomes and the more predictor variables, transformations, and interactions are added to a
model. It is always prudent to apply a sanity check to any model being used to make
decisions. Models should always make sense, preferably grounded in some kind of
background theory or sensible expectation about the types of associations allowed between
variables. Predictions from the model should also be reaosnable (over-complicated models
can give quirky results that may not reflect reality).
50 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
sometimes called confounding or lurking variables, and their absence from a model can lead
to incorrect decisions and poor decision-making.
Missing Data
Real-world datasets frequently contain missing values, so that we do not know the values of
particular variables for some of the sample observations. For example, such values may be
missing because they were impossible to obtain during data collection. Dealing with missing
data is a challenging task. Missing data has the potential to adversely affect a regression
analysis by reducing the total usable sample size. The best solution to this problem is to try
extremely hard to avoid having missing data in the first place. When there are missing values
that are impossible or too costly to avoid, one approach is to replace the missing values with
plausible estimates, known as imputation. Another (easier) approach is to consider only
models that contain predictors with no (or few) missing values. This may be unsatisfactory,
however, because even a predictor variable with a large number of missing values can
contain useful information.
Links:
Loading [MathJax]/extensions/MathZoom.js
[1] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
51 of 52 11-02-2018, 02:53
https://onlinecourses.science.psu.edu/stat501/print/book/export/html/343
/data/bloodpress.txt
[2] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/uncorrpreds.txt
[3] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/uncorrelated.txt
[4] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/allentest.txt
[5] https://onlinecourses.science.psu.edu/stat501/node/240
[6] https://onlinecourses.science.psu.edu/stat501/node/247
[7] https://onlinecourses.science.psu.edu/stat501/node/244
[8] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/cement.txt
[9] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/allentestn23.txt
[10] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files
/data/exerimmun.txt
[11] https://onlinecourses.science.psu.edu/stat501/../data/lifeexpect.txt
[12] https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/examples
/poverty.txt
Loading [MathJax]/extensions/MathZoom.js
52 of 52 11-02-2018, 02:53