Академический Документы
Профессиональный Документы
Культура Документы
DOI 10.1007/s11135-006-9055-1
Springer 2006
Abstract. In the practical cases, we are usually faced with the more difcult problem of
multicollinearity in our tted regression model. Multicollinearity will arise when there are
approximate linear relationships between two or more independent variables. It may cause
some serious problems in validation, interpretation, and analysis of the model, such as
unstable estimates, unreasonable sing, high-standard errors, and so on. Although there are
some methods to solve or avoid this problem, we will propose another alternative from
the practical view in this paper, called nested estimate procedure. The rst half of the
paper explains the concept and process of this procedure, and the second half provides two
examples to illustrate this procedures suitability and reliability.
Key words: multicollinearity, nested estimate procedure, variance ination factors, tolerance
1. Introduction
In the process of tting regression model, when one independent variable is
nearly combination of other independent variables, there will affect parameter estimates. This problem is called multicollinearity. Basically, multicollinearity is not a violation of the assumptions of regression but it may
cause serious difculties (Neter et al., 1989) (1) variances of parameter estimates may be unreasonably large, (2) parameter estimates may not be signicant, (3) a parameter estimate may have a sign different from what is
expected, and so on.
For solving or alleviating this problem in certain regression model, the
usually best way is dropping redundant variables from this model directly,
that is to try to avoid it by not including redundant variables in the regression model (Bowerman et al., 1993). But sometimes, it is hard to decide
which redundant variables are. Another alternative to deleting variables is
to perform a principal component analysis (Maddala, 1977). With principal component regression, we create a set of articial uncorrelated variables that can then be used in the regression model. Although principal
418
FENG-JENQ LIN
component variables are dropped from the model, when the model is transformed back, there will be cause other biases, too (Draper and Smith,
1981).
In this paper, from the practice view, we are going to provide another
idea to avoid multicollinearity problem in the tting process, based on the
Ordinary Least Square (OLS) method, by estimated the different parameters of independent variable individually and sequentially. This idea is
called nested estimate procedure, and the constructed model is called nested
regression model. For explaining the feasibility of this method, the rst half
of this paper describes the concept and process of nested estimate procedure, and the second half provides some examples to demonstrate its suitability and reliability.
2. Procedure Concept and Executing Flow
The nested estimate procedure is easy to execute. Suppose there are k independent variables x1 , x2 , . . . , xk and one dependent variable y. Now, if we
want to use them to construct one multiple regression model, then the real
full model will be
y = 0 + 1 x1 + 2 x2 + + k xk +
k
= 0 +
m xm +
m=1
m xm ,
= 0 +
m=1
419
(3) Next, we will continue to nd the independent variable (xi ) which has
the maximum correlation with residual variable (1 ) besides x(1) , and
this correlation has the same logic sign with y. Now we let x(2) be the
satised variable among xi .
(4) After that, we ret a simple regression model with 1 and x(2) by OLS.
And the estimate model will be
1 = 20 + 21 x(2) .
At the same time, we rst test whether signicance of parameter 21
or not under the null hypothesis that the statistic is zero. If the testing result is fail to reject the null hypothesis, then we will stop the tting process. The nal estimate model, only a simple regression model,
which is in step (2), and there does not exit colearnearily problem.
Otherwise, the real estimate model will become
y = 10 + 11 x(1) + 1
= (10 + 20 ) + 11 x(1) + 21 x(2) + 2 .
420
FENG-JENQ LIN
Also, from the real values yj and forecasting values yj , we can acquire
residuals 2j . That is, we calculate variable 2 = y y.
(5) Similar to continue steps (3) and (4). We can keep nding which independent variables (xi ) has the maximum correlation with new residual
variable (r1 ) in the rth iteration (r k) besides those variables have
retained in model. And let x(r) is the satised variable among xi . And
then, we t the simple regression model with r1 and x(r) by OLS. So,
the estimate model of rth step will be
r1 = r0 + r1 x(r) .
After that, we will test whether signicance of parameter r1 or not
under the null hypothesis that the statistic is zero. If the testing result
is fail to reject the null hypothesis, then we will stop the tting process.
And nal model will be the real estimate model of last iteration (r 1).
That is
y=
r1
m0 +
m=1
r1
m1 x(m) + r1 .
m=1
r
m0 +
m=1
r
m1 x(m) + r .
m=1
Also, from the real values yj and forecasting values yj , we can acquire
residuals rj . That is, we calculate variable r = y y.
m=1
421
k=k+1
Calculate Residual
k = y - y
Find xi is max{ k xi }
yes
Check whether
parameters significance
or not
no
The last arrangement
model is final model
3. Two Examples
For validating the feasibility of nested estimate procedure, there are two examples that have different degree correlation among their independent variables
are used to explain and support. By the way, for examining whether the presence
of multicollinearity in the model or not, we try to use two criteria to indicate:
(1) Variance ination factors
It is used to measure the ination of the variances for the parameters
above what is expected if there is no correlation among the independent
variables. We can calculate VIF for an independent variable xi as
422
FENG-JENQ LIN
VIFi =
1
,
1 Ri2
1
= 1 Ri2 .
VIFi
Therefore, that is close to 0 indicates possible problems with multicollinearity. A rule of thumb is that a tolerance value less than 0.1 may
indicate the presence of multicollinearity.
Example 1: Air Pollution in US Cities (Sokal and Rohlf, 1981) A climatologist is interested in predicting air quality in 41 US cities. The mean
concentration of sulfur dioxide in the air and information pertaining to
seven explanatory variables are gathered over a 3-year period as followings:
Dependent variable
SO2
Average SO2 content of the air in micrograms per cubic meter
Independent variables:
FACTORY Number of manufacturing enterprises employing 20 or more workers
POP
1979 population in thousands
TEMP
Average annual temperature in degrees Fahrenheit
WIND
Average annual wind speed
PRECIP
Average annual precipitation in inches
DAYRAIN Average number of days with precipitation per year
DUST
Average concentration of dust particles in ppm
423
Table I. The relative statistics by using all independent variables for the
pollution data
Variable
Parameter estimate
t-statistic
VIF
TOL
INTERCEPT
POP
FACTORY
TEMP
WIND
PRECIP
DAYRAIN
DUST
112.15865
0.03932
0.06436
1.28230
3.22214
0.49681
0.04807
0.23317
2.338
2.564
4.008
2.032
1.747
1.340
0.292
0.319
0.00000
14.34186
14.88308
3.78325
1.26159
3.46483
3.46361
1.27935
0.06973
0.06719
0.26432
0.79265
0.28861
0.28872
0.78165
Symbol
Table II. The relative statistics by using the stepwise selection method
for the pollution data
Variable
Parameter estimate
t-statistic
VIF
TOL
INTERCEPT
FACTORY
POP
26.32508
0.08243
0.05661
6.855
5.609
3.959
0.00000
11.43374
11.43374
0.08746
0.08746
Symbol
From the full model, to summarize the nature of the multicollinearity between all independent variables, we might conclude that POP and
FACTORY are probably not needed in the same model. Unfortunately,
from the stepwise selection method, we nd FACTORY and POP are still
in the model. And VIF and TOL indicate they both exist multicollinearity.
Now, if we change to t another regression model to the pollution data
using the nested estimate procedure to retain signicant variables in the
model, then we can nd relative statistics for the signicant variables as
Table III.
Obviously, when we use the nested estimate procedure, the variables in
the model will not exit the problem of multicollinearity. Even though the
variable POP replaces with TEMP entering model, in fact, there also does
not cause multicollinearity. Because TEMP is used to t another simple
regression to the residual, not the SO2 in the tting process, FACTORY
and TEMP are the separate contribution of tted model variation.
Example 2: Labor Needs in US Navy Hospitals (Bowerman et al., 1993)
We present the case concerning the need of labor hours for 17 US Navy
424
FENG-JENQ LIN
Table III. The relative statistics by using the nested estimate procedure for
the pollution data.
Variable
Parameter estimate
t-statistic
VIF
TOL
INTERCEPT
FACTORY
TEMP
17.61057 + 56.33228
0.02686
1.01020
5.268
2.782
0.00000
1.03747
1.03747
0.96388
0.96388
Symbol
Independent variables
X1
X2
X3
X4
X5
Parameter estimate
t-statistic
VIF
TOL
INTERCEPT
X1
X2
X3
X4
X5
1962.948156
15.851675
0.055930
1.589624
4.218668
394.314117
1.832
0.162
2.631
0.514
0.588
1.881
0.00
9597.57
7.94
8933.09
23.29
4.28
0.00010
0.12593
0.00011
0.04293
0.23365
Symbol
425
Parameter estimate
t-statistic
VIF
TOL
INTERCEPT
X2
X3
68.3139590
0.0748659
0.8228746
0.299
3.913
9.919
0.000000
5.647157
5.647157
0.17708
0.17708
Symbol
Table VI. The relative statistics by using the nested estimate procedure for
the labor needs data
Variable
Parameter estimate
t-statistic
VIF
TOL
INTERCEPT
X2
X5
492.183+(3235.971)
0.2470
549.0719
11.209*
2.113*
0.00000
1.24921
1.24921
0.80050
0.80050
Symbol
426
FENG-JENQ LIN
Authors Biographies
Feng-Jenq Lin is a Doctor of Management Sciences. He is interested in modelling forecast.
His papers have appeared in the Asia-Pacic Journal of Operational Research, the Yugoslav Journal of Operations Research, Journal of Information & Optimization Sciences and
various other journals.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.