Академический Документы
Профессиональный Документы
Культура Документы
Topics Outline
Multiple Regression Model
Inferences about Regression Coefficients
F Test for the Overall Fit
Residual Analysis
Collinearity
Multiple Regression Model
Multiple regression models use two or more explanatory (independent) variables to predict the
value of a response (dependent) variable. With k explanatory variables, the multiple regression
model is expressed as follows:
y
1 x1
2 x2
k xk
Here , 1 , 2 ,, k are the parameters and the error term is a random variable which
accounts for the variability in y that cannot be explained by the linear effect of the k explanatory
variables. The assumptions about the error term in the multiple regression model parallel those
for the simple regression model.
Regression Assumptions
1. Linearity
The error term is a random variable with a mean 0.
Implication: For given values of x1 , x2 ,, xk , the expected, or average, value of y is given by
E ( y)
y
1 x1
2 x2
k xk
(The relationship is linear, because each term on the right-hand side of the equation is additive, and
the regression parameters do not enter the equation in a nonlinear manner, such as i2 xi . The graph
of the relationship is no longer a line, however, because there are more than two variables involved.)
2. Independence
The values of are statistically independent.
Implication: The value of y for a particular set of values for the explanatory variables is not
related to the value of y for any other set of values.
3. Normality
The error term is a normally distributed random variable (with mean 0 and standard deviation
Implication: Because , 1 , 2 ,, k are constants for the given values of x1 , x2 ,, xk ,
the response variable y is also a normally distributed random variable
(with mean y
).
1 x1
2 x2
k x k and standard deviation
4. Equal spread
The standard deviation
of is the same for all values of the explanatory variables x1 , x2 ,, xk .
Implication: The standard deviation of y about the regression line equals and is the same for
all values of x1 , x2 ,, xk .
-1-
).
1 1
x2
xk
yi
y i
i 1
1 x1
2 x2
k xk
( st.dev. of )
Sample Data
1 x1
2 x2
k xk
x1
x2
.
.
.
.
.
.
xk
.
.
.
.
.
.
Regression Parameters
, 1 , 2 ,, k ,
The values of
a, b1 , b2 ,, bk , s
provide the estimates of
, 1 , 2 ,, k ,
The sample statistics a, b1 ,, bk provide the following estimated multiple regression equation
y
a b1 x1
b2 x2 bk xk
where a is again the y-intercept, and b1 through bk are the slopes. This is the equation of the
fitted surface also known as the least squares surface (plane, line).
Graphically, you are no longer fitting a line to a set of points. If there are exactly two explanatory
variables, you are fitting a plane to the data in three-dimensional space. There is one dimension
for the response variable and one for each of the two explanatory variables.
If there are more than two explanatory variables, then you can only imagine the regression surface;
drawing in four or more dimensions is impossible.
-3-
Example 1
OmniFoods
OmniFoods is a large food products company. The company is planning a nationwide introduction
of OmniPower, a new high-energy bar. Originally marketed to runners, mountain climbers, and
other athletes, high-energy bars are now popular with the general public. OmniFoods is anxious to
capture a share of this thriving market. The business objective facing the marketing manager at
OmniFoods is to develop a model to predict monthly sales volume per store of OmniPower bars
and to determine what variables influence sales. Two explanatory variables are considered here:
1
2
Number
of Bars
4141
3842
Price
(cents)
59
59
Promotion
($)
200
200
33
34
3354
2927
99
99
600
600
Store
SS
39472730.77
12620946.67
52093677.44
MS
19736365.39
407127.31
F
48.48
Significance F
0.0000
Coefficients
5837.5208
-53.2173
3.6131
Standard Error
628.1502
6.8522
0.6852
t Stat
9.2932
-7.7664
5.2728
P-value
0.0000
0.0000
0.0000
Lower 95%
4556.3999
-67.1925
2.2155
Intercept
Price
Promotion
-4-
Upper 95%
7118.6416
-39.2421
5.0106
The computed values of the regression coefficients are a = 5,837.5208, b1 = 53.2173, b2 = 3.6131.
Therefore, the multiple regression equation (representing the fitted regression plane) is
-5-
ei2
i
se
n k 1
where n is the number of observations and k is the number of explanatory variables in the equation.
Fortunately, you can interpret s e exactly as before. It is a measure of the typical prediction error
when the multiple regression equation is used to predict the response variable.
The coefficient of determination r 2 is again the proportion of variation in the response variable
y explained by the combined set of explanatory variables x1 , x2 ,, xk . In fact, it even has the
same formula as before:
Regression Sum of Squares
Total Sum of Squares
r2
SSR
SST
and
SST = 52,093,677.44
Thus,
r2
SSR
SST
39,472,730 .77
52,093,677 .44
0.7577
The coefficient of determination indicates that 75.77% 76% of the variation in sales is
explained by the variation in the price and in the promotional expenditures.
The square root of r 2 is the correlation r between the fitted values y and the observed values y
of the response variable in both simple and multiple regression.
A graphical indication of the correlation can be seen in the plot of fitted (predicted) y values versus
observed y values. If the regression equation gave perfect predictions, all of the points in this plot
would lie on a 45 line each fitted value would equal the corresponding observed value. Although
a perfect fit virtually never occurs, the closer the points are to a 45 line, the better the fit is.
0.7577 0.87 indicating a strong
The correlation in the OmniPower example is r
relationship between the two explanatory variables and the response variable. This is confirmed
by the scatterplot of y values versus y values:
-6-
Predicted Bars
6000
5000
4000
3000
2000
1000
0
0
1000
2000
3000
4000
5000
6000
Observed Bars
bj
SEb j
H0 :
Ha :
b2
SEb2
3.6131
0.6852
5.2728 with df = n k 1 = 34 2 1 = 31
The P-value is extremely small. Therefore, we reject the null hypothesis that there is no significant
relationship between x2 (promotional expenditures) and y (sales) and conclude that there is a strong
significant relationship between promotional expenditures and sales, taking into account the price x1 .
For the slope of sales with price, the respective test statistic and P-value are: t = 7.7664, P-value
Thus, there is a significant relationship between price x1 and sales, taking into account the
promotional expenditures x2 .
-7-
0.
If we fail to reject the null hypothesis for a multiple regression coefficient, it does not mean that
the corresponding explanatory variable has no linear relationship to y. It means that the
corresponding explanatory variable contributes nothing to modeling y after allowing for all the
other explanatory variables.
The parameter
in a multiple regression model can be quite different from zero even when it is
possible there is no simple linear relationship between x j and y. The coefficient of x j in a multiple
regression depends as much on the other explanatory variables as it does on x j . It is even possible
that the multiple regression slope changes sign when a new variable enters the regression model.
Confidence Intervals
To estimate the value of a population slope
confidence interval
bj
b1 t * SE1 = 53.2173
2.0395(6.8522) = 53.2173
Taking into account the effect of promotional expenditures, the estimated effect of a 1-cent
increase in price is to reduce mean sales by approximately 39.2 to 67.2 bars. You have 95%
confidence that this interval correctly estimates the relationship between these variables.
From a hypothesis-testing viewpoint, because this confidence interval does not include 0,
you conclude that the regression coefficient 1 , has a significant effect.
The 95% confidence interval for the slope of sales with promotional expenditures is
b2
t * SE2 = 3.6131
2.0395(0.6852) = 3.6131
Thus, taking into account the effect of price, the estimated effect of each additional dollar of
promotional expenditures is to increase mean sales by approximately 2.22 to 5.01 bars. You have
95% confidence that this interval correctly estimates the relationship between these variables.
From a hypothesis-testing viewpoint, because this confidence interval does not include 0,
you can conclude that the regression coefficient 2 has a significant effect.
-8-
H a : At least one
Failing to reject the null hypothesis implies that the explanatory variables are of little or no use in
explaining the variation in the response variable; that is, the regression model predicts no better
than just using the mean. Rejection of the null hypothesis implies that at least one of the
explanatory variables helps explain the variation in y and therefore, the regression model is useful.
The ANOVA table for multiple regression has the following form.
Source of
Variation
Degrees
Sum
of
of Squares
Freedom
Regression
SSR
MSR
Error
nk1
SSE
MSE
Total
n 1
SST
Mean Squares
(Variance)
SSR
k
SSE
n k 1
F statistic
F
P-value
MSR
MSE
Prob > F
Ha :
and/or
0
2
We reject H 0 and conclude that at least one of the explanatory variables (price and/or
promotional expenditures) is related to sales.
-9-
Residual Analysis
Three types of residual plots are appropriate for multiple regression.
1. Residuals versus y s (the predicted values of y)
This plot should look patternless. If the residuals show a pattern (e.g. a trend, bend, clumping),
there is evidence of a possible curvilinear effect in at least one explanatory variable, a possible
violation of the assumption of equal variance, and/or the need to transform the y variable.
2. Residuals versus each x
Patterns in the plot of the residuals versus an explanatory variable may indicate the existence of a
curvilinear effect and, therefore, the need to add a curvilinear explanatory variable to the
multiple regression model.
3. Residuals versus time
This plot is used to investigate patterns in the residuals in order to validate the independence
assumption when one of the x-variables is related to time or is itself time.
Below are the residual plots for the OmniPower sales example. There is very little or no pattern
in the relationship between the residuals and the predicted value of y, the value of x1 (price), or
the value of x2 (promotional expenditures). Thus, you can conclude that the multiple regression
model is appropriate for predicting sales.
There is no need to plot the residuals versus time because the data were not collected in time order.
Residuals versus Predicted Bars
1500
1000
Residuals
500
0
0
1000
2000
3000
4000
-500
-1000
-1500
-2000
Predicted Bars
- 10 -
5000
6000
1500
1500
1000
1000
500
500
Residuals
Residuals
0
0
50
100
150
-500
0
0
200
400
-1000
-1000
-1500
-1500
-2000
-2000
600
800
-500
Promotion
Price
The third regression assumption states that the errors are normally distributed. We can check it
the same way as we did it in simple regression by forming a histogram or a normal probability
(Q-Q) plot of the residuals. If the third assumption holds, the histogram should be approximately
symmetric and bell-shaped, and the points in the normal probability plot should be close to a 450
line. But if there is an obvious skewness, too many residuals more than, say, two standard
deviations from the mean, or some other nonnormal property, this indicates a violation of the
third assumption.
Neither the histogram, nor the normal probability plot for the OmniPower example shows any
severe signs of departure from normality.
Q-Q Normal Plot of Residual / Data Set #2
3.5
10
2.5
Standardized Q-Value
12
8
6
-3.5
4
2
1.5
0.5
-2.5
-1.5
-0.5
-0.5
0.5
-1.5
-2.5
1126.47
694.56
262.64
-169.27
-601.18
-1033.09
0
-1465.01
Frequency
-3.5
Z-Value
- 11 -
1.5
2.5
3.5
Collinearity
Most explanatory variables in a multiple regression problem are correlated to some degree with
one another. For example, in the OmniPower case the correlation matrix is
Price ( x1 )
Price ( x1 )
Promotion ( x2 )
Bars (y)
1.0000
Promotion ( x2 )
0.0968
1.0000
Bars (y)
0.7351
0.5351
1.0000
The correlation between price and promotion is 0.0968. Thus, we find some degree of linear
association between the two explanatory variables.
Low correlations among the explanatory variables generally do not result in serious deterioration
of the quality of the least squares estimates. However, when the explanatory variables are highly
correlated, it becomes difficult to determine the separate effect of any particular explanatory
variable on the response variable. We interpret the regression coefficients as measuring the change
in the response variable when the corresponding explanatory variable increases by 1 unit while all
the other explanatory variables are held constant. The interpretation may be impossible when the
explanatory variables are highly correlated, because when the explanatory variable changes by 1
unit, some or all of the other explanatory variables will change.
Collinearity (also called multicollinearity or intercorrelation) is a condition that exists when
two or more of the explanatory variables are highly correlated with each other. When highly
correlated explanatory variables are included in the regression model, they can adversely affect the
regression results. Two of the most serious problems that can arise are:
1. The estimated regression coefficients may be far from the population parameters, including
the possibility that the statistic and the parameter being estimated may have opposite signs.
For example, the true slope 2 might actually be +10 and b2 , its estimate, might turn out to be 3.
2. You might find a regression that is very highly significant based on the F test but for which
not even one of the t tests of the individual x variables is significant. Thus, variables that are
really related to the response variable can look like they arent related, based on their P-values.
In other words, the regression result is telling you that the x variables taken as a group explain
a lot about y, but it is impossible to single out any particular x variables as being responsible.
Statisticians have developed several routines for determining whether collinearity is high enough
to cause problems. Here are the three most widely used techniques:
1. Pairwise correlations between xs
The rule of thumb suggests that collinearity is a potential problem if the absolute value of the
correlation between any two explanatory variables exceeds 0.7.
(Note: Some statisticians suggest a cutoff of 0.5 instead of 0.7.)
2. Pairwise correlations between y and xs
The rule of thumb suggests that collinearity may be a serious problem if any of the pairwise
correlations among the x variables is larger than the largest of the correlations between the y
variable and the x variables.
- 12 -
1
1 r j2
where r j2 is the coefficient of determination for a regression model using variable x j as the
response variable and all other x variables as explanatory variables.
The VIF tells how much the variance of the regression coefficient has been inflated due to
collinearity. The higher the VIF, the higher the standard error of its coefficient and the less it can
contribute to the regression model. More specifically, the r j2 shows how well the j-th explanatory
variable can be predicted by the other explanatory variables. The 1 r j2 term measures what that
explanatory variable has left to bring to the model. If r j2 is high, then not only is that variable
superfluous, but it can damage the regression model.
Since r j2 cannot be less than zero, the minimum value of the VIF is 1.
If a set of explanatory variables is uncorrelated, then each r j2 = 0.0 and each VIFj is equal to 1.
As r j2 increases, VIFj increases also. For example, if r j2 = 0.9, then VIFj = 1/(10.9) = 10;
if r j2 = 0.99, then VIFj = 1/(10.99) = 100.
How large the VIFs must be to suggest a serious problem with collinearity is not completely clear.
In general, any individual VIFj larger than 10 is considered as an indication of a potential
collinearity problem. (Note: Some statisticians suggest using the cutoff of 5 instead of 10.)
In the OmniPower sales data, the correlation between the two explanatory variables, price and
promotional expenditure, is 0.0968. Because there are only two explanatory variables in the model,
VIF1
VIF2
1
1
0.0968
1.009
Since all VIFs (two in this example) are less than 10 (or, less than the more conservative value of 5),
you can conclude that there is no problem with collinearity for the OmniPower sales data.
One solution to the collinearity problem is to delete the variable with the largest VIF value.
The reduced model is often free of collinearity problems.
Another solution is to redefine some of the variables so that each x variable has a clear, unique role in
x
explaining y. For example, if x1 and x2 are collinear, you might try using x1 and the ratio 2 instead.
x1
If possible, every attempt should be made to avoid including explanatory variables that are
highly correlated. In practice, however, strict adherence to this policy is rarely achievable.
- 13 -