Академический Документы
Профессиональный Документы
Культура Документы
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Multiple Regression and Model
Building
14.1 The Multiple Regression Model and the
Least Squares Point Estimate
14.2 Model Assumptions and the Standard Error
14.3 R2 and Adjusted R2
14.4 The Overall F Test
14.5 Testing the Significance of an Independent
Variable
14.6 Confidence and Prediction Intervals
14-2
Multiple Regression and Model
Building Continued
14.7 The Sales Territory Performance Case:
Evaluating Employee Performance
14.8 Using Dummy Variables to Model
Qualitative Independent Variables
14.9 Using Squared and Interactive Terms
14.10 Model Building and the Effects of
Multicollinearity
14.11 Residual Analysis in Multiple Regression
14.12 Logistic Regression
14-3
LO 14-1: Explain the
multiple regression
model and the related
least squares point
estimates.
14.1 The Multiple Regression Model and
the Least Squares Point Estimate
Simple linear regression used one independent
variable to explain the dependent variable
Some relationships are too complex to be described
using a single independent variable
Multiple regression uses two or more independent
variables to describe the dependent variable
This allows multiple regression models to handle more
complex situations
There is no limit to the number of independent
variables a model can use
Multiple regression has only one dependent variable
14-4
LO14-1
14-5
LO14-1
The Least Squares Estimates and Point
Estimation and Prediction
1. Estimation/prediction equation
ŷ = b0 + b1x1 + b2x2 + … + bkxk
is the point estimate of the mean value of the
dependent variable when the values of the
independent variables are x1, x2,…, xk
2. It is also the point prediction of an individual value of
the dependent variable when the values of the
independent variables are x1, x2,…, xk
3. b0, b1, b2,…, bk are the least squares point
estimates of the parameters β0, β1, β2,…, βk
4. x1, x2,…, xk are specified values of the independent
predictor variables x1, x2,…, xk
14-6
LO14-1
EXAMPLE 14.1 The Tasty Sub
Shop Case
y = β 0 + β 1 x 1 + β 2 x 2 + … + β kx k +
14-8
LO14-2
14-9
LO14-2
Sum of Squares
Sum of squared errors
SSE e i2 ( y i yˆ i ) 2
n-k 1
Standard error: point estimate of the residual
standard deviation σ
SSE
s MSE
n- k 1
14-10
LO 14-3: Calculate and
interpret the multiple
and adjusted multiple
coefficients of
14-12
LO14-3
14-13
LO14-3
The Adjusted R2
Adding an independent variable to multiple
regression will raise R2
R2 will rise slightly even if the new variable
has no relationship to y
The adjusted R2 corrects this tendency in R2
As a result, it gives a better estimate of the
importance of the independent variables
14-14
LO 14-4: Test the
significance of a
multiple regression
model by using an F
test.
14.4 The Overall F Test
To test
H0: β1= β2 = …= βk = 0 versus
Ha: At least one of β1, β2,…, βk ≠ 0
Test statistic
(Explained variation)/k
F(model)
(Unexplain ed variation)/[n - (k 1)]
14-15
LO 14-5: Test the
significance of a
single independent
variable. 14.5 Testing the Significance of an
Independent Variable
A variable in a multiple regression model is
not likely to be useful unless there is a
significant relationship between it and y
To test significance, we use the null
hypothesis H0: βj = 0
Versus the alternative hypothesis
Ha: βj ≠ 0
14-16
LO14-5
Testing Significance of an Independent
Variable #2
14-17
LO14-5
Testing Significance of an
Independent Variable #3
Customary to test significance of every
independent variable
If we can reject H0: βj = 0 at =0.05, we have
strong evidence the independent variable xj is
significantly related to y
If we can reject H0: βj = 0 at =0.01, we have
very strong evidence the independent
variable xj is significantly related to y
The smaller the significance level at which
H0 can be rejected, the stronger the evidence
that xj is significantly related to y
14-18
LO14-5
A Confidence Interval for the
Regression Parameter βj
If the regression assumptions hold, 100(1-
)% confidence interval for βj
is [b1 ± t/2 Sbj]
t/2 is based on n – (k + 1) degrees of
freedom
14-19
LO 14-6: Find and
interpret a confidence
interval for a mean
value and a prediction
interval for an individual
value.
14.6 Confidence and Prediction
Intervals
The point on the regression line
corresponding to a particular value of x1,
x2,…, xk, of the independent variables is
ŷ = b0 + b1x1 + b2x2 + … + bkxk
It is unlikely that this value will equal the
mean value of y for these x values
Therefore, we need to place bounds on how
far away the predicted value might be
We can do this by calculating a confidence
interval for the mean value of y and a
prediction interval for an individual value of y
14-20
LO14-6
Distance Value
Both the confidence interval for the mean
value of y and the prediction interval for an
individual value of y employ a quantity called
the distance value
With simple regression, we were able to
calculate the distance value fairly easily
However, for multiple regression, calculating
the distance value requires matrix algebra
14-21
LO14-6
A Confidence Interval for a Mean
Value of y
Assume the regression assumptions hold
Confidence interval
[ŷ t /2 s( y yˆ ) ] s( y yˆ ) s Distance value
Prediction interval
[ŷ t /2 s( y yˆ ) ] s( y yˆ ) s 1 Distance value
14-22
14.7 The Sales Territory Performance
Case: Evaluating Employee Performance
yi Yearly sales of the company’s product
x1 Number of months the representative has
been employed
x2 Sales of products in the sales territory
x3 Dollar advertising expenditure in the territory
x4 Weighted average of the company’s market
share in territory for the previous four years
x5 Change in the company’s market share in
the territory over the previous four years
14-23
Partial Excel Output of a Regression Analysis
of the Sales Territory Performance Data
14-25
LO14-7
How to Construct Dummy
Variables
A dummy variable always has a value of
either 0 or 1
For example, to model sales at two locations,
would code the first location as a zero and
the second as a 1
Operationally, it does not matter which is
coded 0 and which is coded 1
14-26
LO14-7
What If We Have More Than Two
Categories?
Consider having three categories, say A, B
and C
Cannot code this using one dummy variable
A=0, B=1 and C=2 would be invalid
Assumes the difference between A and B is
the same as B and C
We must use multiple dummy variables
Specifically, k categories requires k-1 dummy
variables
14-27
LO14-7
14-28
LO14-7
Interaction Models
So far, have only considered dummy
variables as stand-alone variables
Model so far is y = β0 + β1x + β2D +
Where D is dummy variable
However, can also look at interaction
between dummy variable and other variables
That model would take the form
y = β0 + β1x + β2D + β3xD +
With an interaction term, both the intercept
and slope are shifted
14-29
LO 14-8: Use
squared and
interaction variables.
14.9 Using Squared and
Interaction Variables
Quadratic regression model is:
y = β0 + β 1 x + β2 x 2 ε
where
1. β0 + β1x + β2x2 is μy
2. Β, β, and β2 are the regression parameters
3. ε is an error term
14-30
LO14-8
14-31
LO 14-9: Describe
multicollinearity and
build a multiple
regression model.
14.10 Model Building and the
Effects of Multicollinearity
Multicollinearity: when “independent”
variables are related to one another
Considered severe when the simple
correlation exceeds 0.9
Even moderate multicollinearity can be a
problem
Another measurement is variance inflation
factors
Multicollinearity a problem when VIF>10
Moderate problem for VIF>5 1
VIF
1 R 2j
j
14-32
LO14-9
Effect of Adding Independent
Variable
Adding any independent variable will increase
R²
Even adding an unimportant independent
variable
Thus, R² cannot tell us that adding an
independent variable is undesirable
14-33
LO14-9
A Better Criterion is the Standard
Error
A better criterion is the size of the standard
error s
If s increases when an independent variable
is added, we should not add that variable
However, decreasing s alone is not enough
An independent variable should only be
included if it reduces s enough to offset the
higher t value and reduces the length of the
desired prediction interval for y
SSE
s
n k 1
14-34
LO14-9
C Statistic
Another quantity for comparing regression
models is called the C (a.k.a. Cp) statistic
First, calculate mean square error for the
model containing all p potential independent
variables (s2p)
Next, calculate SSE for a reduced model with
k independent variables
C 2 n 2k 1
SSE
sp
14-35
LO14-9
C Statistic Continued
We want the value of C to be small
Adding unimportant independent variables
will raise the value of C
While we want C to be small, we also wish to
find a model for which C roughly equals k+1
A model with C substantially greater than k+1
has substantial bias and is undesirable
If a model has a small value of C and C for
this model is less than k+1, then it is not
biased and the model should be considered
desirable
14-36
LO14-9
The Partial F Test: An F Test a Portion
of a Regression Model
To test
H0: All of the βj coefficients corresponding to the
independent variables in the subset are zero
Ha: At least one of the βj coefficients is not equal to
zero
(SSE R - SSE C )/k *
F
SSE C /[n - (k 1)]
14-37
LO 14-10: Use residual
analysis to check the
assumptions of multiple
regression.
14.11 Residual Analysis in
Multiple Regression
For an observed value of yi, the residual is
ei = yi - ŷ = yi – (b0 + b1xi1 + … + bkxik)
If the regression assumptions hold, the residuals
should look like a random sample from a normal
distribution with mean 0 and variance σ2
Residual plots
Residuals versus each independent variable
Residuals versus predicted y’s
Residuals in time order (if the response is a time
series)
14-38
LO14-10
Residual Plots for the Sales
Territory Performance Model