MultipleRegression NOTES

Multiple regression
 Regression with more than one predictor is called

“multiple regression”
 Uses more than a single predictor (independent variable) to

make predictions
 We wish to build a model that fits the data better than the
simple linear regression model.
y i    1 x1i   2 x 2i    p x pi  i
Model and Required Conditions
• We allow for k independent variables to

potentially be related to the dependent variable
Coefficients Random error variable

Y = 0 + 1X1+ 2X2 + …+ kXk + 
Dependent variable Independent variables

Model Assessment
• The model is assessed using three measures:
– The standard error of estimate
– The coefficient of determination
– The F-test of the analysis of variance
• The standard error of estimates is used in the

calculations for the other measures.
Standard Error of Estimate
• The standard deviation of the error is estimated by
the Standard Error of Estimate:
SSE
s 
n  k 1
 of Determination
Coefficient
SSE
• The definition is: R  1
2
 i
(Y  Y ) 2
Testing the Validity of the Model
• Consider the question:

Is there at least one independent variable linearly
related to the dependent variable?
• To answer this question, we test the hypothesis:
H0: 1 = 2 = … = k = 0
H1: At least one i is not equal to zero.
• If at least one i is not equal to zero, the model has

some validity.
• The hypotheses can be tested by an ANOVA procedure. The
output is:
MSR/MSE
ANOVA
df SS MS F Significance F
Regression k = 6 3123.8 520.6 17.14 0.0000
Residual n–k–1 = 93 2825.6 30.4
Total n-1 = 99 5949.5
SSR MSR=SSR/k
SSE MSE=SSE/(n-k-1)
SST
SSR: Sum of Squares for Regression
SSE: Sum of Squares for Error
SST: Sum of Squares Total
• As in analysis of variance, we have:
Total Variation in Y (SST) = SSR + SSE.
Large F indicates a large SSR; that is, much of the

variation in Y is explained by the regression model.
Therefore, if F is large, the model is considered valid and
hence the null hypothesis should be rejected.
SSR The Rejection Region:

F k
SSE F>F,k,n-k-1
n  k 1
ANOVA
df SS MS F Significance F
Regression 6 3123.8 520.6 17.14 0.0000
Residual 93 2825.6 30.4
Total 99 5949.5
F,k,n-k-1 = F0.05,6,100-6-1=2.17
F = 17.14 > 2.17
Also, the p-value (Significance F) = 0.0000
Reject the null hypothesis.
Conclusion: There is sufficient evidence to reject

the null hypothesis in favor of the alternative hypothesis: at least one
of the i is not equal to zero.
Thus, at least one independent variable is linearly related to Y.

This linear regression model is valid
Testing Individual Coefficients
• The hypothesis for each bi is:
H0: i  0 Test statistic

H1: i  0 b  i
t i d.f. = n - k -1
sb i
• Output:
Ignore Insufficient Evidence
Coefficients Standard Error t Stat P-value

Intercept 38.14 6.99 5.45 0.0000
Number -0.0076 0.0013 -6.07 0.0000
Nearest 1.65 0.63 2.60 0.0108
Office Space 0.020 0.0034 5.80 0.0000
Enrollment 0.21 0.13 1.59 0.1159
Income 0.41 0.14 2.96 0.0039
Distance -0.23 0.18 -1.26 0.2107
Insufficient Evidence
Example: Sex discrimination in wages
 Do female employees tend to receive lower starting salaries
than similarly qualified and experienced male employees?
Variables collected
 93 employees on data file (61 female, 32 male).
 bsal: Annual salary at time of recruitment.

 sal77: Annual salary in 1977.
 educ: Years of education.
 exper: Months previous work prior to hire at bank.
 fsex: 1 if female, 0 if male
 senior: Months worked at bank since hired
 age: Months
Comparison for male and females
 This shows men started at Oneway Analysis of bsal By fsex
higher salaries than women
8000
(t=6.3, p<.0001).
7000
 But, it doesn’t control for
b sa l
6000
other characteristics.
5000
4000
Female Male
fsex
Relationships of bsal with other variables
 Senior and education predict bsal well. We want to

control for them when judging gender effect.
F i t Y b y X G r o u p
B i v a r i a t e F i t B i ov f a r bi sa at l e BF yi t Bs i eov nf a i r obi r sa at l e BF yi t Ba i gov ef a r bi sa at l e BF yi t e
8 0 0 0 8 0 0 0 8 0 0 0 8 0 0 0
l
l
7 0 0 0 7 0 0 0 7 0 0 0 7 0 0 0
a
a
s
s
6 0 0 0 6 0 0 0 6 0 0 0 6 0 0 0
b
b
5 0 0 0 5 0 0 0 5 0 0 0 5 0 0 0
4 0 0 0 4 0 0 0 4 0 0 0 4 0 0 0
6 06 57 07 58 08 59 09 51 0 0 3 0 4 0 0 50 0 60 0 7 0 0 80 0 0 7 8 9 1 10 1 12 13 14 15 16 7 - 50 05 10 01 205 02 053 03 540 0

s e n io r a g e e d u c e x p e r
L in e a r F it L in e a r F it L in e a r F it L in e a r F it
Multiple regression model
 For any combination of values of the predictor variables, the

average value of the response (bsal) lies on a straight line:
bsali    1fsexi  2seniori  3age i  4educi  5experi  i

Output from regression
(fsex = 1 for females, = 0 for males)
Term Estimate Std Error t Ratio Prob>|t|
Int. 6277.9 652 9.62 <.0001
Fsex -767.9 128.9 -5.95 <.0001
Senior -22.6 5.3 -4.26 <.0001
Age 0.63 0.72 0.88 0.3837
Educ 92.3 24.8 3.71 0.0004
Exper 0.50 1.05 0 .47 0.6364

Predictions
 Example: Prediction of beginning wages for a woman with 10
months seniority, that is 25 years old, with 12 years of
education, and 2 years of experience:
bsali    1fsexi  2seniori  3age i  4educi  5experi  i
 Pred. bsal = 6277.9 - 767.9*1 - 22.6*10

+ .63*300 + 92.3*12 + .50*24
= 6592.6
Interpretation of coefficients in multiple regression
 Each estimated coefficient is the amount Y is expected to increase
when the value of its corresponding predictor is increased by one,
holding constant the values of the other predictors.
 Example: estimated coefficient of education equals 92.3.
For each additional year of education of employee, we expect

salary to increase by about 92 dollars, holding all other variables
constant.
 Estimated coefficient of fsex equals -767.
For employees who started at the same time, had the same
education and experience, and were the same age, women earned
$767 less on average than men.
Which variable is the strongest predictor of the outcome?
 The coefficient that has the strongest linear association with

the outcome variable is the one with the largest
absolute value of T, which equals the coefficient over its
SE.
 Example: In wages regression, seniority is a better predictor
than education because it has a larger T.
Hypothesis tests for coefficients
 The reported t-stats (coef. / SE) and p-values are used to test whether a
particular coefficient equals 0, given that all other coefficients are in the
model.
 Examples:
1) Test whether coefficient of education equals zero has p-value = 0.0004.

Hence, reject the null hypothesis; education is a useful predictor of bsal
when all the other predictors are in the model.
 2) Test whether coefficient of experience equals zero has p-value = 0.6364.

Hence, we cannot reject the null hypothesis; it appears that experience is not a
particularly useful predictor of bsal when all other predictors are in the
model.
Checking assumptions
 Plot the residuals versus the predicted values from the

regression line.
 Also plot the residuals versus each of the predictors.
 If non-random patterns in these plots, the assumptions might

be violated.
Plot of residuals versus predicted values
 This plot has a fan shape. Response sal 77

 It suggests non-constant Whol e Model
variance (heteroscedastic). Resi dual by Predi cted Pl ot

5000
4000
 We need to transform
3000
sal77 Residual
2000
variables. 1000
0
- 1000
- 2000
- 3000
7000 9000 11000 13000 15000 17000
sal77 Pr edic t ed
2
Plots of residuals vs. predictors
2
l
l
a
a
F i t Y b y X G r o u p
s
s
B i v a r i a t e F i t o fB i R
v ea sr ii da u
t ae l F bi st a lo fB 2i R
v B
ea ysr ii sda eu
t nae il oF rbi st a lo f
b
b
l
l
1 5 0 0 1 5 0 0 1 5 0 0
a
a
u
u
1 0 0 0 1 0 0 0 1 0 0 0
id
id
id
5 0 0 5 0 0 5 0 0
s
s
0 0 0
e
e
R
R
- 5 0 0 - 5 0 0 - 5 0 0
- 1 0 0 0 - 1 0 0 0 - 1 0 0 0
6 06 57 07 58 08 59 09 51 0 0 3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0 7 8 9 1 01 11 21 31 41 51 61 7
s e n io r a g e e d u c
Fi t Y by X G r o up
Bi v a r i a t e Fi t of Re s i d u a l bs a l Bi
2 vBy
a r ieaxtp ee r F i t of Re s i d u a l bs a l 2 By f se x
15 00 15 00
R e s id u a l b s a l 2
R e s id u a l b s a l 2
10 00 10 00
50 0 50 0
0 0
- 5 00 - 5 00
- 1 00 0 - 1 00 0
- 50 0 5 0 1 0 01 5 02 0 02 5 03 0 03 5 04 0 0 - 0 . 01 . 1 . 2 . 3 . 4 . 5 . 6 . 7 . 8 . 9 1 1 . 1
ex pe r f s ex
Collinearity
 When predictors are highly correlated, standard errors are
inflated

MultipleRegression NOTES

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

MultipleRegression NOTES

Загружено:

Авторское право:

Доступные форматы

Multiple regression

 Regression with more than one predictor is called

 Uses more than a single predictor (independent variable) to

• We allow for k independent variables to

Coefficients Random error variable

Dependent variable Independent variables

• The standard error of estimates is used in the

• Consider the question:

• If at least one i is not equal to zero, the model has

Total Variation in Y (SST) = SSR + SSE.

Large F indicates a large SSR; that is, much of the

SSR The Rejection Region:

Conclusion: There is sufficient evidence to reject

Thus, at least one independent variable is linearly related to Y.

• The hypothesis for each bi is:

H0: i  0 Test statistic

Coefficients Standard Error t Stat P-value

 bsal: Annual salary at time of recruitment.

 But, it doesn’t control for

 Senior and education predict bsal well. We want to

6 06 57 07 58 08 59 09 51 0 0 3 0 4 0 0 50 0 60 0 7 0 0 80 0 0 7 8 9 1 10 1 12 13 14 15 16 7 - 50 05 10 01 205 02 053 03 540 0

 For any combination of values of the predictor variables, the

bsali    1fsexi  2seniori  3age i  4educi  5experi  i

Term Estimate Std Error t Ratio Prob>|t|

Int. 6277.9 652 9.62 <.0001

Fsex -767.9 128.9 -5.95 <.0001

Senior -22.6 5.3 -4.26 <.0001

Age 0.63 0.72 0.88 0.3837

Educ 92.3 24.8 3.71 0.0004

Exper 0.50 1.05 0 .47 0.6364

bsali    1fsexi  2seniori  3age i  4educi  5experi  i

 Pred. bsal = 6277.9 - 767.9*1 - 22.6*10

 Example: estimated coefficient of education equals 92.3.

For each additional year of education of employee, we expect

 Estimated coefficient of fsex equals -767.

 The coefficient that has the strongest linear association with

1) Test whether coefficient of education equals zero has p-value = 0.0004.

 2) Test whether coefficient of experience equals zero has p-value = 0.6364.

 Plot the residuals versus the predicted values from the

 Also plot the residuals versus each of the predictors.

 If non-random patterns in these plots, the assumptions might

 This plot has a fan shape. Response sal 77

variance (heteroscedastic). Resi dual by Predi cted Pl ot

Вам также может понравиться

 Pred. bsal = 6277.9 - 767.91 - 22.610