Вы находитесь на странице: 1из 22

Multiple regression

 Regression with more than one predictor is called


“multiple regression”

 Uses more than a single predictor (independent variable) to


make predictions

 We wish to build a model that fits the data better than the
simple linear regression model.

y i    1 x1i   2 x 2i    p x pi  i
Model and Required Conditions

• We allow for k independent variables to


potentially be related to the dependent variable

Coefficients Random error variable


Y = 0 + 1X1+ 2X2 + …+ kXk + 

Dependent variable Independent variables


Model Assessment
• The model is assessed using three measures:
– The standard error of estimate
– The coefficient of determination
– The F-test of the analysis of variance

• The standard error of estimates is used in the


calculations for the other measures.
Standard Error of Estimate
• The standard deviation of the error is estimated by
the Standard Error of Estimate:

SSE
s 
n  k 1

 of Determination
Coefficient

SSE
• The definition is: R  1
2

 i
(Y  Y ) 2
Testing the Validity of the Model

• Consider the question:


Is there at least one independent variable linearly
related to the dependent variable?
• To answer this question, we test the hypothesis:

H0: 1 = 2 = … = k = 0
H1: At least one i is not equal to zero.

• If at least one i is not equal to zero, the model has


some validity.
• The hypotheses can be tested by an ANOVA procedure. The
output is:

MSR/MSE

ANOVA
df SS MS F Significance F
Regression k = 6 3123.8 520.6 17.14 0.0000
Residual n–k–1 = 93 2825.6 30.4
Total n-1 = 99 5949.5

SSR MSR=SSR/k

SSE MSE=SSE/(n-k-1)
SST
SSR: Sum of Squares for Regression
SSE: Sum of Squares for Error
SST: Sum of Squares Total
• As in analysis of variance, we have:

Total Variation in Y (SST) = SSR + SSE.

Large F indicates a large SSR; that is, much of the


variation in Y is explained by the regression model.
Therefore, if F is large, the model is considered valid and
hence the null hypothesis should be rejected.

SSR The Rejection Region:


F k
SSE F>F,k,n-k-1
n  k 1
ANOVA
df SS MS F Significance F
Regression 6 3123.8 520.6 17.14 0.0000
Residual 93 2825.6 30.4
Total 99 5949.5

F,k,n-k-1 = F0.05,6,100-6-1=2.17
F = 17.14 > 2.17
Also, the p-value (Significance F) = 0.0000
Reject the null hypothesis.

Conclusion: There is sufficient evidence to reject


the null hypothesis in favor of the alternative hypothesis: at least one
of the i is not equal to zero.

Thus, at least one independent variable is linearly related to Y.


This linear regression model is valid
Testing Individual Coefficients

• The hypothesis for each bi is:

H0: i  0 Test statistic


H1: i  0 b  i
t i d.f. = n - k -1
sb i
• Output:
Ignore Insufficient Evidence

Coefficients Standard Error t Stat P-value


Intercept 38.14 6.99 5.45 0.0000
Number -0.0076 0.0013 -6.07 0.0000
Nearest 1.65 0.63 2.60 0.0108
Office Space 0.020 0.0034 5.80 0.0000
Enrollment 0.21 0.13 1.59 0.1159
Income 0.41 0.14 2.96 0.0039
Distance -0.23 0.18 -1.26 0.2107

Insufficient Evidence
Example: Sex discrimination in wages
 Do female employees tend to receive lower starting salaries
than similarly qualified and experienced male employees?

Variables collected
 93 employees on data file (61 female, 32 male).

 bsal: Annual salary at time of recruitment.


 sal77: Annual salary in 1977.
 educ: Years of education.
 exper: Months previous work prior to hire at bank.
 fsex: 1 if female, 0 if male
 senior: Months worked at bank since hired
 age: Months
Comparison for male and females
 This shows men started at Oneway Analysis of bsal By fsex
higher salaries than women
8000
(t=6.3, p<.0001).
7000

 But, it doesn’t control for

b sa l
6000
other characteristics.
5000

4000

Female Male

fsex
Relationships of bsal with other variables

 Senior and education predict bsal well. We want to


control for them when judging gender effect.
F i t Y b y X G r o u p
B i v a r i a t e F i t B i ov f a r bi sa at l e BF yi t Bs i eov nf a i r obi r sa at l e BF yi t Ba i gov ef a r bi sa at l e BF yi t e

8 0 0 0 8 0 0 0 8 0 0 0 8 0 0 0
l

l
7 0 0 0 7 0 0 0 7 0 0 0 7 0 0 0
a

a
s

s
6 0 0 0 6 0 0 0 6 0 0 0 6 0 0 0
b

b
5 0 0 0 5 0 0 0 5 0 0 0 5 0 0 0

4 0 0 0 4 0 0 0 4 0 0 0 4 0 0 0

6 06 57 07 58 08 59 09 51 0 0 3 0 4 0 0 50 0 60 0 7 0 0 80 0 0 7 8 9 1 10 1 12 13 14 15 16 7 - 50 05 10 01 205 02 053 03 540 0


s e n io r a g e e d u c e x p e r

L in e a r F it L in e a r F it L in e a r F it L in e a r F it
Multiple regression model

 For any combination of values of the predictor variables, the


average value of the response (bsal) lies on a straight line:

bsali    1fsexi  2seniori  3age i  4educi  5experi  i


Output from regression
(fsex = 1 for females, = 0 for males)

Term Estimate Std Error t Ratio Prob>|t|

Int. 6277.9 652 9.62 <.0001

Fsex -767.9 128.9 -5.95 <.0001

Senior -22.6 5.3 -4.26 <.0001

Age 0.63 0.72 0.88 0.3837

Educ 92.3 24.8 3.71 0.0004

Exper 0.50 1.05 0 .47 0.6364


Predictions
 Example: Prediction of beginning wages for a woman with 10
months seniority, that is 25 years old, with 12 years of
education, and 2 years of experience:

bsali    1fsexi  2seniori  3age i  4educi  5experi  i

 Pred. bsal = 6277.9 - 767.9*1 - 22.6*10


+ .63*300 + 92.3*12 + .50*24
= 6592.6
Interpretation of coefficients in multiple regression
 Each estimated coefficient is the amount Y is expected to increase
when the value of its corresponding predictor is increased by one,
holding constant the values of the other predictors.

 Example: estimated coefficient of education equals 92.3.

For each additional year of education of employee, we expect


salary to increase by about 92 dollars, holding all other variables
constant.

 Estimated coefficient of fsex equals -767.

For employees who started at the same time, had the same
education and experience, and were the same age, women earned
$767 less on average than men.
Which variable is the strongest predictor of the outcome?

 The coefficient that has the strongest linear association with


the outcome variable is the one with the largest
absolute value of T, which equals the coefficient over its
SE.
 Example: In wages regression, seniority is a better predictor
than education because it has a larger T.
Hypothesis tests for coefficients

 The reported t-stats (coef. / SE) and p-values are used to test whether a
particular coefficient equals 0, given that all other coefficients are in the
model.

 Examples:

1) Test whether coefficient of education equals zero has p-value = 0.0004.


Hence, reject the null hypothesis; education is a useful predictor of bsal
when all the other predictors are in the model.

 2) Test whether coefficient of experience equals zero has p-value = 0.6364.


Hence, we cannot reject the null hypothesis; it appears that experience is not a
particularly useful predictor of bsal when all other predictors are in the
model.
Checking assumptions

 Plot the residuals versus the predicted values from the


regression line.

 Also plot the residuals versus each of the predictors.

 If non-random patterns in these plots, the assumptions might


be violated.
Plot of residuals versus predicted values

 This plot has a fan shape. Response sal 77


 It suggests non-constant Whol e Model

variance (heteroscedastic). Resi dual by Predi cted Pl ot


5000
4000

 We need to transform
3000

sal77 Residual
2000
variables. 1000
0
- 1000
- 2000
- 3000
7000 9000 11000 13000 15000 17000
sal77 Pr edic t ed
2
Plots of residuals vs. predictors

2
l

l
a

a
F i t Y b y X G r o u p
s

s
B i v a r i a t e F i t o fB i R
v ea sr ii da u
t ae l F bi st a lo fB 2i R
v B
ea ysr ii sda eu
t nae il oF rbi st a lo f
b

b
l

l
1 5 0 0 1 5 0 0 1 5 0 0
a

a
u

u
1 0 0 0 1 0 0 0 1 0 0 0
id

id

id
5 0 0 5 0 0 5 0 0
s

s
0 0 0
e

e
R

R
- 5 0 0 - 5 0 0 - 5 0 0

- 1 0 0 0 - 1 0 0 0 - 1 0 0 0

6 06 57 07 58 08 59 09 51 0 0 3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0 7 8 9 1 01 11 21 31 41 51 61 7
s e n io r a g e e d u c

Fi t Y by X G r o up
Bi v a r i a t e Fi t of Re s i d u a l bs a l Bi
2 vBy
a r ieaxtp ee r F i t of Re s i d u a l bs a l 2 By f se x

15 00 15 00
R e s id u a l b s a l 2

R e s id u a l b s a l 2

10 00 10 00

50 0 50 0

0 0

- 5 00 - 5 00

- 1 00 0 - 1 00 0

- 50 0 5 0 1 0 01 5 02 0 02 5 03 0 03 5 04 0 0 - 0 . 01 . 1 . 2 . 3 . 4 . 5 . 6 . 7 . 8 . 9 1 1 . 1
ex pe r f s ex
Collinearity
 When predictors are highly correlated, standard errors are
inflated

Вам также может понравиться