Вы находитесь на странице: 1из 12

AB1202

Statistics and Analysis


Lecture 9
Simple Linear Regression and
Multiple Regression
Chin Chee Kai
cheekai@ntu.edu.sg
Nanyang Business School
Nanyang Technological University
NBS 2016S1 AB1202 CCK-STAT-018
2

Simple Linear Regression


• Least Square Regression Model
• Testing Significance of Slope and Y-Intercept
• Regression Model F Test
• Confidence and Prediction Intervals
• Relationship Between Coefficient of
Determination 𝑅2 and Correlation 𝑟
Multiple Regression
• Model, Assumptions and Standard Error
• 𝑅2 and Adjusted 𝑅2
• Regression Model F Test
• Testing Significance of Explanatory Variables
• Interpreting Regression Reports
NBS 2016S1 AB1202 CCK-STAT-018
3

Least Square Regression Model


• What we want: 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀 𝜀~𝑁(0, 𝜎𝜀2 )
• What we can see: 𝑦𝑖 , 𝑥𝑖 𝑖 = 1 … 𝑛 (sample)
• What we can get: 𝑦 = 𝑏0 + 𝑏1 𝑥 (model)
• Total squared error 𝑆𝑆𝐸 = 𝑛𝑖=1 𝑦𝑖 − 𝑦𝑖 2
▫ SSE is also total unexplained variation
• Total explained variation 𝑆𝑆𝑀𝑜𝑑𝑒𝑙 = 𝑛𝑖=1 𝑦𝑖 − 𝑦 2
• Total variation 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑛𝑖=1 𝑦𝑖 − 𝑦 2
𝑏1
• 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑀𝑜𝑑𝑒𝑙 + 𝑆𝑆𝐸 𝑌 Full
𝑦𝑖
Variation Unexplained
𝑛 𝑛 𝑛 𝑦𝑖
2 2 2
𝑦𝑖 − 𝑦 = 𝑦𝑖 − 𝑦 + 𝑦𝑖 − 𝑦𝑖 Explained
𝑖=1 𝑖=1 𝑖=1 𝑦

𝑆𝑆𝐸
• 𝑠𝜀2 = (point estimate of 𝜎𝜀2 ) 𝑏0
𝑛−2
𝛽1
• 𝑠𝜀 is called standard error. 𝛽0
𝑋
𝑥 𝑥𝑖
NBS 2016S1 AB1202 CCK-STAT-018
4

Testing Significance of Slope & Y-


Intercept
𝑛 𝑌
• 𝑆𝑆𝑋𝑌 = 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
𝑖=1
𝑛
• 𝑆𝑆𝑋𝑋 = 𝑥𝑖 − 𝑥 2𝑖=1
𝑛 𝑏1 = 0.0095
• 𝑆𝑆𝑌𝑌 = 𝑦𝑖 − 𝑦 2𝑖=1
𝑏0 = 0.07
𝑋
𝑆𝑆𝑋𝑌
• 𝑏1 = , 𝑏0 = 𝑦 − 𝑏1 𝑥
𝑆𝑆𝑋𝑋
• If 𝑏1 or 𝑏0 are nearly zero, they might still be
very significant. So must test for significance.
• Testing Significance of Slope & Y-Intercept
𝑠 1 𝑥2
• S.d. of 𝑏1 𝑠𝑏1 = • S.d. of 𝑏0 𝑠𝑏0 = 𝑠 +
𝑆𝑆𝑋𝑋 𝑛 𝑆𝑆𝑋𝑋
𝑏1 𝑏0
• Test statistic 𝑡 = • Test statistic 𝑡 =
𝑠𝑏1 𝑠𝑏0
• d.f. 𝑣 = 𝑛 − 2 • d.f. 𝑣 = 𝑛 − 2
NBS 2016S1 AB1202 CCK-STAT-018
5

Regression Model F Test


• Is simple regression model significant?
• We can test for significance of slope with t-test as above.
• Or we can use F-test
▫ This is like in ANOVA with multiple populations of betas, except
we have only 1 beta here:
 𝐻0 : 𝛽1 = 0
 𝐻1 : 𝛽1 ≠ 0

• Test statistic:
𝑆𝑆𝑀𝑜𝑑𝑒𝑙
1
𝐹= 𝑆𝑆𝐸 df1 𝑣1 = 1, df2 𝑣2 = 𝑛 − 2
𝑛−2

• Always right-tailed.
• F-test’s view of regression model’s significance extends easily
into higher dimensions in multiple regression (but not so for t-
test).
NBS 2016S1 AB1202 CCK-STAT-018
6

Confidence and Prediction Intervals


• Confidence interval (in simple LR) refers to CI for
the mean value of 𝑦 when 𝑥 takes on a specific
value 𝑥0 .
1 𝑥0 −𝑥 2
▫ Confidence interval = 𝑦 ± 𝑡𝛼/2 ∙ 𝑠 + tells
𝑛 𝑆𝑆𝑋𝑋
us with 100 1 − 𝛼 % confidence where the mean
value 𝛽0 + 𝛽1 𝑥0 will be.
• Prediction interval gives a CI that predicts where
the next occurrence of 𝑦 will be when 𝑥 happens to
be 𝑥0 .
1 𝑥0 −𝑥 2
▫ Confidence interval = 𝑦 ± 𝑡𝛼/2 ∙ 𝑠 1 + +
𝑛 𝑆𝑆𝑋𝑋
tells us with 100 1 − 𝛼 % confidence where the
𝑦 value will be when 𝑥 = 𝑥0 .
NBS 2016S1 AB1202 CCK-STAT-018
7

2
Relationship Between 𝑅 And 𝑟
2 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑆𝑀𝑜𝑑𝑒𝑙 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 −𝑆𝑆𝐸
•𝑅 = = =
𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 𝑆𝑆𝑇𝑜𝑡𝑎𝑙
2 𝑆𝑆𝐸
• But 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑌𝑌 . So, 𝑅 = 1 −
𝑆𝑆𝑌𝑌
𝑆𝑆𝑋𝑌
• 𝑟 = 𝐶𝑜𝑟𝑟𝑒𝑙 𝑌, 𝑋 =
𝑆𝑆𝑋𝑋 𝑆𝑆𝑌𝑌
2
𝑆𝑆𝑋𝑌
• So, = 𝑟2
𝑆𝑆𝑋𝑋 ∙𝑆𝑆𝑌𝑌
• Now relationship between 𝑅2 and 𝑟 is: 𝑅2 = 𝑟 2
2
𝑆𝑆𝑀𝑜𝑑𝑒𝑙 𝑆𝑆𝑋𝑌
• This means = , which means:
𝑆𝑆𝑌𝑌 𝑆𝑆𝑋𝑋 ∙𝑆𝑆𝑌𝑌
2
𝑆𝑆𝑋𝑌
• 𝑆𝑆𝑀𝑜𝑑𝑒𝑙 = = 𝑏12 ∙ 𝑆𝑆𝑋𝑋
𝑆𝑆𝑋𝑋

• Such simple relationship does not follow through in


higher dimensional multiple regression.
NBS 2016S1 AB1202 CCK-STAT-018
8

Multiple Regression – Model,


Assumptions and Standard Error
• Expanding simple LR to higher dimension model:
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝜀, 𝜀~𝑁(0, 𝜎𝜀2 )
• We no longer can “see” the scatter plot easily.
• Rely on extended definitions:
• Total squared error 𝑆𝑆𝐸 = 𝑛𝑖=1 𝑦𝑖 − 𝑦𝑖 2
• Total explained variation 𝑆𝑆𝑀𝑜𝑑𝑒𝑙 = 𝑛𝑖=1 𝑦𝑖 − 𝑦 2
• Total variation 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑛𝑖=1 𝑦𝑖 − 𝑦 2
• 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑀𝑜𝑑𝑒𝑙 + 𝑆𝑆𝐸
𝑆𝑆𝐸
• 𝑠𝜀2 = (point estimate of 𝜎𝜀2 )
𝑛− 𝑘+1
• 𝑠𝜀 is called standard error.
NBS 2016S1 AB1202 CCK-STAT-018
9

2 2
𝑅 and Adjusted 𝑅
• 𝑅2 is still having the same definition:
2 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑆𝑀𝑜𝑑𝑒𝑙 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 −𝑆𝑆𝐸
•𝑅 = = =
𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 𝑆𝑆𝑇𝑜𝑡𝑎𝑙
• But there are problematic cases. Unobserved
(X,Y) points
𝑌
• To avoid low sample-to-
variable cases which tend to 𝑏1 = 0.82
inflate 𝑅2 in a misleading
𝑏0 = 2
way, we use Adjusted- 𝑅2 : 𝑋
𝑘 𝑛−1 A simple LR with 2 collinear sample points
• 𝑅2 = 𝑅2 − will have 𝑅 2 = 1 perfect correlation. This
𝑛−1 𝑛−(𝑘+1) does not mean Y is strongly linearly
dependent on X in reality.
In this illustration, if more points were
sampled, Y would appear to have very little
linear dependency on X indeed.
NBS 2016S1 AB1202 CCK-STAT-018
10

Regression Model F Test


• Must still answer the important question: Is the multiple
regression model significant?
• As promised, we extend the F-test from simple LR:
▫ 𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
▫ 𝐻1 : at least one 𝛽𝑖 is not 0
• Test statistic:
𝑆𝑆𝑀𝑜𝑑𝑒𝑙
𝑘 𝑀𝑆𝑇 𝑆𝑆𝑀𝑜𝑑𝑒𝑙
𝐹= = 𝑀𝑆𝑇 =
𝑆𝑆𝐸 𝑀𝑆𝐸 𝑘
𝑛− 𝑘+1 𝑆𝑆𝐸
df1 𝑣1 = 𝑘, df2 𝑣2 = 𝑛 − 𝑘 + 1 𝑀𝑆𝐸 =
𝑛− 𝑘+1

• Always right-tailed, just like doing ANOVA.


• We use computer software or calculators to perform calculations.
• No wonder why software like Excel returns an ANOVA table
when asked to perform multi-regression calculations.
NBS 2016S1 AB1202 CCK-STAT-018
11

Testing Significance of Explanatory


Variables
• If regression model F-test shows significance, we
like to know which explanatory variable(s) is
significant, or positive, or negative.
• Significance test: • S.d. of 𝑏𝑖 : 𝑠𝑏𝑖 is
▫ 𝐻0 : 𝛽𝑖 = 0, 𝐻1 : 𝛽𝑖 ≠ 0 obtained from
▫ Two-tailed computer software.
• Negativity test: 𝑏𝑖
▫ 𝐻0 : 𝛽𝑖 ≥ 0, 𝐻1 : 𝛽𝑖 < 0 • Test statistic 𝑡 =
𝑠𝑏𝑖
▫ Left-tailed
• d.f. 𝑣 = 𝑛 − (𝑘 + 1)
• Positivity test:
▫ 𝐻0 : 𝛽𝑖 ≤ 0, 𝐻1 : 𝛽𝑖 > 0
▫ Right-tailed
NBS 2016S1
AB1202
CCK-STAT-018 12

Interpreting Regression Reports - Excel


SUMMARY OUTPUT Y X1 X2
10 5 7
Regression Statistics 12 6 6
Multiple R 0.9753  + 𝑅2 15 7 4
R Square 0.9512  𝑅2
14 6 3
Adjusted R Square 0.9268  Adj 𝑅2
18 8 2
Standard Error 0.6808  𝑠
Observations 7  𝑛 15 6 3
14 5 4
𝑀𝑆𝑇
𝑆𝑆𝑀𝑜𝑑𝑒𝑙 𝐹=
𝑀𝑆𝐸
ANOVA 𝑆𝑆𝐸 𝑀𝑆𝑇 𝑀𝑆𝐸
Significance
𝑑𝑓1
df SS MS F F
Regression 2 36.146 18.073 38.99213 0.00238
Residual 𝑑𝑓2 4 1.854 0.4635 p-value = 𝑃(𝐹𝑑𝑖𝑠𝑡 > 𝐹)
Total 6 38
𝑑𝑓1 + 𝑑𝑓2 𝑆𝑆𝑇𝑜𝑡𝑎𝑙
𝑠𝑏2 𝑠𝑏1 𝑠𝑏0
Standard 𝑏0 − 𝑡0.025 𝑠𝑏0 𝑏0 + 𝑡0.025 𝑠𝑏0
𝑏0
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 𝑏1 12.934 2.6699 4.8445 0.008373 5.52141 20.3472
X1 0.8504 0.3341 2.545 0.063644 -0.07735 1.778075
X2 𝑏2 -1.0036 0.2015 -4.981 0.007591 -1.56308 -0.44422
Always reported as result of two-tailed test.

Вам также может понравиться