Вы находитесь на странице: 1из 62

Chapter 13

Simple Linear Regression

David Chow Nov 2014

Learning Objectives

Understand the simple linear regression model

Model assumptions

Meaning of the coefficients b 0 and b 1

Predict the value of a dependent variable using regression results

Estimate mean values and predict individual values

Overview

This PPT

Linear regression model

Estimate regression line

Measures of variation

Residual analysis

Statistical inference

Testing & interval estimate

Excel file

Eg: House price

Scatter plot

Reg: standard output

Reg: residuals

Data for review question

What We Know…

1. Equation of a straight line

2. Scatter plot

Linear / non-linear relationship

Strong / weak relationship

3. Correlation:

relationship

between two variables

Correlation doesn’t imply casual relationship

Eg: No. of firefighters and damage (in \$)

4. Hypothesis testing

Regression Analysis

Dependent variable (y):

the variable we wish to

predict or explain

Independent variable (x): the variable used to

predict or explain y

Regression analysis is used to:

Explain the impact of changes in x on the dependent variable y

Predict the value of y based on the value of x(s)

Linear Regression:

Only one x

Linear relationship between x and y

ChapChap 1313--55

Examples

Y=
, X=
Useful tip:
A visual check BEFORE running

Linear Regression Model

Linear Regression Model

Dependent

Variable

Population

Slope

Coefficient

Population

Y-intercept

Independent

Variable

Y i β β X ε

0

1

i

i

Random

Error term

Linear component

Random error component

Assumptions

 1 X and Y are linearly related 2 β 0 and β 1 are population parameters 3 Given an X-value (say, X i ), Y= Y i is random, because of ε 4 Assume E(ε)=0, so that E(Y)=

ChapChap 1313--88

Linear Regression Model
Y i  β  β X  ε
0
1
i
i
Y
Population Regression
line: E(Y) = β 0 +β 1 X
Observed Value
of Y for X i
ε i
Slope = β 1
Predicted Value
of Y for X i
Intercept = β 0
Random Error
for this X i value
X
X i

ChapChap 1313--99

Model Assumptions: LINE

Linearity

The relationship between X and Y is linear

Independence of errors

Error values are statistically independent

Normality of error

Error values are normally distributed for any given value of X

Equal variance (also called homoscedasticity)

The probability distribution of the errors has constant variance

Assumption on error terms:

ε ~ N (0, σ 2 )

See the appendix for a graphical presentation

ChapChap 1313--1010

Estimate the Regression

Line and Predict Y Values

Est. Regression Equation

(Prediction Line)

The regression equation provides an estimate of the population regression line

Estimated (or

predicted) Y value

for observation i

Estimate of

intercept

Estimate of slope

ˆ

Y i

b

0

b X

1

i

Value of X for

observation i

ChapChap 1313--1212

Least Squares Method

(The Best Fitted Line)

b 0 and b 1 are obtained by finding the values of that minimize the sum of the squared

differences between Y and

ˆ

Y

:

min
(Y 
Y )
ˆ
2
min
(Y
(b
b X ))
2
i
i
i
0
1
i

The computations (to find b 0 and b 1 ) are done by Excel

Formulae for b 0 and b 1 are in the appendix

ChapChap 1313--1313

Interpreting b 0 and b 1

b 0 is the estimated mean value of Y when the value of X is zero

b 1 is the estimated change in the mean value of Y as a result of a

one-unit increase in X

ChapChap 1313--1414

Eg: House Price

(in Excel File)
 A real estate agent wishes to examine the relationship
between the selling price of a home and its size
(measured in square feet)

A random sample of 10 houses is selected

Dependent var (Y) = house price (in \$1000s)

Independent var (X) = square feet

ChapChap 1313--1515

House Price (\$1000s)

Eg: House Price Data & Scatter Plot

 House Price in \$1000s (Y) Square Feet (X) 245 1400 312 1600 279 1700 308 1875 199 1100 219 1550 405 2350 324 2450 319 1425 255 1700

450

400

350

300

250

200

150

100

50

0

0

500

1000
1500
2000

2500

3000

Square Feet

ChapChap 1313--1616

Using Excel

1.

Choose Data

2. Choose Data Analysis

3. Choose Regression

4. Input data range and output options

ChapChap 1313--1717

Using Excel: Regression Output

Regression Statistics

The regression equation is:
Multiple R
R Square
Standard Error
0.76211
0.58082
houseprice 98.24833 0.10977(squarefeet)
0.52842
41.33032
Observations
10
ANOVA
df
SS
MS
F
Significance F
Regression
1
18934.9348
18934.9348
11.0848
0.01039
Residual
8
13665.5652
1708.1957
Total
9
32600.5000
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
Square Feet
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580

ChapChap 1313--1818

Eg: House Price

House price model: Scatter Plot and Prediction Line

450
400
350
300
Slope
= 0.10977
250
200
150
100
50
Intercept
0
= 98.248
0
500
1000
1500
2000
2500
3000
House Price (\$1000s)

Square Feet

houseprice98.248330.10977(squarefeet)

ChapChap 1313--1919

Eg: Interpreting b o
houseprice 
98.24833
 0.10977(squarefeet)

b o is the estimated mean value of Y when the value of X is zero (if X = 0 is in the range of observed X values)

Because a house cannot have a square footage of 0, b o has no practical application

Generally speaking, intercept b o is not our focus in regression

ChapChap 1313--2020

Eg: Interpreting b 1
houseprice 98.24833 0.10977
(squarefeet)

b 1 estimates the change in the mean value of Y as a result of a one-unit increase in X

Here, b 1 = 0.10977 tells us that the mean value of a house increases by 0.10977(\$1000) = \$109.77, on average, for each

ChapChap 1313--2121

Eg: Making Predictions

Predict the price for a house with 2000 sq feet:

 house price  98.24833  0.10977 (sq.ft.)  98.24833  0.10977(2000)

317.78

The predicted price for a house with 2000

square feet is 317.78(\$1,000s) = \$317,780

ChapChap 1313--2222

House Price (\$1000s)

Making Predictions

General rule:

Predict only within the relevant range of Xs

Relevant range for interpolation

450

50
0
0
500
1000
1500
2000
2500
3000

400

350

300

250

200

150

100

Square Feet

Do not try to extrapolate beyond the range of observed X’s

ChapChap 1313--2323

Measures of Variation and r 2

Measures of Variation

Total variation is made up of two parts:

SST

SSR

SSE

Total Sum of Squares

Regression Sum of Squares

Error Sum of Squares

SST

2

(Y

i

Y)

SSR

2

ˆ

( Y

i

Y)

SSE

2

(Y

i

ˆ

Y)

i

where:

Y = Mean value of the dependent variable

Y i = Observed value of the dependent variable

ˆ

Y

i

= Predicted value of Y for the given X i value

ChapChap 1313--2525

Measures of Variation

  SST = total sum of squares (Total Variation)  Measures the variation of the Y i values around their mean Y  SSR = regression sum of squares (Explained Variation)  Variation attributable to the relationship between X and Y  SSE = error sum of squares (Unexplained Variation)

Variation in Y attributable to factors other than X

ChapChap 1313--2626

Measures of Variation

Y

Y i

Y

SSE
= (Y i - Y i ) 2
_
=
(Y i - Y) 2
_
SSR
= (Y i - Y) 2

SST

Y

_

Y

X i

_

Y

X

ChapChap 1313--2727

Coefficient of Determination, r 2

The coefficient of determination is the portion of the total variation in Y that is explained by variation in X

The coefficient of determination is also called r- squared and is denoted as r 2

r 2

SSR

SST

regression sum of squares

totalsum of squares

0

r

2

1

ChapChap 1313--2828

Examples of r 2

Y

X

Y

X

0 < r 2 < 1

Weaker linear relationships

between X and Y:

Some but not all of the variation in Y is explained by variation in X

Q: Graph for r 2 = 1?

ChapChap 1313--2929

Measures of Variation and r 2 in Excel

House Price Again
SSR
18934.9348
Regression Statistics
r 2 
 0.58082
Multiple R
R Square
Standard Error
0.76211
SST
32600.5000
0.58082
0.52842
41.33032
58.08% of the variation in house prices
is explained by variation in sq feet
Observations
10
ANOVA
df
SS
MS
F
Significance F
Regression
1
18934.9348
18934.9348
11.0848
0.01039
Residual
8
13665.5652
1708.1957
Total
9
32600.5000
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
Square Feet
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580

ChapChap 1313--3030

Standard Error of Estimate

The standard deviation of the variation of observations

around the regression line is estimated by

S YX

n
2
ˆ
( Y
Y )
i
i
SSE
i  1
n  2
n  2

where

SSE = error sum of squares

n = sample size

ChapChap 1313--3131

Standard Error of Estimate in Excel

House Price Again

Regression Statistics

Multiple R R Square Adjusted R Square Standard Error Observations

0.76211

S

YX

41.33032

0.58082
0.52842
41.33032

10

ANOVA

 df SS MS F Significance F Regression 1 18934.9348 18934.9348 11.0848 0.01039 Residual 8 13665.5652 1708.1957 Total 9 32600.5000 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386 Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

ChapChap 1313--3232

Comparing Standard Errors

S YX is a measure of the variation of observed Y values from the regression line

largeS
YX
Y
Y
smallS
YX
X
X
S YX carries the same unit as Y
It should be judged relative to the size of the Y values in the
sample data
Eg: S YX = \$41.33K is moderately small relative to house
prices in the \$200K - \$400K range

ChapChap 1313--3333

Example: Salary Data

Years of employment and salary (\$1000) in 7 subjects:

Years Salary

 6.6 32 7.4 42 8.8 52 9.7 61 10.5 62 10.7 66 11.8 65

Residual Analysis

(Autocorrelation is NOT covered)

Residual Analysis

The residual for observation i, e i , is the difference between its observed and predicted value

e i

ˆ

Y Y

i

i

Recall the regression assumptions

L: X and Y linearly related?

I: Errors statistically independent?

N: Errors normally distributed?

E: Errors have constant variance (homoscedasticity)?

A residual plot (residuals vs X) is very useful in

checking the assumptions

ChapChap 1313--3636

L: Linearity
Y
Y
x
x
x
x
Not Linear
Linear
residuals
residuals

ChapChap 1313--3737

I: Independence

Not Independent
Independent
X
X
X
residuals
residuals
residuals

ChapChap 1313--3838

N: Normality

N: Are the errors normally distributed?

Many ways to check for normality

Stem-and-Leaf

Histogram

Normal Probability Plot, etc.

My recommendation is always

ChapChap 1313--3939

residuals

E: Equal Variance
Y
Y

x

x

Non-constant variance

x

x
Constant variance
residuals

ChapChap 1313--4040

A Residual Plot by Excel

House Price Again
House Price Model Residual Plot
RESIDUAL OUTPUT
80
Predicted
60
House Price
Residuals
40
1
251.92316
-6.923162
20
2
273.87671
38.12329
0
3
284.85348
-5.853484
0
1000
2000
3000
-20
4
304.06284
3.937162
-40
5
218.99284
-19.99284
-60
6
268.38832
-49.38832
Square Feet
7
356.20251
48.79749
8
367.17929
-43.17929
• Key: Any pattern in the residual plot?
9
254.6674
64.33264
10
284.85348
-29.85348
• NOTE: Autocorrelation is NOT covered
in this course
Residuals

ChapChap 1313--4141

Testing for Significance

and Interval Estimate

The standard error of the regression slope coefficient (b 1 ) is estimated by

S

b

1

S YX
S YX
SSX
2
(X
X)
i

where:

S

b

1

= Estimate of the standard error of the slope

S YX

SSE
n
 2

= Standard error of the estimate

ChapChap 1313--4343

t Test for the Slope

Typical question: X and Y linearly related?

  H 0 : β 1 = 0 (no linear relationship)  H 1 : β 1 ≠ 0 (linear relationship does exist)

Test statistic

t STAT

b

1

β

1

S

b

1

d.f. n 2

where:

b 1 = regression slope

coefficient

β 1 = hypothesized slope

S b1 = standard error of the slope

ChapChap 1313--4444

Inferences About the Slope in Excel

House Price Again
Want to know if
H 0 : β 1 = 0;
H 1 : β 1 ≠ 0
Coefficients
Standard Error
t Stat
P-value
Intercept
98.24833 58.03348
1.69296 0.12892
Square Feet
0.10977
0.03297
3.32938
0.01039
S
b
1
b
1
b
β
0 10977
.
0
1
1
t
 3 32938
.
STAT
S
0 03297
.
b
Next, how to test if slope
equals a particular value?
1

ChapChap 1313--4545

Inferences About the Slope in Excel

House Price Again

Test Statistic: t STAT = 3.329

Critical values?

a  0.05

d.f.  10 2  8

a/2=.025
a/2=.025
Reject H 0
Do not reject H 0
Reject H 0
t
-t α/2
0
α/2
-2.3060
2.3060
3.329

Decision: Reject H 0

ChapChap 1313--4646

Inferences About the Slope in Excel

House Price Again

The p-value Approach

Coefficients
Standard Error
t Stat
P-value
Intercept
98.24833
58.03348
1.69296
0.12892
Square Feet
0.10977
0.03297
3.32938
0.01039

p-value

Decision: Reject H 0 , since p-value < α

ChapChap 1313--4747

F Test for Model Significance

H 0 : β 1 = 0;

H 1 : β 1 ≠ 0

F Test statistic:

F STAT

MSR

MSE

where

SSR
MSR 
k
SSE
MSE 
n
k
1

F STAT follows an F distribution

Numerator d.f. = k

Denominator d.f. = n-k-1

k = the number of independent variables in the regression model

ChapChap 1313--4848

F-Test for Model Significance

House Price Again

Regression Statistics

Multiple R
0.76211
MSR
18934.9348
11.0848
R Square
0.58082
F STAT 
MSE
1708.1957
0.52842
Standard Error
Observations
41.33032
10
With 1 and 8 degrees
of freedom
p-value for
the F-Test
ANOVA
df
SS
MS
F
Significance F
Regression
1
18934.9348
18934.9348
11.0848
0.01039
Residual
8
13665.5652
1708.1957
Total
9
32600.5000

ChapChap 1313--4949

F Test for Significance
Test Statistic:
H 0 : β 1 = 0
a = .05
H 1 : β 1 ≠ 0
MSR
F
11.08
STAT 
df 1 = 1
df 2 = 8
MSE
Decision:
Critical Value:
F a = 5.32
Reject H 0 at a = 0.05
a = .05
0
F
Reject H 0

Do not reject H 0

F .05 = 5.32

ChapChap 1313--5050

Interval Estimate for the Slope

House Price Again
b
 t
S
1
α
/2
b
d.f. = n - 2
1
Excel Printout
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
Square Feet
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580

At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)

we are 95% confident that the average impact on sales price is between \$33.74 and \$185.80 per sq foot of size

Same conclusion as t-test?

ChapChap 1313--5151

t Test for a Correlation Coefficient

Hypotheses

H 0 : ρ = 0 H 1 : ρ ≠ 0

(no correlation between X and Y) (correlation exists)

Test statistic

t STAT

r - ρ

2
1
r
n
2

(with n 2 degrees of freedom)

where

if b

1

 r   r   if b

1

0

0

ChapChap 1313--5252

t-test For A Correlation Coefficient
Is there evidence of a linear relationship between square
feet and house price at the .05 level of significance?

H 0 : ρ = 0 H 1 : ρ ≠ 0

a =.05 ,

(No correlation) (correlation exists)

df = 10 - 2

= 8

r
ρ
.762
0
t
 3.329
STAT
2
2
1
r
1
.762
n
2
10
2
Critical Value =
Decision:

ChapChap 1313--5353

Estimating Mean Values and

Predicting Individual Values

Goal: Form intervals around Y to express uncertainty about the value of Y for a given X i

Confidence
Interval for
the mean of
Y, given X i
Y
Y
Y = b 0 +b 1 X i
Prediction Interval
for an individual Y,
given X i
X
X
i

ChapChap 1313--5454

Suggested Procedures for a

Regression Analysis

 1. Start with a scatter plot to observe possible relationship 2. Run regression 3. Perform residual analysis

Plot the residuals vs. X to check for assumptions such as linearity & homoscedasticity

Use a histogram to see if the residuals are normally

distributed

ChapChap 1313--5555

Suggested Procedures for a

Regression Analysis

 4. If there is violation of any assumption, use alternative methods or models 5. If there is no evidence of assumption violation, then test for the significance of the regression coefficients and construct confidence intervals and prediction intervals 6. Avoid making predictions or forecasts outside the relevant range

ChapChap 1313--5656

Review Questions 1-3

1. In a regression analysis, the error term ε is a random variable with an expected value of

2. In a regression analysis, if r 2 = 1, SST =

3. A regression analysis between sales (in \$1000) and advertising (in \$100) resulted in the following equation:

Y-hat = 500 + 4X

a) How to interpret b 1 ?

b) Based on the above equation, if advertising is \$10,000,

the point estimate for sales (in dollars) is

Review Question 4

4. Shown below is a portion of an Excel printout for a regression analysis relating Y (quantity demanded) and X (unit price).

 ANOVA df SS Regression 1 5048.818 Residual 46 3132.661 Total 47 8181.479 Coefficients Standard Error Intercept 80.390 3.102 X -2.137 0.248

a) Perform a t test to determine if Y and X are related. Let α = 0.05

b) Compute R 2 and interpret its meaning

Review Question 5:

Restaurant Ratings