Вы находитесь на странице: 1из 64

The Simple Linear Regression

Model and Correlation


Contents
Types of Regression Models
Determining the Simple Linear Regression
Equation
Measures of Variation in Regression and
Correlation
Assumptions of Regression and Correlation
Estimation of Predicted Values
Correlation - Measuring the Strength of the
Association
Introduction to Regression
Models
Hypothesis Testing
For example: hypothesis 1 : X is statistically
significantly related to Y.
The relationship is positive (as X increases, Y
increases) or negative (as X decreases, Y increases).
The magnitude of the relationship is small,
medium, or large.
If the magnitude is small, then a unit change in x is
associated with a small change in Y.


Linear Regression Gauss-
Markoff Assumptions
1. Normality
Y values are normally distributed for each X
Probability distribution of error is normal
2. Homoscedasticity (constant variance)
3. Independence of errors E(e
i
e
j
)=0 (i<>j)
4. Linearity
5. Variables are measured without error
(NONSTOCHASTIC)


X Y
i i
| o + =
Linear Regression Gauss-
Markoff Assumptions


If these assumptions hold -
The formulas that we use to estimate the coefficients in
a regression yield BLUE (Best Linear Unbiased
Estimators)

Best = Most Efficient = smallest variance
Unbiased = Expected value of estimator=true
population value
BLUE
Normality & Constant
Variance Assumptions
Y
f(e)
X
X
1
X
2
Variation of Errors Around
the Regression Line
X
1
X
2
X
Y
f(e)
y values are normally distributed
around the regression line.
For each x value, the spread or
variance around the regression
line is the same.
Regression Line
When is this
realistic?
Regression Models
Answer What is the relationship between the
variables?
Equation used
1 numerical dependent (response) variable
What is to be predicted: Y
1 or more numerical or categorical
independent (explanatory) variables: X

X
Y
y a bx = +
X
Y
i
Y
}

i i
y y error =
y y regression effect =
}
{
i
y y Total Effect =
Decomposition of Effects
Decomposition of the
sum of squares

( )


( ) ( ) ( )

( ) ( ) ( )
i i i i
i i i i
n n n
i i i i
i i i
Y Y Y Y Y Y
total effect error effects regression model effect
Y Y Y Y Y Y per case i
Y Y Y Y Y Y per case i
Y Y Y Y Y Y for data set
= = =
= +
= +
= +
= +
= +

2 2 2
2 2 2
1 1 1
Decomposition of the
sum of squares
Total SS = model SS + error SS
and if we divide by df





This yields the Variance Decomposition: We
have the total variance= model variance +
error variance


( ) ( ) ( )
n n n
i i i i
i i i
Y Y Y Y Y Y
n n k k
= = =

= +


2 2 2
1 1 1
1 1
Specifying the Model
Derivation of the Intercept
n n n
i i i
i i i
n n n n
i i i i
i i i i
n
i
i
n n n
i i i
i i i
a y b x
n n
i i
i i
y a bx e
e y a bx
e y a b x
Because by definition e
y a b x
na y b x
a y bx
= = =
= = = =
=
= = =
=
= =
= + +
=
=
=
=

=
=



1 1 1
1 1 1 1
1
1 1 1
1 1
0
0
Derivation of the Regression
Coefficient
:
( )
( )
( )
( )
i i i
i i i
n n
i i i
i i
n n
i i i
i i
n
i n n
i
i i i i
i i
n n
i i i i
i i
n
i i
i
n
i
i
Given y a b x e
e y a b x
e y a b x
e y a b x
e
x y b x x
b
x y b x x
x y
b
x
= =
= =
=
= =
= =
=
=
= + +
=
=
=
c
=
c
=
=

1 1
2 2
1 1
2
1
1 1
1 1
1
2
1
2 2
0 2 2
from which it can be seen that the regression coefficient b,
is a function of r.
( ) ( )
n
i i
i 1
n n
2 2
i i
i 1 i 1
i
i
x y
r
x y
where
x x x
y y y
=
= =
=
=
=


n
i i
i 1
j
n
2
i 1
x y
b
x
=
=
=

*
y
j
x
sd
b r
sd
=
Model Specification
Is Based on Theory
Economic, Psychological & business theory
Mathematical theory
Previous research
Common sense

We ASSUME causality flows
from X to Y
Advertising
Sales
Advertising
Sales
Advertising
Sales
Advertising
Sales
Thinking Challenge:
Which Is More Logical?
Alone Group Class
Types of
Regression Models
Regression
Models
Linear
Non-
Linear
2+ Explanatory
Variables
Simple
Non-
Linear
Multiple
Linear
1 Explanatory
Variable
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Linear Regression Model
Types of
Regression Models
Regression
Models
Linear
Non-
Linear
2+ Explanatory
Variables
Simple
Non-
Linear
Multiple
Linear
1 Explanatory
Variable
Y
Y = bX + a
a = Y-intercept
X
Change
in Y
Change in X
b = Slope
Linear Equations
The Scatter Diagram
0
50
100
0 20 40 60
Axis Title
Axis
Title
Plot of all (X
i
, Y
i
) pairs
Simple Linear
Regression Model
i i i
X Y c | | + + =
1 0
Y intercept (Constant term)
Slope
The Straight Line that Best Fit the Data
Relationship Between Variables Is a Linear Function
Random
Error
Dependent
(Response)
Variable
Independent
(Explanatory)
Variable
c
i
= Random Error
Y
X
Population
Linear Regression Model
Observed
Value
Observed Value
| |
YX
i
X = +
0 1
Y X
i i i
= + + | | c
0 1
(E(Y))
Sample Linear
Regression Model
i i
X b b Y
1 0
+ =
.
Y
i

.
= Predicted Value of Y for observation i
X
i

= Value of X for observation i
b
0

= Sample Y - intercept used as estimate of
the population |
0
b
1

= Sample Slope used as estimate of the
population |
1
Estimating Parameters:
Least Squares Method
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
fits best?
Least Squares
Best fit means difference between actual Y
values & predicted Y values are a
minimum
But positive differences off-set negative
What should we
expect?
If Y and X are not related, then
E(Y|X)=E(Y) - we should predict the
same Y for every value of X.
Y
X
Mean of Y
Y=constant+(0)X
=E(Y)
What should we
expect?
If Y and X are
related, then
E(Y|X)<>E(Y) - we
should predict a
different Y for
every value of X.
Therefore, the
slope will not be
zero
Y
X
Mean of Y
Mean of X
B <>0
What should we
expect?
At the mean of X, we will predict the
mean of Y. When X deviates from its
mean, we expect Y to also deviate from its
mean
Therefore, we can also think about X
explaining deviation of Y from its mean
value.
Simple Linear Regression
Equation: Example
You wish to examine the
relationship between the
square footage of produce
stores and its annual sales.
Sample data for 7 stores
were obtained. Find the
equation of the straight
line that fits the data best
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760

Scatter Diagram
Example
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0
S qua re Fe e t
A
n
n
u
a
l

S
a
l
e
s

(
$
0
0
0
)
Excel Output
2350 X=
5130 Y=
Equation for the Best
Straight Line
i
i i
X . .
X b b Y
487 1 415 1636
1 0
+ =
+ =
.
From Excel Printout:
Co effi ci en ts
I n t e r c e p t 1 6 3 6 . 4 1 4 7 2 6
X V a r i a b l e 1 1 . 4 8 6 6 3 3 6 5 7
If X=0, then =1636.414 Realistic?
Y

Graph of the Best


Straight Line
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0
S q u a r e F e e t
A
n
n
u
a
l

S
a
l
e
s

(
$
0
0
0
)
5130
2350
Interpreting the Results
Interpreting the Results
Y
i
= 1636.415 +1.487X
i
The slope of 1.487 means for each increase of one
unit in X, the Y is estimated to increase 1.487units.
For each increase of 1 square foot in the size of the
store, the model predicts that the expected annual
sales are estimated to increase by $1487.
.
Does X explain a significant
portion of the variation in
Y?
Explaining Variation in
Y
If X and Y have no relationship, we
should predict the mean of Y for every X
value.
We would like to measure whether
knowing the value of X helps us explain
why Y differs from its mean value.
Measures of Variation:
The Sum of Squares
X
i
Y
X
Y
SST = (Y
i
- Y)
2
SSE =(Y
i
- Y
i
)
2

.
SSR = (Y
i
- Y)
2


.
_
_
_
X
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
measures the variation of the Y
i
values around their
mean Y
SSR = Regression Sum of Squares
explained variation attributable to the relationship
between X and Y
SSE = Error Sum of Squares
variation attributable to factors other than the
relationship between X and Y
_
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
This is the identical measure that we used in ANOVA
SSR = Regression Sum of Squares
We called this Sum of Squares Among in ANOVA
SSE = Error Sum of Squares
We called this Sum of Squares Within in ANOVA
_
df SS MS F Significance F
Regression 1 30380456.12 30380456 81.17909015 0.000281201
Residual 5 1871199.595 374239.9
Total 6 32251655.71
Measures of Variation
The Sum of Squares:
Example
Excel Output for Produce Stores
SSR SSE SST
Interpreting Anova
Results
The F-test tests the null hypothesis that
the regression does not explain a
significant proportion of the variation in
Y
The degrees of freedom for the F-test of
a simple regression are 1 and n-2
In this example, F=81.2 with 1 and 5
degrees of freedom.
The Coefficient of
Determination
SSR regression sum of squares
SST total sum of squares
r
2
= =


Measures the proportion of variation that is
explained by the independent variable X in
the regression model
Coefficients of Determination
(r
2
) and Correlation (r)
r
2
= 1,
r
2
= 1,
r
2
= .8, r
2
= 0,
Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
r = +1
r = -1
r = +0.9 r = 0
R
2
and F connection
( )
1
2 n
*
1
1 2
2 n
*
SST * ) 1 (
SST *
2 n
SSE
) 1 2 (
SSR
F
r
r
r
r
2
2
2
2

=
The F-test can be written in terms of the r
2
.
The F-test is the test that the r
2
=0.
Standard Error of
Estimate
2
=
n
SSE
S
yx
2
1
2

=
n
) Y Y (
n
i
i i
.
=
The standard deviation of the variation of
observations around the regression line
Reg ressi o n S tati sti cs
M u l t i p l e R 0 . 9 7 0 5 5 7 2
R S q u a r e 0 . 9 4 1 9 8 1 2 9
A d j u s t e d R S q u a r e 0 . 9 3 0 3 7 7 5 4
S t a n d a r d E r r o r 6 1 1 . 7 5 1 5 1 7
O b s e r va t i o n s 7
Measures of Variation:
Example
Excel Output for Produce Stores
r
2
= .94
S
yx
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage
Inferences about the
Slope: t Test
t Test for a Population Slope
Is a Linear Relationship Between X & Y ?

1
1 1
b
S
b
t
|
=
Test Statistic:

=
=
n
i
i
YX
b
) X X (
S
S
1
2
1
and df = n - 2
Null and Alternative Hypotheses
H
0
: |
1
= 0 (No Linear Relationship)
H
1
: |
1
= 0 (Linear Relationship)
Where
Example: Produce Stores
Data for 7 Stores:
Regression
Model Obtained:
The slope of this model
is 1.487.
Is there a linear
relationship between the
square footage of a store
and its annual sales?
.
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760

Y
i
= 1636.415 +1.487X
i
t Stat P-val ue
I n te r ce p t 3. 6244333 0. 0151488
X V a r i a b l e 1 9. 009944 0. 0002812
H
0
: |
1
= 0
H
1
: |
1
= 0
o = .05
df = 7 - 2 = 5
Critical Value(s):

Test Statistic:
Decision:

Conclusion:

There is evidence of a
relationship.
t
0 2.5706 -2.5706
.025
Reject Reject
.025
From Excel Printout
Reject H
0
Inferences about the
Slope: t Test Example
Connection of F and t in simple
regression
b
1
t
n-2

1
b
S
Excel Printout for Produce Stores
The t test for B=0 is identical to the F test for r
2
=0 for
simple regression. The t-statistic will be the square root of
the F statistic (t=1.4866/.1649=9.01) F
1,n-2
=t
2
n-2

ANOVA
df SS F Significance F
Regression 1 30380456.12 81.17909 0.0002812
Residual 5 1871199.595
Total 6 32251655.71
Coefficients Standard Error P-value Lower 95%
Intercept 1636.41473 451.4953308 0.0151488 475.810926
X Variable 1 1.48663366 0.164999212 0.0002812 1.06249037
Note: These are identical in simple regression!
Inferences about the Slope:
Confidence Interval Example
Confidence Interval Estimate of the Slope
b
1
t
n-2

1
b
S
Excel Printout for Produce Stores
At 95% level of Confidence The confidence Interval for the
slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear relationship
between annual sales and the size of the store.
Lower 95% Upper 95%
I n te r c e p t 4 7 5 . 8 1 0 9 2 6 2 7 9 7 . 0 1 8 5 3
X V a r i a b l e 11 . 0 6 2 4 9 0 3 7 1 . 9 1 0 7 7 6 9 4
Slope estimates make line pivot
around mean point
Different estimates of B
tilt the line around the
mean point
If B is different this will
give small differences in the
forecast for Y near the
mean, but big differences
away from the mean
Regression Line
Y=1636+1.49X
Lower 95% estimate of
B (1.06)
Upper 95% estimate
of B (1.91)
Square Footage
S
A
L
E
S
Estimation of
Predicted Values
Confidence Interval Estimate for
XY
The Mean of Y given a particular X
i

+ -
=

n
i
i
i
yx n i
) X X (
) X X (
n
S t Y

1
2
2
2
1
t value from table
with df=n-2
Standard error
of the estimate
Size of interval vary according to
distance away from mean, X.
Estimation of
Predicted Values
Confidence Interval Estimate for
Individual Response Y
i
at a Particular X
i

+ + -
=

n
i
i
i
yx n i
) X X (
) X X (
n
S t Y

1
2
2
2
1
1
Addition of this 1 increased width of
interval from that for the mean Y
Confidence Bands
Error associated with a forecast has two
components:
Error at the mean (standard error of
estimate)
Error in estimating B
Therefore, the confidence intervals
around forecasts will be larger as we
move away from the mean of X
Interval Estimates for
Different Values of X
X
Y
X
Confidence Interval
for a individual Y
i
A Given X
Confidence
Interval for the
mean of Y
_
Example: Produce Stores
Y
i
= 1636.415 +1.487X
i
Data for 7 Stores:
Regression Model Obtained:
Predict the annual
sales for a store with
2000 square feet.
.
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760

Estimation of Predicted
Values: Example
Confidence Interval Estimate for Individual Y
Find the 95% confidence interval for the average annual sales
for stores of 2,000 square feet

+ -
=

n
i
i
i
yx n i
) X X (
) X X (
n
S t Y

1
2
2
2
1
Predicted Sales Y
i
= 1636.415 +1.487X
i
= 4610.45 ($000)
.
X = 2350.29 S
YX
= 611.75
t
n-2
= t
5
= 2.5706
= 4610.45 980.97
Confidence interval for mean Y
Estimation of Predicted
Values: Example
Confidence Interval Estimate for
XY
Find the 95% confidence interval for annual sales of one
particular stores of 2,000 square feet
Predicted Sales Y
i
= 1636.415 +1.487X
i
= 4610.45 ($000)
.
X = 2350.29 S
YX
= 611.75
t
n-2
= t
5
= 2.5706
= 4610.45 1853.45
Confidence interval for
individual Y

+ + -
=

n
i
i
i
yx n i
) X X (
) X X (
n
S t Y

1
2
2
2
1
1

Вам также может понравиться