Вы находитесь на странице: 1из 78

Chapter 6

Linear Regression and


Correlation analysis
1. Introduction to regression
analysis
 Regression analysis
- Describe a relationship between two or
more than two variables in mathematical
terms.
- Predict the value of a dependent
variable based on the value of at least one
independent variable
- Explain the impact of changes in an
independent variable on the dependent
variable
1. Introduction to regression
analysis
Dependent Independent
variable variable

the variable we wish the variable used to


to explain explain the
dependent variable
Names for ys and xs in regression
model
Names for y Name for xs

Dependent variable Independent variables

Regressand Regressors
Effect variable Causal variables

Explained variable Explanatory variables


Simple Linear Regression
Model

 Only one independent variable, x


 Relationship between x and y is
described by a linear function
 Changes in y are assumed to be
caused by changes in x
Types of Regression Models
Positive Linear Relationship Non-linear relationship

Negative Linear Relationship No Relationship


Population Linear Regression

The population regression model:


Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual
Variable

y  β0  β1x  ε
Linear component Random Error
component
Linear Regression Assumptions

 Error values (ε) are statistically


independent
 Error values are normally distributed for
any given value of x
 The probability distribution of the errors
has constant variance
 The underlying relationship between the x
variable and the y variable is linear
Population Linear Regression

y y  β0  β1x  ε
Observed Value
of y for xi

εi Slope = β1
Predicted Value Random Error
of y for xi
for this x value

Intercept = β0

xi x
Estimated Regression Model

The sample regression line provides an estimate of


the population regression line

Estimated Estimate of Estimate of the


(or predicted) the regression regression slope
y value intercept

Independent

ŷ i  b0  b1x variable

The individual random error terms ei have a mean of zero


Least Squares Criterion

 b0 and b1 are obtained by finding the


values of b0 and b1 that minimize
the sum of the squared residuals

e 2
  (y ŷ) 2

  (y  (b 0  b1x)) 2
The Least Squares Equation

 The formulas for b1 and b0 are:

 xy   x y
b1  n
(
x  n
2  x ) 2

and
or
xy  x . y
b1 
 2
x
b0  y  b1 x
Interpretation of the
Slope and the Intercept

 b0 is the estimated average value of y


when the value of x is zero

 b1 is the estimated change in the


average value of y as a result of a
one-unit change in x
Example

 A real estate agent wishes to examine the


relationship between the selling price of a
home and its size (measured in square
feet)

 A random sample of 10 houses is selected


◦ Dependent variable (y)?house price in $1000s

 Independent variable (x)? square feet


Sample Data for House Price Model
House Price in $1000s Square Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Least Squares Regression
Properties
 Thesum of the residuals from the least
squares regression line is 0
 ( y yˆ )  0
 The sum of the squared residuals is a minimum
(minimized)
 ( y y)
ˆ 2

 The simple regression line always passes through the


mean of the y variable and the mean of the x variable
y  b0  b1 x
 The least squares coefficients are unbiased
estimates of β0 and β1
Explained and Unexplained Variation
 Total variation is made up of two parts:

SST  SSE  SSR


Total sum of Sum of Squares Sum of Squares
Squares Error Regression

SST  ( y  y)2 SSE  ( y  ŷ)2 SSR  ( ŷ  y)2


where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
Coefficient of Determination R 2

 The coefficient of determination is the


portion of the total variation in the
dependent variable that is explained by
variation in the independent variable
 The coefficient of determination is also
called R-squared and is denoted as
RSS 0  R2  1
R 
2

TSS where
Coefficient of Determination R 2

Coefficient of determination
RSS sum of squares explained by regression
R 
2

TSS total sum of squares
Examples of Approximate
Values R2
y
R2 = 1

R2 = 1 Perfect linear relationship


between x and y:
x

y 100% of the variation in y is


explained by variation in x

R2 = +1
x
Examples of Approximate
Values R2
y
0 < R2 < 1

Weaker linear relationship


between x and y:
x
Some but not all of the
y
variation in y is explained
by variation in x

x
Examples of Approximate
Values R 2

R2 = 0
y
No linear relationship
between x and y:
The value of Y does not
x depend on x. (None of the
R2 = 0
variation in y is explained
by variation in x)
y x ŷ ( yˆ  y ) 2 ( y  y )2
245 1400 251.8283 1202.127 1722.25
312 1600 273.7683 162.0962 650.25
279 1700 284.7383 3.103587 56.25
308 1875 303.9358 304.0071 462.25
199 1100 218.9183 4567.286 7656.25
219 1550 268.2833 331.8482 4556.25
405 2350 356.0433 4836.271 14042.25
324 2450 367.0133 6482.391 1406.25
319 1425 254.5708 1019.474 1056.25
255 1700 284.7383 3.103587 992.25
2865 17150 2863.838 18911.71 32600.5
Coefficient of determination

RSS
R 
2

TSS
2. Correlation analysis
 Correlation is a technique used to
measure the strength of the relationship
between two variables.
 The stronger the correlation, the better
the relationship or the better fit the
regression line and vice versa.
Scatter Plot Examples
High degree of Low degree of
correlation correlation

y y

x x

y y

x x
Scatter Plot Examples
No relationship

x
The correlation coefficient (r)

 The correlation coefficient is


used to measure the strength of
the linear relationship between
two variables
 The product moment correlation
coefficient is calculated using the
formula:
The correlation coefficient (r)

r
 ( x  x)( y  y)
[ ( x  x ) ][  ( y  y ) ]
2 2

n xy   x  y
r
[n( x 2 )  ( x )2 ][n(  y 2 )  ( y )2 ]

xy  x . y
r
 x y
Note
 In the single
independent
variable case, the
coefficient of
R r 2 2

determination is

where
r : simple correlation coefficient
Features of r

 Unit free
 Range between -1 and 1
 The closer to -1, the stronger the
negative linear relationship
 The closer to 1, the stronger the positive
linear relationship
 The closer to 0, the weaker the linear
relationship
Examples of Approximate
r Values
y y y

x x x
r = -1 r = -.6 r=0
y y

x x
r = +.3 r = +1
Example calculation
xy  x . y
r
x 2  ( x )2 y 2  ( y )2
Working Productivity
Example experience (items/h)
 The data below 1 2
relates the
working 3 8
experience 4 9
(years) to the
productivity of 10 5 15
workers in a small 6 15
firm
7 20
9 23
12 25
14 22
15 36
Example calculation
 x  76 x 2
 782

 y  175 y 2
 3932

 xy  1722
Estimate b0 and b1
Linear regression equation

Interpretation of b0 and b1?


Coefficient of determination
and correlation coefficient
Steps in Regression
1- For Xi (independent variable) and Yi (dependent variable),
Calculate:
ΣYi
ΣXi
ΣXiYi
ΣXi2
ΣYi2

2- Calculate the correlation coefficient, r:


nX i Yi  (X i )(Yi )
r=
nX i
2
 X i 
2
 nY
i
2
 Yi 
2

-1 ≤ r ≤ 1
[This can be tested for significance. H0: ρ=0. If the correlation is not significant,
then X and Y are not related. You really should not be doing this regression!]

Simple Regression 39
Steps in Regression
3- Calculate the coefficient of determination: r2 = (r)2
0 ≤ r2 ≤ 1
This is the proportion of the variation in the dependent variable (Yi) explained by
the independent variable (Xi)

4- Calculate the regression coefficient b1 (the slope):


nX i Yi  (X i )(Yi )
b1 =
nX i2  X i 
2

Note that you have already calculated the numerator and the denominator for parts
of r. Other than a single division operation, no new calculations are required.
BTW, r and b1 are related. If a correlation is negative, the slope term must be
negative; a positive slope means a positive correlation.

5- Calculate the regression coefficient b0 (the Y-intercept, or constant):


b0 = Y  b1 X

The Y-intercept (b0) is the predicted value of Y when X = 0.

Simple Regression 40
Steps in Regression
6- The regression equation (a straight line) is:
Yˆi = b0 + b1Xi

7- [OPTIONAL] Then we can test the regression for statistical significance.

There are 3 ways to do this in simple regression:


(a) t-test for correlation:
H0: ρ=0
H1: ρ≠0

r n2
tn-2 =
1 r2

(b) t-test for slope term


H0: β1=0
H1: β1≠0
Simple Regression 41
Steps in Regression
(c) F-test – we can do it in MS Excel
MSExplained MSRegressi on
F= F=
MSUn exp lained MSResidual

where numerator is Mean Square (variation) Explained by the regression


equation, and the denominator is Mean Square (variation) unexplained by the
regression.

 This is part of the output you see when you do


regression in SPSS, MS excel…

Simple Regression 42
 n = 5 pairs of X,Y observations
 Independent variable (X) is amount of
water (in gallons) used on crop
 Dependent variable (Y) is yield (bushels of
tomatoes). Y
i X
i XY X Y i i i
2
i
2

2 1 2 1 4

5 2 10 4 25

8 3 24 9 64

10 4 40 16 100

15 5 75 25 225

40 15 151 55 418

Example: Water and Tomato Yield


Simple Regression 43
Example: Water and Tomato Yield
Step 1-
ΣYi = 40
ΣXi =15
ΣXiYi =151
ΣXi2 = 55
ΣYi2 = 418

(5)(151)  (15)( 40) 155


Step 2- r = = = .9903
(5)(55)  (15) (5)(418)  (40) 
2 2
50490

Simple Regression 44
Example: Water and Tomato Yield

Step 3- r2 = (.9903)2 = 98.06%

155
Step 4- b1 = = 3.1 The slope is positive. There is a positive relationship
50
between water and crop yield.

Step 5- b0 =   - 3.1   = -1.3


40 15
 5 5

Step 6- Thus, Yˆi = -1.3 + 3.1Xi

Simple Regression 45
Example: Water and Tomato Yield

Yˆi = -1.3 + 3.1 Xi


# bushels Does no water Every gallon # gallons of water
of result in a adds
tomatoes negative yield? 3.1 bushels
of tomatoes

Simple Regression 46
Example: Water and Tomato Yield
Yi Xi ei ei2
Yˆi
2 1 1.8 .2 .04
5 2 4.9 .1 .01
8 3 8.0 0 0
10 4 11.1 -1.1 1.21
15 5 14.2 .8 .64
Σei = 0 Σei2 = 1.90

Σei2 = 1.90. This is a minimum, since regression


minimizes Σei2 (SSE)

Now we can answer a question like: How many


bushels of tomatoes can we expect if we use 3.5
gallons of water? -1.3 + 3.1 (3.5) = 9.55 bushels.

Notice the danger of predicting outside the range


of X. The more water, the greater the yield? No.
Too much water can ruin the crop.
Simple Regression 47
Sources of Variation in Regression
  2 Y  2

Total Variation:  Y  Y 
i  Y  i
2

n
Explained Variation: Yˆ  Y   b Y  b X Y
2

Yi 
2

i 0 i 1 i i
n
Unexplained Variation:

 Yi  Yˆi   Y
2
i
2
 b0Yi  b1X iY
From our previous problem,
Total variation in Y = 418 – (40)2/5 = 98
Explained variation (explained by X) = -1.3(40) + 3.1(151) – (40)2/5 = 96.10
Unexplained variation = 418 - -1.3(40) - 3.1(151) = 1.90

The coefficient of determination, r2, is the proportion of Y explained by X.

In other words, 98% of the total variation in crop yield is explained


by the linear relationship of yield with amount of water used on the
crop.
ExplainedVariation 96.10
r2    .98
TotalVariation 98

Simple Regression 48
The Multiple Regression Model
Idea: Examine the linear relationship between
1 dependent (y) & 2 or more independent variables (xi)
Population model:
Y-intercept Population slopes Random Error

y  β0  β1x1  β2 x 2    βk xk  ε
Estimated multiple regression model:

Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of y

ŷ  b0  b1x1  b2 x 2    bk xk
Estimates b0, b1, b2,….,bk

 y  nb0  b1  x1  b2  x2  .......  bk  xk

 1  0 1  1  1  b2  x1 x2 .......  bk  x1 xk
2
x y b x b x

 2  0 2  1 1 2  2  2 .......  bk  x2 xk
2
x y b x b x x b x
......................................................................................

 xk y  b0  xk  b1  x1 xk  b2  x2 xk .......  bk  xk2
Interpretation of Estimated
Coefficients
 Slope (bi)
◦ Estimates that the average value of y changes by
bi units for each 1 unit increase in Xi given that
all other variables unchanged
 Intercept (b0)
◦ The estimated average value of y when all xi = 0
Multiple Regression Model
Two variable model
y
ŷ  b0  b1x1  b2 x 2

x2

x1
Multiple Regression Model
Two variable model
y Sample
yi
<
observation ŷ  b0  b1x1  b2 x 2

yi

<
e = (y – y)

x2i
x2

<
x1i The best fit equation, y ,
is found by minimizing the
x1 sum of squared errors, e2
Multiple Regression
Assumptions

Errors (residuals) from the regression model:

<
e = (y – y)

 The errors are normally distributed


 The mean of the errors is zero
 Errors have a constant variance
 The model errors are independent
Example
A distributor of frozen
desert pies wants to
evaluate factors thought
to influence demand
Data are collected for 15 weeks
Price Advertising
Week Pie Sales ($) ($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Example
Dependent variable (y): Pie sales
Independent variables 1 (x1): Prices ($)
Independent variables 2 (x2): Advertising ($ 100s)

Estimated (Predicted) regression equation:

ŷ  b0  b1 x1  b2 x2
Estimates b0, b1, b2

 y  nb0  b1  x1 b2  x2

 x1 y  b0  x1  b1  x1  b2  x1 x2
2


 2  0 2  1 1 2  2 2
2
x y b x b x x b x
Multiple Coefficient of
Determination
 Reports the proportion of total variation in
y explained by all x variables taken
together

RSS Regression sum of squares


R 
2

TSS Total sum of squares
Multiple correlation (R)
 Multiple correlation provides a measure of
the overall strength of the relationship
between dependent variable and
independent variables.
 It is defined as the positive square root of
the coefficient of the determination

R R 2
 Define residuals
 Create residual plots
 Interpret plots

Residual analysis
Notation Properties

E = residual Σe = 0
Y = observed value
Y’ = predicted value
e=0

What is a residual
Residual = observed value – predicted value
Typical patterns for residual plots
Is linear regression appropriate?
 Randome pattern: use linear
regression
 Non-random: consider other
technique

How to use residual plots


TRANSFORMATIONS
 A transformation applies a math operation
to a variable
 Example

Original Math opertion Transformed


variable variable

X Addition Xt = X + 5
Y Multiplication Yt = 2* Y
A Square root At = A
B Reciprocal Bt = 1/B

TRANSFORMATIONS BASIC
Do you know?
Linear Non-linear
No effect on correlation Changes correlation

Xt = c* X Anything else
Xt = X/c
Xt = X + c
Application to Regression

 Nonlinear transformations change linear


correlation
◦ Increase correlation: better regression
◦ Reduce correlation: worse regression
Methods of Transforming
Variables to Achieve Linearity
METHOD TRANSFORATIONS REGRESSION PREDICTED
EQUATION VALUE
Standard linear None y = b0 + b 1 x ŷ = b0 + b1x
regression
Exponential Dependent variable log(y) = b0 + b1x ŷ = 10b0 + b1x
model = log(y)
Quadratic model Dependent variable sqrt(y) = b0 + b1x ŷ = ( b0 + b1x )2
= sqrt(y)

Reciprocal model Dependent variable 1/y = b0 + b1x ŷ = 1 / ( b0 + b1x


= 1/y )

Logarithmic Independent variable y= b0 + b1log(x) ŷ = b0 + b1log(x)


model = log(x)

Power model Dependent variable log(y)= b0 + ŷ = 10b0 + b1log(x)


= log(y) b1log(x)
Independent variable
= log(x)
NOTE
 These methods need to be tested on the
data to which they are applied to be sure
that they increase rather than decrease
the linearity of the relationship. Testing
the effect of a transformation method
involves looking at residual plots and
correlation coefficients,
How to Perform a Transformation
to Achieve Linearity
 Conduct a standard regression analysis on the
raw data.
 Construct a residual plot.
◦ If the plot pattern is random, do not transform data.
◦ If the plot pattern is not random, continue.
 Compute the coefficient of determination (R2).
 Choose a transformation method (see above
table).
 Transform the independent variable, dependent
variable, or both.
 Conduct a regression analysis, using the
transformed variables.
 Compute the coefficient of determination (R2),
based on the transformed variables.
◦ If the tranformed R2 is greater than the raw-score R2, the
transformation was successful. Congratulations!
◦ If not, try a different transformation method.
The best tranformation method
(exponential model, quadratic model,
reciprocal model, etc.) will depend on
nature of the original data. The only way to
determine which method is best is to try
each and compare the result (i.e., residual
plots, correlation coefficients).
X 1 2 3 4 5 6 7 8 9
Y 2 1 6 14 15 30 40 74 75

When we apply a linear


regression to the
untransfromed raw
data, the residual plot
shows a non-random
pattern (a U-shaped
curve), which suggests
that the data are
nonlinear

EXAMPLE
 Suppose we repeat the analysis, using a
quadratic model to transform the dependent
variable. For a quadratic model, we use the
square root of y, rather than y, as the
dependent variable. Using the transformed
data, our regression equation is:
y't = b0 + b1x
 yt = transformed dependent variable, which is equal to
the square root of y
 y't = predicted value of the transformed dependent
variable yt
 x = independent variable
 b0 = y-intercept of transformation regression line
 b1 = slope of transformation regression line
 The table below shows the transformed
data we analyzed.

x 1 2 3 4 5 6 7 8 9
Yt 1.14 1 2.45 3.74 3.87 5.48 6.32 8.6 8.66

The residual plot shows residuals


based on predicted raw scores from
the transformation regression
equation. The plot suggests that
the transformation to achieve
linearity was successful.
Since the transformation was based on
the quadratic model (yt = the square root
of y), the transformation regression
equation can be expressed in terms of the
original units of variable Y as:
y' = ( b0 + b1x )2
where
 y' = predicted value of y in its orginal units
 x = independent variable
 b0 = y-intercept of transformation regression line
 b1 = slope of transformation regression line