Вы находитесь на странице: 1из 57

Linear Regression and Correlation

Introduction
So far we have confined our discussion to the
distributions involving only one variable.
Sometimes, in practical applications, we might
come across certain set of data, where each item
of the set may comprise of the values of two or
more variables.
In this chapter
We develop numerical measures to express the
relationship between two variables.
Is the relationship strong or week, is it direct or
inverse?
In addition we develop an equation to express the
relationship between variables. This will allow us
to estimate one variable on the basis of other.

Examples
Is there a relationship between the amount
Healthtex spends per month on advertising and
the sales in the month?
Can we base an estimate of the cost to heat a
home in January on the number of square feet in
the home?
Is there a relationship between the number of
hours that students studied for an exam and the
score earned?
What is Correlation Analysis
Correlation analysis is the study of the
relationship between variables.
For example, suppose the sales manager of
Copier Sales of America, which has large sales
force throughout the United states and Canada,
wants to determine whether there is a relationship
between the number of sales calls made in a
month and the number of copiers sold that month.
The manager selects a random sample of 10
representatives and determines the number of
sales calls each representative made last month
and the number of copiers sold.
Sales calls and copiers sold for 10
salespeople
Sales Representative Number of Sales Calls Number of Copiers
Sold
Tom Keller 20 30
Jeff Hall 40 60
Brian Virost 20 40
Greg Fish 30 60
Susan Welch 10 30
Carlos Ramirez 10 40
Rich Niles 20 40
Mike Kiel 20 50
Mark Reynolds 20 30
Soni Jones 30 70
Correlation Analysis
By reviewing the data it seems that there is some
relationship between the number of sales call and
the number of units sold.
Here we develop some techniques to portray
more precisely the relationship between the two
variables, sales calls and copiers sold. This group
of statistical technique is called correlation
analysis.
Example:
Copier Sales of America sells copiers to
businesses of all sizes throughout the United
States and Canada. Ms. Marcy Bancer was
recently promoted to the position of national sales
manager.
At the upcoming sales meeting, the sales
representatives from all over the country will be in
attendance.
She would like to impress upon them the
importance of making that extra sales call each
day.
She decides to gather some information on the
relationship between the number of sales calls
and the number of copiers sold.
She selected a random sample of 10 sales
representatives and determined the number of
sales calls they made last month and the number
of copiers they sold.
What observations can you make about the
relationship between the number of sales calls
and the number of copiers sold?
Develop the scatter diagram to display the
information.
Solution
We refer to the number of sales calls as the
independent variable and the number of copier
sold as the dependent variable.
DEPENDENT VARIABLE: The variable that is being
predicted or estimated.
INDEPENDENT VARIABLE: A variable that provides
the basis for estimation. It is the predictor variable.
Scatter Diagram
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25 30 35 40 45
Sales Calls
U
n
i
t
s

S
o
l
d

The scatter diagram shows graphically that the
sales representatives who make more calls tend
to sell more copiers.
It is reasonable for Ms. Banser, to tell her
salespeople that the more sales calls they make
the more copiers they can expect to sell.
The Coefficient of Correlation
Originated by Karl Pearson, the coefficient of
correlation, r describes the strength of the
relationship between two sets of interval-scaled or
ratio-scaled variables.
It takes values from + 1 to 1.
If two sets or data have r = +1, they are said to be
perfectly correlated positively.
If r = -1 they are said to be perfectly correlated
negatively;
and if r = 0 they are uncorrelated.
Perfect Positive Relationship
0
50
100
150
200
250
0 20 40 60 80 100 120
Y
Perfectly Negative Correlation, r = -1
No Relationship, r = 0
0
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100 120 140
The following drawing summarizes the strength and direction of the
coefficient of correlation.
Perfect
Negative
correlation
No
correlation
Perfect
positive
correlation
Strong
Negative
correlation
Weak
Negative
correlation
Moderate
Negative
correlation
Weak
positive
correlation
Moderate
positive
correlation
Strong
Positive
correlation
-1.0
1.0
-.50 0 .50
Negative correlation Positive correlation
Correlation Coefficient
( )( )
Y X
s s n
Y Y X X
r
) 1 (

=

Sales
Representative
Calls Sales
Tom Keller 20 30 -2 -15 30
Jeff Hall 40 60 18 15 270
Brian Virost 20 40 -2 -5 10
Greg Fish 30 60 8 15 120
Susan Welch 10 30 -12 -15 180
Carlos Ramirez 10 40 -12 -5 60
Rich Niles 20 40 -2 -5 10
Mike Kiel 20 50 -2 5 -10
Mark Reynolds 20 30 -2 -15 30
Soni Jones 30 70 8 25 200
900
X X
Y Y
) )( ( Y Y X X
( )( )
759 . 0
14.337) 1)(9.189)( - (10
900

) 1 (
= =


=

Y X
s s n
Y Y X X
r
Interpretation of result
Value of r is positive, so we see that there is a
direct relationship between the number of sales
calls and the number of copiers sold.
The value of 0.759 is fairly close to 1.00, so we
conclude that the relationship is strong.

The Coefficient of Determination
In the previous example the coefficient of
correlation was 0.759, was interpreted as being
strong.
Terms such as weak, moderate and strong,
however, do not have precise meaning.
A measure that has more interpreted meaning is
the coefficient of determination.
It is computed by squaring the coefficient of
correlation.
For the above example r = 0.576, found by
(0.759)
This is a proportion or percent; we can say that
57.6 percent of the variation in the number of
copiers sold is explained, or accounted for, by the
variation in the number of sales calls.

COEFFICIENT OF DETERMINATION: The proportion
of the total variation in the dependent variable Y that
is explained, or accounted for, by the variation in
independent variable.
Testing the Significance of the
Correlation Coefficient
The sales manager of Copier Sales of America
found the strong correlation between the number
of sales calls and the number of copiers sold.
However, only 10 salespeople were sampled.
Could it be that the correlation in the population is
actually 0?
This would mean the correlation of 0.759 was due
to chance.
The population in this example is all the
salespeople employed by the firm.
Resolving this dilemma requires a test to answer
the obvious question: Could there be zero
correlation in the population from which the
sample was selected?
To put it another way, did the computed r come
from a population of paired observations with
zero correlation?
We let represent the correlation in the
population.

State null and alternate


hypothesis
0 :
0 :
1
0
=
=

H
H
(The correlation in the population is zero.)
(The correlation in the population is different from zero.)
t test for the coefficient of correlation
2
1
2
r
n r
t

=
With n 2 degrees of freedom
Using .05 level of significance, the decision rule
states that if the computed t falls in the area
between plus 2.306 and minus 2.306, the null
hypothesis is not rejected.
Applying formula, we get



The computed t is in the rejection region. Thus
null hypothesis is rejected.
This means the correlation population is not zero.

297 . 3
759 . 1
2 10 759 .
1
2
2 2
=

=
r
n r
t
Regression Analysis
Introduction
In previous section we developed measures to
express the strength and the direction of the
relationship between two variables.
In this section we wish to develop an equation to
express the linear (straight line) relationship
between two variables.
In addition we want to be to estimate the value of
the dependent variable Y based on a selected
value of the independent variable X.
The technique used to develop the equation and
provides the estimates is called regression
analysis.
Regression Equation







We want to develop a linear equation that
expresses the relationship between the number
of sales calls and the number of units sold
Regression Equation: An equation that
expresses the linear relationship
between two variables
The scatter diagram is reproduced below with a
line drawn through the dots that a straight line
would probably fit the data.
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25 30 35 40 45
Least Squares Principle
Determining a regression equation by minimizing
the sum of the squares of the vertical distances
between the actual Y values and the predicted
values of Y.
General Form of Linear Regression Equation bX a Y + =
'
Where:
Yis the predicted value of the Y variable for a selected X value.

a is the Y-intercept. It is the estimated value of Y when X = 0.

b is the slope of the line, or the average change in Y for each change of one
unit (either increase or decrease) in the independent variable X.

X is any value of the independent variable that is selected.
The formula for a and b are:
SLOPE OF REGRESSION LINE
X
Y
s
s
r b =
where:
r is the correlation coefficient.

is the standard deviation of Y (the dependent variable).

is the standard deviation of X (the independent variable).
Y
s
X
s
Y - INTERCEPT
X b Y a =
where:

is the mean of Y (the dependent variable).

is the mean of X (the independent variable).
Y
X
Example:
Recall the example involving Copier Sales of
America. The sales manager gathered
information on the number of sales calls and the
number of copiers sold for a random sample of 10
sales representatives.
As a part of her presentation at the upcoming
sales meeting, Ms. Bancer, the sales manager,
would like to offer specific information about the
relationship between the number of sales calls
and the number of copiers sold.
Use the least squares method to determine a
linear equation to express the relationship
between the two variables.
What is the expected number of copiers sold by a
representative who made 20 calls?
The calculations necessary to determine the regression
equation are:
1842 . 1
189 . 9
337 . 14
759 . =
|
.
|

\
|
= =
X
Y
s
s
r b
9476 . 18 22 ) 1842 . 1 ( 45 = = = X b Y a
Thus the regression equation is:
X Y bX a Y 1842 . 1 9476 . 18
' '
+ = = + =
So if a salesperson makes 20 calls, he or she can
expect to sell 42.6316 copiers, found by
The b value of 1.1842 means that for each
additional sales call made the sales
representative can expect to increase the number
of copiers sold by about 1.2. to put in another way
five additional sales calls in a month will result in
about six more copier being sold, found by
1.1842(5) = 5.921
The a value of 18.9476 is the point where the
equation crosses the Y axis. A literal translation is
that if no sales calls are made, that is, X = 0,
18.9476 copiers will be sold.
X = 0 is outside the range of values included in
the sample, therefore should not be used to
estimate the number of copiers sold.

) 20 ( 1842 . 1 9476 . 18
'
+ = Y
Drawing the Line of Regression
Sales Representative Sales Calls (X) Estimated Sales (Y)
Tom Keller 20 42.6316
Jeff Hall 40 66.3156
Brian Virost 20 42.6316
Greg Fish 30 54.4736
Susan Welch 10 30.7896
Carlos Ramirez 10 30.7896
Rich Niles 20 42.6316
Mike Kiel 20 42.6316
Mark Reynolds 20 42.6316
Soni Jones 30 54.4736
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25 30 35 40 45
) 6316 . 42 , 20 (
'
= = Y X
) 3156 . 66 , 40 (
'
= = Y X
Features of the line of best fit
There is no other line through the data for which
the sum of the squared deviations is smaller.
In addition, this line will pass through the points
represented by the mean of the X values and the
mean of the Y values , that is,
Y and X
The Standard Error of Estimate
In the preceding scatter diagram all the points do
not lie exactly on the regression line.
If all were lie on the same line, there would be no
error in estimating the number of units sold.
To put it another way, if all the points were on the
regression line, units sold could be predicted with
100 percent accuracy.
Thus there would be no error in predicting the Y
variable based on an X variable.
Perfect prediction in business and economics is
practically impossible.
Therefore, there is a need for a measure that
describes how precise the prediction of Y is
based on X, or conversely, how inaccurate the
estimates might be.
This measure is called the standard error of
estimate.
The standard error of estimate, denoted by
is the same concept as the standard deviation.
The standard deviation measures the dispersion
around the mean.
The standard error of estimate measures the
dispersion about the regression line.

X Y
s
.
Standard Error Of Estimate
2
) (
2 '
.

=

n
Y Y
s
X Y
The standard deviation is based on the squared deviations from the mean,
whereas the standard error of estimate is based on squared deviations
between each Y and its predicted value, Y.

If is small, this means that the data are relatively close to the regression
line and the regression equation can be used to predict Y with little error.

If is large, this means that the data are widely scattered around the
regression line and the regression equation will not provide a precise
estimate Y.
X Y
s
.
X Y
s
.
For the previous example, determine the
standard error of estimate as a measure of how
well the values fit the regression line.
Sales
Representati
ve
Actual
Sales
(Y)
Estimated
Sales
(Y)
Deviation
(Y - Y)
Deviation
Squared
(Y Y)
Tom Keller 30 42.6316 -12.6316 159.557
Jeff Hall 60 66.3156 -6.3156 39.887
Brian Virost 40 42.6316 -2.6316 6.925
Greg Fish 60 54.4736 5.5264 30.541
Susan Welch 30 30.7896 -0.7896 0.623
Carlos
Ramirez
40 30.7896 9.2104 84.831
Rich Niles 40 42.6316 -2.6316 6.925
Mike Kiel 50 42.6316 7.3684 54.293
Mark
Reynolds
30 42.6316 -12.6316 159.557
Soni Jones 70 54.4736 15.5264 241.069
0.00000 784.211
The standard error of estimate is 9.901, found by using
formula
901 . 9
2 10
211 . 784
2
) (
2 '
.
=

=

n
Y Y
s
X Y
MULTIPLE REGRESSION
Introduction
In the preceding two sections, our discussion on
regression and correlation analysis was confined
to only two variables.
However, in real life, we come across several
situations where the relationship is not that
simple.
One variable may be affected by two or more
independent variables.
For example, sale of a product, Y, may be related
to the number of variables such as price, income,
advertising expenditure, seasons, number, size
and location of the retail outlets, quality of the
products and so forth.
If in such cases we take the affect of only one
independent variable, then the magnitude of the
error in the result is likely to be high.
In view of this it is desirable to use two or more
independent variables in the estimating equation.
The statistical technique of extending linear
regression so as to consider two or more
independent variable is known as multiple linear
regression.
The multiple linear regression
takes the following form:
K k
X b X b X b X b a Y + + + + + = .......
3 3 2 2 1 1
Where Y is the independent variable, which is to be predicted.
are the k known variables on which the predictions
are to be based and a, are parameters, the value of
which are to be determined by the method of least squares.
k
X X X ,........, ,
2 1
k
b b b a ,......., , ,
2 1
Example
The following data relate to radio advertising
expenditures, newspaper advertising
expenditures and sales. Fit a regression line Y.

Radio ad exp.(000 Rs.) (X) 4 7 9
12
Newspaper ad exp.(000) (X) 1 2 5
8
Sales (Rs. Lakh) (Y) 7 12 17
20
Solution
It may be noted that here there are three
variables, viz. Y, X and X, there will be three
normal equations:




+ + =
+ + =
+ + =
2
2 2 2 1 1 2 2
2 1 2
2
1 1 1 1
2 2 1 1
X


X b X X b X a Y
X X b X b X a Y X
X b X b na Y
X X Y X XX X2 XY XY
4 1 7 16 4 1 28 7
7 2 12 49 14 4 84 24
9 5 17 81 45 25 153 85
12 8 20 144 96 64 240 160
32 16 56 290 159 94 505 276
(3) 94 159 16 276
(2) 159 290 32 505
(1) 16 32 4 56
2 1
2 1
2 1
b b a
b b a
b b a
+ + =
+ + =
+ + =
Multiplying (1) by 8 and subtracting (2) from (4),
(5) 31 34 57
(2) 159 290 32 505
(4) 128 256 32 448
2 1
2 1
2 1
b b
b b a
b b a
=
+ + =
+ + =
Multiplying (3) by 2 and subtracting (2) from (6),
(7) 29 28 47
(2) 159 290 32 505
(6) 188 318 32 552
2 1
2 1
2 1
b b
b b a
b b a
+ =
+ + =
+ + =
Multiplying (5) by 14 and (7) by 17 and subtracting (9) from (8)


0.0169
59
1

59 1
(9) 493 476 799
(8) 434 476 798
2
2
2 1
2 1
= =
=
+ =
+ =
b
b
b b
b b
Substituting the value of b = 1/59 in (5) above,
661 . 1
525 . 0 57 34
) 59 / 1 31 ( 34 57
1
1
1
=
=
+ =
b or
b or
b
Substituting the value of b = 1.661 and b = 0.0169 in (1) above,
6444 . 0
55776 . 2 4
2704 . 0 152 . 53 56 4
) 16 0169 . 0 ( ) 32 661 . 1 ( 4 56
=
=
=
+ + =
a
a or
a or
a
Therefore multiple regression of Y on X and X is
2 1
0169 . 0 661 . 1 6444 . 0 X X Y + + =
Radio advertising expenditure is more important than newspaper
Advertising expenditure.

Вам также может понравиться