Вы находитесь на странице: 1из 22

SIMPLE LINEAR REGRESSION

REGRESSION AND CORRELATION ANALYSIS


• So far we have always focused on a single variable at a time.
In practice, however, variables are often interrelated and the proper
analysis of the relationship might facilitate decision making and/or can
help predict a particular variable on the basis of some other variables.
E.g.: Is there a relationship between the age and weight of primary school students?
… between the sales of a certain product and the sex of the customers?
… between GDP, employment and capital stock?

• If there is a relationship between some variables, it can be

Deterministic: Y = f (X) Probabilistic: Y = f (X) + ε


where Y is the dependent variable, where ε is a random variable.
X is the independent variable, It accounts for the difference
and f denotes a function of X. between Y and f (X).
E.g.: If Y is the sales value of a product, E.g.: If Y is test mark, X is the time
X is the number of units sold and p is the spent on studies, then Y = f (X) + ε,
fixed unit price, then Y = p X. since time is not the only factor
determining test results.
2
• Statistics is concerned with probabilistic relationships and focuses on
two interesting questions:
i. What is the probable form of the relationship?
ii. How strong is the relationship?
Accordingly, we can perform two types of analysis:

Regression analysis: Correlation analysis:


It is concerned with the specification It is concerned with the measurement
and estimation of the relationship. of the strength of the relationship.

• Regression models can be classified according to


a) The number of independent variables, k

Simple regression: k =1 Multiple regression: k  2


i.e. only one independent variable. i.e. at least two independent variables.
E.g.: GDP is regressed on employment E.g.: GDP is regressed on employment
only. and capital stock.

3
b) The functional form, i.e. f

Linear regression: Non-linear regression:


the relationship between the the relationship between X and Y
dependent variable (Y) and each cannot be illustrated with a straight
independent variable (X) can be line.
graphed with a straight line.

Simple linear regression model:


Y   0  1 X  

f (X) : deterministic component  straight line


It is the mean value of Y associated with a given value of X, E(Y) in brief,
granted that the ε random variable has a zero mean value.

The net effect of all the variables other than X that


influence Y.
4
Regression line: E(Y ) = β0 + β1 X

E(Y)

y2 Gradient or
Rise: Slope coefficient:
y1 Y= y2- y1
Run: E(Y) changes by
Y-intercept β0 X= x2- x1 β1 units when X
coefficient: β1 = Y /  X goes up by one
E(Y) = β0 unit (X = 1).
when X = 0 X
x1 x2

E.g.: Let E(Y) = 2 + 5 X.


β0 = 2, i.e. E(Y) = 2 if X = 0.
β1 = 5, i.e. E(Y) increases by 5 units whenever X goes up by 1 unit.

5
Note:

1) The Y-intercept coefficient, β0, is an important mathematical term, but in


practical applications it often does not have any logical interpretation
because zero is not a possible, or at least not a reasonable, value for X.
2) The slope coefficient, β1, determines whether there is a linear relationship
between X and Y. It can assume any real number, it can be zero, positive or
negative.
i. If β1 = 0 the straight line is horizontal (parallel with the X axis).
E(Y) = β0 for all X values, suggesting that X and E(Y) are not
related to each other.
ii. If β1 > 0 the straight line is ascending.
E(Y) and X increase or decrease together, i.e. they have a
positive linear relationship.
iii. If β1 < 0 the straight line is descending.
E(Y) and X move into opposite directions, i.e. they have a
negative linear relationship.

6
E(Y)

No linear relationship

y1 = y2
β0 β1 = 0

X
x1 < x2

E(Y) E(Y)
Positive linear relationship Negative linear relationship

y1 < y2 β0 y1 > y2
β1 > 0
β1 < 0
β0

X X
x1 < x2 x1 < x2

7
Y, E(Y)
If we observe e.g.
this point from {Yj }
E(Y ) = β0 + β1 X

then this is εj.


{Yj}
E(Yj)

Each point on this straight line segment


represents a possible value of Yj.
X
xj

At any given value of X, e.g. at X = xj , the possible values of ε generate a sub-


population of Y, {Yj }.
– In practice, however, the β0 and β1 population parameters are unknown,
ε is unobservable and we know at most 1-2 elements of each sub-
population of Y.

8
β0 and β1 must be estimated from a sample of paired observations
of X and Y : (x1,y1), (x2, y2), …, (xn , yn)

0 , 1 Fitted or estimated regression line.

Y True/population E(Y)  β0  β1X  i  yi  E ( yi )


regression line
yi
E(yi)
yˆ i ei  yi  yˆ i
Residual:
the difference between true
and estimated Y values.
Fitted/sample
Yˆ  ˆ0  ˆ1X
regression line This is the error what we
X
x commit replacing the true Y
i
value with its estimate.

We would like to keep the overall error as small as possible.

9
– But how to measure the overall error?
There are several possibilities. For example, it can be measured by
the sum of errors or by the sum of squared errors:
n n

 ei
i 1
 i
e 2

i 1
SSE: Sum of Squares
for Errors.

Positive and negative errors Positive and negative errors


might offset each other. do not cancel each other,
and relatively large errors are
penalized automatically.

E.g.: Suppose that there three different pairs of residuals (n =2).


Best Case A: e1= 2, e2= -2 Σe = e1 + e2 = 0 but Σe2 = e12 + e22 = 8

Case B: e1= 4, e2= -4 Σe = 0 but Σe2 = 32

Worst Case C: e1= 0, e2= 8 Σe = 8 and Σe2 = 64

Σe cannot differentiate between Cases A and B, while Σe2 can rank these three
cases properly.

10
LEAST SQUARES METHOD
– A regression estimation technique that calculates β -hats so as to
minimize SSE.
Given the sample, no other straight line can produce a smaller
SSE than the least squares line.
The least squares estimators of β0 and β1 are given by
SS xy
̂1  and ˆ 0  y  ˆ1 x
SS x

where n

n n  (x i  x )( yi  y ) n
SS x   ( xi  x )   x  nx
2 2
i
2
SS xy  i 1
n
  xi yi  nx y
i 1 i 1
 i
( x
i 1
 x ) 2 i 1

Sum of Squares

Note: Although in practice we hardly ever use these formulas ‘by hand’, it is
worth to do these calculations manually a couple of times.
11
Ex 1:
After several semesters without success, Pat Statsdud (a student in a lower
quarter of a statistics subject) decided to try to improve. Pat needed to know
the secret of success for university students. After many hours of discussion
with other, more successful students, Pat postulated a rather radical theory:
the longer one studied, the better one’s grade. To test the theory, Pat took a
random sample of 100 students in an economics subject and asked each to
report the average amount of time he or she studied economics and the final
mark received. These data are stored in columns 1 (study time in hours) and
2 (final mark out of 100) in file Xr11-17.

Pat’s theory suggests that study time determines final mark.


Therefore, study time is the independent variable (X), and final mark
is the dependent variable (Y).

a) Graph the paired observations of X and Y.

– The graphical technique used to depict the relationship between two


quantitative variables is called scatter diagram.
The independent variable is plotted on the horizontal axis and the
dependent variable is plotted on the vertical axis.

12
120

100 Line of ‘best fit’


by eye.
80
Final mark 60

40

20

0
0 10 20 30 40 50
Study time

This scatter diagram reveals that


i. a straight-line model fits the data reasonably well;
ii. if the study time is zero, the final mark is around 20.
The Y-intercept of the line of ‘best fit’ is about 20.
ii. higher study times are usually accompanied by higher final marks.
The line of ‘best fit’ is ascending, i.e. it has a positive gradient.
At least in this sample, there is a positive linear relationship
between study time and final mark.
13
b) Determine the sample regression line.
Let us do it manually first.
n=100, Σx = 2795, Σy = 7406, Σx2 = 86239 and Σxy = 222239
2
n
 2795 
SS x   xi2  nx 2  86239  100     8118.75
i 1  100 
n
2795 7406
SS xy   xi y i  nx y  222239  100    15241.3
i 1 100 100

Therefore
SS xy 15241.3 7406 2795
ˆ1    1.877 ˆ ˆ
 0  y  1 x   1.877   21.598
SS x 8118.75 100 100

and the least squares sample regression line is


yˆ  ˆ 0  ˆ1 x  21.598  1.877 x

14
Excel Output:

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.891
R Square 0.794
Adjusted R Square 0.792
Standard Error 8.700
Observations 100.000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 21.590 2.835 7.614 0.000 15.963 27.216
Stdytime 1.877 0.097 19.443 0.000 1.686 2.069

yˆ  ˆ 0  ˆ1 x  21.590  1.877 x

15
c) Interpret the coefficients.

ˆ 0  21.590 which means that when X = 0, i.e. study time is zero, the final
mark is expected to be about 21.6.

ˆ1  1.877 which means that for each additional hour of study time, the
final mark is expected to increase by 1.877.

120

100
The least
80 squares fitted
line.
60

40

20

0
0 10 20 30 40 50

Note: The fitted line is ascending, as we expected.


16
MEASURING THE OVERALL FIT OF THE
ESTIMATED MODEL

Y The least squares line goes Is y-hat a better estimator of y


through the sample means. than the sample mean, y-bar?
yi For (xi ; yi) yes, but in general?
y i  yˆ i
yi  y
ŷi
y yˆ i  y

Yˆ  ˆ0  ˆ1X yi  y  ( yˆ i  y )  ( yi  yˆ i )

X
x xi

Error associated with Error associated Improvement due to


using y-bar to = with using y-hat to + using y-hat instead of
estimate y estimate y y-bar
17
– Similar relationships hold for all data points, and it can be shown that
n n n

 (y
i 1
i  y)2   ( y
i 1
i  y)2   (y
i 1
i  y i ) 2

SSy SSR SSE


Total variation = Sum of squares + Sum of squares
in Y for regression for errors

The variation in It measures the amount of It measures the amount of


Y is partitioned variation in Y that can be variation in Y that remains
into two parts. explained by the model, i.e. unexplained.
by the variation in X.

SSR SSE
From this identity, dividing both sides by SSy, we obtain 1  
SS y SS y
(Sample) coefficient of determination: R2
It measures the proportion of the total variation in Y that
can be explained by the estimated regression model.
18
R2 has the following properties:
a) 0  R2  1
b) R2 = 1 if and only if each observation is on the fitted line, i.e. the fit is
perfect.
c) R2 = 0 if and only if the model is completely useless.
d) R2 is a measure of goodness-of-fit, and the better the fit of the
estimated model to the sample data, the closer R2 is to 1.
e) A useful computational formula for R2 is

SS xy2 n n
R 
2
where SS y   ( y i  y )   y i2  ny 2
2

SS x  SS y i 1 i 1

(Ex 1)
d) Calculate the coefficient of determination.
Let us do it manually first. We have already found SSxy and SSx, but we
also need SSy.

19
Since Σy = 7406 and Σy2 = 584518
2
n
 7406 
SS y   y i2  ny 2  584518  100     36029.64
i 1  100 

SS xy2 15241 .3 2
R2    0.794
SS x  SS y 8118.75  36029.64

With Excel:
Regression Statistics R2
Multiple R 0.891
R Square 0.794
Adjusted R Square 0.792
Standard Error 8.700
Observations 100.000

This suggests that about 79% of the total variation in the final marks can be
explained, or is due to, the variation in study time. The remaining 21% is
unexplained.
20
• The square root of R2, the sample coefficient of determination, is the
Pearson’s coefficient of correlation, between the observed and
estimated Y values.

In general, the sample correlation coefficient (r) measures the strength of linear
association between two variables in a sample.
It has the following properties:
a) -1  r  1
b) r = -1 indicates that there is a perfect negative linear relationship
between the variables (all observations are on a single straight line
sloping downward).
c) r = 1 indicates that there is a perfect positive linear relationship
between the variables (all observations are on a single straight line
sloping upward).
d) r = 0 suggests that there is no linear relationship between the two
variables.
e) The sign of r shows the nature (negative/positive) of the linear
relationship, and the closer r is to 1, the stronger the relationship
between the variables.
21
Note: When r is computed for y and y-hat, it is always non-negative.

ry , yˆ  R 2  R 0  ry , yˆ  1

(Ex 1)
e) Calculate the sample correlation coefficient between y and y-hat.

ry , yˆ  R 2  0.794  0.891
Regression Statistics
Multiple R 0.891
R Square 0.794
Adjusted R Square 0.792
Standard Error 8.700
Observations 100.000
There is a reasonable strong association between the observed and
estimated Y values.

In the context of regression R2 is a more useful statistic than ry, y-hat. Since R2
measures the proportion of the variation in Y that can be explained by the
regression model, it can be interpreted precisely.
22

Вам также может понравиться