Академический Документы
Профессиональный Документы
Культура Документы
3
b) The functional form, i.e. f
E(Y)
y2 Gradient or
Rise: Slope coefficient:
y1 Y= y2- y1
Run: E(Y) changes by
Y-intercept β0 X= x2- x1 β1 units when X
coefficient: β1 = Y / X goes up by one
E(Y) = β0 unit (X = 1).
when X = 0 X
x1 x2
5
Note:
6
E(Y)
No linear relationship
y1 = y2
β0 β1 = 0
X
x1 < x2
E(Y) E(Y)
Positive linear relationship Negative linear relationship
y1 < y2 β0 y1 > y2
β1 > 0
β1 < 0
β0
X X
x1 < x2 x1 < x2
7
Y, E(Y)
If we observe e.g.
this point from {Yj }
E(Y ) = β0 + β1 X
8
β0 and β1 must be estimated from a sample of paired observations
of X and Y : (x1,y1), (x2, y2), …, (xn , yn)
9
– But how to measure the overall error?
There are several possibilities. For example, it can be measured by
the sum of errors or by the sum of squared errors:
n n
ei
i 1
i
e 2
i 1
SSE: Sum of Squares
for Errors.
Σe cannot differentiate between Cases A and B, while Σe2 can rank these three
cases properly.
10
LEAST SQUARES METHOD
– A regression estimation technique that calculates β -hats so as to
minimize SSE.
Given the sample, no other straight line can produce a smaller
SSE than the least squares line.
The least squares estimators of β0 and β1 are given by
SS xy
̂1 and ˆ 0 y ˆ1 x
SS x
where n
n n (x i x )( yi y ) n
SS x ( xi x ) x nx
2 2
i
2
SS xy i 1
n
xi yi nx y
i 1 i 1
i
( x
i 1
x ) 2 i 1
Sum of Squares
Note: Although in practice we hardly ever use these formulas ‘by hand’, it is
worth to do these calculations manually a couple of times.
11
Ex 1:
After several semesters without success, Pat Statsdud (a student in a lower
quarter of a statistics subject) decided to try to improve. Pat needed to know
the secret of success for university students. After many hours of discussion
with other, more successful students, Pat postulated a rather radical theory:
the longer one studied, the better one’s grade. To test the theory, Pat took a
random sample of 100 students in an economics subject and asked each to
report the average amount of time he or she studied economics and the final
mark received. These data are stored in columns 1 (study time in hours) and
2 (final mark out of 100) in file Xr11-17.
12
120
40
20
0
0 10 20 30 40 50
Study time
Therefore
SS xy 15241.3 7406 2795
ˆ1 1.877 ˆ ˆ
0 y 1 x 1.877 21.598
SS x 8118.75 100 100
14
Excel Output:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.891
R Square 0.794
Adjusted R Square 0.792
Standard Error 8.700
Observations 100.000
15
c) Interpret the coefficients.
ˆ 0 21.590 which means that when X = 0, i.e. study time is zero, the final
mark is expected to be about 21.6.
ˆ1 1.877 which means that for each additional hour of study time, the
final mark is expected to increase by 1.877.
120
100
The least
80 squares fitted
line.
60
40
20
0
0 10 20 30 40 50
Yˆ ˆ0 ˆ1X yi y ( yˆ i y ) ( yi yˆ i )
X
x xi
(y
i 1
i y)2 ( y
i 1
i y)2 (y
i 1
i y i ) 2
SSR SSE
From this identity, dividing both sides by SSy, we obtain 1
SS y SS y
(Sample) coefficient of determination: R2
It measures the proportion of the total variation in Y that
can be explained by the estimated regression model.
18
R2 has the following properties:
a) 0 R2 1
b) R2 = 1 if and only if each observation is on the fitted line, i.e. the fit is
perfect.
c) R2 = 0 if and only if the model is completely useless.
d) R2 is a measure of goodness-of-fit, and the better the fit of the
estimated model to the sample data, the closer R2 is to 1.
e) A useful computational formula for R2 is
SS xy2 n n
R
2
where SS y ( y i y ) y i2 ny 2
2
SS x SS y i 1 i 1
(Ex 1)
d) Calculate the coefficient of determination.
Let us do it manually first. We have already found SSxy and SSx, but we
also need SSy.
19
Since Σy = 7406 and Σy2 = 584518
2
n
7406
SS y y i2 ny 2 584518 100 36029.64
i 1 100
SS xy2 15241 .3 2
R2 0.794
SS x SS y 8118.75 36029.64
With Excel:
Regression Statistics R2
Multiple R 0.891
R Square 0.794
Adjusted R Square 0.792
Standard Error 8.700
Observations 100.000
This suggests that about 79% of the total variation in the final marks can be
explained, or is due to, the variation in study time. The remaining 21% is
unexplained.
20
• The square root of R2, the sample coefficient of determination, is the
Pearson’s coefficient of correlation, between the observed and
estimated Y values.
In general, the sample correlation coefficient (r) measures the strength of linear
association between two variables in a sample.
It has the following properties:
a) -1 r 1
b) r = -1 indicates that there is a perfect negative linear relationship
between the variables (all observations are on a single straight line
sloping downward).
c) r = 1 indicates that there is a perfect positive linear relationship
between the variables (all observations are on a single straight line
sloping upward).
d) r = 0 suggests that there is no linear relationship between the two
variables.
e) The sign of r shows the nature (negative/positive) of the linear
relationship, and the closer r is to 1, the stronger the relationship
between the variables.
21
Note: When r is computed for y and y-hat, it is always non-negative.
ry , yˆ R 2 R 0 ry , yˆ 1
(Ex 1)
e) Calculate the sample correlation coefficient between y and y-hat.
ry , yˆ R 2 0.794 0.891
Regression Statistics
Multiple R 0.891
R Square 0.794
Adjusted R Square 0.792
Standard Error 8.700
Observations 100.000
There is a reasonable strong association between the observed and
estimated Y values.
In the context of regression R2 is a more useful statistic than ry, y-hat. Since R2
measures the proportion of the variation in Y that can be explained by the
regression model, it can be interpreted precisely.
22