Академический Документы
Профессиональный Документы
Культура Документы
Contents
I Objectives of the analysis.
I Model specification.
I Least Square Estimators (LSE): construction and properties
I Statistical inference:
I For the slope.
I For the variance.
I Prediction for a new observation (the actual value or the
average value)
Chapter 4. Simple linear regression
Learning objectives
I Ability to construct a model to describe the influence of X on
Y
I Ability to find estimates
I Ability to construct confidence intervals and carry out tests of
hypothesis
I Ability to estimate the average value of Y for a given x (point
estimate and confidence intervals)
I Ability to estimate the individual value of Y for a given x
(point estimate and confidence intervals)
Chapter 4. Simple Linear Regression
Bibliography
I Newbold, P. Statistics for Business and Economics (2013)
I Ch. 10
I Ross, S. Introductory Statistics (2005)
I Ch. 12
Introduction
Examples
I Study how the fathers height influences the sons height.
Types of relationships
I Deterministic: Given a value of X , the value of Y can be
perfectly identified.
y = f (x)
Example: The relationship between the temp in C (X ) and
Fahrenheit (Y ) is:
y = 1.8x + 32
Plot of Grados Fahrenheit vs Grados centgrados
112
Grados Fahrenheit
92
72
52
32
0 10 20 30 40
Grados centgrados
Introduction
Types of relationships
I Nondeterministic (random/stochastic): Given a value of X ,
the value of Y cannot be perfectly known.
y = f (x) + u
60
Costos
40
20
0
26 31 36 41 46 51 56
Volumen
f (x) = 0 + 1x
6 6
2 2
Y
Y
-2 -2
-6 -6
-2 -1 0 1 2 -2 -1 0 1 2
X X
Types of relationships
I Nonlinear: When f (x) is nonlinear. For example,
f (x) = log (x), f (x) = x 2 + 3, . . .
Relacin no lineal
2
-1
Y
-2
-3
-4
-2 -1 0 1 2
Types of relationships
I Lack of relationship: When f (x) = 0.
Ausencia de relacin
2,5
1,5
0,5
Y
-0,5
-1,5
-2,5
-2 -1 0 1 2
X
Measures of linear dependence
Covariance
The covariance is defined as
n
X n
X
(xi x) (yi y ) xi yi n(x)(y )
i=1 i=1
cov (x, y ) = =
n 1 n 1
cov (x, y )
r(x,y ) = cor (x, y ) =
sx sy
where
n
X n
X
2
(xi x) (yi y )2
i=1 i=1
sx2 = and sy2 =
n 1 n 1
I -1 cor (x, y ) 1
I cor (x, y ) = cor (y , x)
I cor (ax + b, cy + d) = sign(a)sign(c)cor (x, y ) for arbitrary
numbers a, b, c, d.
Simple linear regression model
Yi = 0 + 1 xi + ui
where
I Yi is the value of the dependent variable Y when the random
variable X takes a specific value xi
I xi is the specific value of the random variable X
I ui is an error, a random variable that is assumed to be normal
with mean 0 and unknown variance 2 , ui N(0, 2 )
I 0 and 1 are the population coefficients:
I : population intercept
0
I 1 : population slope
The (population) parameters that we need to estimate are: 0, 1
and 2 .
Simple linear regression model
Our objective is to find the estimators/estimates 0 , 1 of 0, 1
in order to obtain the regression line:
y = 0 + 1 x
which is the best fit to the data with a linear pattern. Example:
Lets say that the regression line for the last example is
[=
Price 15.65 + 1.29 Production
Plot of Fitted Model
80
60
Costos
40
20
0
26 31 36 41 46 51 56
Volumen
Valor observado
Dato (y)
Recta de
regresin
estimada
E [ui ] = 0
E [ui uj ] = 0
60
Costos
40
20
0
26 31 36 41 46 51 56
Volumen
If not, the regression line is not an adequate model for the data.
Plot of Fitted Model
34
24
14
Y
-6
Simple linear regerssion model: model assumptions
Homoscedasticity
The vertical spread around the line should roughly remain
constant.
Plot of Costos vs Volumen
80
60
Costos
40
20
0
26 31 36 41 46 51 56
Volumen
independientes.
7
f (Y , Y ,..., Y | X , X ,..., X )
11
12
1 2 k 1 2 l 13
14
- Relacin no lineal
19
20
21
I In general, time
Regresin Lineal series fail this assumption. 2 Regresin
Normality
I A priori, we assume that the observations are normal.
Modelo H
yi E 0 E 1 xi u i , u i o N (0, V 2 ) L
yi N
E 0 E1 x
H
xi In
E 0 , E1 ,V : parmetros desconocidos
2
(Ordinary) Least Square Estimators: LSE
x xi method to obtain the
In 1809 Gauss proposed the least squares
estimators 0 and 1 that provide the best fit
Regresin Lineal 7
yi = 0 + 1 xi
The method is based on a criterion in which we minimize the sum
of squares of the residuals, SSR, that is, the sum of squared
Residuos
vertical distances between the observed yi and predicted yi values
X n X n Xn 2
0 E1 xiyi 0e+
ei2 = yi (yi yi )2 E= 1 xi
N
Ni
Valor Observado
i=1 i=1 i=1
Valor Previsto Residuo
ei
yi
yi E0 E1xi
xi
Least Squares Estimators
y i E 0 E 1 xi u i , u i o N (0, V 2 )
The resulting estimators aredependiente
y : Variable yi
i
xi : Variable independiente
n y
X
ui : Parte aleatoria (xi x) (yi y )
V
1 = cov (x, y ) = i=1
n
sx2 X 0
Regresin Lineal
(xi x)2
6 Regresin Lineal
i=1
= y 1 x
Recta de regresin
0 Residuos
y E 0 E1 x y
Ni
Valor Observ
y yi
Pendiente
E1
E 0 y E1 x
x
Regresin Lineal 8 Regresin Lineal
Fitting the regression line
Example 4.1. For the Spanish wheat production data from the 80s with
production (X ) and price per kilo in pesetas (Y ) we have the following
table
production 30 28 32 25 25 25 22 24 35 40
price 25 30 27 40 42 40 50 45 30 25
Regression line is
y = 74.116 1.3537x
Fitting the regression line in software
Estimating the error variance
To estimate the error variance, 2, we can simply take the
uncorrected sample variance,
n
X
ei2
i=1
2 =
n
which is the so-called maximum likelihood estimator of 2.
yi = 74.116 1.3537xi
xi 30 28 32 25 25 25 22 24 35 40
yi 25 30 27 40 42 40 50 45 30 25
yi = 74.116 1.3537xi 33.5 36.21 30.79 40.27 40.27 40.27 44.33 41.62 26.73 19.96
ei = yi yi -8.50 -6.21 -3.79 -0.27 1.72 -0.27 5.66 3.37 3.26 5.03
H0 : 1 =0
H1 : 1 6= 0
1.3537 1
2.306 q 2.306
25.99
932.04
2.046 1 0.661
2. Since the interval (with the same ) doesnt contain 0, we reject the
null 1 = 0 at 0.05 level. Also, the (observed) test statistic is
1 1.3537
t=p 2 2
=q = 4.509.
sR / (n 1) sX 25.99
932.04
H0 : 0 =0
H1 : 0 6= 0
2. Since the interval (with the same ) doesnt contain 0, we reject the
null hypothesis that 0 = 0. Also, the (observed) test statistic is
t
z }| {
0 74.1151
t=r =r = 8.484
2
28.62
sR2 n1 + (n x1)s 2 1
25.99 10 + 932.04
X
We have:
(n 2) sR2 2
2
n 2
(n 2) sR2 2 (n 2) sR2
2
2
n 2,/2 n 2,1 /2
In both cases:
y0 = 0 + 1 x0
= y + 1 (x0 x)
Remember that:
2
Var Y0 = Var Y + (x0 x) Var 1
!
2 1 (x0 x)2
= +
n (n 1) sX2
And thus the confidence interval for the actual value Y0 is:
v !
u
u 1 (x x)2
Y0 tn 2,/2 tsR2 1 + +
0
n (n 1) sX2
The size of this interval is bigger than that for the average
prediction.
Estimating/predicting the average and actual values
45
Precio en ptas.
40
35
30
25
22 25 28 31 34 37 40
Produccion en kg.
Regression line: R-squared and variability decomposition
I Coefficient of determination, R-squared is used to assess the
goodness-of-fit of the model. It is defined as
R 2 = r(x,y
2
) 2 [0, 1]
From Wikipedia:
ANOVA table
Note that the value of the F statistic is the square of that for the t
statistic in the simple regression significance test.