Вы находитесь на странице: 1из 43

Introductory Econometrics

Simple and Multiple Regression Analysis

Monash Econometrics and Business Statistics

2019

1 / 43
Recap

I We reviewed the concepts of:


1. a random variable and its probability density function (pdf);
2. the mean and the variance of a random variable;
3. the joint pdf of two random variables;
4. the covariance and correlation between two random variables;
5. the mean and the variance of a linear combination of random
variables;
6. how and why diversification reduces risk, and how good
estimators in econometrics reduce risk in a similar fashion;
7. conditional distribution of one random variable given another;
8. the conditional expectation function as the fundamental object
for modelling.

2 / 43
Lecture Outline: Wooldridge: Ch 2 - 2.3, Appendix B-1,
B-2, B-4

I Explaining y using x with a model


I Simple linear regression (textbook reference Chapter 2, 2-1)
I The OLS estimator (textbook reference, Ch 2, 2-2,2-3)
I Simple linear regression in matrix form
I Geometric interpretation of least squares (not in the textbook)
I Stretching our imagination to multiple regression - Appendix E of
the textbook, section E-1

3 / 43
What is a model?

I To study the relationship between random variables, we use a model


I A complete model for a random variable is a model of its probability
distribution
I For example, a possible model for heights of adult men is that it is
normally distributed, and since normal distribution is fully described
by its mean and variance, we only have to estimate the mean and
variance from a sample of observations on adult men and we have
estimated a model example1
I But when studying two or more random variables, modelling the
joint distribution is difficult and requires a lot of data.
I So, we model the conditional distribution of y given x.
I In particular we provide a model for E (y | x)

4 / 43
Terminology and notation
I Notation: Unlike first year, we use lower case letters for random
variables
I Terminology: There are many different ways people refer to the
target variable y that we want to explain using variable x. These
include:
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor

I The expressions
“Run a regression of y on x”, and,
“Regress y on x”
both mean “Estimate the model y = β0 + β1 x + u using the
ordinary least squares method”
5 / 43
Modelling mean

E (y | x) = β0 + β1 x (y | x) = βˆ0 + βˆ1 x
ŷ = E\
(Conditional expectation function) (Sample regression function)

6 / 43
The simple linear regression model in a picture

7 / 43
The simple linear regression model in equation form

I The following equation specifies the conditional mean of y given x

y = β0 + β1 x + u with E (u | x) = 0

I It is an incomplete model because it does not specify the probability


distribution of y conditional of x
I If we add the assumptions that Var (u | x) = σ 2 and that the
conditional distribution of u given x is normal, then we have a
complete model, which is called the “classical linear model” and was
shown in the picture on the previous slide.
I In summary: we make the assumption that in the big scheme of
things, data are generated by this model, and we want to use
observed data to learn the unknowns β0 , β1 and σ 2 in order to
predict y using x.

8 / 43
The OLS estimator

I Let’s leave the theory universe and go back to the data world
I We have a random sample of n observations on two variables x and
y in two columns of a spreadsheet
I Let’s denote them by (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )
I We want to see if these two columns of numbers are related to each
other
I In the simple case that there is only one x, we look at the scatter
plot, which is very informative visual tools and give us a good idea
of the correlation between y and x (bodyfat and weight)
I Unfortunately though, in some business and economic applications
the signal is too weak to be detected by data visualisation alone
(asset return and size)

9 / 43
The OLS estimator
I How can we determine a straight line that fits our data best?
I We find βb0 and βb1 that minimise the sum of squared residuals
n
X
SSR(b0 , b1 ) = (yi − b0 − b1 xi )2
i=1

I My favourite visual explanation of this is in


http://youtu.be/jEEJNz0RK4Q
I The first order conditions are:
n
∂SSR X
|βb0 ,βb1 = −2 (yi − βb0 − βb1 xi ) = 0
∂b0
i=1
n
∂SSR X
| b b = −2 xi (yi − βb0 − βb1 xi ) = 0
∂b1 β0 ,β1
i=1

I Two equations and two unknowns, we can solve to get βb0 and βb1
10 / 43
I We obtain the set of equations
n
X n
X
ûi = (yi − βb0 − βb1 xi ) = 0
i=1 i=1
n
X n
X
xi ûi = xi (yi − βb0 − βb1 xi ) = 0
i=1 i=1

I And after some algebra (see page 28 of the textbook, 5th ed., or
page 26, 6th ed.)
Pn
(x − x̄)(yi − ȳ )
β1 = i=1
b Pn i 2
i=1 (xi − x̄)

βb0 = ȳ − βb1 x̄

11 / 43
I Several important things to note:
1.
\
Cov (x, y ) σ̂xy
βb1 = = 2
\
Var (x) σ̂x
Those of you who are doing finance, now realise where the
name “beta” of a stock has come from!

2. From the formula for βb0 we see that ȳ = βb0 + βb1 x̄, which
shows that (x̄, ȳ ) lies on the regression line, i.e. regression
prediction is most accurate for the sample average.

3. Both βb0 and βb1 are functions of sample observations only.


They are estimators, i.e. formulae that can be computed from
the sample. If we collect a different sample from the same
population, we get different estimates. So these estimators are
random variables.

12 / 43
Estimators we have seen so far

Population parameter its estimator


population mean sampleP average
n
µy = E (y ) ȳ = n1 i=1 yi
population variance sample varianceP
1 n
σy2 = E (y − µy )2 sy2 = σ̂y2 = n−1 i=1 (yi − ȳ )
2

population
q st. dev. sampleqst. dev.
σy = σy2 σ̂y = σ̂y2
population covariance sample covariancePn
1
σxy = E (x − µx )(y − µy ) σ̂xy = n−1 i=1 (xi − x̄)(yi − ȳ )
population correlation coeff. sample corr. coeff.
σ σ̂
ρxy = σx xyσy ρ̂xy = σ̂x xyσ̂y
slope and intercept of the PRF their OLS estimators
σ̂
β1 and β0 in βb1 = σ̂xy2
x
E (y | x) = β0 + β1 x βb0 = ȳ − βb1 x̄

13 / 43
Simple linear regression in matrix form

I For each of the n observations (x1 , y1 ),(x2 , y2 ),. . .,(xn , yn ), our


model implies

yi = β0 + β1 xi + ui , i = 1, . . . , n

I We stack all of these on top of each other


       
y1 1 x1 u1
 y2   1   x2   u2 
 ..  =  ..  β0 +  ..  β1 + 
       
.. 
 .   .   .   . 
yn 1 xn un

14 / 43
I We define the n × 1 vectors y (dependent variable vector) and u
(error vector):
   
y1 u1
 y2   u2 
y= .  and u = 
   
..
 ..

  . 
yn un
and the n × 2 matrix of regressors
 
1 x1
 1 x2 
X= .
 
..
 ..

. 
1 xn
and the 2 × 1 parameter vector
 
β0
β=
β1
which allow us to write the model simply as
y = Xβ + u
15 / 43
I In multiple regression where we have k explanatory variables plus
the intercept
y = X β + u
n×1 n×(k+1) (k+1)×1 n×1

where  
1 x11 x12 ··· x1k
 1 x21 x22 ··· x2k 
X =
 
.. .. .. .. 
n×(k+1)  . . . . 
1 xn1 xn2 ··· xnk
and  
β0
 β1 
β =
 
.. 
(k+1)×1  . 
βk
I Remember that y, X are observable, β and u are unknown and
unobservable
I Also remember that Xb for any (k + 1) × 1 vector b is an n × 1
vector that is a linear combination of columns of X
16 / 43
The power of abstraction
I In the real world we have goals such as:
I We want to determine the added value of education to wage

after controlling for experience and IQ based on a random


sample of 1000 observations
I We want to establish if the return on a firm’s share price is

related to its value measured by its book to market ratio and


firm’s size based on 24 yearly observations on 500 stocks
I We want to predict the value of house in the city of Monash

given characteristics such as land area, number of bedrooms,


number of bathrooms, proximity to public transport, etc.,
based on data on the last 100 houses sold in Monash
I We want to forecast hourly electricity load based on all

information available at the time that forecast is made, based


on hourly electricity load in the last 3 years
I In all cases we want to get a good estimate of β in the model
y = Xβ + u based on our sample of observations
I We have now turned these very diverse questions into a single
mathematical problem
17 / 43
The OLS solution

I OLS finds the vector in the column space of X which is closest to y.


Let’s see what this means
I To visualise we are limited to 3 dimensions at most
I Assume we want to explain house prices with number of bedrooms
I We have data on 3 houses, with 4, 1 and 1 bedrooms which sold for
10, 4 and 6 ×$100, 000
I Want to find a good βb
       
10 1 4 1 4
 4  = 1 β̂0 + 1 β̂1 + û = 1 1 βb + û
6 1 1 1 1

18 / 43
Vectors and vector spaces
I An n-dimensional vector is an arrow from the origin to the point in
<n with coordinates given by its elements
 
I Example: v =
4
3

19 / 43
I The length of a vector is the square root of sum of squares of its
coordinates
length(v) = (v0 v)1/2
I Example: vector v on the previous slide
I For u and v of the same dimension
u0 v = 0 ⇔ u and v are perpendicular (orthogonal) to each other
   
I Example:
3 −1
and
2 1.5

20 / 43
I For any constant c, the vector cv is on the line that passes through
v
I This line is called the “space spanned by v”
     
I Example:
3 −2 1
and are in the space spanned by v =
3 −2 1

21 / 43
I Consider a matrix X that has two columns that are not a multiple of
each other
I Geometrically, these vectors form a two dimensional plane that
contains all linear combinations of these two vectors, i.e. set of all
Xb for all b. This plane is called the column space of X
I If the number of rows of X is also two, then the column space of X
is the entire <2 , that is any two dimensional y can be written as a
linear combination of columns of X
I Consider the house price example, but only with two houses
       
10 1 4 1 4 b
= β̂ + β̂ + û = β + û
4 1 0 1 1 1 1

I The price vector can be perfectly explained with a linear


combination of columns of X, i.e. with zero û

22 / 43
6 obs 2

1
obs 1
0
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

-1

-2

-3

-4

-5

-6

23 / 43
6 obs 2

1
obs 1
0
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

-1

-2

-3

-4

-5

-6

24 / 43
6 obs 2

1
obs 1
0
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

-1

-2

-3

-4

-5

-6

25 / 43
6 obs 2

1
obs 1
0
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

-1

-2

-3

-4

-5

-6

26 / 43
6 obs 2

1
obs 1
0
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

-1

-2

-3

-4

-5

-6

27 / 43
6 obs 2

1
obs 1
0
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

-1

-2

-3

-4

-5

-6

28 / 43
6 obs 2

1
obs 1
0
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10

-1

-2

-3

-4

-5

-6

29 / 43
I This shows that with only two observations, we would suggest

price
[ = 2 + 2bedrooms

I While this fits the first two houses perfectly, when we add the third
house (1 bedroom sold for 6 hundred thousands) we make an error
of 2 hundred thousand dollars
I With 3 observations, the 3 dimensional price vector no longer lies in
the space of linear combinations of columns of X.
I The closest point in the column space of X to y is ...

30 / 43
31 / 43
32 / 43
I Hence, the shortest û is the one that is perpendicular to column of
X, i.e.
X0 û = 0

I Since û = y − Xβ,
b we get

X0 (y − Xβ)
b =0

⇒ X0 y = X0 Xβb

⇒ βb = (X0 X)−1 X0 y

I For X0 X to be invertible, columns of X must be linearly independent


I The top box is more important than the formula for the OLS
estimator in the second box. Geometry of OLS is summarised in the
first box: The OLS residuals are orthogonal to columns of X

33 / 43
I The vector of OLS predicted values is the orthogonal projection of y
in the column space of X

ŷ = Xβb = X(X0 X)−1 X0 y

I By definition
y = ŷ + û
I Since y, ŷ and û form a right-angled triangle, we know (remember
the length of a vector)

y0 y = ŷ0 ŷ + û0 û

i.e.,
n
X n
X n
X
yi2 = ŷi2 + ûi2 (L2)
i=1 i=1 i=1

34 / 43
I Since (1 1 ··· 1)û = 0 we also have
n
X n
X
yi = ŷi ⇒ ȳ = ŷ¯
i=1 i=1

I Subtracting nȳ 2 from both sides of (L2) and rearranging, we get


n
X n
X n
X
(yi − ȳ )2 = (ŷi − ȳ )2 + ûi2
i=1 i=1 i=1

or
SST = SSE + SSR
I This leads to the definition of the coefficient of determination R 2 ,
which is a measure of goodness of fit

R 2 = SSE/SST = 1 − SSR/SST

35 / 43
Interpretation of OLS estimates
Example: The causal effect of education on wage

I Consider an extension of the wage equation that is in tutorial 2:

wage = β0 + β1 educ + β2 IQ + u

where IQ is IQ score (in the population, it has a mean of 100 and sd


= 15).
I Primarily interested in β1 , because we want to know the value that
education adds to a person’s wage.
I Without IQ in the equation, the coefficient of educ will show how
strongly wage and educ are correlated, but both could be caused by
a person’s ability.
I By explicitly including IQ in the equation, we obtain a more
persuasive estimate of the causal effect of education provided that
IQ is a good proxy for intelligence.

36 / 43
OLS in action
Example: The causal effect of education on wage

I If we only estimate a regression of wage on education, we cannot be


sure if we are measuring the effect of education, or if education is
acting as a proxy for smartness. This is important, because if the
education system does not add any value other than separating
smart people from not so smart, the society can achieve that much
cheaper by national IQ tests!
37 / 43
Interpretation of OLS estimates
Example: The causal effect of education on wage

I The coefficient of educ now shows that for two people with the
same IQ score, the one with 1 more year of education is expected to
earn $42 more.

38 / 43
Interpretation of OLS estimates

I Consider k = 2 for simplicity


I The conditional expectation of y given x1 and x2 (also know as the
population regression function) is

E (y | x1 , x2 ) = β0 + β1 x1 + β2 x2

I The estimated regression (also known as the sample regression


function) is
ŷ = β̂0 + β̂1 x1 + β̂2 x2 .
I The formula
∆ŷ = β̂1 ∆x1 + β̂2 ∆x2
allows us to compute how predicted y changes when x1 and x2
change by any amount.

39 / 43
I Now, in
∆ŷ = β̂1 ∆x1 + β̂2 ∆x2
what happens if we hold x2 fixed and increase x1 by 1 unit, that is,
∆x2 = 0 and ∆x1 = +1? We will have

∆ŷ = β̂1

I In other words, β̂1 is the slope of ŷ with respect to x1 when x2 is


held fixed.
I We also refer to β̂1 as the estimate of the partial effect of x1 on y
holding x2 constant
I Yet another legitimate interpretation is that β̂1 estimates the effect
of x1 on y after the influence of x2 has been removed (or has been
controlled for)

40 / 43
I Similarly,
∆ŷ = β̂2 if ∆x1 = 0 and ∆x2 = +1
I Let’s go back to regression output and interpret the parameters.

wage
[ = −128.89 + 42.06 educ + 5.14 IQ
n = 935, R 2 = .134

I 42.06 shows that for two people with the same IQ, the one with one
more year of education is predicted to earn $42.06 more in monthly
wages.
I Or: Every extra year of education increases the predicted wage by
$42.06, keeping IQ constant (or “after controlling for IQ”, or ”after
removing the effect of IQ”, or ”all else constant”, or ”all else
equal”, or ”ceteris paribus”)

41 / 43
Summary
I Given a sample of n observations, OLS finds the orthogonal
projection of the dependent variable vector in the column space of
explanatory variables
I The residual vector is orthogonal to each of the explanatory variable
vectors, including a column of ones for the intercept
X0 û = 0
I This leads to the formula for the OLS estimator in multiple
regression
βb = (X0 X)−1 X0 y
I It also leads to
SST = SSE + SSR
I We learned how to interpret the coefficients of an estimated
regression model
I Note the difference between the population parameter β and its
OLS estimator β.b β is constant and does not change, βb is a
function of sample and its value changes for different samples. Why
are these good estimators?
42 / 43
I The probability that the height of a randomly selected man lies in a
certain interval is the area under the pdf over that interval

Back to what is a model

43 / 43

Вам также может понравиться