Lecture 2

Simple Linear Regression

Basic Idea

Ordinary Least Square Methods (OLS)

Estimation

Assumption

Properties of OLS

Interpretation

Goodness of Fit

Testing Hypothesis

True Model

Dependent Variable

Yi =0 +1Xi +i

Fitted Line

(the models prediction)

E(Y|Xi)= B0 +B1Xi

Leftover term

Stochastic error, e

Estimated Model

Y = b 0 + b1X + ei

Y = b 0 + b1X

Residual, e

Conditional Expected Value E(Y|X)

E(Y)

Population Average of Y

0 + b

1X

Y = b

PRF vs SRF

OLS

OLS is the most basic and most commonlyused regression technique.

Given Yi =0 +1Xi +i

We wish to estimate Y = b 0 + b1X

OLS permits the estimation of B0 and B1 such

that the sum of squared residuals (RSS) are

minimized.

Residual

The residuals is ei=Yi- i

OLS minimizes the sum of squared residuals

(RSS), means:

OLS minimizes

ei2

i =1,2,...,n

OLS

Goodness of Fit

The best fitting line may not be all that good,

so it is desirable to have some measure of fit

for how good the line is. We want to know

how well the regression line does in explaining

the movement of the dependent variable.

R2 provides us with a measure for how much

of the movement in the dependent variable

can be explained by the regression model.

Goodness of Fit : R2

Goodness of Fit

Once a regression equation is estimated, we

wish to determine the quality of the

estimation equation or the goodness of fit.

To do so, we use Total Sum of Squares (TSS),

Explained Sum of Squares (ESS), and Residual

Sum of Squares (RSS):

Goodness of Fit = R2

Goodness of Fit

Decomposition of Variance

R2

R2 and Adjusted R2

b 0 = -299.59

b1 = 0.722

Adjusted R2=

99.8%

Are We Finished?

Apakah SRF merepresentasikan PRF

Apakah Coefficients menggambarkan

Parameters?

OLS Assumptions

1. The regression model is: (a) linear in the coefficients, and

(b) correctly specified with the right independent

variables

2. No explanatory variable is a perfect linear function of any

other explanatory variable(s) (no perfect multicollinearity)

3. No explanatory variable is correlated with the error term

4. No serial correlation

5. Zero population mean of error term

6. Homoskedasticity of error term

7. Normally distributed error term

Linear Regression

Linear in Parameters

WAGEit

1 EDUCit

2 TENUREit

3 UNIONit

it

Linear in Variables

Y AKa Lb

Could be estimated using logarithms as

ln Y ln A aln K b ln L

2

Linearity

Other Example:

Yi 0 1 Xi i

Transform: Xi* Xi

Thus: Yi 0 1Xi* i

Variables

No perfect multicollinearity.

They are really the same variable, or

That one (or more) has zero variance, or

Two independent variables sum to equal a

third, or

That a constant has been added to or

subtracted from one of the variables.

uncorrelated with the error term.

If the observed values for the Xis are correlated with

the error term, then the estimated coefficients on the

Xis would be biased

If X and are positively correlated, then X will be

higher than if X and are not positively correlated;

If X and are negatively correlated, then X will be

lower than if X and are not negatively correlated.

Why? Because OLS will mistakenly attribute to X, the

variation in Y caused by .

No serial correlation

Homoscedasticity

Homoskedasticity

Assumption 7 states that the observations of

the error term are drawn from a distribution

that is normal (i.e. bell-shaped and symmetric).

This assumption of normality is not required

for OLS estimation. However, it is useful for

hypothesis testing Without the normality

assumption most of our hypothesis tests

would be invalid.

Model

Assumption 7: The error term is normally

distributed.

Needed for Statistical Inference

Statistical Inference

Population: the entire group of items that

interests us.

Sample: the part of the population that we

actually observe.

Statistical inference: using the sample to draw

conclusions about the characteristics of the

population from which the sample came

We use samples because it is often not practical

or possible to consider the entire population

But each time we use a different sample, we will

obtain different estimates!

Sampling Distributions

A sample statistic, such as the sample mean or a

regression coefficient, is a random variable that

depends on which particular observations

happen to be selected for the random sample

Sampling error is the difference between the

value of one particular sample mean and the

average of the means of all possible samples of

this size; this error is not due to a poorly designed

experiment or sloppy procedure. It is the

inevitable result of the fact that the observations

in our sample are chosen by chance.

OLS estimators are BLUE if the Gauss-Markov

Theorem is satisfied.

B OLS estimators are the BEST, as they have the

minimum possible variance;

L LINEAR

U UNBIASED

E ESTIMATOR.

Finally, we can say that estimators that are BLUE

are efficient estimators. Why? Because the

estimator provides an unbiased estimate with the

minimum possible variance about its distribution.

BLUE

