Вы находитесь на странице: 1из 6

ECON1203 Statistics

Chapter 16 Simple Linear


Regression & Correlation
Contents
1.
2.
3.
4.
5.
6.

Model
Estimating the Coefficients
Error Variable: Required Conditions
Assessing the Model
Using the Regression Equation
Regression Diagnostics (Part 1)

Introduction

Regression analysis predicts one variable based on other variables.

The dependant variable is to be forecast ( Y ).

The statistics practitioner believes that it is related to independent


variables ( X 1 , X 2 , , X k ).

Correlation analysis determines whether a relationship exists:


o Scatter diagram
o Coefficient of correlation
o Covariance

16.1 Model

Deterministic models determine the dependent variable from the


independent variables. They are unrealistic because there may be other
influencing variables.
Probabilistic models include the randomness of real life (e.g. an error
variable).

The error variable ( ) is the difference between the estimated and

actual dependent variable.


The first-order linear model (or simple linear regression model) is a
straight-line model with one independent variable:
o

y= 0 + 1 x +

y=

dependant variable

x=

independent variable

0=

y-intercept

1=

slope of the line

error variable

16.2 Estimating the coefficients

The least squares line ( ^y =b 0+ b1 x ) uses the least squares method


n

to minimise the sum of squared deviations (

i=1

).

The sum of squares for error (SSE) is the minimised sum of squared
deviations.
Residuals are the deviations between the actual data points and the line:
o

( y i ^y i )2

e i= y i ^y i

Least squares line coefficients:

s xy

b1 =

b0 = y b 1 x

s2x

( x ix )( y i y )

s xy = i=1

n1

( x i x )2

s 2x = i=1

n1

xi

x = i=1
n
n

y = i=1
n

Shortcuts:
o

yi

s xy =

1
n1

1
s =
n1
2
x

xi yi

i=1

x i y i i=1

x
i=1

( )
xi

n
2
i

i=1

i=1

Excel:
o Have two columns of data: one for the dependent variable; the
other for the independent variable.
o Click Data, Data Analysis, and Regression.
o

Specify the Input

Range and the Input

Range.

Assumptions of Classical Linear Regression Model


1. The least squares line coefficients ( 0

1 ) are linear.

and

2. The observed variables ( x i , y i ) are randomly sampled.

x i ; they are not all equal.

3. There is sample variation in


4. The mean of

a. Therefore,
5. The variance of

x :

is 0, regardless of

and

E ( i|x i )=0 .

are uncorrelated.

is a constant:

Var ( i ) = 2 .
i

a. But in reality, not necessarily true (e.g. higher income may increase
variance in expenditure because they have a greater range of
choices)
6. The error variables are uncorrelated:

Cov ( i , j )=0 .

7. The error variables are normally distributed:

i N ( 0, 2 ) .

16.4 Assessing the Model


There are three ways of assess how well the linear model fits the data:
1. The standard error of estimate
2. The

t -test of the slope

3. The coefficient of determination

Sum of Squares for Error (SSE)


n

SSE= ( y i ^y i ) =( n1 ) s
i=1

2
y

s2xy
s2x

Standard Error of Estimate ( s )


SSE
( s = n2
) isusually compared with y

Testing the Slope

We can use hypothesis testing to infer the population slope ( 1 ) from


the sample slope ( b1 ).

If

1=0 , there is no linear relationship (but there may be a quadratic

relationship).

The sample slope ( b1 ) is an unbiased estimator of the population


slope ( 1 ) ( E ( b1 )= 1 ) because the estimated standard error of

b1

s
s
=
b
(
( n1 ) s 2x ) decreases as
1

Test statistic for

1=t=

increases.

b1 1
[ where =n2 ]
sb
1

Confidence interval estimator of

1=b1 t 2 s b [ where =n2 ]


1

Coefficient of Determination

Coefficient of Determination:

R 2=

s2xy
2

=1
2

sx s y

( y i y ) SSE = Explained variation


SSE
=
2
Variation y
( y i y )
( yi y )2

( y i y ) =( y i y ) + ^y i^y i

( y i y ) =Unexplained residual ( y i ^y i ) + Explained variation ( ^y y )

( y i y )2= ( yi ^y i )2+ ( ^y i y )2

Variation y=SSE+ SSR

Coefficient of Correlation

We can use hypothesis testing to infer the population coefficient of


correlation ( ) from the sample coefficient of correlation ( r ).

Sample coefficient of correlation:

Test statistic for

t=r

r=

s xy
sx sy

n2
1r 2 [where

=n2 and variables are

bivariate normally distributed]

16.5 Using the Regression Equation


^y i=b 0+b1 x i

is a point estimator.

There are two interval estimators:


1. Prediction interval:

1 ( x gx )
^y t 2,n2 s 1+ +
n ( n1 ) s 2x

2. Confidence interval estimator of the expected value of

y : ^y t 2,n2 s

2
1 ( x gx )
+
n ( n1 ) s 2x

The farther the given value of

x , the greater the estimated error:

is from

( x gx )

( n1 ) s 2x

16.6 Regression Diagnostics (Part 1)


Residual analysis

Standard deviation of the ith residual : s =s 1h i


i

2
1 ( x ix )
Where hi= +
n ( n1 ) s 2x

Normality

The residuals should be normally distributed.

Homoscedasticity

The variance of the error variable should be constant.

Independence of the error variable

The error variable should be independent.

Outliers
Outliers may be:
1. Recording errors
2. Points that should not have been included in the sample
3. Valid and should belong to the sample

Influential observations

Some points are influence in determining a least squares line. Without it,
there would be no least squares line.

Procedure
1. Develop a model that has a theoretical basis; find an independent variable
that you believe is linearly related to the dependent variable.
2. Gather data for the two variables from (preferably) a controlled
experiment, or observational data.
3. Draw a scatter diagram. Determine whether a linear model is appropriate.
Identify outliers and influential observations.
4. Determine the regression equation.
5. Calculate the residuals and check the required conditions:
a. Is the error variable normal?
b. Is the variance constant?
c. Are the errors independent?

6. Assess the models fit:


a. Compute the standard error of estimate.
b. Test
c.
7. If the
a.
b.

1 or

to determine whether there is a linear

relationship.
Compute the coefficient of determination.
model fits the data, use the regression equation to:
Predict a particular value of the dependant variable
Estimate its mean

Вам также может понравиться