Вы находитесь на странице: 1из 100

Applied Business Forecasting

and planning
Multiple Regression Analysis

Introduction

In simple linear regression we studied the


relationship between one explanatory
variable and one response variable.
Now, we look at situations where several
explanatory variables works together to
explain the response.

Introduction

Following our principles of data analysis,


we look first at each variable separately,
then at relationships among the variables.
We look at the distribution of each variable
to be used in multiple regression to
determine if there are any unusual patterns
that may be important in building our
regression analysis.

Multiple Regression

Example. In a study of direct operating cost, Y, for 67


branch offices of consumer finance charge, four
independent variables were considered:

X1: Average size of loan outstanding during the year,


X2 : Average number of loans outstanding,
X3 : Total number of new loan applications processed, and
X4 : Office salary scale index.

The model for this example is


Y 0 1 x1 2 x2 3 X 3 4 x4

Formal Statement of the Model

General regression model

Y 0 1 x1 2 x2 k xk

0, 1, , k are parameters
X1, X2, ,Xk are known constants
, the error terms are independent N(o, 2)

Estimating the parameters of the model

The values of the regression parameters i are not


known. We estimate them from data.
As in the simple linear regression case, we use the
least-squares method to fit a linear function
y b0 b1 x1 b2 x2 bk xk

to the data.
The least-squares method chooses the bs that
make the sum of squares of the residuals as small
as possible.

Estimating the parameters of the model

The least-squares estimates are the values that


minimize the quantity
n

(
y

y
)
i i
i 1

Since the formulas for the least-squares estimates


are complicated and hand calculation is out of
question, we are content to understand the leastsquares principle and let software do the
computations.

Estimating the parameters of the model

The estimate of i is bi and it indicates the change


in the mean response per unit increase in Xi when
the rest of the independent variables in the model
are held constant.
The parameters i are frequently called partial
regression coefficients because they reflect the
partial effect of one independent variable when the
rest of independent variables are included in the
model and are held constant

Estimating the parameters of the model

The observed variability of the responses


about this fitted model is measured by the
variance
n
1
S
( yi y i ) 2

n k 1 i 1
2

and the regression standard error


s s2

Estimating the parameters of the model

In the model 2 and measure the


variability of the responses about the
population regression equation.
It is natural to estimate 2 by s2 and by s.

Analysis of Variance Table

The basic idea of the regression ANOVA table are


the same in simple and multiple regression.
The sum of squares decomposition and the
associated degrees of freedom are:
2
2
2

(
y

y
)

(
y

y
)

(
y

y
)
i
i
i i

SST

df:

SSR

n 1 k (n k 1)

SSE

Analysis of Variance Table


Source

Sum of
Squares

df

Mean
Square

F-test

Regression

SSR

MSR=
SSR/k

MSR/MSE

Error

SSE

n-k-1

MSE=
SSE/n-k-1

Total

SST

n-1

F-test for the overall fit of the model

To test the statistical significance of the regression


relation between the response variable y and the
set of variables x1,, xk, i.e. to choose between
the alternatives:
H 0 : 1 2 k 0
H a : not all i (i 1,k ) equal zero

We use the test statistic:


MSR
F
MSE

F-test for the overall fit of the model

The decision rule at significance level is:

Reject H0 if

Where the critical value F(, k, n-k-1) can be found


from an F-table.

F F ( ; k , n k 1)

The existence of a regression relation by itself


does not assure that useful prediction can be made
by using it.
Note that when k=1, this test reduces to the F-test
for testing in simple linear regression whether or
not 1= 0

Interval estimation of i

For our regression model, we have:


bi i
s (bi )

has a t - distributi on with n - k - 1 degrees of freedom

Therefore, an interval estimate for i


with 1- confidence coefficient is:
bi t (

Where
s(bi )

; n k 1) s(bi )
MSE
( x x )2

Significance tests for i

To test:

H 0 : i 0
H a : i 0

We may use the test statistic:


bi
t
s (bi )
Reject H0 if

t t ( ; n k 1)
2

t t ( ; n k 1)
2

or

Multiple regression model Building

Often we have many explanatory variables,


and our goal is to use these to explain the
variation in the response variable.
A model using just a few of the variables
often predicts about as well as the model
using all the explanatory variables.

Multiple regression model Building

We may find that the reciprocal of a variable is a


better choice than the variable itself, or that
including the square of an explanatory variable
improves prediction.
We may find that the effect of one explanatory
variable may depends upon the value of another
explanatory variable. We account for this situation
by including interaction terms.

Multiple regression model Building

The simplest way to construct an interaction


term is to multiply the two explanatory
variables together.
How can we find a good model?

Selecting the best Regression equation.

After a lengthy list of potentially useful


independent variables has been compiled,
some of the independent variables can be
screened out. An independent variable

May not be fundamental to the problem


May be subject to large measurement error
May effectively duplicate another independent
variable in the list.

Selecting the best Regression Equation.

Once the investigator has tentatively


decided upon the functional forms of the
regression relations (linear, quadratic, etc.),
the next step is to obtain a subset of the
explanatory variables (x) that best explain
the variability in the response variable y.

Selecting the best Regression Equation.

An automatic search procedure that


develops sequentially the subset of
explanatory variables to be included in the
regression model is called stepwise
procedure.
It was developed to economize on
computational efforts.
It will end with the identification of a single
regression model as best.

Example: Sales Forecasting

Sales Forecasting
Multiple regression is a popular technique for predicting
product sales with the help of other variables that are likely to
have a bearing on sales.
Example
The growth of cable television has created vast new potential
in the home entertainment business. The following table gives
the values of several variables measured in a random sample of
20 local television stations which offer their programming to
cable subscribers. A TV industry analyst wants to build a
statistical model for predicting the number of subscribers that a
cable station can expect.

Example:Sales Forecasting

Y = Number of cable subscribers (SUSCRIB)


X1 = Advertising rate which the station charges local
advertisers for one minute of prim time
space (ADRATE)
X2 = Kilowatt power of the stations non-cable signal
(KILOWATT)
X3 = Number of families living in the stations area of
dominant influence (ADI), a geographical division of
radio and TV audiences (APIPOP)
X4 = Number of competing stations in the ADI
(COMPETE)

Example:Sales Forecasting

The sample data are fitted by a multiple regression


model using Excel program.
The marginal t-test provides a way of choosing the
variables for inclusion in the equation.
The fitted Model is

SUBSCRIBE 0 1 ADRATE 2 APIPOP 3 COMPETE 4 SIGNAL

Example:Sales Forecasting

Excel Summary output

SUMMARY OUTPUT
Regression Statistics
Multiple R

0.884267744

R Square

0.781929444

Adjusted R Square

0.723777295

Standard Error

142.9354188

Observations

20

ANOVA
df
Regression

SS

MS

1098857.84

274714.4601

Residual

15

306458.0092

20430.53395

Total

19

1405315.85

Coefficients

Standard Error

t Stat

Significance F

13.44626923

P-value

7.52E-05

Lower 95% Upper 95%

Intercept

51.42007002

98.97458277

0.51952803

0.610973806

-159.539

AD_Rate

-0.267196347

0.081055107

-3.296477624

0.004894126

-0.43996

-0.09443

Signal

-0.020105139

0.045184758

-0.444954014

0.662706578

-0.11641

0.076204

0.440333955

0.135200486

3.256896248

0.005307766

0.152161

0.728507

16.230071

26.47854322

0.61295181

0.549089662

-40.2076

72.66778

APIPOP
Compete

262.3795

Example:Sales Forecasting

Do we need all the four variables in the


model?
Based on the partial t-test, the variables
signal and compete are the least significant
variables in our model.
Lets drop the least significant variables one
at a time.

Example:Sales Forecasting

Excel Summary Output


SUMMARY OUTPUT
Regression Statistics
Multiple R

0.882638739

R Square

0.779051144

Adjusted R Square

0.737623233

Standard Error

139.3069743

Observations

20

ANOVA
df

SS

Regression

MS

1094812.92

364937.64

Residual

16

310502.9296

19406.4331

Total

19

1405315.85

Coefficients

Standard Error

t Stat

F
18.80498277

P-value

Significance F
1.69966E-05

Lower 95%

Upper 95%

Intercept

51.31610447

96.4618242

0.531983558

0.602046756

-153.1737817

255.806

AD_Rate

-0.259538026

0.077195983

-3.36206646

0.003965102

-0.423186162

-0.09589

APIPOP

0.433505145

0.130916687

3.311305499

0.004412929

0.15597423

0.711036

Compete

13.92154404

25.30614013

0.550125146

0.589831583

-39.72506442

67.56815

Example:Sales Forecasting

The variable Compete is the next variable to


get rid of.

Example:Sales Forecasting

Excel Summary Output

SUMMARY OUTPUT
Regression Statistics
Multiple R

0.8802681

R Square

0.774871928

Adjusted R Square

0.748386273

Standard Error

136.4197776

Observations

20

ANOVA
df

SS

Regression

MS

1088939.802

544469.901

Residual

17

316376.0474

18610.35573

Total

19

1405315.85

Coefficients
Intercept

96.28121395

AD_Rate

-0.254280696

APIPOP

0.495481252

Standard Error
50.16415506

t Stat

F
29.2562866

P-value

Significance F
3.13078E-06

Lower 95%

Upper 95%

-9.556049653

202.1184776

1.919322948

0.07188916

0.075014548 -3.389751739

0.003484198

-0.41254778 -0.096013612

0.065306012

7.45293E-07

0.357697418

7.587069489

0.633265086

Example:Sales Forecasting
All the variables in the model are
statistically significant, therefore our final
model is:
Final Model

SUBSCRIBE 96.28 0.25 ADRATE 0.495 APIPOP

Interpreting the Final Model

What is the interpretation of the estimated parameters.


Is the association positive or negative?
Does this make sense intuitively, based on what the data
represents?
What other variables could be confounders?
Are there other analysis that you might consider doing?
New questions raised?

Multicollinearity

In multiple regression analysis, one is often


concerned with the nature and significance of the
relations between the explanatory variables and
the response variable.
Questions that are frequently asked are:

What is the relative importance of the effects of the


different independent variables?
What is the magnitude of the effect of a given
independent variable on the dependent variable?

Multicollinearity

Can any independent variable be dropped from the


model because it has little or no effect on the dependent
variable?
Should any independent variables not yet included in
the model be considered for possible inclusion?

Simple answers can be given to these questions if

The independent variables in the model are


uncorrelated among themselves.
They are uncorrelated with any other independent
variables that are related to the dependent variable but
omitted from the model.

Multicollinearity

When the independent variables are correlated among


themselves, multicollinearity or colinearity among them is
said to exist.
In many non-experimental situations in business,
economics, and the social and biological sciences, the
independent variables tend to be correlated among
themselves.
For example, in a regression of family food expenditures
on the variables: family income, family savings, and the
age of head of household, the explanatory variables will
be correlated among themselves.

Multicollinearity

Further, the explanatory variables will also


be correlated with other socioeconomic
variables not included in the model that do
affect family food expenditures, such as
family size.

Multicollinearity
Some key problems that typically arise when the
explanatory variables being considered for the regression
model are highly correlated among themselves are:

1.

2.

3.

Adding or deleting an explanatory variable changes the


regression coefficients.
The estimated standard deviations of the regression coefficients
become large when the explanatory variables in the regression
model are highly correlated with each other.
The estimated regression coefficients individually may not be
statistically significant even though a definite statistical relation
exists between the response variable and the set of explanatory
variables.

Multicollinearity Diagnostics

A formal method of detecting the presence of


multicollinearity that is widely used is by the
means of Variance Inflation Factor.

It measures how much the variances of the estimated


regression coefficients are inflated as compared to
when the independent variables are not linearly related.
VIF j

1
,
1 R 2j

j 1,2, k

R 2j Is the coefficient of determination from the


regression of the jth independent variable on the
remaining k-1 independent variables.

Multicollinearity Diagnostics

AVIF near 1 suggests that multicollinearity is not a


problem for the independent variables.

Its estimated coefficient and associated t value will not change


much as the other independent variables are added or deleted from
the regression equation.

A VIF much greater than 1 indicates the presence of


multicollinearity. A maximum VIF value in excess of 10 is
often taken as an indication that the multicollinearity may
be unduly influencing the least square estimates.

the estimated coefficient attached to the variable is unstable and


its associated t statistic may change considerably as the other
independent variables are added or deleted.

Multicollinearity Diagnostics

The simple correlation coefficient between all


pairs of explanatory variables (i.e., X1, X2, , Xk
) is helpful in selecting appropriate explanatory
variables for a regression model and is also critical
for examining multicollinearity.
While it is true that a correlation very close to +1
or 1 does suggest multicollinearity, it is not true
(unless there are only two explanatory variables)
to infer multicollinearity does not exist when there
are no high correlations between any pair of
explanatory variables.

Example:Sales Forecasting
Pearson Correlation Coefficients, N = 20
Prob > |r| under H0: Rho=0
SUBSCRIB

ADRATE

KILOWATT

APIPOP

COMPETE

1.00000

-0.02848
0.9051

0.44762
0.0478

0.90447
<.0001

0.79832
<.0001

-0.02848
0.9051

1.00000

-0.01021
0.9659

0.32512
0.1619

0.34147
0.1406

KILOWATT
KILOWATT

0.44762
0.0478

-0.01021
0.9659

1.00000

0.45303
0.0449

0.46895
0.0370

APIPOP
APIPOP

0.90447
<.0001

0.32512
0.1619

0.45303
0.0449

1.00000

0.87592
<.0001

COMPETE
COMPETE

0.79832
<.0001

0.34147
0.1406

0.46895
0.0370

0.87592

1.00000

SUBSCRIB
SUBSCRIB
ADRATE
ADRATE

<.0001

Example:Sales Forecasting
SUBSCRIBE 51.42 0.27 ADRATE - .02 SIGNAL 0.44 APIPOP 16.23 COMPETE

SUBSCRIBE 51.32 0.26 ADRATE 0.43 APIPOP 13.92 COMPETE

SUBSCRIBE 96.28 0.25 ADRATE 0.495 APIPOP

Example:Sales Forecasting

VIF calculation:
Fit the model

APIPOP 0 1 SIGNAL 2 ADRATE 3 COMPETE


SUMMARY OUTPUT
Regression Statistics
Multiple R

0.878054

R Square

0.770978

Adjusted R Square

0.728036

Standard Error

264.3027

Observations

20

ANOVA
df
Regression

SS

MS

3762601

1254200

Residual

16

1117695

69855.92

Total

19

4880295

Coefficients
Standard Error t Stat

F
17.9541

Significance F
2.25472E-05

P-value

Lower 95%

Intercept

-472.685

139.7492

-3.38238

0.003799

-768.9402258

Upper 95%
-176.43

Compete

159.8413

28.29157

5.649786

3.62E-05

99.86587622

219.8168

ADRATE

0.048173

0.149395

0.322455

0.751283

-0.268529713

0.364876

Signal

0.037937

0.083011

0.457012

0.653806

-0.138038952

0.213913

Example:Sales Forecasting

Fit the model


Compete 0 1 ADRATE 2 APIPOP 3 SIGNAL

SUMMARY OUTPUT
Regression Statistics
Multiple R

0.882936

R Square

0.779575

Adjusted R Square

0.738246

Standard Error

1.34954

Observations

20

ANOVA
df
Regression

SS

MS

103.0599

34.35329

Residual

16

29.14013

1.821258

Total

19

132.2

Coefficients
Standard Error t Stat

F
18.86239

P-value

Significance F
1.66815E-05

Lower 95%

Upper 95%

Intercept

3.10416

0.520589

5.96278

1.99E-05

2.000559786

4.20776

ADRATE

0.000491

0.000755

0.649331

0.525337

-0.001110874

0.002092

Signal

0.000334

0.000418

0.799258

0.435846

-0.000552489

0.001221

APIPOP

0.004167

0.000738

5.649786

3.62E-05

0.002603667

0.005731

Example:Sales Forecasting

Fit the model


Signal 0 1 ADRATE 2 APIPOP 3 COMPETE

SUMMARY OUTPUT
Regression Statistics
Multiple R

0.512244

R Square

0.262394

Adjusted R Square

0.124092

Standard Error

790.8387

Observations

20

ANOVA
df
Regression

SS
3

MS

3559789

1186596

Residual

16 10006813

625425.8

Total

19 13566602

Coefficients
Standard Error t Stat

F
1.897261

Significance F
0.170774675

P-value

Lower 95%

Intercept

5.171093

547.6089

0.009443

0.992582

-1155.707711

Upper 95%
1166.05

APIPOP

0.339655

0.743207

0.457012

0.653806

-1.235874129

1.915184

Compete

114.8227

143.6617

0.799258

0.435846

-189.7263711

419.3718

ADRATE

-0.38091

0.438238

-0.86919

0.397593

-1.309935875

0.548109

Example:Sales Forecasting

Fit the model


ADRATE 0 1 Signal 2 APIPOP 3 COMPETE

SUMMARY OUTPUT
Regression Statistics
Multiple R

0.399084

R Square

0.159268

Adjusted R Square

0.001631

Standard Error

440.8588

Observations

20

ANOVA
df
Regression

SS

MS

589101.7

196367.2

Residual

16

3109703

194356.5

Total

19

3698805

Coefficients
Standard Error t Stat
Intercept
Signal
APIPOP
Compete

F
1.010346

Significance F
0.413876018

P-value

Lower 95%

Upper 95%

253.7304

298.6063

0.849716

0.408018

-379.2865355

886.7474

-0.11837

0.136186

-0.86919

0.397593

-0.407073832

0.170329

0.134029

0.415653

0.322455

0.751283

-0.747116077

1.015175

52.3446

80.61309

0.649331

0.525337

-118.5474784

223.2367

Example:Sales Forecasting

VIF calculation Results:


Variable

R- Squared

VIF

ADRATE

0.159268

1.19

COMPETE

0.779575

4.54

SIGNAL

0.262394

1.36

APIPOP

0.770978

4.36

There is no significant multicollinearity.

Qualitative Independent Variables

Many variables of interest in business, economics,


and social and biological sciences are not
quantitative but are qualitative.
Examples of qualitative variables are gender
(male, female), purchase status (purchase, no
purchase), and type of firms.
Qualitative variables can also be used in multiple
regression.

Qualitative Independent Variables

An economist wished to relate the speed with which a


particular insurance innovation is adopted (y) to the size of
the insurance firm (x1) and the type of firm. The dependent
variable is measured by the number of months elapsed
between the time the first firm adopted the innovation and
and the time the given firm adopted the innovation. The
first independent variable, size of the firm, is quantitative,
and measured by the amount of total assets of the firm. The
second independent variable, type of firm, is qualitative
and is composed of two classes-Stock companies and
mutual companies.

Indicator variables
Indicator, or dummy variables are used to
determine the relationship between qualitative
independent variables and a dependent variable.
Indicator variables take on the values 0
and 1.
For the insurance innovation example, where the
qualitative variable has two classes, we might
define the indicator variable x2 as follows:

1 if stock company
x2
0 otherwise

Indicator variables

A qualitative variable with c classes will be


represented by c-1 indicator variables.
A regression function with an indicator
variable with two levels (c = 2) will yield
two estimated lines.

Interpretation of Regression Coefficients

In our insurance innovation example, the


regression model is:
y 0 1 x1 2 x2

Where:

x1 size of firm

x2

1 if stock company
0 otherwise

Interpretation of Regression Coefficients

To understand the meaning of the


regression coefficients in this model,
consider first the case of mutual firm. For
such a firm, x2 = 0 and we have:
y i b0 b1 x1 b2 (0) b0 b1 x1

Mutual firms

For a stock firm x2 = 1 and the response


function is:
y i b0 b1 x1 b2 (1) (b0 b2 ) b1 x1

Stock firms

Interpretation of Regression Coefficients

The response function for the mutual firms is a


straight line, with y intercept 0 and slope 1.
For stock firms, this also is a straight line, with the
same slope 1 but with y intercept 0+2.
With reference to the insurance innovation
example, the mean time elapsed before the
innovation is adopted is linear function of size of
firm (x1), with the same slope 1for both types of
firms.

Interpretation of Regression Coefficients

2 indicates how much lower or higher the


response function for stock firm is than the one for
the mutual firm.
2 measures the differential effect of type of firms.
In general, 2 shows how much higher (lower) the
mean response line is for the class coded 1 than
the line for the class coded 0, for any level of x1.

Example: Insurance Innovation Adoption

Here is the data set for the insurance innovation example:


Months Elapsed

Size

type of firm

Type

17

151

Mutual

26

92

Mutual

21

175

Mutual

30

31

Mutual

22

104

Mutual

277

Mutual

12

210

Mutual

19

120

Mutual

290

Mutual

16

238

Stock

28

164

Stock

15

272

Stock

11

295

Stock

38

68

Stock

31

85

Stock

21

224

Stock

20

166

Stock

13

305

Stock

30

124

Stock

14

246

Stock

Example: Insurance Innovation Adoption

Fitting the regression model


y 0 1 x1 2 x2

Where

x1 size of firm
x2

1 if stock company
0 otherwise

fitted response function is:


y 33 .87 .1061 x1 8.77 x2

Example: Insurance Innovation Adoption


SUMMARY OUTPUT
Regression Statistics
Multiple R

0.95993655

R Square

0.92147818

Adjusted R Square

0.91224031

Standard Error

2.78630562

Observations

20

ANOVA
df
Regression

SS

MS

1548.820517

774.4103

Residual

17

131.979483

7.763499

Total

19

1680.8

Coefficients
Intercept
Size
type of firm

Standard Error

33.8698658

1.562588138

-0.10608882
8.76797549

t Stat

F
99.75016

P-value

21.67549

8E-14

0.007799653

-13.6017

1.45E-10

1.286421264

6.815789

3.01E-06

Significance F
4.04966E-10

Lower 95%

Upper 95%

30.57308841

37.16664321

-0.122544675 -0.089632969
6.053860079

11.4820909

Example: Insurance Innovation Adoption

The fitted response function is:


y 33 .87 .1061 x1 8.77 x2

Stock firms response function is:


y (33 .87 8.77 ) .1061 x1

Mutual firms response function is:


y 33 .87 .1061 x1

Interpretation ?

Accounting for Seasonality in a Multiple


regression Model

Seasonal Patterns are not easily accounted for by


the typical causal variables that we use in
regression analysis.
An indicator variable can be used effectively to
account for seasonality in our time series data.
The number of seasonal indicator variables to use
depends on the data.
If we have p periods in our data series, we can not
use more than P-1 seasonal indicator variables.

Example: Private Housing Starts (PHS)

Housing starts in the United States measured in


thousands of units. These data are plotted for 1990
Q1 through 1999Q4. There are typically few
housing starts during the first quarter of the year
(January, February, March); there is usually a big
increase in the second quarter of (April, May,
June), followed by some decline in the third
quarter (July, August, September), and further
decline in the fourth quarter (October, November,
December).

Nov-98

Jul-98

Mar-98

Nov-97

Jul-97

Mar-97

Nov-96

Jul-96

Mar-96

Nov-95

Jul-95

Mar-95

Nov-94

Jul-94

100

Mar-94

Nov-93

Jul-93

Mar-93

Nov-92

Jul-92

200

Mar-92

Nov-91

Jul-91

Mar-91

Nov-90

Jul-90

Mar-90

Example: Private Housing Starts (PHS)


Private Housing Starts (PHS) in Thousands of Units

400

350

300

250
1
1

150
1

"1" marks the first quarter of each year.

50

Example: Private Housing Starts (PHS)

To Account for and measure this seasonality in a


regression model, we will use three dummy
variables: Q2 for the second quarter, Q3 for the
third quarter, and Q4 for the fourth quarter. These
will be coded as follows:

Q2 = 1 for all second quarters and zero otherwise.


Q3 = 1 for all third quarters and zero otherwise
Q4 = 1 for all fourth quarters and zero otherwise.

Example: Private Housing Starts (PHS)

Data for private housing starts (PHS), the


mortgage rate (MR), and these seasonal indicator
variables are shown in the following slide.
Examine the data carefully to verify your
understanding of the coding for Q2, Q3, Q4.
Since we have assigned dummy variables for the
second, third, and fourth quarters, the first quarter
is the base quarter for our regression model.
Note that any quarter could be used as the base,
with indicator variables to adjust for differences in
other quarters.

Example: Private Housing Starts (PHS)


PERIOD

PHS

MR

Q2

Q3

Q4

31-Mar-90

217

10.1202

30-Jun-90

271.3

10.3372

30-Sep-90

233

10.1033

31-Dec-90

173.6

9.9547

31-Mar-91

146.7

9.5008

30-Jun-91

254.1

9.5265

30-Sep-91

239.8

9.2755

31-Dec-91

199.8

8.6882

31-Mar-92

218.5

8.7098

30-Jun-92

296.4

8.6782

30-Sep-92

276.4

8.0085

31-Dec-92

238.8

8.2052

31-Mar-93

213.2

7.7332

30-Jun-93

323.7

7.4515

30-Sep-93

309.3

7.0778

31-Dec-93

279.4

7.0537

31-Mar-94

252.6

7.2958

30-Jun-94

354.2

8.4370

30-Sep-94

325.7

8.5882

31-Dec-94

265.9

9.0977

31-Mar-95

214.2

8.8123

30-Jun-95

296.7

7.9470

30-Sep-95

308.2

7.7012

31-Dec-95

257.2

7.3508

31-Mar-96

240

7.2430

30-Jun-96

344.5

8.1050

30-Sep-96

324

8.1590

31-Dec-96

252.4

7.7102

31-Mar-97

237.8

7.7905

30-Jun-97

324.5

7.9255

30-Sep-97

314.6

7.4692

31-Dec-97

256.8

7.1980

31-Mar-98

258.4

7.0547

30-Jun-98

360.4

7.0938

30-Sep-98

348

6.8657

31-Dec-98

304.6

6.7633

31-Mar-99

294.1

6.8805

30-Jun-99

377.1

7.2037

30-Sep-99

355.6

7.7990

31-Dec-99

308.1

7.8338

Example: Private Housing Starts (PHS)

The regression model for private housing


starts (PHS) is:
PHS 0 1 ( MR ) 2 (Q 2) 3 (Q3) 4 (Q 4)

In this model we expect b1 to have a


negative sign, and we would expect b2, b3,
b4 all to have positive signs. Why?
Regression results for this model are shown
in the next slide.

Example: Private Housing Starts (PHS)


SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error

0.885398221
0.78393001
0.759236296
26.4498851

Observations

40

ANOVA
df

SS

Regression

MS

88837.93624

22209.48406

Residual

35

24485.87476

699.5964217

Total

39

113323.811

Coefficients
Intercept

473.0650749

Standard Error
35.54169837

t Stat

F
31.74613731

P-value

13.31014264

2.93931E-15

4.257226391 -7.058206249

3.21421E-08

Significance F
3.33637E-11

Lower 95%

Upper 95%

400.9115031

545.2186467

MR

-30.04838192

-38.69102153 -21.40574231

Q2

95.74106935

11.84748487

8.081130334

1.6292E-09

71.689367

119.7927717

Q3

73.92904763

11.82881519

6.249911462

3.62313E-07

49.91524679

97.94284847

Q4

20.54778131

11.84139803

1.73524961

0.091495355

-3.491564078

44.5871267

Example: Private Housing Starts (PHS)

Use the prediction equation to make a


forecast for each of the fourth quarter of
1999.
Prediction equation:
PHS 473 .06 30 .05 ( MR ) 95 .74 (Q 2) 73 .93(Q3) 20 .55 (Q 4)

Example: Private Housing Starts (PHS)


400

Private Housing Starts (PHS) with a Simple Regression Forecast (PHSF1) and a Multiple Regression Forecast (PHSF2) in
Thousands of Units
350

300

250

200

150

100

50

PHS

PHSF1

PHSF2

Nov-98

Jul-98

Mar-98

Nov-97

Jul-97

Mar-97

Nov-96

Jul-96

Mar-96

Nov-95

Jul-95

Mar-95

Nov-94

Jul-94

Mar-94

Nov-93

Jul-93

Mar-93

Nov-92

Jul-92

Mar-92

Nov-91

Jul-91

Mar-91

Nov-90

Jul-90

Mar-90

Regression Diagnostics and Residual


Analysis

It is important to check the adequacy of the model before it


becomes part of the decision making process.
Residual plots can be used to check the model
assumptions.
It is important to study outlying observations to decide
whether they should be retained or eliminated.
If retained, whether their influence should be reduced in
the fitting process or revise the regression function.

Time Series Data and the Problem of


Serial Correlation

In the regression models we assume that the


errors i are independent.
In business and economics, many regression
applications involve time series data.
For such data, the assumption of
uncorrelated or independent error terms is
often not appropriate.

Problems of Serial Correlation

If the error terms in the regression model are


autocorrelated, the use of ordinary least squares
procedures has a number of important
consequences

MSE underestimate the variance of the error terms


The confidence intervals and tests using the t and F
distribution are no longer strictly applicable.
The standard error of the regression coefficients
underestimate the variability of the estimated regression
coefficients. Spurious regression can result.

First order serial correlation

The error term in current period is directly related


to the error term in the previous time period.
Let the subscript t represent time, then the simple
linear regression model is:
yt 0 1 xt t

Where

t t 1 t

t = error at time t
= the parameter that measures correlation between
adjacent error terms
t normally distributed error terms with mean zero and
variance 2

Example

The effect of positive serial correlation in a


simple linear regression model.

Misleading forecasts of future y values.


Standard error of the estimate, S y.x will
underestimate the variability of the ys about
the true regression line.
Strong autocorrelation can make two unrelated
variables appear to be related.

Durbin-Watson Test for Serial


Correlation

Recall the first-order serial correlation model


yt 0 1 xt t

t t 1 t

The hypothesis to be tested are:


H0 : 0
Ha : 0

The alternative hypothesis is > 0 since in


business and economic time series tend to show
positive correlation.

Durbin-Watson Test for Serial


Correlation

The Durbin-Watson statistic is defined as


n

DW

(e
t 2

et 1 ) 2

e
t 1

2
t

Where
et yt y t the residual for time period t
et 1 yt 1 y t 1 the residual for time period t - 1

Durbin-Watson Test for Serial


Correlation

The auto correlation coefficient can be


estimated by the lag 1 residual
autocorrelation r1(e)
n

r1 (e)

e e
t 2
n

e
t 1

t 1

2
t

And it can be shown that


DW 2(1 r1 (e))

Durbin-Watson Test for Serial


Correlation

Since 1 < r1(e) < 1 then 0 < DW < 4


If r1(e) = 0, then DW = 2 (there is no
correlation.)
If r1(e) > 0, then DW < 2 (positive
correlation)
If r1(e) < 0, Then DW > 2 (negative
correlation)

Durbin-Watson Test for Serial


Correlation

Decision rule:

If DW > U, Do not reject H0.


If DW < L, Reject H0
If L DW U, the test is inconclusive.

The critical Upper (U) an Lower (L) bound can be


found in Durbin-Watson table of your text book.
To use this table you need to know The
significance level () The number of independent
parameters in the model (k), and the sample size
(n).

Example

The Blaisdell Company wished to predict


its sales by using industry sales as a
predictor variable. The following table
gives seasonally adjusted quarterly data on
company sales and industry sales for the
period 1983-1987.

Example
Year
1983

1984

1985

1986

1987

Quarter
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4

t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

CompSale
20.96
21.4
21.96
21.52
22.39
22.76
23.48
23.66
24.1
24.01
24.54
24.3
25
25.64
26.36
26.98
27.52
27.78
28.24
28.78

InduSale
127.3
130
132.7
129.4
135
137.1
141.2
142.8
145.5
145.3
148.3
146.4
150.2
153.1
157.3
160.7
164.2
165.6
168.7
171.7

Example
Blaisdell Company Example

Company Sales ($
millions)

35
30
25
20
15
10
5
0
0

50

100
Industry sales($ millions)

150

200

Example

The scatter plot suggests that a linear regression


model is appropriate.
Least squares method was used to fit a regression
line to the data.
The residuals were plotted against the fitted
values.
The plot shows that the residuals are consistently
above or below the fitted value for extended
periods.

Example

Example

To confirm this graphic diagnosis we will use the


Durbin-Watson test for:
H0 : 0
Ha : 0

The test statistic is:


n

DW

(e
t 2

et 1 ) 2

e
t 1

2
t

Example
Year
1983

1984

1985

1986

1987

Quarter
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4

t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Company sales(y) Industry sales(x)


20.96
127.3
21.4
130
21.96
132.7
21.52
129.4
22.39
135
22.76
137.1
23.48
141.2
23.66
142.8
24.1
145.5
24.01
145.3
24.54
148.3
24.3
146.4
25
150.2
25.64
153.1
26.36
157.3
26.98
160.7
27.52
164.2
27.78
165.6
28.24
168.7
28.78
171.7

Blaisdell Company Example


35
30

et
-0.02605
-0.06202
0.022021
0.163754
0.04657
0.046377
0.043617
-0.05844
-0.0944
-0.14914
-0.14799
-0.05305
-0.02293
0.105852
0.085464
0.106102
0.029112
0.042316
-0.04416
-0.03301

et -et-1

(et -et-1)^2

-0.03596
0.084036
0.141733
-0.11718
-0.00019
-0.00276
-0.10205
-0.03596
-0.05474
0.001152
0.094937
0.030125
0.12878
-0.02039
0.020638
-0.07699
0.013204
-0.08648
0.011152

0.001293
0.007062
0.020088
0.013732
3.76E-08
7.61E-06
0.010415
0.001293
0.002997
1.33E-06
0.009013
0.000908
0.016584
0.000416
0.000426
0.005927
0.000174
0.007478
0.000124

et ^2
0.000679
0.003846
0.000485
0.026815
0.002169
0.002151
0.001902
0.003415
0.008911
0.022243
0.021901
0.002815
0.000526
0.011205
0.007304
0.011258
0.000848
0.001791
0.00195
0.00109

0.097941 0.133302

Example
.09794
DW
.735
.13330

Using Durbin Watson table of your text


book, for k = 1, and n=20, and using =
.01 we find U = 1.15, and L = .95
Since DW = .735 falls below L = .95 , we
reject the null hypothesis, namely, that the
error terms are positively autocorrelated.

Remedial Measures for Serial Correlation

Addition of one or more independent


variables to the regression model.

One major cause of autocorrelated error terms


is the omission from the model of one or more
key variables that have time-ordered effects on
the dependent variable.

Use transformed variables.

The regression model is specified in terms of


changes rather than levels.

Extensions of the Multiple Regression


Model

In some situations, nonlinear terms may be needed


as independent variables in a regression analysis.

Business or economic logic may suggest that nonlinearity is expected.


A graphic display of the data may be helpful in
determining whether non-linearity is present.

One common economic cause for non-linearity is


diminishing returns.

Fore example, the effect of advertising on sales may


diminish as increased advertising is used.

Extensions of the Multiple Regression


Model

Some common forms of nonlinear functions


are :
Y 0 1 ( X ) 2 ( X 2 )
Y 0 1 ( X ) 2 ( X 2 ) 3 ( X 3 )

Y 0 1 (1 X )

Y e 0 X 1

Extensions of the Multiple Regression


Model

To illustrate the use and interpretation of a


non-linear term, we return to the problem of
developing a forecasting model for private
housing starts (PHS).
So far we have looked at the following
model
PHS 0 1 ( MR ) 2 (Q 2) 3 (Q3) 4 (Q 4)

Where MR is the mortgage rate and Q2, Q3, and Q4 are


indicators variables for quarters 2, 3, and 4.

Example: Private Housing Start

First we add real disposable personal


income per capita (DPI) as an independent
variable. Our new model for this data set is:
PHS 0 1 ( MR ) 2 (Q 2) 3 (Q3) 4 (Q 4) 5 ( DPI )

Regression results for this model are shown


in the next slide.

Example: Private Housing Start


SUMMARY OUTPUT
Regression Statistics
Multiple R

0.943791346

R Square

0.890742104

Adjusted R Square

0.874187878

Standard Error

19.05542121

Observations

39

ANOVA
df

SS

Regression

MS

97690.01942

19538

Residual

33

11982.59955

363.1091

Total

38

109672.619

Coefficients
Intercept

Standard Error

F
53.80753

Significance F
6.51194E-15

t Stat

P-value

Lower 95%

Upper 95%

-0.2953

0.769613

-245.0826992

182.9546249

-31.06403714

105.1938477

MR

-20.1992545

4.124906847

-4.8969

2.5E-05

Q2

97.03478074

8.900711541

10.90191

1.78E-12

78.9261326

115.1434289

Q3

75.40017073

8.827185877

8.541813

7.17E-10

57.44111179

93.35922967

Q4

20.35306822

8.83373887

2.304015

0.027657

2.380677107

38.32545934

DPI

0.022407799

0.004356973

5.142974

1.21E-05

0.013543464

0.031272134

-28.59144723 -11.80706176

Example: Private Housing Start

The prediction model is

PHS 31 .06 20 .19 ( MR ) 97 .03(Q 2) 75 .40 (Q3) 20 .35 (Q 4) 0.02 ( DPI )

In comparison with the previous model, we


see that the R-squared has improved.It has
changed from 78% to 89%.
The standard error of the estimate has
decreased from 26.49 for the previous
model to 19.05 for the new model.

Example: Private Housing Start

The value of the DW test has changed from 0.88


for the previous model to 0.78 for the new model.
At 5% level the critical value for DW test, from
Durbin-Watson table, for k = 5, and n = 39 is L=
1.22, and U = 1.79.
Since The value of the DW test is smaller than
L=1.22, we reject the null hypothesis H0: =0
This implies that there is serial correlation in both
models, the assumption of the independence of
the error terms is not valid.

Example: Private Housing Start

The Plot of PHS against


DPI shows a curve linear
relation.
Next we introduce a
nonlinear term into the
regression.
The square of disposable
personal income per capita
(DPI2) is included in the
regression model.

Private Housing Start and Disposable Personal Income

21500

21000

20500

20000

PHS

19500

19000

18500

18000

17500
0

50

100

150

200
DPI

250

300

350

400

Example: Private Housing Start

We also add the dependent variable, lagged


one quarter, as an independent variable in
order to help reduce serial correlation.
The third model that we fit to our data set
is:

PHS 0 1 ( MR ) 2 (Q 2) 3 (Q3) 4 (Q 4) 5 ( DPI ) 6 ( DPI 2 ) 7 ( LPHS )

Regression results for this model are shown


in the next slide.

Example: Private Housing Start


SUMMARY OUTPUT
Regression Statistics
Multiple R

0.97778626

R Square

0.956065971

Adjusted R Square

0.946145384

Standard Error

12.46719572

Observations

39

ANOVA
df

SS

Regression

MS

104854.2589

14979.17985

Residual

31

4818.360042

155.4309691

Total

38

109672.619

Coefficients

t Stat

Significance F
3.07085E-19

P-value

Lower 95%

716.5926532

1017.664989

0.704153784

0.486593

-1358.949934

2792.13524

MR

-13.65521724

3.093504134

-4.414158396

0.000114

-19.96446404

-7.345970448

Q2

106.9813297

6.069780998

17.62523718

1.04E-17

94.60192287

119.3607366

Q3

27.72122303

9.111432565

3.042465916

0.004748

9.138323433

46.30412262

Q4

-13.37855186

7.653050858

-1.748133144

0.09034

-28.98706069

2.22995698

DPI

Intercept

Standard Error

F
96.37191

Upper 95%

-0.060399279

0.104412354

-0.578468704

0.567127

-0.273349798

0.15255124

DPI SQUARED

0.000335974

0.000536397

0.626354647

0.535668

-0.000758014

0.001429963

LPHS

0.655786939

0.097265424

6.742241114

1.51E-07

0.457412689

0.854161189

Example: Private Housing Start

The inclusion of DPI2 and Lagged PHS has


increased the R-squared to 96%
The standard error of the estimate has decreased to
12.45
The value of the DW test has increased to 2.32
which is greater than U = 1.79 which rule out
positive serial correlation.
You see that the third model worked best for this
data set.
The following slide gives the data set.

Example: Private Housing Start


PERIOD

PHS

LPHS

Q2

Q3

Q4

DPI

30-Jun-90

271.3

10.3372

MR

217

18063

DPI SQUARED
1,631,359.85

30-Sep-90

233

10.1033

271.3

18031

1,625,584.81

31-Dec-90

173.6

9.9547

233

17856

1,594,183.68

31-Mar-91

146.7

9.5008

173.6

17748

1,574,957.52

30-Jun-91

254.1

9.5265

146.7

17861

1,595,076.61

30-Sep-91

239.8

9.2755

254.1

17816

1,587,049.28

31-Dec-91

199.8

8.6882

239.8

17811

1,586,158.61

31-Mar-92

218.5

8.7098

199.8

18000

1,620,000.00

30-Jun-92

296.4

8.6782

218.5

18085

1,635,336.13

30-Sep-92

276.4

8.0085

296.4

18036

1,626,486.48

31-Dec-92

238.8

8.2052

276.4

18330

1,679,944.50

31-Mar-93

213.2

7.7332

238.8

17975

1,615,503.13

30-Jun-93

323.7

7.4515

213.2

18247

1,664,765.05

30-Sep-93

309.3

7.0778

323.7

18246

1,664,582.58

31-Dec-93

279.4

7.0537

309.3

18413

1,695,192.85

31-Mar-94

252.6

7.2958

279.4

18154

1,647,838.58

30-Jun-94

354.2

8.4370

252.6

18409

1,694,456.41

30-Sep-94

325.7

8.5882

354.2

18493

1,709,955.25

31-Dec-94

265.9

9.0977

325.7

18667

1,742,284.45

31-Mar-95

214.2

8.8123

265.9

18834

1,773,597.78

30-Jun-95

296.7

7.9470

214.2

18798

1,766,824.02

30-Sep-95

308.2

7.7012

296.7

18871

1,780,573.21

31-Dec-95

257.2

7.3508

308.2

18942

1,793,996.82

31-Mar-96

240

7.2430

257.2

19071

1,818,515.21

30-Jun-96

344.5

8.1050

240

19081

1,820,422.81

30-Sep-96

324

8.1590

344.5

19161

1,835,719.61

31-Dec-96

252.4

7.7102

324

19152

1,833,995.52

31-Mar-97

237.8

7.7905

252.4

19331

1,868,437.81

30-Jun-97

324.5

7.9255

237.8

19315

1,865,346.13

30-Sep-97

314.6

7.4692

324.5

19385

1,878,891.13

31-Dec-97

256.8

7.1980

314.6

19478

1,896,962.42

31-Mar-98

258.4

7.0547

256.8

19632

1,927,077.12

30-Jun-98

360.4

7.0938

258.4

19719

1,944,194.81

30-Sep-98

348

6.8657

360.4

19905

1,980,963.41

31-Dec-98

304.6

6.7633

348

20194

2,038,980.00

31-Mar-99

294.1

6.8805

304.6

20377

2,076,010.87

30-Jun-99

377.1

7.2037

294.1

20472

2,095,440.74

30-Sep-99

355.6

7.7990

377.1

20756

2,153,982.23

31-Dec-99

308.1

7.8338

355.6

21124

2,231,020.37

Вам также может понравиться