Вы находитесь на странице: 1из 6

TUTORIAL 7: Multiple Linear Regression

I. Multiple Regression

A regression with two or more explanatory variables is called a multiple regression. Multiple linear regression is an extremely effective tool for answering statistical questions involving many variables. The procedures PROC REG and PROC GLM can be used to perform regression in SAS. In this tutorial we concentrate on using PROC REG. Much of the syntax is similar to that used for fitting simple linear regression models; see Tutorial 6 for a review of this material.

A. PROC REG

PROC REG is the basic SAS procedure for performing regression analysis. The general form of the PROC REG procedure is:

PROC REG DATA=dataset; MODEL response_variable = explanatory_variables; PLOT variable1 * variable2 <options>; OUTPUT OUT = newdata <options>; RUN;

The MODEL statement is used to specify the response and explanatory variables to be used in the regression model. For example, the statement:

MODEL y = x1 x2;

fits a multiple linear regression model with the variable y as the response variable and the variables x1 and x2 as explanatory variables.

The fit of the model and the model assumptions can be checked graphically using the PLOT statement. This statement can be used to make all the relevant plots needed for the regression model. In the regression models we have discussed so far, it is assumed that the errors are

independent and normally distributed with mean 0 and variance

it is necessary to check these assumptions by analyzing the residuals and studying a series of

residual plots. To plot the residuals against the explanatory variables use the statement:

against the explanatory variables use the statement: 2 . After performing regression, PLOT residual.*(x1 x2);

2 . After performing regression,

PLOT residual.*(x1 x2);

Note that residual. (the period is required) is the variable name for the residuals created by PROC REG. To plot the residuals against the predicted values we would use the statement:

PLOT residual.*predicted.;

Note that predicted. (the period is again required) is the variable name for the predicted values from the regression model.

The OUTPUT statement is used to produce a new data set containing the original data used in the regression model, as well as the predicted values and residuals. This new data set can, in turn, be used to produce further diagnostic plots and check the model fit. When using the OUTPUT statement, there are a number of helpful options which help to control the contents of the OUTPUT file. The statement:

OUTPUT out = outdata r = resid p = yhat;

creates a new data set named outdata which contains the residuals and predicted values. The residuals are given the name resid, and the predicted values the name yhat. The data set outdata can then be used to further study the residuals.

Ex. Data was collected on 15 houses recently sold in a city. It consisted of the sales price (in $), house size (in square feet), the number of bedrooms, the number of bathrooms, the lot size (in square feet) and the annual real estate tax (in $).

The following program reads in the data and fits a multiple regression model with price as the response variable and size and lot as the explanatory variables. It also produces residual plots of the residuals against both explanatory variables as well as the predicted values.

DATA houses; INPUT tax bedroom bath price size lot; DATALINES;

590

2 1 50000 770 22100

 

1050

3 2 85000 1410 12000

 

20 3 1 22500 1060 3500

 

870

2 2 90000 1300 17500

 

1320

3 2 133000 1500 30000

 

1350

2 1 90500 820 25700

 

2790

3 2.5 260000 2130 25000

680

2 1 142500 1170 22000

 

1840

3 2 160000 1500 19000

 

3680

4 2 240000 2790 20000

1660

3 1 87000 1030 17500

 

1620

3 2 118600 1250 20000

 

3100

3 2 140000 1760 38000

2070

2 3 148000 1550 14000

650

3 1.5 65000 1450 12000

 

;

RUN;

PROC REG data=houses; MODEL price = size lot; PLOT residual.*(predicted. size lot); RUN;

% Model statement % Residual plots

This program gives rise to the following output:

 

Analysis of Variance Sum of

Mean

Source

DF

Squares

Square

F Value

Pr > F

Model

2 44825992653 22412996326 19.10 0.0002

Error

12 14082023347

1173501946

Corrected Total

14

58908016000

Root MSE

34256

R-Square

0.7609

Dependent Mean

122140

Adj R-Sq

0.7211

Coeff Var

28.04684

Parameter Estimates

 

Parameter

Standard

Variable

DF

Estimate

Error

t Value

Pr > |t|

Intercept

1

-61969

32257

-1.92

0.0788

size

1

97.65137

18.16474

5.38

0.0002

lot

1

2.22295

1.13918

1.95

0.0748

We also obtained three residual plots which aren’t shown here. The output can be used to test a variety of hypothesis tests regarding the model. For example, by studying the last paragraph we see that the coefficient corresponding to size is significant (p-value=0.0002) when controlling for lot size. However, the coefficient corresponding to lot size is not significant (p-value=0.0748) when controlling for house size.

B. Testing a subset of variables using a partial F-test

Sometimes we are interested in simultaneously testing whether a certain subset of the coefficients

are equal to 0 (e.g.

comparing the SSE from a reduced model (excluding the parameters we are testing) with the SSE

from the full model (including all of the parameters).

SSE from the full model (including all of the parameters). 3 = 4 = 0). We

3 =

the full model (including all of the parameters). 3 = 4 = 0). We can do

4 = 0). We can do this using a partial F-test. This test involves

We can perform a partial F-test in PROC REG by including a TEST statement. For example, the statement

TEST var1=0, var2=0;

tests the null hypothesis that the regression coefficients corresponding to var1 and var2 are both equal to 0. However, note that any number of variables can be included in the TEST statement.

Ex. Housing data cont.

Suppose we include the variables bedroom, bath and size in our model and are interested in testing whether the number of bedrooms and bathrooms are significant after taking size into consideration. The following program performs the partial F-test:

PROC REG data=houses; MODEL price = bedroom bath size; TEST bath=0, bedroom=0; RUN;

% Model statement % Partial F-test

This gives rise to the following output:

Analysis of Variance Sum of

Mean

Source

DF

Squares

Square

F Value

Pr > F

Model

3

43908504107

14636168036

10.73

0.0013

Error

11

14999511893

1363591990

Corrected Total

14

58908016000

Root MSE

36927

R-Square

0.7454

Dependent Mean

122140

Adj R-Sq

0.6759

Coeff Var

30.23321

Parameter Estimates

 

Parameter

Standard

Variable

DF

Estimate

Error

t Value

Pr > |t|

Intercept

1

27923

56306

0.50

0.6297

bedroom

1

-35525

25037

-1.42

0.1836

bath

1

2269.34398

22209

0.10

0.9205

size

1

130.79392

36.20864

3.61

0.0041

Test 1 Results for Dependent Variable price

 

Mean

Source

DF

Square

F Value

Pr > F

Numerator

2

1775487498

1.30

0.3108

Denominator

11

1363591990

The final paragraph shows the results of the partial F-test. Since F=1.30 (p-value=0.3108) we

of the partial F-test. Since F=1.30 (p-value=0.3108) we cannot reject the null hypothesis ( 2 =

cannot reject the null hypothesis ( 2 =

we cannot reject the null hypothesis ( 2 = 3 = 0). It appears that bedroom

3 = 0). It appears that bedroom and bath do not contribute

significant information to the sales price once size has been taken into consideration.

C. Model Selection

Often we have data on a large number of explanatory variables and wish to construct a regression model using some subset of them. The use of a subset will make the resulting model easier to interpret and more manageable, especially if more data is to be collected in the future. Unnecessary terms in the model may also yield less precise inference.

One approach to model selection is to consider all possible subsets of the pool of explanatory variables and find the model that best fits the data according to some criteria. Different criteria may be used to select the best model, such as adjusted R 2 or Mallow’s Cp. These criteria assign scores to each model and allow us to choose the model with the best score.

In SAS we can perform model selection using Mallow’s Cp by including the selection option in the MODEL statement. The statement:

MODEL y = x1 x2 x3 x4 x5 /selection = cp;

steps through each possible model consisting of a subset of the 5 explanatory variables (x1, x2, x3, x4 and x5) and calculates a Cp score for each one. Thereafter it chooses the model that minimizes the score.

Ex. Housing data continued.

To use Mallow’s Cp to determine which subset of the 5 possible explanatory variables that best models the data, we can use the following code:

PROC REG data=houses; MODEL price = tax bedroom bath size lot/ selection = cp; RUN;

This gives rise to the following output:

Number in

 

Model

C(p)

R-Square

Variables in Model

3

2.3274

0.8115

tax bedroom size

2

2.7395

0.7628

tax size

2

2.8314

0.7609

size lot

3

3.0608

0.7967

bedroom size lot

2

3.6142

0.7451

bedroom size

3

3.9514

0.7787

tax size lot

4

4.0001

0.8182

tax bedroom size lot

1

4.1292

0.6943

tax

3

4.1942

0.7738

bath size lot

4

4.3138

0.8118

tax bedroom bath size

3

4.4539

0.7686

tax bath size

1

4.5857

0.6851

size

2

4.9046

0.7191

tax bath

4

4.9963

0.7980

bedroom bath size lot

4

5.5470

0.7869

tax bath size lot

3

5.6023

0.7454

bedroom bath size

2

5.9088

0.6988

bath size

5

6.0000

0.8182

tax bedroom bath size lot

2

6.0115

0.6967

tax bedroom

2

6.1249

0.6944

tax lot

3

6.8608

0.7199

tax bath lot

3

6.8755

0.7196

tax bedroom bath

3

7.9613

0.6977

tax bedroom lot

4

8.8538

0.7201

tax bedroom bath lot

3

13.7814

0.5801

bedroom bath lot

2

16.2074

0.4907

bath lot

2

18.5511

0.4433

bedroom bath

1

20.4532

0.3645

bath

2

23.5548

0.3422

bedroom lot

1

29.3255

0.1852

lot

1

31.1563

0.1482

bedroom

From the output we see that the model with tax, bedroom and size minimizes the Cp criteria.

When the number of explanatory variables is large it is often not feasible to fit all possible models. It is instead more efficient to use a search algorithm to find the best model. A number of such search algorithms exist. They include: forward selection, backward elimination and stepwise regression.

In

SAS we can perform model selection using these algorithms by including the selection option

in

the MODEL statement. The statement:

MODEL y = x1 x2 x3 x4 /selection = forward;

fits the best model using forward regression. Other options can be used by exchanging forward with either backward or stepwise in the MODEL statement.

D. Variance Inflation Factor

In multiple regression one would like the explanatory variables to be highly correlated with the

response variable. However, it is not desirable for the explanatory variables to be correlated with one another. Multicollinearity exists when two or more of the explanatory variables used in the regression model are moderately or highly correlated. The presence of a high degree of multi- collinearity among the explanatory variables can result in the following problems: (i) the standard

deviation of the regression coefficients may be disproportionately large, (ii) the coefficient estimates are unstable, and (iii) the regression coefficients may not be interpretable.

A method for detecting the presence of multicollinearity is the variance inflation factor (VIF). A

large VIF (>10) is taken as an indication that multicollinearity may be influencing the estimates.

Including the option VIF in the MODEL statement prints the variance inflation factors for each of the explanatory variables.

Ex. Housing data continued.

If we want to determine whether there is any multicollinearity present in our model we need to

add the VIF option to the MODEL statement as shown below:

PROC REG data=houses; MODEL price = size lot /VIF; RUN;

This gives rise to the same output as in the previous example. The only difference is an additional column in the parameter estimate section showing each explanatory variables VIF score.

Parameter Estimates

 

Parameter

Standard

Variance

Variable

DF

Estimate

Error

t Value

Pr > |t|

Inflation

Intercept

1

-61969

32257

-1.92

0.0788

0

size

1

97.65137

18.16474

5.38

0.0002

1.03891

lot

1

2.22295

1.13918

1.95

0.0748

1.03891

Note that all the VIF values are rather small, so there does not appear to be any problems with multicollinearity in this example.