Вы находитесь на странице: 1из 6

TUTORIAL 7: Multiple Linear Regression

I. Multiple Regression

A regression with two or more explanatory variables is called a multiple regression. Multiple linear regression is an extremely effective tool for answering statistical questions involving many variables. The procedures PROC REG and PROC GLM can be used to perform regression in SAS. In this tutorial we concentrate on using PROC REG. Much of the syntax is similar to that used for fitting simple linear regression models; see Tutorial 6 for a review of this material.

A. PROC REG

PROC REG is the basic SAS procedure for performing regression analysis. The general form of the PROC REG procedure is:

PROC REG DATA=dataset; MODEL response_variable = explanatory_variables; PLOT variable1 * variable2 <options>; OUTPUT OUT = newdata <options>; RUN;

The MODEL statement is used to specify the response and explanatory variables to be used in the regression model. For example, the statement:

MODEL y = x1 x2;

fits a multiple linear regression model with the variable y as the response variable and the variables x1 and x2 as explanatory variables.

The fit of the model and the model assumptions can be checked graphically using the PLOT statement. This statement can be used to make all the relevant plots needed for the regression model. In the regression models we have discussed so far, it is assumed that the errors are

independent and normally distributed with mean 0 and variance

it is necessary to check these assumptions by analyzing the residuals and studying a series of

residual plots. To plot the residuals against the explanatory variables use the statement: 2 . After performing regression,

PLOT residual.*(x1 x2);

Note that residual. (the period is required) is the variable name for the residuals created by PROC REG. To plot the residuals against the predicted values we would use the statement:

PLOT residual.*predicted.;

Note that predicted. (the period is again required) is the variable name for the predicted values from the regression model.

The OUTPUT statement is used to produce a new data set containing the original data used in the regression model, as well as the predicted values and residuals. This new data set can, in turn, be used to produce further diagnostic plots and check the model fit. When using the OUTPUT statement, there are a number of helpful options which help to control the contents of the OUTPUT file. The statement:

OUTPUT out = outdata r = resid p = yhat;

creates a new data set named outdata which contains the residuals and predicted values. The residuals are given the name resid, and the predicted values the name yhat. The data set outdata can then be used to further study the residuals.

Ex. Data was collected on 15 houses recently sold in a city. It consisted of the sales price (in \$), house size (in square feet), the number of bedrooms, the number of bathrooms, the lot size (in square feet) and the annual real estate tax (in \$).

The following program reads in the data and fits a multiple regression model with price as the response variable and size and lot as the explanatory variables. It also produces residual plots of the residuals against both explanatory variables as well as the predicted values.

DATA houses; INPUT tax bedroom bath price size lot; DATALINES;

 590 2 1 50000 770 22100 1050 3 2 85000 1410 12000 20 3 1 22500 1060 3500 870 2 2 90000 1300 17500 1320 3 2 133000 1500 30000 1350 2 1 90500 820 25700 2790 3 2.5 260000 2130 25000 680 2 1 142500 1170 22000 1840 3 2 160000 1500 19000 3680 4 2 240000 2790 20000 1660 3 1 87000 1030 17500 1620 3 2 118600 1250 20000 3100 3 2 140000 1760 38000 2070 2 3 148000 1550 14000 650 3 1.5 65000 1450 12000 ; RUN;

PROC REG data=houses; MODEL price = size lot; PLOT residual.*(predicted. size lot); RUN;

% Model statement % Residual plots

This program gives rise to the following output:

 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 44825992653 22412996326 19.10 0.0002 Error 12 14082023347 1173501946 Corrected Total 14 58908016000
 Root MSE 34256 R-Square 0.7609 Dependent Mean 122140 Adj R-Sq 0.7211 Coeff Var 28.0468

Parameter Estimates

 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -61969 32257 -1.92 0.0788 size 1 97.65137 18.16474 5.38 0.0002 lot 1 2.22295 1.13918 1.95 0.0748

We also obtained three residual plots which aren’t shown here. The output can be used to test a variety of hypothesis tests regarding the model. For example, by studying the last paragraph we see that the coefficient corresponding to size is significant (p-value=0.0002) when controlling for lot size. However, the coefficient corresponding to lot size is not significant (p-value=0.0748) when controlling for house size.

B. Testing a subset of variables using a partial F-test

Sometimes we are interested in simultaneously testing whether a certain subset of the coefficients

are equal to 0 (e.g.

comparing the SSE from a reduced model (excluding the parameters we are testing) with the SSE

from the full model (including all of the parameters). 3 = 4 = 0). We can do this using a partial F-test. This test involves

We can perform a partial F-test in PROC REG by including a TEST statement. For example, the statement

TEST var1=0, var2=0;

tests the null hypothesis that the regression coefficients corresponding to var1 and var2 are both equal to 0. However, note that any number of variables can be included in the TEST statement.

Ex. Housing data cont.

Suppose we include the variables bedroom, bath and size in our model and are interested in testing whether the number of bedrooms and bathrooms are significant after taking size into consideration. The following program performs the partial F-test:

PROC REG data=houses; MODEL price = bedroom bath size; TEST bath=0, bedroom=0; RUN;

% Model statement % Partial F-test

This gives rise to the following output:

 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 43908504107 14636168036 10.73 0.0013 Error 11 14999511893 1363591990 Corrected Total 14 58908016000
 Root MSE 36927 R-Square 0.7454 Dependent Mean 122140 Adj R-Sq 0.6759 Coeff Var 30.2332

Parameter Estimates

 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 27923 56306 0.50 0.6297 bedroom 1 -35525 25037 -1.42 0.1836 bath 1 2269.34398 22209 0.10 0.9205 size 1 130.79392 36.20864 3.61 0.0041

Test 1 Results for Dependent Variable price

 Mean Source DF Square F Value Pr > F Numerator 2 1775487498 1.30 0.3108 Denominator 11 1363591990

The final paragraph shows the results of the partial F-test. Since F=1.30 (p-value=0.3108) we cannot reject the null hypothesis ( 2 = 3 = 0). It appears that bedroom and bath do not contribute

significant information to the sales price once size has been taken into consideration.

C. Model Selection

Often we have data on a large number of explanatory variables and wish to construct a regression model using some subset of them. The use of a subset will make the resulting model easier to interpret and more manageable, especially if more data is to be collected in the future. Unnecessary terms in the model may also yield less precise inference.

One approach to model selection is to consider all possible subsets of the pool of explanatory variables and find the model that best fits the data according to some criteria. Different criteria may be used to select the best model, such as adjusted R 2 or Mallow’s Cp. These criteria assign scores to each model and allow us to choose the model with the best score.

In SAS we can perform model selection using Mallow’s Cp by including the selection option in the MODEL statement. The statement:

MODEL y = x1 x2 x3 x4 x5 /selection = cp;

steps through each possible model consisting of a subset of the 5 explanatory variables (x1, x2, x3, x4 and x5) and calculates a Cp score for each one. Thereafter it chooses the model that minimizes the score.

Ex. Housing data continued.

To use Mallow’s Cp to determine which subset of the 5 possible explanatory variables that best models the data, we can use the following code:

PROC REG data=houses; MODEL price = tax bedroom bath size lot/ selection = cp; RUN;

This gives rise to the following output:

 Number in Model C(p) R-Square Variables in Model 3 2.3274 0.8115 tax bedroom size 2 2.7395 0.7628 tax size 2 2.8314 0.7609 size lot 3 3.0608 0.7967 bedroom size lot 2 3.6142 0.7451 bedroom size 3 3.9514 0.7787 tax size lot 4 4.0001 0.8182 tax bedroom size lot 1 4.1292 0.6943 tax 3 4.1942 0.7738 bath size lot 4 4.3138 0.8118 tax bedroom bath size 3 4.4539 0.7686 tax bath size 1 4.5857 0.6851 size 2 4.9046 0.7191 tax bath 4 4.9963 0.7980 bedroom bath size lot 4 5.5470 0.7869 tax bath size lot 3 5.6023 0.7454 bedroom bath size 2 5.9088 0.6988 bath size 5 6.0000 0.8182 tax bedroom bath size lot 2 6.0115 0.6967 tax bedroom 2 6.1249 0.6944 tax lot 3 6.8608 0.7199 tax bath lot 3 6.8755 0.7196 tax bedroom bath 3 7.9613 0.6977 tax bedroom lot 4 8.8538 0.7201 tax bedroom bath lot 3 13.7814 0.5801 bedroom bath lot 2 16.2074 0.4907 bath lot 2 18.5511 0.4433 bedroom bath 1 20.4532 0.3645 bath 2 23.5548 0.3422 bedroom lot 1 29.3255 0.1852 lot 1 31.1563 0.1482 bedroom

From the output we see that the model with tax, bedroom and size minimizes the Cp criteria.

When the number of explanatory variables is large it is often not feasible to fit all possible models. It is instead more efficient to use a search algorithm to find the best model. A number of such search algorithms exist. They include: forward selection, backward elimination and stepwise regression.

 In SAS we can perform model selection using these algorithms by including the selection option in the MODEL statement. The statement:

MODEL y = x1 x2 x3 x4 /selection = forward;

fits the best model using forward regression. Other options can be used by exchanging forward with either backward or stepwise in the MODEL statement.

D. Variance Inflation Factor

In multiple regression one would like the explanatory variables to be highly correlated with the

response variable. However, it is not desirable for the explanatory variables to be correlated with one another. Multicollinearity exists when two or more of the explanatory variables used in the regression model are moderately or highly correlated. The presence of a high degree of multi- collinearity among the explanatory variables can result in the following problems: (i) the standard

deviation of the regression coefficients may be disproportionately large, (ii) the coefficient estimates are unstable, and (iii) the regression coefficients may not be interpretable.

A method for detecting the presence of multicollinearity is the variance inflation factor (VIF). A

large VIF (>10) is taken as an indication that multicollinearity may be influencing the estimates.

Including the option VIF in the MODEL statement prints the variance inflation factors for each of the explanatory variables.

Ex. Housing data continued.

If we want to determine whether there is any multicollinearity present in our model we need to

add the VIF option to the MODEL statement as shown below:

PROC REG data=houses; MODEL price = size lot /VIF; RUN;

This gives rise to the same output as in the previous example. The only difference is an additional column in the parameter estimate section showing each explanatory variables VIF score.

Parameter Estimates

 Parameter Standard Variance Variable DF Estimate Error t Value Pr > |t| Inflation Intercept 1 -61969 32257 -1.92 0.0788 0 size 1 97.65137 18.16474 5.38 0.0002 1.03891 lot 1 2.22295 1.13918 1.95 0.0748 1.03891

Note that all the VIF values are rather small, so there does not appear to be any problems with multicollinearity in this example.