Вы находитесь на странице: 1из 2

SAS: Running a Lasso Regression Analysis

Lasso regression analysis is a shrinkage and variable selection method for linear regression
models. The goal of lasso regression is to obtain the subset of predictors that minimizes
prediction error for a quantitative response variable. The lasso does this by imposing a
constraint on the model parameters that causes regression coefficients for some variables to
shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage
process are excluded from the model. Variables with non-zero regression coefficients
variables are most strongly associated with the response variable. Explanatory variables can
be either quantitative, categorical or both.
In my assignment I used Gapminder data set for running a lasso regression analysis. All used
variables were quantitative.
Response variable was lifeexpectancy.
Predictor variables were: alcconsumption, employrate, co2emissions, incomeperperson,
oilperperson, suicideper100th and urbanrate.
As you can see in the code below the data were randomly split into a training set (70%) and
test set (30%) of the observations.
The least angle regression algorithm with k=10 fold cross validation was used to estimate the
lasso regression model in the training set. The model was validated using the test set.

Code
libname mydata /courses/d1406ae5ba27fe300 access=readonly;
DATA new; set mydata.gapminder;
keep country alcconsumption employrate co2emissions incomeperperson lifeexpectancy
oilperperson suicideper100th urbanrate;
*delete observations with missing data;
if cmiss(of _all_) then delete;
run;
ods graphics on;
* Split data randomly into test and training data;
proc surveyselect data=new out=traintest seed = 123
samprate=0.7 method=srs outall;
run;
* lasso multiple regression with lars algorithm k=10 fold validation;
proc glmselect data=traintest plots=all seed=123;
partition ROLE=selected(train=1 test=0);
model lifeexpectancy = alcconsumption employrate co2emissions incomeperperson oilperperson

suicideper100th urbanrate/selection=lar(choose=cv stop=none) cvmethod=random(10);


run;

Summary of Findings

In the best model were 4 of 7 predictor variables: incomeperperson, urbanrate,,


suicideper100th and alcconsumption. The most important predictor of life expectancy
was incomeperperson. Followed by urbanrate and so on.

The CV PRESS shows the sum of the residual sum of squares in the test data set.
Theres an asterisk at step 4. This is the model selected as the best model by the
procedure. You can see that this is the model with the lowest summed residual sum
of squares and that adding other variables to this model, actually it increases from
681.1791 to 780.9117 in the next variable.

Incomeperperson had the largest regression coefficient shows plot Coefficient


Progression for lifeexpectancy. The plot shows relative importance of the predictor
selected at any step of the selection process, how the regression coefficients changed
with the addition of a new predictor at each step.

Positively associated with life expectancy were incomeperperson, alcconsumption,


urbanrate and employrate, Negatively associated with life expectancy were
oilperperson, co2emissions and suicideper100t.

The final plot shows the change in the average or mean square error at each step in the
process. The model had a lower ASE (less prediction error) in the test data set and
pretty accurate in predicting.

Also, the output shows the R-Square and adjusted R-Square for the selected model and the
mean square error for both the training and test data. It also shows the estimated regression
coefficients for the selected model.

Вам также может понравиться