Вы находитесь на странице: 1из 17

Subset Selection

Case: Package Pricing at Mission Hospital


To develop a model that could assist Mission Hospital to predict the
package price at the time of admission.
The possible set of predictors include a number of demographic
variables as well as a number of health indicators.
We remove all the rows where there is some missing information,
resulting in a data set of 191 observations.
Next, We randomly divide the data set in two parts-a training set
containing 150 observations and a test set containing 41
observations.

Subset Selection
When we have a large number of predictors in the model, there will
in general many predictors that have little or no effect on the
response variable.
Leaving the insignificant variables in the model makes it harder to see
the big picture, i.e., the effect of the important variables.
The model will be easier to interpret by removing the unimportant
variables.
The prediction accuracy also generally improves with this simpler
model.

Subset Selection
Identifying a subset of all predictors that are related to the response
variable, and then fitting the model using this subset.
We will discuss following two approaches for subset selection:
Best Subset Selection
Stepwise Selection

Best Subset Selection


In this approach, we run a linear regression for each possible
combination of the predictors (X).
How do we judge which subset is the best?
One simple approach is to take the subset with the smallest or
the largest 2 .
Unfortunately, one can show that the model that includes all the
variables will always have the largest 2 (and smallest ).

Case: Package Pricing at Mission Hospital

Best Subset Selection


To compare different models, we can use other approaches:

Adjusted R2

AIC (Akaike information criterion)


BIC (Bayesian information criterion)

For least squares models, and AIC are proportional to each other.
These methods add penalty to for the number of variables (i.e.,
complexity) in the model.

Best Subset Selection


A small value of AIC and BIC indicates a low error, and thus a better
model.
A large value for the Adjusted R2 indicates a better model.

Best Subset Selection

Stepwise Selection
Best Subset Selection is computationally intensive especially when we
have a large number of predictors (large ).
More attractive methods:
Forward Stepwise Selection: Begins with the model containing no predictor,
and then adds one predictor at a time that improves the model the most until
no further improvement is possible.
Backward Stepwise Selection: Begins with the model containing all
predictors, and then deleting one predictor at a time that improves the model
the most until no further improvement is possible.

Forward Regression

Forward Regression

Backward Regression

Backward Regression

Best Models
Method

Criterion

Best Subset Selection

Adjusted 2

18

0.132

14

0.136

BIC

0.145

Adjusted

18

0.126

13

0.128

BIC

0.153

Adjusted 2

18

0.196

14

0.188

BIC

0.194

Forward Selection

Backward Selection

No. of Variables

Test Error

Plot of Total Cost and Predicted Values


for Model 4 for Test Data Set

Plot of Total Cost and Predicted Values


for Model 4 for Test Data Set

Вам также может понравиться