Model Selection Part 1

Model selection
Chapter 7: Model Selection
What is a good model?

Fits the observed data
Minimizes the residual sum of squares
Does not overfit the data

Capable of making prediction for unknown observations
Explanatory/Predictive power
Models that do not have good predictive power may have a more natural
causal interpretation
Computational complexity
Commonly underestimated
For the purpose of this course, we will focus on the first two points.
Model building
In most cases, especially with observational data, we control the
levels of the regressor variables
Oftentimes, there are many potential explanatory variables measured
along with the response
The goal is to build a model to make prediction, or simply to
understand the influential explanatory variables
Challenges
The purpose of the model may be unclear
Predict or estimate?
In observational studies, the available explanatory variables may be

questionable
Variables may be given in a certain metric that makes model building
very difficult
The presence of high-leverage cases, outliers and influential
observations
The real world and the idealized world are different
Likelihood ratio statistic

Suppose we have data = (1 , 2 , , ) such that the observations are iid
and come from an unknown distribution
Further assume that there are two competing model:
1 and the likelihood function is L1 = =1 1
2 and the likelihood function is L2 = =1 2
The likelihood ratio statistic is
1
=
2
Values of > 1 favour the model in the numerator (M1), whereas

< 1 favour the model in the denominator (M2)
It is a good rule of thumb to select M1 when > 1, and to select M1
when < 1
However, models with more parameters have more flexibility to
maximize the likelihood function
Models with more parameters are favored by the likelihood ratio statistics
Model selection procedures

In general, there are two types of model selection techniques:
1. Automated selection
Forward selection
Backward selection
Stepwise selection
2. Manual selection
Question 3 of Assignment 2
What we have
No prior knowledge/information
Response variable
= (1 , 2 , , )
Covariates (includes interaction terms)
= (1 , 2 , , )
Forward selection
1. Start with a model with only the intercept (null model)
0 : ~ 1
This model is the base model.
2. Fit the model for all possible i
1, : ~ 1 +
3. Pick model 1, with the smallest p-value of the F-test
If the p-value is greater than , then the base model is the final model
Otherwise, the model 1, becomes the new base model
4. Repeat 2 and 3 until the final model is obtained
Backward selection
1. Start with a full model with all covariates
: ~ 1 + 1 + 2 + +
This model is the base model.
2. Remove a covariate and fit the model for all possible i
1, : ~ 1 + 1 + 2 + + 1 ++1 + +
If all the p-values in the F-test are smaller than , then the base model is the final model
3. Pick model 1, with the largest p-value of the F-test

The model 1, becomes the new base model
4. Repeat 2 and 3 until the final model is obtained
Forward or backward?
The algorithms do not necessarily produce the same results
In general, the backward elimination method tends to perform better
Suppose the best combination consists of 2 covariates (1 , 2 ) but another covariate, say 3
is the most significant covariate when only the intercept is present
The forward selection method will propose = 0 + 3 3
The stepwise selection method is a combination of the forward and backward

selections
Stepwise selection
1. Start with a base model of your choice (somewhat in between the
full and null model)
Model with only the main effects
2. Add a covariate using the forward selection technique

3. Remove a covariate using the backward selection technique
4. Repeat 2 3 until no covariate can be added or removed
Note: A covariate may be added or removed multiple times

In R, all 3 techniques can be performed using the function step()

Model Selection Part 1

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Model Selection Part 1

Загружено:

Авторское право:

Доступные форматы

Model selection

Chapter 7: Model Selection

What is a good model?

Does not overfit the data

In observational studies, the available explanatory variables may be

Likelihood ratio statistic

Values of > 1 favour the model in the numerator (M1), whereas

Model selection procedures

4. Repeat 2 and 3 until the final model is obtained

3. Pick model 1, with the largest p-value of the F-test

4. Repeat 2 and 3 until the final model is obtained

The forward selection method will propose = 0 + 3 3

The stepwise selection method is a combination of the forward and backward

2. Add a covariate using the forward selection technique

Note: A covariate may be added or removed multiple times

Вам также может понравиться