Академический Документы
Профессиональный Документы
Культура Документы
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations
Debanjan Mitra
1 Data Pre-processing
Data Transformations
Dealing with Missing Values
2 Over-fitting
3 Model Tuning
4 Data Splitting
Resampling: Cross-validation
Resampling: Bootstrap
Oversampling
Data Pre-processing
Required for those models that need all predictors in same unit
To center a predictor, we subtract the average value of it from all
the values of that predictor
To scale, we divide the centered values by the standard deviation of
that predictor
This improves numerical stability of the model under consideration
The only downside is a reduction in interpretability of the individual
values
For large datasets, a predictor having too many missing values may
simply be removed, if the missingness is not informative
Otherwise, missing values should be imputed
Imputation is another layer of modeling where we try to estimate
values of the predictors based on other predictor variables
Usually, training set is used to build an imputation model
The problem of overfitting occurs when the model fits not only to
the pattern but also to the noise in the data
The immediate outcome is a very impressive result for training data
(i.e., low training error) but a very poor result for test data (i.e.,
high test error)
Different models face this problem differently, with same end result -
high test error
There are different methods to overcome this problem with respect
to different models (will be discussed for each model we consider in
this course)
Model Tuning
Data Splitting
Data Splitting
When the dataset is small, the test set maybe avoided, as one may
need the entire dataset to train the model well
In such cases, the efficiency of the model may be evaluated based on
resampling techniques
We now present two commonly used resampling technique
k-Fold Cross-validation
Bootstrap
Oversampling
Oversampling
How to Oversample?
1 Separate records with 1 and 0 response into two distinct sets, called
strata.
2 Randomly select records from each stratum for the training data.
Typically, one might select half of the 1s, and an equal number of 0s.
3 The remaining 1s are put in the validation set.
4 0s are randomly selected for the validation set in sufficient numbers,
to maintain the original ratio of 1s to 0s.
5 If a test set is required, it can be randomly taken from the validation
set.
Oversampling: Cautions
Oversampling can be used to train models, but they are often not
suitable to evaluate model performance
To get an unbiased estimate of model performance, the model
should be applied to regular data (i.e., not oversampled data)
In case it is absolutely necessary to test model performance on
oversampled validation data, the results must be adjusted (will be
discussed in details later)
Reading