Вы находитесь на странице: 1из 28

Contents

Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations

Some General Strategies for Data Analysis

Debanjan Mitra

Indian Institute of Management Udaipur

June 18, 2019

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations

1 Data Pre-processing
Data Transformations
Dealing with Missing Values

2 Over-fitting

3 Model Tuning

4 Data Splitting
Resampling: Cross-validation
Resampling: Bootstrap
Oversampling

5 Some Further Considerations

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Pre-processing

Data pre-processing generally includes transformations and cleaning


The need for, and type of data pre-processing is dependent upon the
type of model being used
For example, tree-based models are fairly insensitive to the
characteristics of the predictors
However, regression-based methods are heavily affected by
characteristics of predictors
One may pre-process the data for both supervised and unsupervised
learning

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Transformations for Individual Variables

Needed because some modeling techniques may have strict


requirements
Some common transformations are like centering, scaling, skewness
transformations etc

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Transformations: Centering and Scaling

Required for those models that need all predictors in same unit
To center a predictor, we subtract the average value of it from all
the values of that predictor
To scale, we divide the centered values by the standard deviation of
that predictor
This improves numerical stability of the model under consideration
The only downside is a reduction in interpretability of the individual
values

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Transformations: Resolving Skewness


For regression-based methods, resolving skewness for the response
variable may be necessary
General rule of thumb: a variable whose ratio of the highest value to
the lowest value is greater than 20 has significant skewness
Skewness can be identified from skewness statistics as well, given by
m3
sk = 3/2 ,
m2
where
1 X 1 X
m3 = (xi − x̄)3 , m2 = (xi − x̄)2
n−1 n−1
i i

If sk is close to zero, the distribution of the variable is roughly


symmetric
For right- and left-skewed distributions, sk is large positive and
negative numbers, respectively
Debanjan Mitra Some General Strategies for Data Analysis
Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Transformations: Resolving Skewness

Replacing the variable by its log, square root or inverse transforms


may reduce skewness
Alternatively, Box-Cox transformation may be used for predictors
with values greater than zero
Box-Cox transformation:
xλ − 1
x∗ = , if λ 6= 0,
λ
= log(x), if λ = 0.

Value of λ can be estimated using training data

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Transformations for Mutiple Predictors

These transformations act on groups of predictors, typically the


entire set under consideration
Primary goals are resolving outliers and reducing the dimension of
data

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Transformations: Resolving Outliers

Generally speaking, outliers can be hard to define and detect


Sometimes, the dilemma is that the outlying data may be an
indication of a special part of the population under study that is just
starting to get sampled!
When one or more samples are suspected to be outliers, the first
step is to make sure that the values are scientifically valid
Suspected values should not to be hastily removed or changed,
especially if the sample size is small
Some predictive models are resistant to outliers, such as tree-based
models

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Transformations: Resolving Outliers


However, some models are affected more than others, such as
regression-based models
For models that are sensitive to outliers, spatial sign transformation
can minimize the problem
This procedure projects the predictor values onto a multidimensional
sphere, and makes all values equidistant from the center of the
sphere
Mathematically,
xij
xij∗ = PP
2
j−1 xij

Caution 1: Center and scale the variable prior to using this


transformation
Caution 2: Do not remove variables after applying this
transformation, as it is applied on predictors as a group
Debanjan Mitra Some General Strategies for Data Analysis
Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Data Transformations: Dimension Reduction

Data reduction techniques are another class of predictor


transformations
Principal components analysis (PCA) is one such technique
PCA reduces dimension of the data, while retaining the original
information contained in the data

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Dealing with Missing Values

Important to understand why the values are missing


“Informative Missingness”
Missing values are more often related to predictor variables than the
sample - Kuhn and Johnson (2013)
Because of this, missingness may be concentrated in a subset of
predictors rather than occurring across all predictors

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Dealing with Missing Values

For large datasets, a predictor having too many missing values may
simply be removed, if the missingness is not informative
Otherwise, missing values should be imputed
Imputation is another layer of modeling where we try to estimate
values of the predictors based on other predictor variables
Usually, training set is used to build an imputation model

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Some Further Considerations

Dealing with Missing Values

A popular technique for imputation is k-nearest neighbour algorithm


For a missing value, first find the samples in the training set that are
“closest” to it
Then, calculate weighted average of these nearby points to fill in the
missing value
Advantage: the imputed data are confined within the range of the
training set values
Disadvantage: entire training set is required every time a missing
value needs to be imputed - computationally costly

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations

The Problem of Overfitting

The problem of overfitting occurs when the model fits not only to
the pattern but also to the noise in the data
The immediate outcome is a very impressive result for training data
(i.e., low training error) but a very poor result for test data (i.e.,
high test error)
Different models face this problem differently, with same end result -
high test error
There are different methods to overcome this problem with respect
to different models (will be discussed for each model we consider in
this course)

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations

The Problem of Overfitting

The key message to remember is that in general, the model that we


fit to the training data should generalize as much as it is possible to
new data
For example, in regression-based models, this is achieved by
selecting the predictors optimally - not too many, not too few
In tree-based methods, this is achieved by pruning the tree optimally
For model-specific techniques to deal with over-fitting, refer to the
specific discussions on different models

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations

Model Tuning

Some models have important parameters which cannot be directly


estimated from the data
However, choosing an appropriate value of these parameters are
necessary for optimal performance of these models
Such parameters are called tuning parameters
For example, “k” in k-nearest neighbour algorithm

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations

Determining the Tuning Parameter

A general approach for determining the tuning parameter of a model


is the following:
Define a set of candidate values
For each of the candidate values, generate reliable estimates of
model utility
Choose the optimal settings
To get a reasonable estimate of model utility for a given value of the
tuning parameter, one may use the test set after fitting the model to
the training set
Alternative approaches based on resampling techniques may also be
used

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Resampling: Cross-validation
Over-fitting
Resampling: Bootstrap
Model Tuning
Oversampling
Data Splitting
Some Further Considerations

Data Splitting

The dataset is partitioned usually into two or three parts - ‘training


set’, ‘validation set’ and ‘test set’
Not all applications need to have all three partitions; in most
applications that use a single model, just two partitions are sufficient
This idea of partitioning the data into different parts works well
when the dataset is large

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Resampling: Cross-validation
Over-fitting
Resampling: Bootstrap
Model Tuning
Oversampling
Data Splitting
Some Further Considerations

Data Splitting

When the dataset is small, the test set maybe avoided, as one may
need the entire dataset to train the model well
In such cases, the efficiency of the model may be evaluated based on
resampling techniques
We now present two commonly used resampling technique

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Resampling: Cross-validation
Over-fitting
Resampling: Bootstrap
Model Tuning
Oversampling
Data Splitting
Some Further Considerations

k-Fold Cross-validation

Cross-validation starts with partitioning the data into k “folds”, or


non-overlapping subsamples
Often, k = 5, with each fold having 20% of the observations
Each time, one fold serves as the validation set, and the remaining
k − 1 folds serve as training set
We can then combine the model’s performance over the k validation
sets in order to evaluate the overall performance
This is a general technique that can be used for any model

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Resampling: Cross-validation
Over-fitting
Resampling: Bootstrap
Model Tuning
Oversampling
Data Splitting
Some Further Considerations

Bootstrap

A bootstrap sample taken from the original data is


of the same size as the original data
sampled using with replacement sampling
Naturally, some records will be selected multiple times in a bootstrap
sample while some records will not be selected at all
Bootstrapping is the basis of the tools like Bagging, Random Forests
etc.

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Resampling: Cross-validation
Over-fitting
Resampling: Bootstrap
Model Tuning
Oversampling
Data Splitting
Some Further Considerations

Oversampling

Sometimes, the categories of the response variable show severe


imbalance with respect to occurrences
For example, when Class 1 is the important class (responder to a
certain online marketing campaign, say), we may have too few 1s in
our data as compared to 0s (non-responders)
Naturally, when we randomly partition the data into training and
testing sets, we have 0 as the dominant class in the training set
In such cases, classifiers get affected, and they are tempted to
classify records into the dominant class

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Resampling: Cross-validation
Over-fitting
Resampling: Bootstrap
Model Tuning
Oversampling
Data Splitting
Some Further Considerations

Oversampling

One approach to solve this problem is by stratified sampling


The process is called oversampling, or weighted sampling
Next, we describe the process of oversampling, assuming there are
two classes of the response: 1 (say, responder, or buyer) and 0
(non-responder, non-buyer)
This process cannot be extended to multi-class response

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Resampling: Cross-validation
Over-fitting
Resampling: Bootstrap
Model Tuning
Oversampling
Data Splitting
Some Further Considerations

How to Oversample?

1 Separate records with 1 and 0 response into two distinct sets, called
strata.
2 Randomly select records from each stratum for the training data.
Typically, one might select half of the 1s, and an equal number of 0s.
3 The remaining 1s are put in the validation set.
4 0s are randomly selected for the validation set in sufficient numbers,
to maintain the original ratio of 1s to 0s.
5 If a test set is required, it can be randomly taken from the validation
set.

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Resampling: Cross-validation
Over-fitting
Resampling: Bootstrap
Model Tuning
Oversampling
Data Splitting
Some Further Considerations

Oversampling: Cautions

Oversampling can be used to train models, but they are often not
suitable to evaluate model performance
To get an unbiased estimate of model performance, the model
should be applied to regular data (i.e., not oversampled data)
In case it is absolutely necessary to test model performance on
oversampled validation data, the results must be adjusted (will be
discussed in details later)

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations

Some Advanced Topics

Approaches for reducing number of predictors


Model selection algorithms - forward, backward etc
Simulated annealing
Measurement error in predictors

Debanjan Mitra Some General Strategies for Data Analysis


Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations

Reading

This presentation is based on the textbook and following books:


“An Introduction to Statistical Learning”, James, Witten, Hastie,
and Tibshirani
“Applied Predictive Modeling”, Kuhn and Johnson

Debanjan Mitra Some General Strategies for Data Analysis

Вам также может понравиться