General Strategies PDF

Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Further Considerations
Some General Strategies for Data Analysis
Debanjan Mitra
Indian Institute of Management Udaipur
June 18, 2019
Debanjan Mitra Some General Strategies for Data Analysis

Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
1 Data Pre-processing
Data Transformations
Dealing with Missing Values
2 Over-fitting
3 Model Tuning
4 Data Splitting
Resampling: Cross-validation
Resampling: Bootstrap
Oversampling
5 Some Further Considerations

Contents
Data Pre-processing
Over-fitting Data Transformations
Model Tuning Dealing with Missing Values
Data Splitting
Data Pre-processing
Data pre-processing generally includes transformations and cleaning

The need for, and type of data pre-processing is dependent upon the
type of model being used
For example, tree-based models are fairly insensitive to the
characteristics of the predictors
However, regression-based methods are heavily affected by
characteristics of predictors
One may pre-process the data for both supervised and unsupervised
learning

Contents
Data Pre-processing
Data Splitting
Data Transformations for Individual Variables
Needed because some modeling techniques may have strict

requirements
Some common transformations are like centering, scaling, skewness
transformations etc

Contents
Data Pre-processing
Data Splitting
Data Transformations: Centering and Scaling
Required for those models that need all predictors in same unit
To center a predictor, we subtract the average value of it from all
the values of that predictor
To scale, we divide the centered values by the standard deviation of
that predictor
This improves numerical stability of the model under consideration
The only downside is a reduction in interpretability of the individual
values

Contents
Data Pre-processing
Data Splitting
Data Transformations: Resolving Skewness

For regression-based methods, resolving skewness for the response
variable may be necessary
General rule of thumb: a variable whose ratio of the highest value to
the lowest value is greater than 20 has significant skewness
Skewness can be identified from skewness statistics as well, given by
m3
sk = 3/2 ,
m2
where
1 X 1 X
m3 = (xi − x̄)3 , m2 = (xi − x̄)2
n−1 n−1
i i
If sk is close to zero, the distribution of the variable is roughly

symmetric
For right- and left-skewed distributions, sk is large positive and
negative numbers, respectively
Contents
Data Pre-processing
Data Splitting
Data Transformations: Resolving Skewness
Replacing the variable by its log, square root or inverse transforms

may reduce skewness
Alternatively, Box-Cox transformation may be used for predictors
with values greater than zero
Box-Cox transformation:
xλ − 1
x∗ = , if λ 6= 0,
λ
= log(x), if λ = 0.
Value of λ can be estimated using training data

Contents
Data Pre-processing
Data Splitting
Data Transformations for Mutiple Predictors
These transformations act on groups of predictors, typically the

entire set under consideration
Primary goals are resolving outliers and reducing the dimension of
data

Contents
Data Pre-processing
Data Splitting
Data Transformations: Resolving Outliers
Generally speaking, outliers can be hard to define and detect

Sometimes, the dilemma is that the outlying data may be an
indication of a special part of the population under study that is just
starting to get sampled!
When one or more samples are suspected to be outliers, the first
step is to make sure that the values are scientifically valid
Suspected values should not to be hastily removed or changed,
especially if the sample size is small
Some predictive models are resistant to outliers, such as tree-based
models

Contents
Data Pre-processing
Data Splitting
Data Transformations: Resolving Outliers

However, some models are affected more than others, such as
regression-based models
For models that are sensitive to outliers, spatial sign transformation
can minimize the problem
This procedure projects the predictor values onto a multidimensional
sphere, and makes all values equidistant from the center of the
sphere
Mathematically,
xij
xij∗ = PP
2
j−1 xij
Caution 1: Center and scale the variable prior to using this

transformation
Caution 2: Do not remove variables after applying this
transformation, as it is applied on predictors as a group
Contents
Data Pre-processing
Data Splitting
Data Transformations: Dimension Reduction
Data reduction techniques are another class of predictor

transformations
Principal components analysis (PCA) is one such technique
PCA reduces dimension of the data, while retaining the original
information contained in the data

Contents
Data Pre-processing
Data Splitting
Important to understand why the values are missing

“Informative Missingness”
Missing values are more often related to predictor variables than the
sample - Kuhn and Johnson (2013)
Because of this, missingness may be concentrated in a subset of
predictors rather than occurring across all predictors

Contents
Data Pre-processing
Data Splitting
For large datasets, a predictor having too many missing values may
simply be removed, if the missingness is not informative
Otherwise, missing values should be imputed
Imputation is another layer of modeling where we try to estimate
values of the predictors based on other predictor variables
Usually, training set is used to build an imputation model

Contents
Data Pre-processing
Data Splitting
A popular technique for imputation is k-nearest neighbour algorithm

For a missing value, first find the samples in the training set that are
“closest” to it
Then, calculate weighted average of these nearby points to fill in the
missing value
Advantage: the imputed data are confined within the range of the
training set values
Disadvantage: entire training set is required every time a missing
value needs to be imputed - computationally costly

Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
The Problem of Overfitting
The problem of overfitting occurs when the model fits not only to
the pattern but also to the noise in the data
The immediate outcome is a very impressive result for training data
(i.e., low training error) but a very poor result for test data (i.e.,
high test error)
Different models face this problem differently, with same end result -
high test error
There are different methods to overcome this problem with respect
to different models (will be discussed for each model we consider in
this course)

Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
The Problem of Overfitting
The key message to remember is that in general, the model that we

fit to the training data should generalize as much as it is possible to
new data
For example, in regression-based models, this is achieved by
selecting the predictors optimally - not too many, not too few
In tree-based methods, this is achieved by pruning the tree optimally
For model-specific techniques to deal with over-fitting, refer to the
specific discussions on different models

Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Model Tuning
Some models have important parameters which cannot be directly

estimated from the data
However, choosing an appropriate value of these parameters are
necessary for optimal performance of these models
Such parameters are called tuning parameters
For example, “k” in k-nearest neighbour algorithm

Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Determining the Tuning Parameter
A general approach for determining the tuning parameter of a model

is the following:
Define a set of candidate values
For each of the candidate values, generate reliable estimates of
model utility
Choose the optimal settings
To get a reasonable estimate of model utility for a given value of the
tuning parameter, one may use the test set after fitting the model to
the training set
Alternative approaches based on resampling techniques may also be
used

Contents
Data Pre-processing
Over-fitting
Model Tuning
Oversampling
Data Splitting
Data Splitting
The dataset is partitioned usually into two or three parts - ‘training

set’, ‘validation set’ and ‘test set’
Not all applications need to have all three partitions; in most
applications that use a single model, just two partitions are sufficient
This idea of partitioning the data into different parts works well
when the dataset is large

Contents
Data Pre-processing
Over-fitting
Model Tuning
Oversampling
Data Splitting
Data Splitting
When the dataset is small, the test set maybe avoided, as one may
need the entire dataset to train the model well
In such cases, the efficiency of the model may be evaluated based on
resampling techniques
We now present two commonly used resampling technique

Contents
Data Pre-processing
Over-fitting
Model Tuning
Oversampling
Data Splitting
k-Fold Cross-validation
Cross-validation starts with partitioning the data into k “folds”, or

non-overlapping subsamples
Often, k = 5, with each fold having 20% of the observations
Each time, one fold serves as the validation set, and the remaining
k − 1 folds serve as training set
We can then combine the model’s performance over the k validation
sets in order to evaluate the overall performance
This is a general technique that can be used for any model

Contents
Data Pre-processing
Over-fitting
Model Tuning
Oversampling
Data Splitting
Bootstrap
A bootstrap sample taken from the original data is

of the same size as the original data
sampled using with replacement sampling
Naturally, some records will be selected multiple times in a bootstrap
sample while some records will not be selected at all
Bootstrapping is the basis of the tools like Bagging, Random Forests
etc.

Contents
Data Pre-processing
Over-fitting
Model Tuning
Oversampling
Data Splitting
Oversampling
Sometimes, the categories of the response variable show severe

imbalance with respect to occurrences
For example, when Class 1 is the important class (responder to a
certain online marketing campaign, say), we may have too few 1s in
our data as compared to 0s (non-responders)
Naturally, when we randomly partition the data into training and
testing sets, we have 0 as the dominant class in the training set
In such cases, classifiers get affected, and they are tempted to
classify records into the dominant class

Contents
Data Pre-processing
Over-fitting
Model Tuning
Oversampling
Data Splitting
Oversampling
One approach to solve this problem is by stratified sampling

The process is called oversampling, or weighted sampling
Next, we describe the process of oversampling, assuming there are
two classes of the response: 1 (say, responder, or buyer) and 0
(non-responder, non-buyer)
This process cannot be extended to multi-class response

Contents
Data Pre-processing
Over-fitting
Model Tuning
Oversampling
Data Splitting
How to Oversample?
1 Separate records with 1 and 0 response into two distinct sets, called
strata.
2 Randomly select records from each stratum for the training data.
Typically, one might select half of the 1s, and an equal number of 0s.
3 The remaining 1s are put in the validation set.
4 0s are randomly selected for the validation set in sufficient numbers,
to maintain the original ratio of 1s to 0s.
5 If a test set is required, it can be randomly taken from the validation
set.

Contents
Data Pre-processing
Over-fitting
Model Tuning
Oversampling
Data Splitting
Oversampling: Cautions
Oversampling can be used to train models, but they are often not
suitable to evaluate model performance
To get an unbiased estimate of model performance, the model
should be applied to regular data (i.e., not oversampled data)
In case it is absolutely necessary to test model performance on
oversampled validation data, the results must be adjusted (will be
discussed in details later)

Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Some Advanced Topics
Approaches for reducing number of predictors

Model selection algorithms - forward, backward etc
Simulated annealing
Measurement error in predictors

Contents
Data Pre-processing
Over-fitting
Model Tuning
Data Splitting
Reading
This presentation is based on the textbook and following books:

“An Introduction to Statistical Learning”, James, Witten, Hastie,
and Tibshirani
“Applied Predictive Modeling”, Kuhn and Johnson

General Strategies PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

General Strategies PDF

Загружено:

Авторское право:

Доступные форматы

Contents

Some General Strategies for Data Analysis

Indian Institute of Management Udaipur

June 18, 2019

Debanjan Mitra Some General Strategies for Data Analysis

5 Some Further Considerations

Debanjan Mitra Some General Strategies for Data Analysis

Data pre-processing generally includes transformations and cleaning

Debanjan Mitra Some General Strategies for Data Analysis

Data Transformations for Individual Variables

Needed because some modeling techniques may have strict

Debanjan Mitra Some General Strategies for Data Analysis

Data Transformations: Centering and Scaling

Debanjan Mitra Some General Strategies for Data Analysis

Data Transformations: Resolving Skewness

If sk is close to zero, the distribution of the variable is roughly

Data Transformations: Resolving Skewness

Replacing the variable by its log, square root or inverse transforms

Value of λ can be estimated using training data

Debanjan Mitra Some General Strategies for Data Analysis

Data Transformations for Mutiple Predictors

These transformations act on groups of predictors, typically the

Debanjan Mitra Some General Strategies for Data Analysis

Data Transformations: Resolving Outliers

Generally speaking, outliers can be hard to define and detect

Debanjan Mitra Some General Strategies for Data Analysis

Data Transformations: Resolving Outliers

Caution 1: Center and scale the variable prior to using this

Data Transformations: Dimension Reduction

Data reduction techniques are another class of predictor

Debanjan Mitra Some General Strategies for Data Analysis

Dealing with Missing Values

Important to understand why the values are missing

Debanjan Mitra Some General Strategies for Data Analysis

Dealing with Missing Values

Debanjan Mitra Some General Strategies for Data Analysis

Dealing with Missing Values

A popular technique for imputation is k-nearest neighbour algorithm

Debanjan Mitra Some General Strategies for Data Analysis

The Problem of Overfitting

Debanjan Mitra Some General Strategies for Data Analysis

The Problem of Overfitting

The key message to remember is that in general, the model that we

Debanjan Mitra Some General Strategies for Data Analysis

Some models have important parameters which cannot be directly

Debanjan Mitra Some General Strategies for Data Analysis

Determining the Tuning Parameter

A general approach for determining the tuning parameter of a model

Debanjan Mitra Some General Strategies for Data Analysis

The dataset is partitioned usually into two or three parts - ‘training

Debanjan Mitra Some General Strategies for Data Analysis

Debanjan Mitra Some General Strategies for Data Analysis

Cross-validation starts with partitioning the data into k “folds”, or

Debanjan Mitra Some General Strategies for Data Analysis

A bootstrap sample taken from the original data is

Debanjan Mitra Some General Strategies for Data Analysis

Sometimes, the categories of the response variable show severe

Debanjan Mitra Some General Strategies for Data Analysis

One approach to solve this problem is by stratified sampling

Debanjan Mitra Some General Strategies for Data Analysis

Debanjan Mitra Some General Strategies for Data Analysis

Debanjan Mitra Some General Strategies for Data Analysis

Some Advanced Topics