Академический Документы
Профессиональный Документы
Культура Документы
Data Preprocessing
Hands-on Webinar By
Dr. Haleema
Training and Development Director – ITExpertTraining
Adjunct Faculty - University of Stirling, RAK Campus, UAE
Agenda
What is EDA?
Why EDA?
Steps in Understanding Data through EDA
What is Data Pre-processing and Why is it needed?
Steps in Data Preprocessing
Hands-on with Bank’s Term Deposit Sale Project Data
What is EDA?
Data Analysis basically refers to the process of understanding the data by figuring out the trends in
the data set with the help of the statistical methods.
Exploratory Data Analysis (EDA) is the first step in the Data analysis process developed by “John
Tukey” in the 1970s.
EDA in Machine Learning refers to the critical process of performing initial investigations on data so
as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help
of summary statistics and graphical representations.
In short, EDA is a way of visualizing, summarizing and interpreting the information that is hidden in
rows and column format.
Why EDA?
It is not a good practice for a Data Scientist / Researcher to start building a Machine Learning model
without identifying the patterns that exist in the data and the features that are useful in model
building.
EDA can help to detect mistakes, debunk assumptions, and understand the relationships between
different key variables. Such insights may eventually lead to perform feature engineering and to the
selection of an appropriate predictive model.
Steps in understanding data through EDA
1. Variable transformations
Label Encoding / One-Hot Encoding of Categorical Data.
Z-score transformation - linearly transformed data values having a mean of zero and a standard
deviation of 1.
2. Missing value treatment
Drop the column if the column contains too many missing values.
Replace missing values with mean or median for numerical data, with mode for categorical
data.
3. Outlier treatment
IQR Strategy.
Logarithmic Transformation.
Hands-on with Bank Term Deposit Sale Project Data
EDA
Importing the necessary libraries in Python.
Loading the dataset as Pandas Data Frame and checking the size.
Displaying the first and last five records from the data fame.
Checking the data types of the attributes and converting certain numerical variables into categorical.
Checking the columns for null values and dropping duplicate records if any.
Displaying the statistical summary of numerical columns and identifying the skewness in the data.
Non-Graphical and Graphical Univariate Analysis on numerical and categorical columns.
Bivariate Analysis using pair plot for numerical columns.
Bivariate Analysis using cross tab for categorical columns.
Hands-on with Bank Term Deposit Sale Project Data
Data Preprocessing
Drop columns that don’t significantly contribute to predict the target
Handle missing values and outliers
Label encoding of Categorical data
One hot encoding of Categorical data
Normalizing/Standardizing the data
SMOTE for imbalanced data
Thank you!!