Вы находитесь на странице: 1из 11

Exploratory Data Analysis and

Data Preprocessing
Hands-on Webinar By

Dr. Haleema
Training and Development Director – ITExpertTraining
Adjunct Faculty - University of Stirling, RAK Campus, UAE
Agenda

 What is EDA?
 Why EDA?
 Steps in Understanding Data through EDA
 What is Data Pre-processing and Why is it needed?
 Steps in Data Preprocessing
 Hands-on with Bank’s Term Deposit Sale Project Data
What is EDA?

 Data Analysis basically refers to the process of understanding the data by figuring out the trends in
the data set with the help of the statistical methods.
 Exploratory Data Analysis (EDA) is the first step in the Data analysis process developed by “John
Tukey” in the 1970s.
 EDA in Machine Learning refers to the critical process of performing initial investigations on data so
as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help
of summary statistics and graphical representations.
 In short, EDA is a way of visualizing, summarizing and interpreting the information that is hidden in
rows and column format.
Why EDA?

 It is not a good practice for a Data Scientist / Researcher to start building a Machine Learning model
without identifying the patterns that exist in the data and the features that are useful in model
building.
 EDA can help to detect mistakes, debunk assumptions, and understand the relationships between
different key variables. Such insights may eventually lead to perform feature engineering and to the
selection of an appropriate predictive model.
Steps in understanding data through EDA

1. Identification of variables and data types


2. Analyzing the basic metrics
 Size of the dataset
 Statistical summary of numerical columns
3. Non-Graphical Univariate Analysis
 Checking for Null values in each column
 Getting the count of unique values in categorical columns
4. Graphical Univariate Analysis
 Getting insights on the distribution of data in each column
 Checking for Outliers in the data using boxplots
5. Bivariate Analysis
 Checking the relationship between the predictor variables and the target variable using a pair plot/cross
tab
Types of variables
What is Data Pre-processing and Why is it
needed?
 Data Preprocessing refers to the process of preparing the data for model building.
 Data is said to be unclean if it contains missing attribute values, contains noise or outliers
and duplicate or wrong data. Presence of any of these will degrade quality of the results.
 Data preprocessing is crucial in any data mining process as they directly impact success
rate of the project.
Steps in Data Preprocessing

1. Variable transformations
 Label Encoding / One-Hot Encoding of Categorical Data.
 Z-score transformation - linearly transformed data values having a mean of zero and a standard
deviation of 1.
2. Missing value treatment
 Drop the column if the column contains too many missing values.
 Replace missing values with mean or median for numerical data, with mode for categorical
data.
3. Outlier treatment
 IQR Strategy.
 Logarithmic Transformation.
Hands-on with Bank Term Deposit Sale Project Data

EDA
 Importing the necessary libraries in Python.
 Loading the dataset as Pandas Data Frame and checking the size.
 Displaying the first and last five records from the data fame.
 Checking the data types of the attributes and converting certain numerical variables into categorical.
 Checking the columns for null values and dropping duplicate records if any.
 Displaying the statistical summary of numerical columns and identifying the skewness in the data.
 Non-Graphical and Graphical Univariate Analysis on numerical and categorical columns.
 Bivariate Analysis using pair plot for numerical columns.
 Bivariate Analysis using cross tab for categorical columns.
Hands-on with Bank Term Deposit Sale Project Data

Data Preprocessing
 Drop columns that don’t significantly contribute to predict the target
 Handle missing values and outliers
 Label encoding of Categorical data
 One hot encoding of Categorical data
 Normalizing/Standardizing the data
 SMOTE for imbalanced data
Thank you!!

Вам также может понравиться