# Data Exploration with

Python
Andrew Michelson, MD
Pulmonary/Critical Care
Institute for Informatics
Washington University School of Medicine in St. Louis

Disclosures
No relevant financial disclosures.

Many topics could be their own courses, so this will be a brief overview

The best techniques to analyze and clean your data will depend on the question your

Class Structure

Objectives
1. Learn how to import data into Python

The Data
Source: MIMIC-III Demo Data

Contents:
• Vital Signs: Blood pressure, heart rate, respiratory rate, etc…

The Working Environment
1. Python

2. jupyter-notebook

3. Import libraries
A. Pandas
B. Numpy
C. Seaborn
D. Datetime
E. Matplotlib
F. Scipy.stats

Importing Data Into Python
1. Python is a versatile and powerful language that can accept data from
many formats

2. In this class we import CSV documents from the MIMIC-III demo data

Importing Data Into Python

Jupyer-Notebook
• Open Jupyter-Notebook
• Run Section 2: Import Libraries for DataSet Exploration
• Fill in the blank to import the following files:
• ICUSTAYS.csv
• PATIENTS.csv
• D_ITEMS.csv
• D_LABITEMS.csv

Variable Identification
Variable Name: Variable name

Variable type:
• Continuous (ex, age)
• Categorical (ex, sex)

Data Type:
• String
• Category
• Integer
• Float
• ManyString

Independent vs Dependent:

Variable Identification

Patients dataframe

Note: you can use >> DataFrame.tail( ) to view the tail rows of the data frame

By adding in a number within the parenthesis you can specify how many rows to view

Variable Identification

ICU Stays

Variable Identification
How do we know how many rows and columns we have in total?

>> DataFrame.shape

## How do we know the type of the data type?

>> DataFrame.info()

Variable Identification
Remove Extraneous Information that takes up space (visible and memory)

Variable identification in Python

variables

## Complete until section 3.2: Merge Patients & ICU Data to

Create a single DataFrame

Manipulating Data in Python
Often data is collected from different sources and then
merged together for analysis.

on=[‘’])

merged correctly

Variable identification in Python

DataFrame

merge

Missing Data
Very Common in clinical data

## Why is data missing?

• Data extraction
• Data collection

Missing Data Categorization
1. Missing completely at random:
• The propensity for a data point to be missing is completely
random and not dependent on observed or unobserved data

2. Missing at random:
• Systematic differences between the missing and observed values,
but these can be entirely explained by other observed variables

Missing Data Categorization
3. Missing not at random
• There is a relationship between the propensity of a value to be
missing and it’s values

Missing Data Treatment

Missing Data: Case Deletion

## Delete all data Analyze all cases

where any where data is
missing available
value is present

Missing Data: Imputation
Goal is to fill missing data with estimated values

## Most common methods: mean/median/mode:

• Population-wide
• Cohort-wide

Missing Data: Statistical-Model Imputation
Linear Regression
• Limitations:
• Reduces variability
• Overestimates the model fit and correlation coefficient

## K-nearest Neighbor Imputation

• Limitations:
• The choice of k critical in getting desired results
• Very slow

Missing Data: Statistical-Model Imputation
Multiple Imputation by Chained Equations (MICE)
• Assumes data is missing at random
• Runs multiple regression models
• Each value is modeled conditionally
• Multiple data sets are made (usually at least 10)

Assessing Missing data in Python
Look for null entries
>>DataFrame.isnull( ).sum

## Look for non-null entries

>>DataFrame.notnull( ).sum

Assessing Missing Data

## Go to section 3.3: Assess Missing Data in NEW Patients

DataFrame and complete UP TO, but not including Import Vital
Signs

Data Mapping
Process of extracting and unifying data for further analysis

not of interest

## The same value can have different names

• Sometimes the differences in names is important, other
times its not

Data Mapping
Vital Signs:
• Blood Pressure (systolic/diastolic)
• Heart Rate
• Respiratory Rate
• Oxygen saturation (%)
• Temperature

## In MIMIC-III vital signs are mixed with other measurements in

the CHARTEVENTS.CSV

Data Mapping with Vital Signs
Systolic Blood Pressure Synonyms in THIS dataset:
• Non Invasive Blood Pressure systolic',
• 'Arterial Blood Pressure systolic',
• 'Manual Blood Pressure Systolic Left',
• 'Manual Blood Pressure Systolic Right’,

Data Mapping with Vital Signs
Count variable frequency
>> DataFrame.series.value_counts( )

Data Mapping with Dictionaries

## Dictionaries are data structures

that consist of an unordered
collections of key-value pairs
that can be changed

Dictionary = {
<key>: <value>
}

Data Mapping with Vital Signs
To accommodate synonyms, or extract items of interest from a
larger data set, you can use a dictionary

Import the remaining data and assess
missingness

## Go to section 4.2 Import Vital Signs complete up to section 5:

Univariate & Bivariate Analysis

Univariate Analysis
Explore variables individually

## Central Tendency Measure Dispersion Visualization

Mean Interquartile Range Histogram
Median Standard Deviation/ Box plot
Variance
Mode Skewness
Min Kurtosis
Max

Univariate Analysis: Skewness
Measure of the asymmetry of the probability distribution of a variable
• Positive or Right
• Negative or Left

• Minimal: -0.5 and 0.5
• Moderate: -1 and -0.5 or 0.5 and 1
• Severe: < -1 or >1
https://en.wikipedia.org/wiki/Skewness

Univariate Analysis: Kurtosis
“The kurtosis parameter is a measure of the combined weight of the tails relative to the rest
of the distribution.”

## Kurtosis >3: Positive

No Kurtosis/Normal

## Kurtosis <3: Negative

https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics#kurtosis
https://bishalbanksonfinance.wordpress.com/tag/probabality-distribution/

Bivariate Analysis
A method to determine the relationship between 2 variables

## 1. Visualization: Scatter plots

2. Regression analysis: Find the equation for the line or curve that best fits the data

Outliers
What is an outlier?

• A data point that appears far away and diverges from the overall pattern in a sample
• Can be univariate or bivariate

Outliers
How do outliers occur?
• Natural
• Sampling error
• Data entry error
• Data processing error
• Measurement error
• Intentional outlier
• Experimental error

Outliers
Why are they important?

## • Alters population variance, leading to non-normal data distributions

• Alters performance of downstream analyses
• Biases results

## How do you detect outliers?

• Visualization
• Bar charts
• Box plots
• Scatter plots (looking for bivariate outliers)
• There are many, many ways, but we will focus on visualization today!

Outliers: Univariate

Outliers: Univariate

Outliers: Bivariate

Outliers
How do you treat outliers? (Subject for an entire course!)

• Delete observations:
• Data entry error
• Data processing error
• Very few (subjective)

• Transform values
• Log conversion
• Binning
• Differential observation weights

• Impute
• Would avoid with natural outliers

Assessing Data in Python: Pivot Tables
DataFrames must be properly structured before they can be plotted

## Patient Label Value

John Smith Heart Rate 75
John Smith Respiratory Rate 15

John Smith 75 15

Visualize Data Within Python
Declare the graph properties
>> fig, ax = plt.subplots(rows,columns, figsize = (width,height))

## Locate a subset of data from within the larger dataframe

>> DataFrame.loc[DataFrame.column == ‘columnname’, ‘return column name']

## Use Seaborn to make distribution and boxplots

>> sns.distplot(data, ax=ax[ X ])

## >> sns.boxplot(x = data, ax = ax[ X ])

>>DataFrame.pivot_table(values = 'value', index = [‘columns’],
columns='label').reset_index()

## Use Seaborn to plot bivariate data

>>sns.pairplot(pivoted table)

Visualize Data Within Python
Seaborn can make a heatmap to help you more rapidly identify correlations
>> sns.heatmap(dflabs.corr(), vmax = 1)

Univariate & Bivariate Visualization with
Vital Signs

## Go to section 5: Univariate & Bivariate Analysis and complete

until section 6: Data Transformation

Data Transformation
Skewed data
• Skewed data can violate model assumptions (logistic regression)
• Amplify a class imbalance, degrading model performance towards the tail of the
distribution

Heteroskedasticity
• The relationship between two variables shows increasing scatter (non-constant standard
error) at extremes of measurement of the dependent variable
• Two forms:
• Conditional: Unpredictable volatility
• Unconditional: Predictable volatility

Data Transformation: Heteroskedasticity
Conditional

Data Transformation: Heteroskedasticity
Unconditional

Data Transformation
Way to improve skewness and heteroskedasticity is to normalize your data
• Remove/manage outliers
• Log
• Cube Root
• Binning
• Normalization
• Sigmoid
• Hyperbolic tangent
• Etc…

Again, there are many different ways to do this and the best way will depend on your
planned analyses and the question you are answering

Data Transformation
To perform the log function on data, you take a Pandas Series as such:
>> DataFrame.Column = np.log(DataFrame.column)

## To raise a value to the cube root

>> DataFrame.Column = DataFrame.column**(1/3)

Data Transformation

Questions?
Thank you!

