Вы находитесь на странице: 1из 58

# Data Exploration with

Python
Andrew Michelson, MD
Pulmonary/Critical Care
Institute for Informatics
Washington University School of Medicine in St. Louis

## Institute for Informatics (I 2)

Disclosures
No relevant financial disclosures.

Many topics could be their own courses, so this will be a brief overview

The best techniques to analyze and clean your data will depend on the question your

Class Structure

## Institute for Informatics (I 2)

Objectives
1. Learn how to import data into Python

## Institute for Informatics (I 2)

The Data
Source: MIMIC-III Demo Data

Contents:
• Vital Signs: Blood pressure, heart rate, respiratory rate, etc…

## Institute for Informatics (I 2)

The Working Environment
1. Python

2. jupyter-notebook

3. Import libraries
A. Pandas
B. Numpy
C. Seaborn
D. Datetime
E. Matplotlib
F. Scipy.stats

## Institute for Informatics (I 2)

Importing Data Into Python
1. Python is a versatile and powerful language that can accept data from
many formats

2. In this class we import CSV documents from the MIMIC-III demo data

## Institute for Informatics (I 2)

Importing Data Into Python

Jupyer-Notebook
• Open Jupyter-Notebook
• Run Section 2: Import Libraries for DataSet Exploration
• Fill in the blank to import the following files:
• ICUSTAYS.csv
• PATIENTS.csv
• D_ITEMS.csv
• D_LABITEMS.csv

## Institute for Informatics (I 2)

Variable Identification
Variable Name: Variable name

Variable type:
• Continuous (ex, age)
• Categorical (ex, sex)

Data Type:
• String
• Category
• Integer
• Float
• ManyString

Independent vs Dependent:

## Institute for Informatics (I 2)

Variable Identification

Patients dataframe

Note: you can use >> DataFrame.tail( ) to view the tail rows of the data frame

By adding in a number within the parenthesis you can specify how many rows to view

## Institute for Informatics (I 2)

Variable Identification

ICU Stays

## Institute for Informatics (I 2)

Variable Identification
How do we know how many rows and columns we have in total?

>> DataFrame.shape

## How do we know the type of the data type?

>> DataFrame.info()

## Institute for Informatics (I 2)

Variable Identification
Remove Extraneous Information that takes up space (visible and memory)

## Institute for Informatics (I 2)

Variable identification in Python

variables

## Complete until section 3.2: Merge Patients & ICU Data to

Create a single DataFrame

## Institute for Informatics (I 2)

Manipulating Data in Python
Often data is collected from different sources and then
merged together for analysis.

on=[‘’])

merged correctly

## Institute for Informatics (I 2)

Variable identification in Python

DataFrame

merge

## Institute for Informatics (I 2)

Missing Data
Very Common in clinical data

## Why is data missing?

• Data extraction
• Data collection

## Institute for Informatics (I 2)

Missing Data Categorization
1. Missing completely at random:
• The propensity for a data point to be missing is completely
random and not dependent on observed or unobserved data

2. Missing at random:
• Systematic differences between the missing and observed values,
but these can be entirely explained by other observed variables

## Institute for Informatics (I 2)

Missing Data Categorization
3. Missing not at random
• There is a relationship between the propensity of a value to be
missing and it’s values

## Institute for Informatics (I 2)

Missing Data Treatment

## Institute for Informatics (I 2)

Missing Data: Case Deletion

## Delete all data Analyze all cases

where any where data is
missing available
value is present

## Institute for Informatics (I 2)

Missing Data: Imputation
Goal is to fill missing data with estimated values

## Most common methods: mean/median/mode:

• Population-wide
• Cohort-wide

## Institute for Informatics (I 2)

Missing Data: Statistical-Model Imputation
Linear Regression
• Limitations:
• Reduces variability
• Overestimates the model fit and correlation coefficient

## K-nearest Neighbor Imputation

• Limitations:
• The choice of k critical in getting desired results
• Very slow

## Institute for Informatics (I 2)

Missing Data: Statistical-Model Imputation
Multiple Imputation by Chained Equations (MICE)
• Assumes data is missing at random
• Runs multiple regression models
• Each value is modeled conditionally
• Multiple data sets are made (usually at least 10)

## Institute for Informatics (I 2)

Assessing Missing data in Python
Look for null entries
>>DataFrame.isnull( ).sum

## Look for non-null entries

>>DataFrame.notnull( ).sum

## Institute for Informatics (I 2)

Assessing Missing Data

## Go to section 3.3: Assess Missing Data in NEW Patients

DataFrame and complete UP TO, but not including Import Vital
Signs

## Institute for Informatics (I 2)

Data Mapping
Process of extracting and unifying data for further analysis

not of interest

## The same value can have different names

• Sometimes the differences in names is important, other
times its not

## Institute for Informatics (I 2)

Data Mapping
Vital Signs:
• Blood Pressure (systolic/diastolic)
• Heart Rate
• Respiratory Rate
• Oxygen saturation (%)
• Temperature

## In MIMIC-III vital signs are mixed with other measurements in

the CHARTEVENTS.CSV

## Institute for Informatics (I 2)

Data Mapping with Vital Signs
Systolic Blood Pressure Synonyms in THIS dataset:
• Non Invasive Blood Pressure systolic',
• 'Arterial Blood Pressure systolic',
• 'Manual Blood Pressure Systolic Left',
• 'Manual Blood Pressure Systolic Right’,

## Institute for Informatics (I 2)

Data Mapping with Vital Signs
Count variable frequency
>> DataFrame.series.value_counts( )

## Institute for Informatics (I 2)

Data Mapping with Dictionaries

## Dictionaries are data structures

that consist of an unordered
collections of key-value pairs
that can be changed

Dictionary = {
<key>: <value>
}

## Institute for Informatics (I 2)

Data Mapping with Vital Signs
To accommodate synonyms, or extract items of interest from a
larger data set, you can use a dictionary

## Institute for Informatics (I 2)

Import the remaining data and assess
missingness

## Go to section 4.2 Import Vital Signs complete up to section 5:

Univariate & Bivariate Analysis

## Institute for Informatics (I 2)

Univariate Analysis
Explore variables individually

## Central Tendency Measure Dispersion Visualization

Mean Interquartile Range Histogram
Median Standard Deviation/ Box plot
Variance
Mode Skewness
Min Kurtosis
Max

## Institute for Informatics (I 2)

Univariate Analysis: Skewness
Measure of the asymmetry of the probability distribution of a variable
• Positive or Right
• Negative or Left

• Minimal: -0.5 and 0.5
• Moderate: -1 and -0.5 or 0.5 and 1
• Severe: < -1 or >1
https://en.wikipedia.org/wiki/Skewness

## Institute for Informatics (I 2)

Univariate Analysis: Kurtosis
“The kurtosis parameter is a measure of the combined weight of the tails relative to the rest
of the distribution.”

## Kurtosis >3: Positive

No Kurtosis/Normal

## Kurtosis <3: Negative

https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics#kurtosis
https://bishalbanksonfinance.wordpress.com/tag/probabality-distribution/

## Institute for Informatics (I 2)

Bivariate Analysis
A method to determine the relationship between 2 variables

## 1. Visualization: Scatter plots

2. Regression analysis: Find the equation for the line or curve that best fits the data

## Institute for Informatics (I 2)

Outliers
What is an outlier?

• A data point that appears far away and diverges from the overall pattern in a sample
• Can be univariate or bivariate

## Institute for Informatics (I 2)

Outliers
How do outliers occur?
• Natural
• Sampling error
• Data entry error
• Data processing error
• Measurement error
• Intentional outlier
• Experimental error

## Institute for Informatics (I 2)

Outliers
Why are they important?

## • Alters population variance, leading to non-normal data distributions

• Alters performance of downstream analyses
• Biases results

## How do you detect outliers?

• Visualization
• Bar charts
• Box plots
• Scatter plots (looking for bivariate outliers)
• There are many, many ways, but we will focus on visualization today!

## Institute for Informatics (I 2)

Outliers: Univariate

## Institute for Informatics (I 2)

Outliers: Univariate

## Institute for Informatics (I 2)

Outliers: Bivariate

## Institute for Informatics (I 2)

Outliers
How do you treat outliers? (Subject for an entire course!)

• Delete observations:
• Data entry error
• Data processing error
• Very few (subjective)

• Transform values
• Log conversion
• Binning
• Differential observation weights

• Impute
• Would avoid with natural outliers

## Institute for Informatics (I 2)

Assessing Data in Python: Pivot Tables
DataFrames must be properly structured before they can be plotted

## Patient Label Value

John Smith Heart Rate 75
John Smith Respiratory Rate 15

John Smith 75 15

## Institute for Informatics (I 2)

Visualize Data Within Python
Declare the graph properties
>> fig, ax = plt.subplots(rows,columns, figsize = (width,height))

## Locate a subset of data from within the larger dataframe

>> DataFrame.loc[DataFrame.column == ‘columnname’, ‘return column name']

## Use Seaborn to make distribution and boxplots

>> sns.distplot(data, ax=ax[ X ])

## >> sns.boxplot(x = data, ax = ax[ X ])

>>DataFrame.pivot_table(values = 'value', index = [‘columns’],
columns='label').reset_index()

## Use Seaborn to plot bivariate data

>>sns.pairplot(pivoted table)

## Institute for Informatics (I 2)

Visualize Data Within Python
Seaborn can make a heatmap to help you more rapidly identify correlations
>> sns.heatmap(dflabs.corr(), vmax = 1)

## Institute for Informatics (I 2)

Univariate & Bivariate Visualization with
Vital Signs

## Go to section 5: Univariate & Bivariate Analysis and complete

until section 6: Data Transformation

## Institute for Informatics (I 2)

Data Transformation
Skewed data
• Skewed data can violate model assumptions (logistic regression)
• Amplify a class imbalance, degrading model performance towards the tail of the
distribution

Heteroskedasticity
• The relationship between two variables shows increasing scatter (non-constant standard
error) at extremes of measurement of the dependent variable
• Two forms:
• Conditional: Unpredictable volatility
• Unconditional: Predictable volatility

## Institute for Informatics (I 2)

Data Transformation: Heteroskedasticity
Conditional

## Institute for Informatics (I 2)

Data Transformation: Heteroskedasticity
Unconditional

## Institute for Informatics (I 2)

Data Transformation
Way to improve skewness and heteroskedasticity is to normalize your data
• Remove/manage outliers
• Log
• Cube Root
• Binning
• Normalization
• Sigmoid
• Hyperbolic tangent
• Etc…

Again, there are many different ways to do this and the best way will depend on your
planned analyses and the question you are answering

## Institute for Informatics (I 2)

Data Transformation
To perform the log function on data, you take a Pandas Series as such:
>> DataFrame.Column = np.log(DataFrame.column)

## To raise a value to the cube root

>> DataFrame.Column = DataFrame.column**(1/3)

## Institute for Informatics (I 2)

Data Transformation

Questions?
Thank you!

## Institute for Informatics (I 2)

References:
1. Grus, Joel. Data Science from Scratch. O’Reilly Media;2015.
2. Marcellino, P. Comprehensive data exploration with python.
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python. 2/2018. Accessed:
2/12/2020.
3. Sheridan, E. Un-bottling the data. 12/2/2019.
https://towardsdatascience.com/un-bottling-the-data-2da3187fb186. Accessed: 2/12/2020.
4. Ojeda, T. Data exploration with python, part 3.
https://www.districtdatalabs.com/data-exploration-with-python-3. Accessed: 2/12/20.
5. Sunil, R. A comprehensive guide to data exploration.
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/#two. Accessed: 2/12/2020.
6. Bratkovics, C. Exploratory data analysis tutorial in Python.
https://towardsdatascience.com/exploratory-data-analysis-tutorial-in-python-15602b417445. 6/16/19.
Accessed: 2/12/20.
7. Sunil, R. Ultiamte guide for data exomploration in Python using Numpy, Matplotlib and Pandas.
https://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-using-python-nump
y-scipy-matplotlib-pandas/
. 4/9/2015. Accessed: 2/12/2020.
8. Akinfaderin, W. Missing data conundrum: exploration and imputation techniques.
https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techni
ques-9f40abe0fd87
. 9/11/2017. Accessed: 2/12/20.
9. Wade, C. Transforming skewed data. https://towardsdatascience.com/transforming-skewed-data-73da4c2d0d16.
8/21/2019. Accessed: 2/20/20.
10. Chow, J. Log transformation base for data linearization does not matter.
https://towardsdatascience.com/log-transformation-base-for-data-linearization-does-not-matter-22eb3c1463d0.
6/27/2019. Accessed: 2/12/20. Institute for Informatics (I 2)
11. Azur MJ, Stuart EA, Franggakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it
Thank you!

## Institute for Informatics (I 2)

Institute for Informatics (I 2)