Вы находитесь на странице: 1из 21

Enabling Digital Diagnostics

With a Data Science Platform

Wade L. Schulz, MD, PhD Richard Torres, MD, MS


Department of Laboratory Medicine Department of Laboratory Medicine

SLIDE 0
Overview

Healthcare Data and Data Sources


Yale/YNHH Data Science Platform
The Data Science Journey: Computational Healthcare Use Case
Question
Data
Architecture
Implementation
Other Use Cases in Healthcare

SLIDE 1
Healthcare (Big?) Data

SLIDE 2
Healthcare (Big?) Data

SLIDE 3
The Dream for Big Data in Healthcare

SLIDE 4
The Data Science Platform
Kerberized

SLIDE 5
Data Science Journey

Step 1:
Get Data

Step 2: ?

Step 3:
Profit
SLIDE 6
Creating Data Science Questions

Define the question / hypothesis


Assess the data source, clean as necessary
Iteratively test the experiment and implementation

Descriptive Diagnostic Predictive Prescriptive

SLIDE 7
Clinical Laboratory Use Case

Advanced Analytics for Laboratory Diagnostics (Yale-New Haven)


~10 million laboratory tests
~46 million results per year
Significant number require physician interpretation
Lab based example
Analogous to anatomic (tissue) pathology
Applicable beyond

SLIDE 8
General Problem Definition

Can we use big data and machine learning to improve our diagnostic reproducibility and
throughput for laboratory testing?

Hypothesis: The use of a ML algorithm and real-time data processing pipeline will improve
our turnaround time and interpretive reproducibility. Quantitative results will improve
diagnostic accuracy.

SLIDE 9
Current Data Acquisition Complete Blood Count (CBC) Example

S L I D E 10
Goal Definition - Peripheral Blood Smear

Current
Frequent dacrocytes and occasional target cells, which can be
seen in iron deficiency anemia. Platelets and leukocytes
unremarkable.

Future
Normal RBCs: 95.5% (Low)
Dacrocytes: 2.7% (High)
Target cells: 1.2% (High)
Elliptocytes: 0.5% (Normal)
Schistocytes: 0.1% (Normal)

Based on red cell indicators and integrated laboratory studies,


patient has likely diagnosis of iron deficiency anemia, recommend
iron replacement therapy

S L I D E 11
Architecting a Pipeline

Ingestion
- HDF, Flume, Storm

Storage Workflow Coordination


- HDFS, Elastic - HDF, Oozie, Python

Analysis
- Spark, Python

S L I D E 12
Data Acquisition Technology Assessment

Storm HDF / NiFi


Great capacity, scalability Good capacity
Strong real-time streaming architecture Strong real-time streaming architecture
Requires Java development No coding / development needed for this
Often difficult to debug/test use case
Easier to debug (with data provenance)
Other laboratory data already processed in
HDF
Flume
Good capacity, easy to scale
Built-in features should work for file
acquisition, but may require custom dev

S L I D E 13
Data Analysis Technology Assessment

Spark Python
Scalability Less scalable
Easy to develop / test More difficult to integrate existing data
Great integration with existing data sets Great ML support with ability to use GPU
Easily tested locally then deployed within our
Less mature ML frameworks cluster-enabled Docker Swarm
Minimal GPU support

S L I D E 14
The Tech Stack

S L I D E 15
Advanced Analytics (Machine Learning!)

Normal Erythrocyte

S L I D E 16
Data Science!

Precision : Training Size


120.0% 100.0%
Percent Total Cells

100.0%
80.0%
80.0%
60.0%
60.0%
40.0%
20.0% 40.0%
1 2 3
Predicted
Training Cells Precision normal echinocyte dacrocyte schistocyte elliptocyte acanthocyte target cell stomatocyte spherocyte total
normal 141 0 0 0 0 0 0 1 0 142
echinocyte 0 49 0 1 0 0 0 0 0 50
dacrocyte 2 0 10 1 0 0 0 0 1 14

Actual
schistocyte 0 0 1 121 0 0 0 0 0 122
elliptocyte 0 0 0 2 9 0 0 0 0 11
acanthocyte 0 0 0 3 0 19 0 0 0 22
target cell 0 0 0 0 0 0 140 0 0 140
stomatocyte 0 0 0 0 0 0 0 14 0 14
spherocyte 0 0 0 0 0 0 0 1 22 23
Data from: Durant, Olsen, and Torres 2016 total 143 49 11 128 9 19 140 16 23 538

S L I D E 17
Conclusions: Advanced Healthcare Analytics

Machine learning can do things!


Other frameworks from the data science toolbox can be integrated with the Hadoop
ecosystem, even in a Kerberized environment
Hortonworks Data Flow / NiFi provides an easy to use data ingestion / data pipeline system
Big data workflows are complex, but often have reusable architecture

S L I D E 18
Take Home Clinical Image Analysis

Very deep learning convolutional neural networks work wonderfully well


Human Learning Points:
Linkable data exists
Training computational intensive but classifying less so
Images are bigger than text
Instrument data sources very varied - customization
Future random access of image data critical
Professionally annotated data is limiting

S L I D E 19
Questions?

Wade L. Schulz, MD, PhD Richard Torres, MD, MS


Department of Laboratory Medicine Department of Laboratory Medicine

Thomas J.S. Durant, MD, MPT


Department of Laboratory Medicine

S L I D E 20

Вам также может понравиться