2017-04 Enabling Digital Diagnostics

Enabling Digital Diagnostics
With a Data Science Platform
Wade L. Schulz, MD, PhD Richard Torres, MD, MS

Department of Laboratory Medicine Department of Laboratory Medicine
SLIDE 0
Overview
Healthcare Data and Data Sources

Yale/YNHH Data Science Platform
The Data Science Journey: Computational Healthcare Use Case
Question
Data
Architecture
Implementation
Other Use Cases in Healthcare
SLIDE 1
Healthcare (Big?) Data
SLIDE 2
Healthcare (Big?) Data
SLIDE 3
The Dream for Big Data in Healthcare
SLIDE 4
The Data Science Platform
Kerberized
SLIDE 5
Data Science Journey
Step 1:
Get Data
Step 2: ?
Step 3:
Profit
SLIDE 6
Creating Data Science Questions
Define the question / hypothesis

Assess the data source, clean as necessary
Iteratively test the experiment and implementation
Descriptive Diagnostic Predictive Prescriptive
SLIDE 7
Clinical Laboratory Use Case
Advanced Analytics for Laboratory Diagnostics (Yale-New Haven)

~10 million laboratory tests
~46 million results per year
Significant number require physician interpretation
Lab based example
Analogous to anatomic (tissue) pathology
Applicable beyond
SLIDE 8
General Problem Definition
Can we use big data and machine learning to improve our diagnostic reproducibility and
throughput for laboratory testing?
Hypothesis: The use of a ML algorithm and real-time data processing pipeline will improve
our turnaround time and interpretive reproducibility. Quantitative results will improve
diagnostic accuracy.
SLIDE 9
Current Data Acquisition Complete Blood Count (CBC) Example
S L I D E 10
Goal Definition - Peripheral Blood Smear
Current
Frequent dacrocytes and occasional target cells, which can be
seen in iron deficiency anemia. Platelets and leukocytes
unremarkable.
Future
Normal RBCs: 95.5% (Low)
Dacrocytes: 2.7% (High)
Target cells: 1.2% (High)
Elliptocytes: 0.5% (Normal)
Schistocytes: 0.1% (Normal)
Based on red cell indicators and integrated laboratory studies,

patient has likely diagnosis of iron deficiency anemia, recommend
iron replacement therapy
S L I D E 11
Architecting a Pipeline
Ingestion
- HDF, Flume, Storm
Storage Workflow Coordination

- HDFS, Elastic - HDF, Oozie, Python
Analysis
- Spark, Python
S L I D E 12
Data Acquisition Technology Assessment
Storm HDF / NiFi

Great capacity, scalability Good capacity
Strong real-time streaming architecture Strong real-time streaming architecture
Requires Java development No coding / development needed for this
Often difficult to debug/test use case
Easier to debug (with data provenance)
Other laboratory data already processed in
HDF
Flume
Good capacity, easy to scale
Built-in features should work for file
acquisition, but may require custom dev
S L I D E 13
Data Analysis Technology Assessment
Spark Python
Scalability Less scalable
Easy to develop / test More difficult to integrate existing data
Great integration with existing data sets Great ML support with ability to use GPU
Easily tested locally then deployed within our
Less mature ML frameworks cluster-enabled Docker Swarm
Minimal GPU support
S L I D E 14
The Tech Stack
S L I D E 15
Advanced Analytics (Machine Learning!)
Normal Erythrocyte
S L I D E 16
Data Science!
Precision : Training Size

120.0% 100.0%
Percent Total Cells
100.0%
80.0%
80.0%
60.0%
60.0%
40.0%
20.0% 40.0%
1 2 3
Predicted
Training Cells Precision normal echinocyte dacrocyte schistocyte elliptocyte acanthocyte target cell stomatocyte spherocyte total
normal 141 0 0 0 0 0 0 1 0 142
echinocyte 0 49 0 1 0 0 0 0 0 50
dacrocyte 2 0 10 1 0 0 0 0 1 14
Actual
schistocyte 0 0 1 121 0 0 0 0 0 122
elliptocyte 0 0 0 2 9 0 0 0 0 11
acanthocyte 0 0 0 3 0 19 0 0 0 22
target cell 0 0 0 0 0 0 140 0 0 140
stomatocyte 0 0 0 0 0 0 0 14 0 14
spherocyte 0 0 0 0 0 0 0 1 22 23
Data from: Durant, Olsen, and Torres 2016 total 143 49 11 128 9 19 140 16 23 538
S L I D E 17
Conclusions: Advanced Healthcare Analytics
Machine learning can do things!

Other frameworks from the data science toolbox can be integrated with the Hadoop
ecosystem, even in a Kerberized environment
Hortonworks Data Flow / NiFi provides an easy to use data ingestion / data pipeline system
Big data workflows are complex, but often have reusable architecture
S L I D E 18
Take Home Clinical Image Analysis
Very deep learning convolutional neural networks work wonderfully well

Human Learning Points:
Linkable data exists
Training computational intensive but classifying less so
Images are bigger than text
Instrument data sources very varied - customization
Future random access of image data critical
Professionally annotated data is limiting
S L I D E 19
Questions?
Wade L. Schulz, MD, PhD Richard Torres, MD, MS

Department of Laboratory Medicine Department of Laboratory Medicine
Thomas J.S. Durant, MD, MPT

Department of Laboratory Medicine
S L I D E 20

2017-04 Enabling Digital Diagnostics

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

2017-04 Enabling Digital Diagnostics

Загружено:

Авторское право:

Доступные форматы

Enabling Digital Diagnostics

With a Data Science Platform

Wade L. Schulz, MD, PhD Richard Torres, MD, MS

Healthcare Data and Data Sources

Define the question / hypothesis

Descriptive Diagnostic Predictive Prescriptive

Advanced Analytics for Laboratory Diagnostics (Yale-New Haven)

Based on red cell indicators and integrated laboratory studies,

Storage Workflow Coordination

Storm HDF / NiFi

Precision : Training Size

Machine learning can do things!

Very deep learning convolutional neural networks work wonderfully well

Wade L. Schulz, MD, PhD Richard Torres, MD, MS

Thomas J.S. Durant, MD, MPT

Вам также может понравиться