Академический Документы
Профессиональный Документы
Культура Документы
SLIDE 0
Getting Started
https://github.com/ComputationalHealth/odsceast17
https://github.com/ComputationalHealth/odsceast17-data
SLIDE 1
VirtualBox / Hortonworks Sandbox
VirtualBox: https://www.virtualbox.org/
Hortonworks Sandbox: HDP 2.6
Large download, have flash drives with VirtualBox image if not already
downloaded
SLIDE 2
VirtualBox Networking Issues?
SLIDE 3
Code for Sandbox
SLIDE 4
Success!
SLIDE 5
Hadoop and Ambari
Ambari: Hadoop
management interface
HDFS: Storage
YARN: Resource mngr
Hive: SQL-like
HBase: Non-relational DB
Oozie: Workflow
ZooKeeper: Coordinator
Storm: Stream processing
Kafka: Message queue
Ranger: Security
Spark: Compute/analytics
Zeppelin: Notebook
NiFi: Stream processing
SLIDE 6
SSH to Download Code
SLIDE 7
SSH to Download Code
SLIDE 8
Code Repository
SLIDE 9
HDF / NiFi
Step 1:
Get Data
Step 2: ?
Step 3:
Profit
S L I D E 12
Healthcare No Exception
S L I D E 13
For the architect/developer:
What to do when there is no Step 2?
S L I D E 14
Three Healthcare Use Cases
Patient Monitoring
Image Analysis / Deep Learning
S L I D E 15
Workshop Goal: Create Reusable Workflow Framework
S L I D E 16
Data Science Toolbox
S L I D E 17
Data Flow in a Healthcare System
S L I D E 18
Data Flow in a Healthcare System
S L I D E 19
Use Case 1: Create a Laboratory Data Workflow
S L I D E 20
What We Will Build
HL7 Data
Generator
Kafka NiFi
HDFS Zeppelin
S L I D E 21
Problem #1: Sample/Test Data
S L I D E 22
Problem #2: Healthcare Data Standards
S L I D E 23
Health Level 7 Order/Result Unit (ORU)
MSH|^~\&|LCS|LCA|LIS|TEST9999|199807311532||ORU^R01|3629|P|2.2
PID|2|2161348462|20809880170|1614614|20809880170^TESTPAT||19760924|M|||^^^^|||||||86427531^^^03|
ORC|NW|8642753100012^LIS|20809880170^LCS||||||19980727000000|||HAVILAND
OBR|1|8642753100012^LIS|20809880170^LCS|008342^UPPER RESPIRATORY CULTURE^L|||19980727175800||||||
SRC:THROAT
OBX|1|ST|008342^UPPER RESPIRATORY CULTURE^L||FINALREPORT|||||N|F||| 19980729160500|BN
OBX|2|CE|997231^RESULT 1^L||M415|||||N|F|||19980729160500|BN
NTE|1|L|MORAXELLA (BRANHAMELLA) CATARRHALIS
NTE|2|L| HEAVY GROWTH
NTE|3|L| BETA LACTAMASE POSITIVE
https://corepointhealth.com/resource-center/hl7-resources/hl7-msh-message-header
S L I D E 24
Workshop Part #1: The Lab Data Generator
Follow along, if you have Python 2.7 and/or Jupyter installed, open
the Laboratory Data Generatory.ipynb notebook from the GitHub
repository (/odsceast17/1-generation/data-generator/Laboratory
Data Generatory.ipynb)
S L I D E 25
A Note on Laboratory Data
Generating good test data may require more than a simple random
number generator
S L I D E 26
Architect the Pipeline
S L I D E 27
Data Ingest - Kafka
S L I D E 28
Start a Data/Kafka Producer
1. Less Fancy: Random values, CLI data loader (any Python, can run
inside Sandbox if )
1. /odsceast17/1-generation/
2. Fancy: Normally distributed data (if you have Python >= 2.7 with
previous dependencies and were able to config hosts file)
S L I D E 29
Architect the Pipeline: Data Transformation / Load
S L I D E 30
Workshop Part #2: Python HL7 Parsing
NiFi if installed
Python to HDFS if NiFi not installed
S L I D E 31
Workshop Part #3: Data Visualization
Spark / Zeppelin
S L I D E 32
Workflow Overview
S L I D E 33
Implementation Architecture
S L I D E 34
Extending Hadoop
S L I D E 35
Repeatable Architecture Patterns
- Continuous Patient Monitoring
Requirements:
Highly scalable, fault-tolerant processing pipeline
Batch analytics of entire data set
Real-time visualization of more recent data
S L I D E 36
Architect the Pipeline: Patient Monitoring
S L I D E 37
Expanding to Other Use Cases Patient Monitoring
S L I D E 38
Repeatable Architecture Patterns
- Image Analysis / Deep Learning
Requirements:
Ability to capture data from vendor instrument
Integrate Python-based deep learning libraries
Store features for batch and real-time analysis
S L I D E 39
Expanding to Other Use Cases Image Analysis
S L I D E 40
Conclusions
S L I D E 41
Creating Data Science Workflows
A Healthcare Use Case
S L I D E 42