Академический Документы
Профессиональный Документы
Культура Документы
Big Data
Jason Albert
University of Pennsylvania
Big Data
PERSPECTIVES
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
3/24/2013
1 Terabyte = 1024 Gigabytes 1 Petabyte = 1024 Terabytes 1 Exabyte = 1024 Petabytes 1 Zettabyte = 1024 Petabytes
3/24/2013
Consume: Volume, Velocity, Variety Store: Gigabytes, Terabytes, Petabytes Process: Cluster, Classify, Predict Present: Visualize, Interact, Evaluate
Volume Velocity Variety Magnetic Agile Deep Consume Store Process Present
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
Is it considered MAD?
Is it an End-to-End Solution?
3/24/2013
Options to Consider
Two promising options with low market penetration (Gartner) MapReduce and alternatives In-memory Computing
Big Data
MAP REDUCE
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
3/24/2013
Hadoop Distrbuted File System (HDFS) Data distributed and replicated over multiple systems Block oriented MapReduce Map function processes intermediate key/value pairs Reduce function merges intermediate values Facilitates parallel processing of multi-terabytes of data on large clusters of commodity platforms
Scale Out Fully depreciated Repurposed Low Cost
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
MapReduce Workflow
$ hadoop jar wordcount.jar WordCount /usr/input /usr/output
1. 2.
1.
2.
3. 4. 5.
Mapper outputs intermediate data Data is exchanged between nodes Intermediate data of same key goes to same reducer
Reduce(keyword, (listOfValues)) for each x in (listOfValues): sum += x; output.collect(keyword, sum);
3.
Jack be nimble, Jack be quick, Jack jump over the candlestick. (0, "Jack be nimble,") (15, Jack be quick") (28, " Jack jump over the candlestick") (Jack, 1), (be,1), (nimble,,1), (Jack,1), (be, 1),(quick,, 1),
(Jack,1), (jump, 1),(over, 1),(the, 1), (candlestick., 1)
4. 5.
6.
6.
(Jack, (1,1,1)), (be, (1,1)), (nimble,,(1)), (quick, (1)), (jump, (1)),(over, (1)), (the, (1)), (candlestick., (1)) (Jack, 3), (be, 2), (nimble,,1), (quick, 1), (jump, 1), (over, 1), (the, 1), (candlestick., 1)
10
3/24/2013
11
Example: LinkedIn People you may know Application Behavior Analytics Risk & Fraud Analysis Social Network "Connectedness Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
12
3/24/2013
13
http://techcrunch.com/2012/05/18/study-twitter-sentiment-mirrored-facebooks-stock-price-today/ http://www.cs.ucr.edu/~vagelis/publications/wsdm2012-microblog-financial.pdf
MapReduce is Different
MapReduce handles processing differently: Distributed Programming Fault Tolerant MapReduce handles modeling differently: Schema-less Orientated toward exploration and discovery MapReduce handles data differently: Mostly unstructured data objects Vast number of attributes and data sources Data sources added and/or updated frequently External References Quality is unknown http://developer.yahoo.com/hadoop/
http://code.google.com/edu/parallel/mapreduce-tutorial.html Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
14
3/24/2013
MapReduce
handle Big Data? considered MAD?
Magnetism Agile MapReduce requires algorithm development Deep and End to End Solution?
15
Big Data
IN-MEMORY COMPUTING
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
3/24/2013
In-Memory Computing
Overview
All relevant structured data in-memory Cache aware memory organization (current bottleneck between CPU and main memory) Data partitioning for parallel execution
Current Methodology
Computation
Application Stack
Leveraging current innovations in Hardware & Software to move computations into the Database
Optimized for disk access on platforms with limited main memory and slow disk I/O.
Database Stack
Computation
Future Methodology
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
17
In-Memory Workflow
In-memory computing applies a combination of: Optimization: for Query Pruning and Data Distribution Execution: SQL statement plan for computational parallelization Stores: Column store with partitioning/compression (5-30x ratio) Persistence: Temporal Tables and MVCC
IBM x3850 x5 QPI Scaling Or Max5 Tray 2,3,4TB RAM 2-4 CPU @ 10 Cores/each > 4 TB @ 8x HDD
http://ark.intel.com/
18
3/24/2013
19
20
10
3/24/2013
21
Genome Analysis:
Optimized Data Warehouse: Sequence Alignment 81 minutes + Variant Calling: 65 minutes In-Memory: Sequence Alignment 15 minutes + Variant Calling 19.5 minutes (6.5 min estimated) Approximately 2hr savings
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
22
11
3/24/2013
In-Memory Computing
handle Big Data? considered MAD?
Magnetism Unstructured data still requires pre-processing Agile Deep Unsupervised Supervised an End to End Solution?
23
Big Data
12
3/24/2013
18M Records
Hadoop-HANA Connector
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
25
26
13
3/24/2013
2) Retrieves records 3) Join 1.1B PoS records to Session Data 5) Write back to database for persistence 6) Use to provide Recommendations for Future Website Visitors
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
27
28
14
3/24/2013
Better together
handle Big Data? MapReduce Enables Magnetism
preprocesses unstructured data Data Provisioning
29
30
15
3/24/2013
Most significant constraint: Shortage of talent to take advantage of the insights gained from large datasets
Deep analytical talent with technical skills in statistics to provide insights Data-savvy analysts to interpret/challenge/base decisions on results Support personnel who develop/implement/maintain the architecture Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
31
Big Data
QUESTIONS?
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu
16