Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013
Big Data
Jason Albert
University of Pennsylvania
Jason P. Albert, University of Pennsylvania

jasonalb@wharton.upenn.edu
Big Data
PERSPECTIVES
3/24/2013
What is Big Data?

high volume, velocity and/or variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner)
1 Terabyte = 1024 Gigabytes 1 Petabyte = 1024 Terabytes 1 Exabyte = 1024 Petabytes 1 Zettabyte = 1024 Petabytes
1 ZB (1,099,511,627,776GB) * 7.9 = 8,686,141,859,430GB

How do we handle Big Data?

MAD Information Management is the approach: Must be Magnetic, attracting all data sources Must be Agile for easy accommodation of data at a rapid pace Must provide sophisticated statistical methods for its Deep data repository
Why is MAD a departure from traditional Data Warehouse?

3/24/2013
What is the Scope of the Solution?

An End to End Solution must be Considered:
Consume: Volume, Velocity, Variety Store: Gigabytes, Terabytes, Petabytes Process: Cluster, Classify, Predict Present: Visualize, Interact, Evaluate

Perspectives on Big Data

Does it handle Big Data?

Volume Velocity Variety Magnetic Agile Deep Consume Store Process Present
Is it considered MAD?

Is it an End-to-End Solution?

3/24/2013
Options to Consider
Two promising options with low market penetration (Gartner) MapReduce and alternatives In-memory Computing

Big Data
MAP REDUCE
3/24/2013
Hadoop = MapReduce + HDFS

Open Source, Batch Oriented, Data Intensive general purpose framework for creating distributed applications that process big data
i.e. Volume, Velocity, Variety
Hadoop Distrbuted File System (HDFS) Data distributed and replicated over multiple systems Block oriented MapReduce Map function processes intermediate key/value pairs Reduce function merges intermediate values Facilitates parallel processing of multi-terabytes of data on large clusters of commodity platforms
Scale Out Fully depreciated Repurposed Low Cost
MapReduce Workflow
$ hadoop jar wordcount.jar WordCount /usr/input /usr/output
1. 2.
Input data is distributed Map Tasks work on a split of data

Map(key, value) for each word x in value: output.collect(x,1)
1.
2.
3. 4. 5.
Mapper outputs intermediate data Data is exchanged between nodes Intermediate data of same key goes to same reducer
Reduce(keyword, (listOfValues)) for each x in (listOfValues): sum += x; output.collect(keyword, sum);
3.
Jack be nimble, Jack be quick, Jack jump over the candlestick. (0, "Jack be nimble,") (15, Jack be quick") (28, " Jack jump over the candlestick") (Jack, 1), (be,1), (nimble,,1), (Jack,1), (be, 1),(quick,, 1),
(Jack,1), (jump, 1),(over, 1),(the, 1), (candlestick., 1)
4. 5.
6.
6.
Reducer output is stored
(Jack, (1,1,1)), (be, (1,1)), (nimble,,(1)), (quick, (1)), (jump, (1)),(over, (1)), (the, (1)), (candlestick., (1)) (Jack, 3), (be, 2), (nimble,,1), (quick, 1), (jump, 1), (over, 1), (the, 1), (candlestick., 1)

10
3/24/2013
Scale-Out: MapReduce + HDFS

11
Case Study: Recommendations

1) 9 TB of W3C Extended Log File Format data
2) MapReduce program: sessionExtractor

Session SDF92MGSLOK4M23K ASD90K23MOLFWQIE Person B041Q3EV EM9IU67Y Person N23KFMWE
Example: LinkedIn People you may know Application Behavior Analytics Risk & Fraud Analysis Social Network "Connectedness Jason P. Albert, University of Pennsylvania
Text Analysis Regressions (Financial)
12
3/24/2013
Supplemental Case Study

Product Sentiment Analysis over Time
1 Month of Twitter Feeds and Opinion Boards onto HDFS Process using Word Count example of Positive and Negative words associated with a Product over time
This type of analysis is being done with some success

13
http://techcrunch.com/2012/05/18/study-twitter-sentiment-mirrored-facebooks-stock-price-today/ http://www.cs.ucr.edu/~vagelis/publications/wsdm2012-microblog-financial.pdf
MapReduce is Different
MapReduce handles processing differently: Distributed Programming Fault Tolerant MapReduce handles modeling differently: Schema-less Orientated toward exploration and discovery MapReduce handles data differently: Mostly unstructured data objects Vast number of attributes and data sources Data sources added and/or updated frequently External References Quality is unknown http://developer.yahoo.com/hadoop/
http://code.google.com/edu/parallel/mapreduce-tutorial.html Jason P. Albert, University of Pennsylvania
14
3/24/2013
MapReduce
handle Big Data? considered MAD?
Magnetism Agile MapReduce requires algorithm development Deep and End to End Solution?

15
Big Data
IN-MEMORY COMPUTING
3/24/2013
In-Memory Computing
Overview
All relevant structured data in-memory Cache aware memory organization (current bottleneck between CPU and main memory) Data partitioning for parallel execution
Current Methodology
Computation
Application Stack
Leveraging current innovations in Hardware & Software to move computations into the Database
Optimized for disk access on platforms with limited main memory and slow disk I/O.
Database Stack
Computation
Future Methodology
17
In-Memory Workflow
In-memory computing applies a combination of: Optimization: for Query Pruning and Data Distribution Execution: SQL statement plan for computational parallelization Stores: Column store with partitioning/compression (5-30x ratio) Persistence: Temporal Tables and MVCC
IBM x3850 x5 QPI Scaling Or Max5 Tray 2,3,4TB RAM 2-4 CPU @ 10 Cores/each > 4 TB @ 8x HDD

http://ark.intel.com/
18
3/24/2013
Scale-Out Strategy for In-Memory

19
Capturing and Presenting

Data Provisioning
IM-DBMS does not currently accommodate transaction workloads Trigger Replication new transactions to replicate to an in-memory DB facilitating real time operational analysis, planning, and simulation. Extraction using ETL (Extract, Transform, Load) tools with a large variety of external and internal source system support handles other data sources in near real-time but require job scheduling
e.g. SAP HANA

20
10
3/24/2013
Case Study: Sales Analysis

1) Load 1.1 Billion PoS in < 1 sec 3) Drill Down Into Category < 1 sec
2) Identify Top Selling Categories 4) Plan/Actuals as Schema & Visualize
Link to Video: PoS from HANA using Business Objects Explorer

21
Examples of Performance Gains

Report on Product Dimensions
120 million line items Standard ERP solution: several minutes on pre-aggregated dataset; more for drilldown In-Memory: less than 1 second on line item level data; minute delay for drilldown
Genome Analysis:
Optimized Data Warehouse: Sequence Alignment 81 minutes + Variant Calling: 65 minutes In-Memory: Sequence Alignment 15 minutes + Variant Calling 19.5 minutes (6.5 min estimated) Approximately 2hr savings
22
11
3/24/2013
In-Memory Computing
handle Big Data? considered MAD?
Magnetism Unstructured data still requires pre-processing Agile Deep Unsupervised Supervised an End to End Solution?

23
Big Data
HDFS + MAP REDUCE + IN MEMORY

12
3/24/2013
Case Study: Recommendations

1) 9 TB of W3C Extended Log File Format data
2) MapReduce program: sessionExtractor

Session SDF92MGSLOK4M23K ASD90K23MOLFWQIE Product B041Q3EV EM9IU67Y Product N23KFMWE
18M Records
Hadoop-HANA Connector
25
Scale-Out: MapReduce + HDFS

Recall this slide as the Foundation
26
13
3/24/2013
+ Case Study: Predictive Analysis

1) Add Connection Details to all Data Reader Component 4) Explore Outcome
4) K-Means Cluster of Sessions
2) Retrieves records 3) Join 1.1B PoS records to Session Data 5) Write back to database for persistence 6) Use to provide Recommendations for Future Website Visitors
27
Scale-Out Strategy for In-Memory

Recall this slide as the Foundation
28
14
3/24/2013
Better together
handle Big Data? MapReduce Enables Magnetism
preprocesses unstructured data Data Provisioning

In-Memory Enables Agility

Replication Extraction
Both MapReduce and In-Memory Enable Deep Analysis

During MapReduce preprocessing Unsupervised & Supervised for InMemory
an End to End Solution?

29
SAP HANA + Intel Distribution of Hadoop

This is New News
February 27, 2013

http://www.sap.com/corporate-en/news.epx?PressID=20498

30
15
3/24/2013
MAD Improvement Focus

Transformative potential in five domains
U.S. Healthcare E.U. Public Sector administration Retail Manufacturing Personal Location
Most significant constraint: Shortage of talent to take advantage of the insights gained from large datasets
Deep analytical talent with technical skills in statistics to provide insights Data-savvy analysts to interpret/challenge/base decisions on results Support personnel who develop/implement/maintain the architecture Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute
31
Big Data
QUESTIONS?
16

Shawndra Hill Upenn Jasonalb Big Data WK11

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Shawndra Hill Upenn Jasonalb Big Data WK11

Загружено:

Авторское право:

Доступные форматы

3/24/2013

Jason P. Albert, University of Pennsylvania

What is Big Data?

1 ZB (1,099,511,627,776GB) * 7.9 = 8,686,141,859,430GB

How do we handle Big Data?

Jason P. Albert, University of Pennsylvania

What is the Scope of the Solution?

Jason P. Albert, University of Pennsylvania

Perspectives on Big Data

Jason P. Albert, University of Pennsylvania

Hadoop = MapReduce + HDFS

i.e. Volume, Velocity, Variety

Input data is distributed Map Tasks work on a split of data

Reducer output is stored

Jason P. Albert, University of Pennsylvania

Scale-Out: MapReduce + HDFS

Jason P. Albert, University of Pennsylvania

Case Study: Recommendations

2) MapReduce program: sessionExtractor

Text Analysis Regressions (Financial)

Supplemental Case Study

This type of analysis is being done with some success

Jason P. Albert, University of Pennsylvania

Jason P. Albert, University of Pennsylvania

Jason P. Albert, University of Pennsylvania

Scale-Out Strategy for In-Memory

Jason P. Albert, University of Pennsylvania

Capturing and Presenting

e.g. SAP HANA

Jason P. Albert, University of Pennsylvania

Case Study: Sales Analysis

2) Identify Top Selling Categories 4) Plan/Actuals as Schema & Visualize

Link to Video: PoS from HANA using Business Objects Explorer

Jason P. Albert, University of Pennsylvania

Examples of Performance Gains

Jason P. Albert, University of Pennsylvania

HDFS + MAP REDUCE + IN MEMORY

Case Study: Recommendations

2) MapReduce program: sessionExtractor

Scale-Out: MapReduce + HDFS

Jason P. Albert, University of Pennsylvania

Recall this slide as the Foundation

+ Case Study: Predictive Analysis

4) K-Means Cluster of Sessions

Scale-Out Strategy for In-Memory

Jason P. Albert, University of Pennsylvania

Recall this slide as the Foundation

In-Memory Enables Agility

Both MapReduce and In-Memory Enable Deep Analysis

an End to End Solution?

Jason P. Albert, University of Pennsylvania

SAP HANA + Intel Distribution of Hadoop

February 27, 2013

Jason P. Albert, University of Pennsylvania

MAD Improvement Focus

Вам также может понравиться