Вы находитесь на странице: 1из 92

Big Data Analytics, Data Science & Fast Data

Kunal Joshi
joshik@vmware.com

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. 1


Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases BIG DATA
3. Technologies for Big Data Analytics
4. Introduction to Data Science
DATA SCIENCE
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases FAST DATA
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Big Data Pioneers

1,000,000,000 Queries A Day

250,000,000 New Photo‟s / Day

290,000,000 Updates / Day

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Other Companies using Big Data

4,000,000 Claims / Day

2,800,000,000 Trades / Day

31,000,000,000 Interactions / Day

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Moore’s Law
Gordon Moore (Founder of Intel)

Number of
transistors that
can be placed in a
processor
DOUBLES in
approximately
every TWO years.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Introduction to Big Data Analytics

Your Thoughts?

What is Big Data?

What makes data, “Big” Data?

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. 7


Big Data Defined
• “Big Data” is data whose scale, distribution, diversity,
and/or timeliness require the use of new technical
architectures and analytics to enable insights that unlock
new sources of business value.
 Requires new data architectures, analytic sandboxes
 New tools
 New analytical methods
 Integrating multiple skills into new role of data scientist

• Organizations are deriving business benefit from analyzing


ever larger and more complex data sets that increasingly
require real-time or near-real time capabilities
Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved. 8
Key Characteristics of Big Data
1. Data Volume
 44x increase from 2010 to 2020
(1.2zettabytes to 35.2zb)

2. Processing Complexity
 Changing data structures
 Use cases warranting additional transformations and
analytical techniques

3. Data Structure
 Greater variety of data structures to mine and analyze

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 9
Big Data Characteristics: Data Structures
Data Growth is Increasingly Unstructured

• Data containing a defined data type, format, structure


Structure
• Example: Transaction data and OLAP
d
• Textual data files with a discernable
Semi-
More Structured

pattern, enabling parsing

Structured • Example: XML data files that are self


describing and defined by an xml schema

• Textual data with erratic data formats, can


“Quasi” be formatted with effort, tools, and time

Structured • Example: Web clickstream data that


may contain some inconsistencies in data
values and formats
• Data that has no inherent
structure and is usually stored
as different types of files.
Unstructured
• Example: Text
documents, PDFs, images and
video
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 10
Four Main Types of Data Structures
Structured Data Quasi-Structured Data

Semi-Structured Data
View  Source

http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&
pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs
_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651

Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 11
Business Drivers for Big Data Analytics
Current Business Problems Provide Opportunities for Organizations to
Become More Analytical & Data Driven

Driver Examples
1
Desire to optimize business
Sales, pricing, profitability, efficiency
operations

2
Desire to identify business risk Customer churn, fraud, default

3
Predict new business Upsell, cross-sell, best new customer
opportunities prospects
4
Comply with laws or regulatory Anti-Money Laundering, Fair Lending,
requirements Basel II

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 12
Challenges with a Traditional Data Warehouse
1 Data
Sources
Non-Agile Models

2 Departmental
“Spread
Marts”
Warehouse

Enterprise 4
Departmental Applications
Warehouse
3 Prioritized
Operational
Processes

Static schemas
accrete over time Reporting Siloed
Analytics

Non-Prioritized Data Provisioning

Errant data & marts

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved. 13
Implications of a Traditional Data Warehouse

• High-value data is hard to reach and leverage


• Predictive analytics & data mining activities are last
in line for data
 Queued after prioritized operational processes
• Data is moving in batches from EDW to local Slow
“time-to-insight”
analytical tools &
 In-memory analytics (such as R, SAS, SPSS, Excel) reduced
 Sampling can skew model accuracy business impact
• Isolated, ad hoc analytic projects, rather than
centrally-managed harnessing of analytics
 Non-standardized initiatives
 Frequently, not aligned with corporate business goals

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 14
Opportunities for a New Approach to Analytics
New Applications Driving Data Volume

MEASURED IN MEASURED IN WILL BE MEASURED IN


LARGE TERABYTES PETABYTES EXABYTES
1TB = 1,000GB 1PB = 1,000TB 1EB = 1,000PB
VOLUME OF INFORMATION

SMALL

1990‟s 2000‟s 2010‟s


(RDBMS & DATA (CONTENT & DIGITAL ASSET (NO-SQL & KEY/VALUE)
WAREHOUSE) MANAGEMENT)
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 15
Considerations for Big Data Analytics
Criteria for Big Data Projects New Analytic Architecture

Analytic Sandbox
Data assets gathered from multiple sources
1. Speed of decision making and technologies for analysis

2. Throughput
• Enables high performance analytics
3. Analysis flexibility using in-db processing
• Reduces costs associated with data
replication into "shadow" file
systems
• “Analyst-owned” rather than “DBA
owned”

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. 16


State of the Practice in Analytics: Mini-Case Study
Big Data Enabled Loan Processing at XYZ bank

Traditional Big Data Enabled


Underwriting Underwriting Your Thoughts?
Risk Level Risk Level
Underwriting Risk

TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 17
Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Big Data Analytics: Industry Examples

1
Health Care
• Reducing Cost of Care Medical

2
Public Services Government Internet

• Preventing Pandemics
3 Life Sciences Data
Collectors
• Genomic Mapping
4 IT Infrastructure
• Unstructured Data Analysis Phone/TV Retail

5 Online Services Financial

• Social Media for Professionals

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 19
1
Big Data Analytics: Healthcare

• Poor police response and problems with medical care, triggered


Situation by shooting of a Rutgers student
• The event drove local doctor to map crime data and examine
local health care

• Dr. Jeffrey Brenner generated his own crime maps from medical
Use of Big Data billing records of 3 hospitals

• City hospitals & ER‟s provided expensive care, low quality care
• Reduced hospital costs by 56% by realizing that 80% of city‟s
Key medical costs came from 13% of its residents, mainly low-
Outcomes income or elderly
• Now offers preventative care over the phone or through home
visits
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 20
2
Big Data Analytics: Public Services

• Threat of global pandemics has increased exponentially


Situation
• Pandemics spreads at faster rates, more resistant to antibiotics

• Created a network of viral listening posts


• Combines data from viral discovery in the field, research in
Use of Big Data disease hotspots, and social media trends
• Using Big Data to make accurate predications on spread of new
pandemics
• Identified a fifth form of human malaria, including its origin

Key • Identified why efforts failed to control swine flu


Outcomes
• Proposing more proactive approaches to preventing outbreaks

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 21
3
Big Data Analytics: Life Sciences

Situation • Broad Institute (MIT & Harvard) mapping the Human Genome

• In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes


Use of Big Data
• Developed 30+ software packages, now shared publicly, along
with the genomic data

• Using genetic mappings to identify cellular mutations causing


Key cancer and other serious diseases
Outcomes
• Innovating how genomic research informs new pharmaceutical
drugs

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 22
4
Big Data Analytics: IT Infrastructure

Situation • Explosion of unstructured data required new technology to


analyze quickly, and efficiently

• Doug Cutting created Hadoop to divide large processing tasks


into smaller tasks across many computers
Use of Big Data
• Analyzes social media data generated by hundreds of
thousands of users

• New York Times used Hadoop to transform its entire public


Key archive, from 1851 to 1922, into 11 million PDF files in 24 hrs
Outcomes
• Applications range from social media, sentiment analysis,
wartime chatter, natural language processing

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 23
5
Big Data Analytics: Online Services

Situation • Opportunity to create social media space for professionals

• Collects and analyzes data from over 100 million users


Use of Big Data
• Adding 1 million new users per week

• LinkedIn Skills, InMaps, Job Recommendations, Recruiting


Key
Outcomes • Established a diverse data scientist group, as founder believes
this is the start of Big Data revolution

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 24
Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Greenplum Unified Analytic Platform

Unify your team


Data Data Data Analyst Bl LOB
Scientist Engineer Analyst User

Drive Collaboration
DATA SCIENCE TEAM

GREENPLUM CHORUS – Analytic Productivity Layer

Keep Your Options Open


Partner Tools & Services

The Power of Data


Co-Processing
GREENPLUM GREENPLUM
Data DATABASE HD
Platform
Admin

Greenplum gNet

Cloud, x86 Infrastructure, or Appliance

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Greenplum Hadoop

STRUCTURED UNSTRUCTURED

Schema on load
SequenceFile
Directories MapReduce
Hive
Java
XML, JSON, … No ETL
Pig Flat files

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Greenplum Database

STRUCTURED UNSTRUCTURED
SQL BI Tools
Partitioning

RDBMS Indexing Greenplum


MapReduce
Tables and Schemas

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
What do we Mean by Hadoop
• A framework for handling big data
 An implementation of the MapReduce paradigm
 Hadoop glues the storage and analytics together and provides reliability,
scalability, and management

Two Main Components


Storage (Big Data) MapReduce (Analytics)
 HDFS – Hadoop Distributed  Programming model for
File System processing sets of data
 Reliable, redundant,  Mapping inputs to outputs and
reducing the output of multiple
distributed file system
Mappers to one (or a few)
optimized for large files
answer(s)

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 30
Hadoop Distributed File System

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 31
MapReduce and HDFS
Job Tracker
Client/Dev
Map Job
2
Map
Map Job
Job Reduce Job

Reduce
Reduce Job
Job
3
Task Tracker
Task Tracker Task Tracker

Map Job 4 Map Job Map Job

Reduce Job Reduce Job Reduce Job


1

Large Data Set


(Log files, Sensor Data) Hadoop Distributed File System (HDFS)
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 32
Components of Hadoop
• As you move from Pig to Hive to
HBase, you are increasingly
moving away from the mechanics DBMS View

of Hadoop and get an RDBMS Queries


view of the Big Data world HBase against defined
tables
Less Hadoop
Visible

Hive SQL-based
language

Data flow
language &
Pig Execution
environment
More Hadoop
Visible
Mechanics of
Hadoop

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 33
Greenplum Database
Extreme Performance for Analytics

• Optimized for BI and analytics


 Deep integration with statistical packages
 High performance parallel implementations
• Simple and automatic parallelization
 Just load and query like any database
 Tables are automatically distributed
across nodes
 No need for manual partitioning or tuning
• Extremely scalable
 MPP* shared-nothing architecture
 All nodes can scan and process in parallel
 Linear scalability by adding nodes where each
node adds storage, query & load performance
*MPP – Massive Parallel Processing
2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Greenplum DB & HD
Massively Parallel Access and Movement
Greenplum DB Hadoop
Maximize Solution
External Text Flexibility
Binary Minimize Data
Tables User- Duplication
Map Defined
Reduce Access Hadoop
Data in Real Time
From Greenplum
Segment 1 Node 1 DB
GP DB
gNet
Master Segment 2 Node 2 Import and export
Host 10Gb Ethernet
in Text, Binary
Segment 3 Node 3 and Compressed
Formats
Custom formats via user-written MapReduce Java
program And GPDB Format classes

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Exploiting Parallelism
In-Database Analytics

Master Segment Processor

Interconnect
Switch
Interconnect Analytic
Results
Independent Segment Analytical Software
Processors
Math & Statistical
Independent Memory Functions

Independent
Direct Storage
Connection
Storage

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Big Data Requires Data Science

Data Science
High

• Predictive analysis
• What if…..?

Data
BUSINESS Science
VALUE

Business
Intelligence
Business
Intelligence
• Standard reporting
Low
• What happened?
Past TIME Future

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Data science and business intelligence

“TRADITIONAL BI”

Repetitive
“BIG DATA ANALYTICS”

Experimental, Ad Hoc Structured

Mostly Semi-Structured Operational

External + Operational GBs to 10s of TBs

10s of TB to Pb‟s

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Profile of a Data Scientist

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 46
Data Science as a Process

• People
• Scientists / Analysts
• Business Analysts
Domain • Consumers of analysis
• Stakeholders
• EMC sales and services
Variable Model Model Communication
Data Prep • Ecosystem & Operationalization
Selection Building Execution
• Sector (Telecom, banking, security agency etc.)
• Modeling software and other tools used by analysts
Evaluate
(MADlib, SAS, R etc.)
• Database (Greenplum) & Data Sources

People and Ecosystem

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Data Science as a Process

Discovery & prioritized identification of


Domain opportunities
• Customer Retention
• Fraud detection
Variable
• Pricing Model Model Communication
Data Prep Selection
• Marketing effectiveness
Building andExecution
optimization & Operationalization

• Product Recommendation
• Others……Evaluate

People and Ecosystem

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Data Science as a Process

Domain
• What are the data sources?
• Do we have access to them?
• How big are they?
Variable
Model Model Communication
Data Prep Selection
• How often are they updated?Execution & Operationalization
Building
• How far back do they go?
• Which of these data sources are being used for
Evaluate
analysis? Can we use a data source which is currently
unused? What problems would that help us solve?

People and Ecosystem

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Data Science as a Process

• Selection of raw variables which are


potentially relevant to problem being
solved
Domain
• Transformations to create a set of
candidate variables
Variable Model Model Communication
Data Prep
Selection • ClusteringExecution
Building and other types of
& Operationalization

categorization which could provide


insights
Evaluate

People and Ecosystem

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Data Science as a Process

Pick suitable statistics, or suitable model form and algorithm


and build model
Domain

Variable Model Model Communication


Data Step & Operationalization
Selection Building Execution

Evaluate

People and Ecosystem

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Data Science as a Process

The model needs to be executable in database on big data


with reasonable execution time

Domain

Variable Model Model Communication


Data Prep & Operationalization
Selection Building Execution

Evaluate

People and Ecosystem

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Data Science as a Process

The model results need to be communicated &


operationalized to have a measurable impact on the
business
Domain

Variable Model Model Communication


Data Prep & Operationalization
Selection Building Execution

Evaluate

People and Ecosystem

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Data Science as a Process

Domain

Variable Model Model Communication


Data Prep & Operationalization
Selection Building Execution

Evaluate

• Accuracy of results and forecasts


• Analysis of real-world experiments
People and Ecosystem
• A/B testing on target samples
• End-user and LOB feedback

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Use Case 1 Trip modeling
Problem: Analyze behaviour of
visitors to MakeMyTrip.com
Particularly interested in
unregistered visitors
– About 99% of total visitor traffic

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Applications of model
• Tailor promotions for popular types of trips
 Most popular types probably already well-known; potential in
next tier down
• ... and for different types of customers
• Present customised promotions to visitors based on clicks
• Ad optimization: present ads based on modelled behavior

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Hypertargeting
• Serving content to customers based on individual
characteristics and preferences, rather than broad
generalizations

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Available data
• Data available from server:
 Date/time
 IP address
 Parts of site visited

• Geographic location can be obtained via geo lookup on IP


• Personal information available for registered visitors only

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Approach
• Use clustering to identify trip/visitor types
 Sport (IPL,F1, Football, etc)
 Festivals
 Other seasonal movements

• Decision trees to predict which type of trip a visitor is likely


to make
 Based on successively more information as they move
through the site
• Use registered visitor info to augment models

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Use Case 2 Municipal traffic analysis
• Client domain: Municipal city government
• Available data:
Cross-city loop detectors measuring traffic volume
Detailed city bus movement information from Bluetooth devices
Video detection of traffic volume, velocity

• Goal: Exploit available data for unrealized business insights and


values

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Data loading and manipulation
• Parallel data loading
– Data loaded from local file system and distributed across Greenplum
servers in parallel.
– Loading 9 months of traffic volume data (16 GB, 464 million rows) in 69.4
seconds.

• SQL data manipulation


– Standard SQL permits city personnel to use existing skillsets.
– Greenplum SQL extensions offer the control over data distribution.
– Open source packages (e.g. in Python, R) can be conveniently deployed
within Greenplum for visualization and analytics purposes.

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Basic reporting on traffic volume
• Easy generation of reports via straightforward user-defined functions
• Standard graphing utilities called from within Greenplum to create figures
• Detector downtimes can be clearly spotted in the figure, or via an SQL query, thus
mitigating maintenance challenges

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Basic reporting on city buses
• Data from Bluetooth devices has a wealth of information on city
buses that we can report on:
 Travel route of each bus
 Deviations of arrival times compared to provided timetable
 Occurrences of driver errors (e.g. taking a wrong turn) and possible
causes
 Occurrences where the same bus service arrives at the same stop
within seconds of each other
 Whether new bus services translates into lower traffic volume on
introduced roads

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Result visualizations (Google Earth)

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Applications for traffic network modelling
• Compute the fastest path between any two locations at a
future time point
• Identify potential bottlenecks in the traffic
• Identify phase transition points for massive traffic congestion
using simulation techniques
• Study the likely impact of new roads and traffic policies,
without having to observe real disruptive events to
determine the impact

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Parallel traffic network modelling
• Greenplum‟s parallel architecture permits traffic network analysis on a
city scale
• Travel time can be predicted via model learning, involving hundreds of
thousands of optimizations in parallel, across the entire traffic network
• Variables that can be considered include
 Distance between two locations
 Concurrent traffic volume
 Time of day
 Weather
 Construction work

• Computationally prohibitive for traditional non-parallel database


environments

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Use Case 3 - Product Recommendation Analysis
• Eight banks became one
 Branches across the US

• Consolidation of products and customers


 Employees faced with new products and
customers
 Visibility into churn and retention was
challenged
• Analytics focus was historically reporting-
centric
 Descriptive “hindsight”`

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Customer Segmentation
Customer segments
– First, define a measurement of
customer value
– Then create clusters of
customers based on customer
value, and then product
profiles.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Association Rules
Product associations
– Now find products that are
common in the segment, but
not owned by the given
household.

Product X
Product A
Product Y
Product B
Product Z

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Product Recommendations
Next best offer
– Now, filter down to products
associated with high-value
customers in the same segment.

Product X
Product A
Product Y
Product B
Product Z

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Increased customer value
Customer Comments
– “The Greenplum Solution has
scaled from 6 to 11 TB of data.”
– Moved from 7 hours /month of
data to 7.5 hours / 2.5 years of
data

Product Recommender
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


VS

Ferrari Freight Train

0-100 KMPH 2.3 seconds 100 seconds

Top Speed 360 KMPH 140 KMPH

Stops / hr 1000 5

Horse Power 660 bhp 16,000 bhp

Throughput 220 KG in 27 mins 55000000 KG in 60 mins

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module #: Module Name 74
Fast Data VS Big Data

Fast Data Big Data

Transactions / 100000+ per second n.a


Second
Concurrent hits 10000 + per sec 10 per second

Update Patterns Read / Write Appends

Data Complexity Simple Joins on a few tables Can be highly complex

Data Volumes GB‟s / TB PB to ZB

Access Tools GemFire / SQLFire GP DB, GP Hadoop

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved. Module #: Module Name 75
Not a fast OLTP DB!
APPLICATION(S)

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Fast Data is
• More than just an OLTP DB
• Super Fast access to Data
• Server side flexibility
• Data is HA
• Supports transactions
• Setup is fault tolerant
• Can handle thousands of concurrent hits
• Distributed hence horizontally scalable
• Runs on cheap x86 hardware

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
CAP Theorem
A distributed system can only
achieve TWO out of the
three qualities of
Consistency, Availability and
Partition Tolerance

onsistency vailability artition Tolerence

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Fast Data =
Database
• Storage • High Availability
• Persistence • Load Balancing
• Transactions • Data Replication
• Queries • L1 Caching

+ Messaging System + Service Bus


• Data Distribution • System Integration
• Event Propagation • Data Transformation

• Guaranteed Delivery • Service Loose Coupling

+ Grid Controller + Complex Event Processor


• Task Decomposition • Business Event Detection
• Distributed Task Assignment • Real-time Analysis
• Map-Reduce, Scatter-Gather • Event Driven Architectures
• Result Summarization

Fast Data combines select features from all of these products and combines
them into a low-latency, linearly scalable, memory-based data fabric

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


A Typical Fast Data Setup

Load Balancer
Add/remove
web/application/data servers

Web Tier

Application Tier

Database Tier

Add/remove storage
Disks may be direct or network
attached

Storage Tier
Optional reliable, asynchronous
feed
to a Big Data Store

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Memory-based Performance
Fast Data uses memory on a peer machine to make data updates
Perform durable, allowing the updating thread to return 10x to 100x faster than
updates that must be written through to disk, without risking any data
loss. Typical latencies are in the few hundreds of microseconds
instead of in the tens to hundreds of milliseconds.

One can optionally write updates to disk / data warehouse / big data store
asynchronously and reliably.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


WAN Distribution

Distribute

Fast Data can keep clusters that are distributed around the world synchronized in real-
time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network
environments.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Distributed Events

Notify

Targeted, guaranteed delivery, event


notification and Continuous Queries

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Parallel Queries

Compute Batch Controller


or Client

Scatter-Gather (Map-Reduce)
Queries

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Data-Aware Routing
Data Aware Function
Execute Batch Controller
or Client

Fast Data provides „data aware function routing‟ – moving the behavior to
the correct data instead of moving the data to the behavior.

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Accessing Fast Data
Order

GemFire Order Line Item

Key-Value store with OQL Queries Quantity

Product
Stores Objects (Java, C++, C#, .NET) or unstructured data SKU

Spring-GemFire Unit Price

L2 Cache plugin for Hibernate Discount

HTTP Session replication module

SQLFire
Stores Relational Data with SQL interface
Supports JDBC, ODBC, Java and .NET interfaces
Uses existing relational tools

2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Use Cases

Applying the technology


A few examples of Fast Data technology
applied to real business cases

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Mainframe Migration

A mainframe-based, nightly customer account reconciliation batch run

0 60 120
min

CPU Unavailable CPU Busy I/O Wait


76% 15% 9%

Mainframe

Batch now runs in 60 seconds


COTS Cluster

93% Network Wait! Time could have been reduced further with higher network bandwidth

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Mainframe Migration

So What? So the batch runs faster – who cares?

1. It ran on cheaper, modern, scalable hardware


2. If something goes wrong with the batch, you only wait 60
seconds to find out
3. Now, the hardware and the data are available to do other
things in the remaining 119 minutes:
• Fraud detection
• Regulatory compliance
• Re-run risk calculations with 119 different scenarios
• Up sell customers

4. You can move from batch to real-time processing!

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Online Betting

A popular online gambling site attracts new players through ads on affiliate sites

Customized Banner Ad on affiliate site


In a fraction of a second, the banner ad sever must:
Generate a tracking id specific to the request
Apply temporal, sequential, regional, contractual and other
policies in order to decide which banner to deliver
Customize the banner
Record that the banner ad was delivered

4
2
3
1 Banner Ad Server

Affiliate's Web Server

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Online Betting (Contd.)

Their initial RDBMS-based system


Limited their ability to sign up new affiliates
Limited their ability to add new products on their site
Limited the delivery performance experienced by their
affiliates and their customers
Limited their ability to add additional internal applications
and policies to the process

Their new Fast Data based system


Responded with sub-millisecond latency
Met their target of 2500 banner ad deliveries per second
Provides for future scalability
Improved performance to the browser by 4x
Cost less

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Asset/Position Monitoring

Centralized data storage was not possible


Multi-agency, multi-force integration
Numerous Applications needed access to multiple data sources
simultaneously
Networks constantly changing, unreliable, mobile deployments
Upwards of 60,000 object updates each minute
Over 70 data feeds

Needed a real-time situational awareness system to track assets that could be used by the war fighters in theatre
Northrop Grumman (integrator) investigated the following technologies before deciding on GemFire
•RDBMS – Oracle, Sybase, Postgres, TimesTen, MySQL
•ODBMS - Objectivity
•jCache – GemFire, Oracle Coherence
•JMS – SonicMQ, BEA Weblogic, IBM, jBoss
•TIBCO Rendezvous
•Web Services

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Asset/Position Monitoring

655 sites, 11 thousand users


Real-time, 3 dimensional, NASA World Wind User Interface
60,000 Position updates per minute
Real time info available on the desk of
President of the United States
US Secretary of Defense
Each of the Joint Chiefs of Staff
Every commander in the US Military

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Global Foreign Exchange Trading System

The project achieved:


Low-latency trade insertion
Permanent Archival of every trade
Kept pace with fast ticking market data
Rapid, Event Based Position Calculation
Distribution of Position Updates Globally
Consistent Global Views of Positions
Pass the Book
Regional Close-of-day
High Availability
Disaster Recovery
Regional Autonomy

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Global Foreign Exchange Trading System

In that same application, Fast Data replaced:


Sybase Database In Every Region
Still need 1 instance for archival purposes
TIBCO Rendezvous for Local Area Messaging
IBM MQ Series for WAN Distribution
Veritas N+1 Clustering for H/A
In fact, we save the physical +1 node itself
3DNS or Wide IP
Admin personnel reduced from 1.5 to 0.5

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Agenda

1. Introduction to Big Data Analytics


2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
2
EMC PROVEN PROFESSIONAL

Copyright © 2012 EMC Corporation. All Rights Reserved.


Application High Level Overview
APPLICATION(S)

Single DB cant handle both


OLTP and OLAP
workloads
2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
How to get the best of Fast & Big Data

In case record isn't available


APPLICATION(S)

Concurrent hits

Big Data Setup


Fast Data
Setup

2
EMC PROVEN PROFESSIONAL

Copyright © 2011
2012 EMC Corporation. All Rights Reserved.

Вам также может понравиться