Академический Документы
Профессиональный Документы
Культура Документы
Kunal Joshi
joshik@vmware.com
2
EMC PROVEN PROFESSIONAL
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Other Companies using Big Data
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Moore’s Law
Gordon Moore (Founder of Intel)
Number of
transistors that
can be placed in a
processor
DOUBLES in
approximately
every TWO years.
2
EMC PROVEN PROFESSIONAL
Your Thoughts?
2
EMC PROVEN PROFESSIONAL
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved. 8
Key Characteristics of Big Data
1. Data Volume
44x increase from 2010 to 2020
(1.2zettabytes to 35.2zb)
2. Processing Complexity
Changing data structures
Use cases warranting additional transformations and
analytical techniques
3. Data Structure
Greater variety of data structures to mine and analyze
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 9
Big Data Characteristics: Data Structures
Data Growth is Increasingly Unstructured
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 10
Four Main Types of Data Structures
Structured Data Quasi-Structured Data
Semi-Structured Data
View Source
http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&
pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs
_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651
Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 11
Business Drivers for Big Data Analytics
Current Business Problems Provide Opportunities for Organizations to
Become More Analytical & Data Driven
Driver Examples
1
Desire to optimize business
Sales, pricing, profitability, efficiency
operations
2
Desire to identify business risk Customer churn, fraud, default
3
Predict new business Upsell, cross-sell, best new customer
opportunities prospects
4
Comply with laws or regulatory Anti-Money Laundering, Fair Lending,
requirements Basel II
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 12
Challenges with a Traditional Data Warehouse
1 Data
Sources
Non-Agile Models
2 Departmental
“Spread
Marts”
Warehouse
Enterprise 4
Departmental Applications
Warehouse
3 Prioritized
Operational
Processes
Static schemas
accrete over time Reporting Siloed
Analytics
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved. 13
Implications of a Traditional Data Warehouse
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 14
Opportunities for a New Approach to Analytics
New Applications Driving Data Volume
SMALL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 15
Considerations for Big Data Analytics
Criteria for Big Data Projects New Analytic Architecture
Analytic Sandbox
Data assets gathered from multiple sources
1. Speed of decision making and technologies for analysis
2. Throughput
• Enables high performance analytics
3. Analysis flexibility using in-db processing
• Reduces costs associated with data
replication into "shadow" file
systems
• “Analyst-owned” rather than “DBA
owned”
2
EMC PROVEN PROFESSIONAL
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 17
Agenda
1
Health Care
• Reducing Cost of Care Medical
2
Public Services Government Internet
• Preventing Pandemics
3 Life Sciences Data
Collectors
• Genomic Mapping
4 IT Infrastructure
• Unstructured Data Analysis Phone/TV Retail
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 19
1
Big Data Analytics: Healthcare
• Dr. Jeffrey Brenner generated his own crime maps from medical
Use of Big Data billing records of 3 hospitals
• City hospitals & ER‟s provided expensive care, low quality care
• Reduced hospital costs by 56% by realizing that 80% of city‟s
Key medical costs came from 13% of its residents, mainly low-
Outcomes income or elderly
• Now offers preventative care over the phone or through home
visits
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 20
2
Big Data Analytics: Public Services
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 21
3
Big Data Analytics: Life Sciences
Situation • Broad Institute (MIT & Harvard) mapping the Human Genome
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 22
4
Big Data Analytics: IT Infrastructure
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 23
5
Big Data Analytics: Online Services
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 24
Agenda
Drive Collaboration
DATA SCIENCE TEAM
Greenplum gNet
2
EMC PROVEN PROFESSIONAL
STRUCTURED UNSTRUCTURED
Schema on load
SequenceFile
Directories MapReduce
Hive
Java
XML, JSON, … No ETL
Pig Flat files
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Greenplum Database
STRUCTURED UNSTRUCTURED
SQL BI Tools
Partitioning
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
What do we Mean by Hadoop
• A framework for handling big data
An implementation of the MapReduce paradigm
Hadoop glues the storage and analytics together and provides reliability,
scalability, and management
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 30
Hadoop Distributed File System
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 31
MapReduce and HDFS
Job Tracker
Client/Dev
Map Job
2
Map
Map Job
Job Reduce Job
Reduce
Reduce Job
Job
3
Task Tracker
Task Tracker Task Tracker
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 32
Components of Hadoop
• As you move from Pig to Hive to
HBase, you are increasingly
moving away from the mechanics DBMS View
Hive SQL-based
language
Data flow
language &
Pig Execution
environment
More Hadoop
Visible
Mechanics of
Hadoop
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 5: Advanced Analytics - Technology and Tools 33
Greenplum Database
Extreme Performance for Analytics
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Greenplum DB & HD
Massively Parallel Access and Movement
Greenplum DB Hadoop
Maximize Solution
External Text Flexibility
Binary Minimize Data
Tables User- Duplication
Map Defined
Reduce Access Hadoop
Data in Real Time
From Greenplum
Segment 1 Node 1 DB
GP DB
gNet
Master Segment 2 Node 2 Import and export
Host 10Gb Ethernet
in Text, Binary
Segment 3 Node 3 and Compressed
Formats
Custom formats via user-written MapReduce Java
program And GPDB Format classes
2
EMC PROVEN PROFESSIONAL
Interconnect
Switch
Interconnect Analytic
Results
Independent Segment Analytical Software
Processors
Math & Statistical
Independent Memory Functions
Independent
Direct Storage
Connection
Storage
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Agenda
Data Science
High
• Predictive analysis
• What if…..?
Data
BUSINESS Science
VALUE
Business
Intelligence
Business
Intelligence
• Standard reporting
Low
• What happened?
Past TIME Future
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Data science and business intelligence
“TRADITIONAL BI”
Repetitive
“BIG DATA ANALYTICS”
10s of TB to Pb‟s
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Profile of a Data Scientist
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module 1: Introduction to BDA 46
Data Science as a Process
• People
• Scientists / Analysts
• Business Analysts
Domain • Consumers of analysis
• Stakeholders
• EMC sales and services
Variable Model Model Communication
Data Prep • Ecosystem & Operationalization
Selection Building Execution
• Sector (Telecom, banking, security agency etc.)
• Modeling software and other tools used by analysts
Evaluate
(MADlib, SAS, R etc.)
• Database (Greenplum) & Data Sources
2
EMC PROVEN PROFESSIONAL
• Product Recommendation
• Others……Evaluate
2
EMC PROVEN PROFESSIONAL
Domain
• What are the data sources?
• Do we have access to them?
• How big are they?
Variable
Model Model Communication
Data Prep Selection
• How often are they updated?Execution & Operationalization
Building
• How far back do they go?
• Which of these data sources are being used for
Evaluate
analysis? Can we use a data source which is currently
unused? What problems would that help us solve?
2
EMC PROVEN PROFESSIONAL
2
EMC PROVEN PROFESSIONAL
Evaluate
2
EMC PROVEN PROFESSIONAL
Domain
Evaluate
2
EMC PROVEN PROFESSIONAL
Evaluate
2
EMC PROVEN PROFESSIONAL
Domain
Evaluate
2
EMC PROVEN PROFESSIONAL
2
EMC PROVEN PROFESSIONAL
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Hypertargeting
• Serving content to customers based on individual
characteristics and preferences, rather than broad
generalizations
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Available data
• Data available from server:
Date/time
IP address
Parts of site visited
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Approach
• Use clustering to identify trip/visitor types
Sport (IPL,F1, Football, etc)
Festivals
Other seasonal movements
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Use Case 2 Municipal traffic analysis
• Client domain: Municipal city government
• Available data:
Cross-city loop detectors measuring traffic volume
Detailed city bus movement information from Bluetooth devices
Video detection of traffic volume, velocity
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Data loading and manipulation
• Parallel data loading
– Data loaded from local file system and distributed across Greenplum
servers in parallel.
– Loading 9 months of traffic volume data (16 GB, 464 million rows) in 69.4
seconds.
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Basic reporting on traffic volume
• Easy generation of reports via straightforward user-defined functions
• Standard graphing utilities called from within Greenplum to create figures
• Detector downtimes can be clearly spotted in the figure, or via an SQL query, thus
mitigating maintenance challenges
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Basic reporting on city buses
• Data from Bluetooth devices has a wealth of information on city
buses that we can report on:
Travel route of each bus
Deviations of arrival times compared to provided timetable
Occurrences of driver errors (e.g. taking a wrong turn) and possible
causes
Occurrences where the same bus service arrives at the same stop
within seconds of each other
Whether new bus services translates into lower traffic volume on
introduced roads
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Result visualizations (Google Earth)
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Applications for traffic network modelling
• Compute the fastest path between any two locations at a
future time point
• Identify potential bottlenecks in the traffic
• Identify phase transition points for massive traffic congestion
using simulation techniques
• Study the likely impact of new roads and traffic policies,
without having to observe real disruptive events to
determine the impact
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Parallel traffic network modelling
• Greenplum‟s parallel architecture permits traffic network analysis on a
city scale
• Travel time can be predicted via model learning, involving hundreds of
thousands of optimizations in parallel, across the entire traffic network
• Variables that can be considered include
Distance between two locations
Concurrent traffic volume
Time of day
Weather
Construction work
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Use Case 3 - Product Recommendation Analysis
• Eight banks became one
Branches across the US
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Customer Segmentation
Customer segments
– First, define a measurement of
customer value
– Then create clusters of
customers based on customer
value, and then product
profiles.
2
EMC PROVEN PROFESSIONAL
Product X
Product A
Product Y
Product B
Product Z
2
EMC PROVEN PROFESSIONAL
Product X
Product A
Product Y
Product B
Product Z
2
EMC PROVEN PROFESSIONAL
Product Recommender
2
EMC PROVEN PROFESSIONAL
Stops / hr 1000 5
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module #: Module Name 74
Fast Data VS Big Data
2
EMC PROVEN PROFESSIONAL
Copyright © 2012 EMC Corporation. All Rights Reserved. Module #: Module Name 75
Not a fast OLTP DB!
APPLICATION(S)
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Fast Data is
• More than just an OLTP DB
• Super Fast access to Data
• Server side flexibility
• Data is HA
• Supports transactions
• Setup is fault tolerant
• Can handle thousands of concurrent hits
• Distributed hence horizontally scalable
• Runs on cheap x86 hardware
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
CAP Theorem
A distributed system can only
achieve TWO out of the
three qualities of
Consistency, Availability and
Partition Tolerance
2
EMC PROVEN PROFESSIONAL
Fast Data combines select features from all of these products and combines
them into a low-latency, linearly scalable, memory-based data fabric
2
EMC PROVEN PROFESSIONAL
Load Balancer
Add/remove
web/application/data servers
Web Tier
Application Tier
Database Tier
Add/remove storage
Disks may be direct or network
attached
Storage Tier
Optional reliable, asynchronous
feed
to a Big Data Store
2
EMC PROVEN PROFESSIONAL
One can optionally write updates to disk / data warehouse / big data store
asynchronously and reliably.
2
EMC PROVEN PROFESSIONAL
Distribute
Fast Data can keep clusters that are distributed around the world synchronized in real-
time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network
environments.
2
EMC PROVEN PROFESSIONAL
Notify
2
EMC PROVEN PROFESSIONAL
Scatter-Gather (Map-Reduce)
Queries
2
EMC PROVEN PROFESSIONAL
Fast Data provides „data aware function routing‟ – moving the behavior to
the correct data instead of moving the data to the behavior.
2
EMC PROVEN PROFESSIONAL
Product
Stores Objects (Java, C++, C#, .NET) or unstructured data SKU
SQLFire
Stores Relational Data with SQL interface
Supports JDBC, ODBC, Java and .NET interfaces
Uses existing relational tools
2
EMC PROVEN PROFESSIONAL
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Mainframe Migration
0 60 120
min
Mainframe
93% Network Wait! Time could have been reduced further with higher network bandwidth
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Mainframe Migration
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Online Betting
A popular online gambling site attracts new players through ads on affiliate sites
4
2
3
1 Banner Ad Server
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Online Betting (Contd.)
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Asset/Position Monitoring
Needed a real-time situational awareness system to track assets that could be used by the war fighters in theatre
Northrop Grumman (integrator) investigated the following technologies before deciding on GemFire
•RDBMS – Oracle, Sybase, Postgres, TimesTen, MySQL
•ODBMS - Objectivity
•jCache – GemFire, Oracle Coherence
•JMS – SonicMQ, BEA Weblogic, IBM, jBoss
•TIBCO Rendezvous
•Web Services
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Asset/Position Monitoring
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Global Foreign Exchange Trading System
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Global Foreign Exchange Trading System
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
Agenda
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.
How to get the best of Fast & Big Data
Concurrent hits
2
EMC PROVEN PROFESSIONAL
Copyright © 2011
2012 EMC Corporation. All Rights Reserved.