You are on page 1of 39

BIG DATA

Philippe Julio Big Data Consulting Practice Manager

AGENDA

BIG DATA
Who is KEYRUS ? Big Data & Analytics, What is it ? Positioning Software & Tools Technical Architecture Value Proposition

Big Data

KEYRUS
A UNIQUE VALUE PROPOSITION
A GROUP STRONG AND AGILE SPECIALIST IN ORGANIZATIONS PERFORMANCE OUR VALUES FOR THE BENEFIT OF OUR CUSTOMERS AN INTERNATIONAL DIMENSION

153m
2012 Revenues

350
Large accounts* & LME

3800
SME customers

1650
Employees

12
countries
on 4 continents

*including 80 Global Fortune 500

The infrastructures and processes (quality HR,..) of a large professionnal services Group Simple and formalized governance to maintain agility at all times A customer-focused decision center Listed on NYSE-Euronext Paris

An ability to act on performance management strategy, systems and organizations Different Business Units to serve different types of clients (Large corporations, midmarket, and SMEs) Functional, Industry and Technology skills

Entrepreneurship Customer proximity Building our brand on quality of service A culture of innovation that defines how we operate and is also part of our value proposition Diversity as a key component of our HR policy

Expertise in deploying international projects Nearshore & offshore capacities

Revenue by Sector
Keyrus - All rights reserved

Industries: 31% Banking - Insurance: 19% Telecom : 8% Services - Distribution: 16% Public Services: 14% Utilities: 12%

Belgium Brazil Canada China Spain France Mauritius Israel Luxembourg Switzerland Tunisia USA 3

Big Data

BIG DATA & ANALYTICS, WHAT IS IT ?

5 Billion
# of cell phone users worldwide in 2010

10x
Growth in digital data every 5 years

2 Billion
# of Internet users worldwide in 2010

30 Billion
Pieces of content shared on Facebook every month

35 ZB
By 2020, the Digital Universe will be 44 times as big as it was in 2009

BIG DATA
LARGE HADRON COLLIDER OF CERN (SWITZERLAND)

LARGE HADRON COLLIDER (LHC) of CERN


Keyrus - All rights reserved

15PB of Data /Year !!

Big Data

BIG DATA ?
NOT ONLY DATA VOLUME

Volume
Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information. TB, Records, Transactions, Tables, Files

Velocity
Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business. Batch, Near time, Real time, Streams

Value Variety
Keyrus - All rights reserved

Innovate new business models Replace/Support human decision Custom actions Discover needs Improve performance Create transparency

Big data extends beyond structured data, including semistructured and unstructured data of all varieties: text, audio, video, click streams, log files and more. Multi-structured : Unstructured, Semi-Structured, Structured
Big Data 7

WHAT DOES ANALYTICS MEANS ?


FOR SALES MANAGEMENT BY EXAMPLE

Analysis
Quarterly sales reporting Sales growth plan

Simulation & Forecast


Run alternative sales scenarios to identify the best product mix for next quarter Run simulations to determine the ideal number of sales professionals to assign to a particular new territory

What will happen?

How can it be done better?

Future

What happened and when? Facts

How and why it happened? Interpretation

Strategy
Keyrus - All rights reserved

Forecast vs. results analysis Predictable patterns Decision-making

Past

Big Data

ANALYZING DATA
SKILL OF THE FUTURE

McKinsey: by 2018, the United States alone could face a shortage of:
140-190,000 people with deep analytical skills 1.5M managers/analysts with the know-how to use the analysis of big data to make effective decisions
www.mckinsey.com/mgi/publications/big_data/

Keyrus - All rights reserved

Stacy Collett Computerworld , August 23, 2010


Big Data 9

DATA SCIENTIST, STATISTICIAN


WHAT IS THE DIFFERENCE

Data Scientist
Working on global data Modeling complex business problems Using Big Data software packages (Mahout, Lucene) Discovering business insights Identifying opportunities Skills for coding, integrating and preparing large, varied, data sets Advanced analytics and modeling skills to reveal and understand hidden relationships Business knowledge and communication skills to present results

Statistician
Working on data sampling From data sampling to global data by projection Using statistical software packages (SAS, SPSS) Skills for probability, regression and modeling Practical experience on data cleansing, simulation and data visualization Skills for data interpretation, analysis, categorization, correlation, explanation Communication skills to present results

Keyrus - All rights reserved

Big Data

10

BIG DATA POSITIONING

HYPE CYCLE 2012


GARTNER ANALYSIS

Keyrus - All rights reserved

Big Data

12

STRATEGIC TECHNOLOGY FROM GARTNER


TRENDS 2013

Strategic Big Data


Big Data is moving from a focus on individual projects to an influence on enterprises strategic information architecture

Actionable Analytics
Provides simulation, prediction, optimization and other analytics, to empower even more decision flexibility at the time and place of every business process action

In Memory Computing
The execution of certain-types of hours-long batch processes can be squeezed into minutes or even seconds
Keyrus - All rights reserved

Integrated Ecosystems
Packaging of software and services to address infrastructure or application workload

Big Data

13

BIG DATA STATISTICS


REPORT
Sources: "Big Data: The Next Frontier for Innovation, Competition and Productivity." US Bureau of Labor Statistics | McKinsley Global Institute Analysis

Amount of Stored Data By Sector


(in Petabytes, 2009)
1000 900 800 700 966 848 715 619 434 364 269 227

Petabytes

600 500 400 300 200 100 0

35ZB -> a stack of 50GB Bluray DVDs reaching from earth to the moon x2

Keyrus - All rights reserved

10 ** 21 Bytes

Big Data

14

BIG DATA BUSINESS DRIVERS


ON MAJOR INDUSTRIES

Bank/Insurance risks management Bale III customer qualification, fraud management Telecommunications more reliable network where we can predict and prevent failure customers attrition Media more content that is lined up with your personal preferences Marketing e-reputation - Trends analysis on the web sites Healthcare prevention system epidemiological surveillance

Life Science

better targeted medicines with fewer complications and side effects Retail a personal experience with products and offers that are just what you need Government government services that are based on hard data, not just gut IT support optimization electric consumption analysis Gaming determining the future direction of the games

Keyrus - All rights reserved

Big Data

15

BIG DATA DOMAINS


A LARGE ACTIVITY

Digital marketing optimization (e.g., web analytics, attribution, golden path analysis) Data exploration and discovery (e.g., data scientists, identifying new data-driven products, new markets) Fraud detection prevention (e.g. revenue protection, site integrity, credit card protection, suspect transactions, fight against money laundering) Machine-generated data analytics (e.g., remote device insight, remote sensing, location-based intelligence) Social network and relationship analysis (e.g., influencer marketing, crowdsourcing, attrition prediction)
Keyrus - All rights reserved

Source: Teradata

Data retention (e.g. long term conservation of data, data archiving

Big Data

16

NEW DATA & MANAGEMENT ECONOMICS


TRENDS

Compute Trend
New Analytics
(Massively Parallel Processing,, MapReduce , Algorithms)

Storage Trend
New Data Structure
(Distributed File Systems, NoSQL , NewSQL)

Elastic Data Warehouse


Master/Slave

Enterprise data warehouse General purpose data warehouse Proprietary and dedicated data warehouse OLTP is the data warehouse
Keyrus - All rights reserved

Object Storage

Multi-Structured Data

Master/Master

Distributed FS

Federated/ Sharded

Master Data Management Data Quality


Big Data 17

BIG DATA SOFTWARE & TOOLS

BIG DATA IS MOSTLY OPEN SOURCE SOFTWARE


OPEN SOURCE NOT ONLY FREE

Keyrus - All rights reserved

Shared source code Publicly available and free Support suscription not free No software vendor lock-in For the use and benefit of all without favour
Big Data

Open Source software

Commercial software

19

DATA WAREHOUSE
GARTNER ANALYSIS

Data Warehouse appliances


EMC Greenplum

Parallel Data Warehouse (Microsoft) IBM Netezza Oracle Exadata SAP HANA ParAccel Analytic Database Teradata HP Vertica

Massively Parallel Processing Hadoop Connectivity


Keyrus - All rights reserved

Column-Oriented database
Source Gartner January 2013

In-Memory database

Big Data

20

DATA MANAGEMENT
GARTNER ANALYSIS

Data Integration
Source Gartner October 2012

Data Quality
Source Gartner October 2012

Master Data
Source Gartner October 2012

2011 position (in orange) to 2012 position (in red)


Keyrus - All rights reserved

Data acquisition Consolidation Data migrations/conversions Synchronization of data between operational applications Interenterprise data sharing Delivery of data services in an SOA context

Profiling Parsing and standardization Data cleansing Matching Monitoring Enrichment

Identify, link and synchronize the information across heterogeneous data sources Create and manage a central database of record or index Support master data and governance requirements through workflow

Big Data

21

BIG DATA QUALITY


BUSINESS AND IT IMPACTS
Business IT

Governance

1 2 3 4 5 6 7
Keyrus - All rights reserved

Accessibility

External data access Open data access Data collect easily Wrong figures Visualization not clear for decision-making Incorrect data, doubloons Decision making impact Data update All data in the context Global data Data-understanding Data life cycle From sources to users Data lost Data intrusion Data habilitations

Business consistency

Technical consistency

Freshness

Completeness

Explicable

Traceability

Security

Big Data

22

BUSINESS INTELLIGENCE
GARTNER ANALYSIS

Predictive analysis Advanced visualization Geospatial analysis Cloud analytics platform Innovation Last years acquisitions
IBM > Cognos, Algorithmics
Keyrus - All rights reserved

SAP > BusinessObjects Oracle > Hyperion, Siebel, Endeca

Source Gartner - February 2012

Big Data

23

HADOOP OVERVIEW
OPEN SOURCE FRAMEWORK

What is Hadoop ?
Top level Apache Foundation project Large, active user base, mailing lists, user groups Very active community, strong development team

Why Hadoop ?
Searching Log Processing Data Analytics Video and Image Analysis Data Retention

Keyrus - All rights reserved

Open Source software flexible and available architecture for large scale computation and data processing on a network of commodity hardware
Big Data 24

HADOOP PROVIDERS
FORESTER ANALYSIS
Amazon is the most prominent Hadoop cloud service provider IBM has the deepest Hadoop platform and application portfolio EMC Greenplum is the first mover in Hadoop appliances MapR has a strong OEM business for its Hadoop distribution Cloudera is the Hadoop pure play with the greatest adoption Hortonworks provides professional services to the Hadoop ecosystem Pentaho executes Hadoop MapReduce models and Pig scripts for data integration and analytics products DataStax embeds Cassandra for real-time Hadoop applications Datameer provides a user-friendly Hadoop modeling tool Platform Computing brings proven cluster management tools to Hadoop Zettaset specializes in Hadoop cluster management tools
Keyrus - All rights reserved

Outerthought focuses on Hadoop search applications HStreaming provides complex event processing middleware for Hadoop
Source Forester Research Inc. - February 2012

Big Data

25

CLOUDERA
HADOOP DISTRIBUTION - CDH

Keyrus - All rights reserved

Hadoop is framework based on flexible and available architecture for large scale computation and data processing on a network of commodity hardwar e HDFS / MapReduce : Hadoop Distributed File System for storage and Hadoop MapReduce for compute. High availability and scalability. Open source software Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Hadoop it provides Tools to enable easy data extract/transform/load , a mechanism to impose structure on a variety of data formats, access to files stored either directly in HDFS or in other data storage systems such as HBase and query execution via MapReduce Pig is a high-level data-flow language and execution framework for parallel computation. Simple to write MapReduce program. Abstracts you from specific detail. Focus on data processing. Data flow. Data manipulation. for enhancing extract, transform and load data into HDFS or from HDFS into any target systems. Open source software Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data

Hadoop Framework
Job Workflow APACHE OOZIE Data Processing Lib DataFu for Pig

Data Mining Lib


APACHE MAHOUT

Build/Test: APACHE BIGTOP

Web Console
HUE

Interactive SQL
Impala

Metadata
APACHE HIVE MetaStore

Batch Processing Languages Data Integration APACHE FLUME, APACHE SQOOP Cloud Deployment APACHE WHIRR APACHE PIG, APACHE HIVE Hadoop Core Kernel MapReduce, HDFS Connectivity
ODBC/JDBC/FUSE/HTTPS

Fast Read/Write Access APACHE HBASE Coordination APACHE ZOOKEEPER

Cloudera Manager Free Edition (Installation Wizard)

CDH4 June 2012

Big Data

26

MAPREDUCE
MASSIVELY PARALLEL PROCESSING
MapReduce
MapReduce is the programming paradigm popularized by Google researchers Open-source Hadoop implementation of MapReduce by Yahoo Open source software framework for distributed computation Parallel computation (Map) on each block (Split) of data in an HDFS file and output a stream of (Key, Value) pairs to the local file system JobTracker schedules and manages jobs TaskTracker executes individual map() and reduce() tasks on each cluster node

Algorithms
Association Rule Learning Algorithms Genetic Algorithms Neural Network Algorithms Statistical Algorithms (Pandas) Machine Learning Algorithms (Mahout, Weka, Scikit Learn) Natural Language Processing Algorithms Trading Algorithms Clinical design Algorithms Searching Algorithms (Lucene, Solr, Katta, ElasicSearch, OpenSearch Server)

Languages
PHP Erlang Python Ruby R Java

Keyrus - All rights reserved

Big Data

27

NOSQL DATABASES CATEGORIES


MAJOR CATEGORIES

Column

BigTable (Google), HBase, Cassandra (DataStax), Hypertable

NoSQL = Not only SQL


Popular name for a subset of structured storage software that is designed with the intention of delivering increased optimization for highperformance operations on large datasets Basically, available, scalable, eventually consistent Easy to use Tolerant of scale by way of horizontal distribution

Key-Value

3
Document
MongoDB (10Gen), CouchDB, Terra store, SimpleDB (AWS)

Redis, Riak (Basho), CouchBase, Voldemort (LinkedIn) MemcacheDB

Graph
Keyrus - All rights reserved

Neo4j (Neo Technology), Jena, InfiniteGraph (Objectivity), FlockDB (Twitter)

Big Data

28

BIG DATA TECHNICAL ARCHITECTURE

CLOUD FOR BIG DATA


CLOUD MODELS

Cloud

Cloud

Cloud

Cloud Computing Private Public SaaS Applications Hybrid

Cloud models
SalesForce.com, Facebook, Twitter, Li nkedIn Amazon Web Services, Microsoft Windows Azure, Google Linux, Windows, Unix)

SaaS

App

App

App

App

App

App

Platform Tools & Services

PaaS

Java

Ruby

Python

PHP

Erlang

Operating Systems Virtualization

Keyrus - All rights reserved

IaaS
Hardware (server, storage, network)

Amazon Web Services, CloudWatt

Big Data

30

INFRASTRUCTURE AS A SERVICE
IAAS MODEL

General Purpose
Combine server with storage & networking (Hyper-Scale Server) Specialized software enables general purpose systems designs to provide high performance data services

Data services move to the infrastructure


Application Legacy
Application Data Services Metadata Mgnt Data Services
Keyrus - All rights reserved

Emerging
Application Data Services

Future
Application

Metadata Mgnt Storage Storage

Metadata Mgnt Storage

Infrastructure
Big Data 31

BI ARCHITECTURE VS. BIG DATA ARCHITECTURE


ALIGNING ARCHITECTURE ON BUSINESS BI & DWH Architecture - Traditional SQL based Commercial software SAP BO, IBM Cognos, Oracle Hyperion High availability Enterprise database Right design for structured data Current storage hardware (SAN, NAS, DAS)
App Servers

Analytics Architecture New Generation Not only SQL based Hadoop, Cassandra High scalability, availability and flexibility Compute and storage in the same box for reducing the network latency Right design for semi-structured and unstructured data

Edge Nodes Network Switches Network Switches Database Servers

Keyrus - All rights reserved

SAN Switch

Storage Array Data Nodes

Big Data

32

HADOOP ARCHITECTURE
OVEVIEW

Network Switches

Keyrus - All rights reserved

2 x EdgeNode 2 CPU 6 core 96GB RAM 6 x HDD 600GB 15K (Raid10) 2 x 10GbE Ports

2 x NameNode/BackupNode 2 CPU 6 core 96GB RAM 6 x HDD 600GB 15K (Raid10) 2 x 10GbE Ports

3 to n DataNode 2 CPU 6 core 48GB RAM 12 x HDD 3TB 7.5K 2 x 10GbE Ports

Edge Nodes

Control Nodes

Worker Nodes

Big Data

33

ENTERPRISE DATA ARCHITECTURE


360 INSIGHT
ENGINEERS DATA ADMINISTRATOR SYSTEM OPERATORS DATA SCIENTISTS ANALYSTS BUSINESS USERS CUSTOMERS

Dev./Int. Meta Data/ ETL Tools Cloudera Manager

Modeling Tools

BI / Analytics

Enterprise Reporting

Web/Mobile Applications

Enterprise Data Warehouse Online Serving Systems

Keyrus - All rights reserved

Logs

Files

Web Data

RDBMS

Big Data

34

BIG DATA VALUE PROPOSITION

BIG DATA - TCO / ROI APPROACH


KEY QUESTIONS

Evaluate the investment opportunity


What can we expect from the investment ? Is it worth investing in-house ? How long to payback on investment ? What is the competitive advantage value ? What is the risk if we dont start the project ?

Costs
Hardware & software products costs Services & Support costs Training & communication costs Energy & professional costs

Benefits
Increase productivity Increase margins and revenues Reduce time to access to relevant information
Keyrus - All rights reserved

TCO = Costs ROI = (Benefits TCO) / TCO

Reduce time to decision making Enhance quality of information Enhance users satisfaction
Big Data 36

BIG DATA VALUE PROPOSITION Keyrus, leader in Business Intelligence (Consulting & Delivery) Works closely with the big data leaders Works with high level profiles: Statistician, Architect, BIDW Specialist, Consultant, Manager Develops partnerships Develops innovation Uses open source software
No software vendors lock-in Low TCO

Apache Hadoop framework


HDFS, MapReduce, Hive

Big data integration software


Informatica, Talend

Big data analytics & visualization software


SAS, SAP, QlikTeck, Tableau Software

DWH appliances and big data connectivity


Vertica, Exadata, Greenplum, Netezza, Teradata, SAP HANA, MS Parallel Data Warehouse
Big Data

37

QUESTIONS & ANSWERS


WHO, WHAT, WHEN, WHERE

&
Keyrus - All rights reserved

Big Data

38

THANK YOU
FOR YOUR ATTENTION

Keyrus - All rights reserved

Big Data

39