BigData BI Mature

Hadoop meets Mature BI:
Where the rubber meets the road for Data Scientists

Michael Hiskey
Futurist, + Product Evangelist
VP, Marketing & Business Development Kognitio
@Kognitio @mphnyc #MPP_R #OANYC
The Data Scientist

Sexiest job of the 21st Century?
@Kognitio
@mphnyc
#MPP_R #OANYC
Key Concept: Graduation

Projects will need to Graduate from the Data Science Lab and become part of Business as Usual
Demand for the Data Scientist

Organizational appetite for tens, not hundreds
@Kognitio
@mphnyc
#MPP_R #OANYC
Dont be a Railroad Stoker!

Highly skilled engineering required but the world innovated around them.
@Kognitio
@mphnyc
#MPP_R #OANYC
Business Intelligence
Straddle IT and Business
Numbers Tables Charts Indicators Time - History - Lag Access - to view (portal) - to data - to depth - Control/Secure Consumption - digestion
with ease and simplicity More granularity
Richer data model Faster
Lower latency Self service

What has changed?

More connected-users?
More-connected users?
According to one estimate, mankind created 150 exabytes of data in 2005 (billion gigabytes)
In 2010 this was 1,200 exabytes
Data flow
Data Variety
@Kognitio
@mphnyc
#OANYC
What?
New value comes from your existing data
Respondents were asked to choose up to two descriptions about how their organizations view big data from the choices above. Choices have been abbreviated, and selections have been normalized to equal 100%. n=1144
Source: IBM Institute for Business Value/Said Business School Survey
@Kognitio
@mphnyc
#OANYC
20th Century Fox
@Kognitio
@mphnyc
#MPP_R #OANYC
Hadoop ticks many but not all the boxes
aaa aaaa aa aa a aa aa a a a aa aa aaaa a aa aa

Null storage concerns

No need to triage No need to pre-process No need to align to schema
The drive for deeper understanding

Analytical Complexity
Behaviour modelling Statistical Analysis Clustering
Machine learning algorithms
Dynamic Simulation
Dynamic Interaction Fraud detection Reporting & BPM Campaign Management
#MPP_R
Technology/Automation
@Kognitio
@mphnyc
#OANYC
Hadoop just too slow for interactive BI!

while hadoop shines as a processing
platform, it is painfully slow as a query tool
loss of trainof-thought
Analytics needs low latency, no I/O wait

High speed in-memory processing
@Kognitio
@mphnyc
#MPP_R #OANYC
Analytical Platform: Reference Architecture

Application & Client Layer
All BI Tools All OLAP Clients Excel
Analytical Platform Layer
Near-line Storage (optional)
Reporting
Persistence Layer
Hadoop Clusters
Cloud Storage
Enterprise Data Warehouses
Legacy Systems
The Future
Predictive Analytics Advanced Analytics
Big Data
In-memory
Data Scientists
Logical Data Warehouse

connect www.kognitio.com linkedin.com/companies/kognitio tinyurl.com/kognitio NA: +1 855 KOGNITIO EMEA: +44 1344 300 770 twitter.com/kognitio youtube.com/kognitio
Hadoop meets Mature BI: Where the rubber meets the road for Data Scientists
The key challenge for Data Scientists is not the proliferation of their roles, but the ability to graduate key Big Data projects from the Data Science Lab and production-ize them into their broader organizations. Over the next 18 months, "Big Data' will become just "Data"; this means everyone (even business users) will need to have a way to use it - without reinventing the way they interact with their current reporting and analysis. To do this requires interactive analysis with existing tools and massively parallel code execution, tightly integrated with Hadoop. Your Data Warehouse is dying; Hadoop will elicit a material shift away from price per TB in persistent data storage.
@Kognitio @mphnyc #MPP_R
Wanted
Dead or Alive
The No SQL Posse
SQL
The new bounty hunters: Drill Impala Pivotal Stinger
Its all about getting work done
Tasks evolving: Used to be simple fetch of value Then was calc dynamic aggregate
Now complex algorithms!
Behind the numbers
select receives sum(sales) total_sales Trans_Year, Num_Trans, dept, (sum(sales) SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALES partition by PRODNO order by PRODNO, ROW_ID from sales_history count(distinct summary Account_ID) Num_Accts, sales_fact sends ( R_OUTPUT varchar ) where year = 2006 and month = -055 and region=1; sum(count( distinct Account_ID)) over (partition by-05Trans_Year Where period between date 01 2006 and date 31 2006 isolate partitions script S'endofr( # Simple R script to Total_Spend, run a linear fit on daily sales cast(sum(total_spend)/1000 as int) group by dept prod1<-read.csv(file=file("stdin"), header=FALSE,row.names cast(sum(total_spend)/1000 as int) / count(distinct Account_ID having sum(sales) > 50000; colnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES") rank() over (partition dim1<-dim(prod1) by Trans_Year order by count(distinct A daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW), rank() over (partition by Trans_Year order by sum(total_spend) daily1[,2]<-daily1[,2]/sum(daily1[,2]) from( select Account_ID, basesales<-array(0,c(dim1[1],2)) basesales[,1]<-prod1$ID Extract(Year from Effective_Date) Trans_Year, basesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2]) @Kognitio @mphnyc #MPP_R count(Transaction_ID) Num_Trans, colnames(basesales)<-c("ID","BASESALES")
fit1=lm(BASESALES ~ ID,as.data.frame(basesales))
create external script LM_PRODUCT_FORECAST environment
rsint
For once technology is on our side

First time we have full triumvirate of
Excellent Computing power Unlimited storage Fast Networks
now that RAM is cheap!
@Kognitio
@mphnyc
#MPP_R
Hadoop is
Lots of these
Hadoop inherently disk oriented
Not so many of these

Typically low ratio of CPU to Disk
@Kognitio
@mphnyc
#MPP_R

BigData BI Mature

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

BigData BI Mature

Загружено:

Авторское право:

Доступные форматы

Hadoop meets Mature BI:

Where the rubber meets the road for Data Scientists

The Data Scientist

Key Concept: Graduation

Demand for the Data Scientist

Dont be a Railroad Stoker!

with ease and simplicity More granularity

Richer data model Faster

Lower latency Self service

What has changed?

In 2010 this was 1,200 exabytes

20th Century Fox

Hadoop ticks many but not all the boxes

aaa aaaa aa aa a aa aa a a a aa aa aaaa a aa aa

Null storage concerns

The drive for deeper understanding

Behaviour modelling Statistical Analysis Clustering

Machine learning algorithms

Dynamic Interaction Fraud detection Reporting & BPM Campaign Management

Hadoop just too slow for interactive BI!

platform, it is painfully slow as a query tool

Analytics needs low latency, no I/O wait

Analytical Platform: Reference Architecture

Analytical Platform Layer

Near-line Storage (optional)

Enterprise Data Warehouses

Logical Data Warehouse

The No SQL Posse

The new bounty hunters: Drill Impala Pivotal Stinger

Its all about getting work done

Now complex algorithms!

Behind the numbers

create external script LM_PRODUCT_FORECAST environment

For once technology is on our side

now that RAM is cheap!

Not so many of these

Вам также может понравиться