You are on page 1of 48

No Big Data HackingTime for a Complete ETL Solution

with Oracle Data Integrator 12c

Jerome Francoisse | Oracle OpenWorld 2015


info@rittmanmead.com www.rittmanmead.com @rittmanmead 1
Jrme Franoisse
Consultant for Rittman Mead

Oracle BI/DW Architect/Analyst/Developer

ODI Trainer

Providing ODI support on OTN Forums

ODI 12c Beta Program Member

Blogger at http://www.rittmanmead.com/blog/

Email : jerome.francoisse@rittmanmead.com

Twitter : @JeromeFr

info@rittmanmead.com www.rittmanmead.com @rittmanmead 2


About Rittman Mead

Optimizing your investment in Oracle Data Integration

Worlds leading specialist partner for technical Providing our customers targeted expertise; we are a
excellence, solutions delivery and innovation in company that doesnt try to do everything only
Oracle Data Integration, Business Intelligence, what we excel at
Analytics and Big Data
Founded on the values of collaboration, learning,
70+ consultants worldwide including 1 Oracle ACE integrity and getting things done
Director and 3 Oracle ACEs
Comprehensive service portfolio designed to
support the full lifecycle of any analytics solution

info@rittmanmead.com www.rittmanmead.com @rittmanmead 3


User Engagement

Average user adoption for BI


platforms is below 25%

Rittman Meads User Engagement Service can help

Visual Redesign Business User Training

Engagement Toolkit Ongoing Support

info@rittmanmead.com www.rittmanmead.com @rittmanmead 4


The Oracle BI, DW and Big Data Product Architecture

info@rittmanmead.com www.rittmanmead.com @rittmanmead 5


The place of Big Data in the Reference Architecture

info@rittmanmead.com www.rittmanmead.com @rittmanmead 6


Hive
SQL Interface over HDFS

Set-based transformation

SerDe to map complex file structure

info@rittmanmead.com www.rittmanmead.com @rittmanmead 7


HiveQL
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|
\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

LOAD DATA INPATH '/user/jfrancoi/apache_data/FlumeData.1412752921353' OVERWRITE INTO TABLE apachelog;

info@rittmanmead.com www.rittmanmead.com @rittmanmead 8


Pig
Dataflow language

Pipeline of transformations

Can benefit from UDF

info@rittmanmead.com www.rittmanmead.com @rittmanmead 9


Pig Latin
register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
raw_logs = LOAD '/user/mrittman/rm_logs' USING TextLoader AS (line:chararray);
logs_base = FOREACH raw_logs
GENERATE FLATTEN
(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)"
"([^"]*)"')
)AS
(remoteAddr: chararray, remoteLogname: chararray, user: chararray,time: chararray, request: chararray, status:
chararray, bytes_string: chararray,referrer:chararray,browser: chararray);
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|bot|monitis|Baiduspider|
AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*');
logs_base_page = FOREACH logs_base_nobots GENERATE SUBSTRING(time,0,2) as day, SUBSTRING(time,3,6) as month,
SUBSTRING(time,7,11) as year, FLATTEN(STRSPLIT(request,' ',5)) AS (method:chararray, request_page:chararray,
protocol:chararray), remoteAddr, status;
logs_base_page_cleaned = FILTER logs_base_page BY NOT (SUBSTRING(request_page,0,3) == '/wp' or request_page == '/'
or SUBSTRING(request_page,0,7) == '/files/' or SUBSTRING(request_page,0,12) == '/favicon.ico');
logs_base_page_cleaned_by_page = GROUP logs_base_page_cleaned BY request_page;
page_count = FOREACH logs_base_page_cleaned_by_page GENERATE FLATTEN(group) as request_page,
COUNT(logs_base_page_cleaned) as hits;
page_count_sorted = ORDER page_count BY hits DESC;
page_count_top_10 = LIMIT page_count_sorted 10;

info@rittmanmead.com www.rittmanmead.com @rittmanmead 10


Spark
Open-source Computing framework

Dataflow processes

RDDs

in-Memory

Scala, Python or Java

info@rittmanmead.com www.rittmanmead.com @rittmanmead 11


Spark
package com.cloudera.analyzeblog case _ => Nil
import org.apache.spark.SparkConf } val postsLocation = "/user/mrittman/posts.psv"
import org.apache.spark.SparkContext val logs_base_nobots = logs_base.filter( r => !
r.request.matches(".*(spider|robot|bot|slurp|
import org.apache.spark.SparkContext._ val posts =
bot|monitis|Baiduspider|AhrefsBot|EasouSpider|
sc.textFile(postsLocation).map{ line =>
import org.apache.spark.sql.SQLContext HTTrack|Uptime|FeedFetcher|dummy).*"))
val cols=line.split('|')
()
def main(args: Array[String]) { val logs_base_page = logs_base_nobots.map { r
val sc = new SparkContext(new =>
postRow(cols(0),cols(1),cols(2),cols(3),cols(4)
SparkConf().setAppName("analyzeBlog")) val request = getRequestUrl(r.request)
,cols(5),cols(6).concat("/"))
val sqlContext = new SQLContext(sc) val request_formatted = if
}
import sqlContext._ (request.charAt(request.length-1).toString ==
"/") request else request.concat("/")
val raw_logs = "/user/mrittman/rm_logs"
(r.host, request_formatted, r.status, posts.registerAsTable("posts")
//val rowRegex = """^([0-9.]+)\s([\w.-]+) r.agent)
\s([\w.-]+)\s(\[[^\[\]]+\])\s"((?:[^"]|\")
+)"\s(\d{3})\s(\d+|-)\s"((?:[^"]|\")+)"\s"((?: } val pages_and_posts_details = sql("SELECT
[^"]|\")+)"$""".r p.request_page, p.hits, ps.title, ps.author
FROM page_count p JOIN posts ps ON
val rowRegex = """^([\d.]+) (\S+) (\S+) \[([\w val logs_base_page_schemaRDD =
p.request_page = ps.generated_url ORDER BY hits
\d:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) ([\d\-]+) logs_base_page.map(p => pageRow(p._1, p._2,
DESC LIMIT 10")
"([^"]+)" "([^"]+)".*""".r p._3, p._4))

pages_and_posts_details.saveAsTextFile("/user/
val logs_base = sc.textFile(raw_logs) flatMap { logs_base_page_schemaRDD.registerAsTable("logs_
mrittman/top_10_pages_and_author4")
case rowRegex(host, base_page")
identity, user, time, request, status, size,
referer, agent) => }
val page_count = sql("SELECT request_page,
count(*) as hits FROM logs_base_page GROUP BY }
Seq(accessLogRow(host, identity, user, time, request_page").registerAsTable("page_count")
request, status, size, referer, agent))

info@rittmanmead.com www.rittmanmead.com @rittmanmead 12


How its done

A few experts writing code

Hard to maintain

No Governance

New tools every month

info@rittmanmead.com www.rittmanmead.com @rittmanmead 13


Dj vu?
DECLARE WHEN DUP_VAL_ON_INDEX THEN -- account already exists
CURSOR c1 IS UPDATE accounts SET bal = acct.new_value
SELECT account_id, oper_type, new_value FROM action WHERE account_id = acct.account_id;
ORDER BY time_tag UPDATE action SET status =
FOR UPDATE OF status; 'Insert: Acct exists. Updated instead.'
BEGIN WHERE CURRENT OF c1;
FOR acct IN c1 LOOP -- process each row one at a time END;
acct.oper_type := upper(acct.oper_type); ELSIF acct.oper_type = 'D' THEN
IF acct.oper_type = 'U' THEN DELETE FROM accounts
UPDATE accounts SET bal = acct.new_value WHERE account_id = acct.account_id;
WHERE account_id = acct.account_id;
IF SQL%NOTFOUND THEN -- account didn't exist. Create it. IF SQL%NOTFOUND THEN -- account didn't exist.
INSERT INTO accounts UPDATE action SET status = 'Delete: ID not found.'
VALUES (acct.account_id, acct.new_value); WHERE CURRENT OF c1;
UPDATE action SET status = ELSE
'Update: ID not found. Value inserted.' UPDATE action SET status = 'Delete: Success.'
WHERE CURRENT OF c1; WHERE CURRENT OF c1;
ELSE END IF;
UPDATE action SET status = 'Update: Success.' ELSE -- oper_type is invalid
WHERE CURRENT OF c1; UPDATE action SET status =
END IF; 'Invalid operation. No action taken.'
ELSIF acct.oper_type = 'I' THEN WHERE CURRENT OF c1;
BEGIN
INSERT INTO accounts END IF;
VALUES (acct.account_id, acct.new_value);
UPDATE action set status = 'Insert: Success.' END LOOP;
WHERE CURRENT OF c1; COMMIT;
EXCEPTION END; source : docs.oracle.com

info@rittmanmead.com www.rittmanmead.com @rittmanmead 14


Moved to ETL Solutions

info@rittmanmead.com www.rittmanmead.com @rittmanmead 15


Moved to ETL Solutions

info@rittmanmead.com www.rittmanmead.com @rittmanmead 15


Can we do that for Big Data?

info@rittmanmead.com www.rittmanmead.com @rittmanmead 16


Can we do that for Big Data?

Yes! ODI provides an excellent framework for running Hadoop ETL


jobs

- ODI uses all the natives technologies, by pushing down the


transformations to Hadoop

info@rittmanmead.com www.rittmanmead.com @rittmanmead 16


Can we do that for Big Data?

Yes! ODI provides an excellent framework for running Hadoop ETL


jobs

- ODI uses all the natives technologies, by pushing down the


transformations to Hadoop
Hive, Pig, Spark, HBase, Sqoop and OLH/OSCH KMs provide
native Hadoop loading / transformation - Requires BigData Option

info@rittmanmead.com www.rittmanmead.com @rittmanmead 16


Can we do that for Big Data?

Yes! ODI provides an excellent framework for running Hadoop ETL


jobs

- ODI uses all the natives technologies, by pushing down the


transformations to Hadoop
Hive, Pig, Spark, HBase, Sqoop and OLH/OSCH KMs provide
native Hadoop loading / transformation - Requires BigData Option
Also benefits from everything else in ODI

- Orchestration and Monitoring


- Data firewall and Error handling

info@rittmanmead.com www.rittmanmead.com @rittmanmead 16


Can we do that for Big Data?

ODI

Files - Logs
Files
API BigData SQL
Flume OLH/OSCH
Sqoop Sqoop

NoSQL Enterprise
Database DWH

Hive Hive
HBase HBase
HDFS HDFS
OLTP
Database
info@rittmanmead.com www.rittmanmead.com @rittmanmead 17
Import Hive Table Metadata into ODI Repository

Connections to Hive, Hadoop (and Pig) set up earlier

Define physical and logical schemas, reverse-engineer the


table definitions into repository

1
- Can be temperamental with tables using non-standard SerDes;
make sure JARs registered
3

info@rittmanmead.com www.rittmanmead.com @rittmanmead 18


Demo - Logical - Business Rules

info@rittmanmead.com www.rittmanmead.com @rittmanmead 19


Demo - Hive Physical Mapping

info@rittmanmead.com www.rittmanmead.com @rittmanmead 20


HiveQL
INSERT INTO TABLE default.movie_rating
SELECT
MOVIE.movie_id movie_id ,
MOVIE.title title ,
MOVIE.year year ,
ROUND(MOVIEAPP_LOG_ODISTAGE_1.rating) avg_rating
FROM
default.movie MOVIE JOIN (
SELECT
AVG(MOVIEAPP_LOG_ODISTAGE.rating) rating ,
MOVIEAPP_LOG_ODISTAGE.movieid movieid
FROM
default.movieapp_log_odistage MOVIEAPP_LOG_ODISTAGE
WHERE
(MOVIEAPP_LOG_ODISTAGE.activity = 1
)
GROUP BY
MOVIEAPP_LOG_ODISTAGE.movieid
) MOVIEAPP_LOG_ODISTAGE_1
ON MOVIE.movie_id = MOVIEAPP_LOG_ODISTAGE_1.movieid

info@rittmanmead.com www.rittmanmead.com @rittmanmead 21


Demo - Pig Physical Mapping

info@rittmanmead.com www.rittmanmead.com @rittmanmead 22


Pig
MOVIE = load 'default.movie' using org.apache.hive.hcatalog.pig.HCatLoader as
(movie_id:int, title:chararray, year:int, budget:int, gross:int,
plot_summary:chararray);
MOVIEAPP_LOG_ODISTAGE = load 'default.movieapp_log_odistage' using
org.apache.hive.hcatalog.pig.HCatLoader as (custid:int, movieid:int, genreid:int,
time:chararray, recommended:int, activity:int, rating:int, sales:float);
FILTER0 = filter MOVIEAPP_LOG_ODISTAGE by activity == 1;
AGGREGATE = foreach FILTER0 generate movieid as movieid, rating as rating;
AGGREGATE = group AGGREGATE by movieid;
AGGREGATE = foreach AGGREGATE generate
group as movieid,
AVG($1.rating) as rating;
JOIN0 = join MOVIE by movie_id, AGGREGATE by movieid;
JOIN0 = foreach JOIN0 generate
MOVIE::movie_id as movie_id, MOVIE::title as title, MOVIE::year as year,
ROUND(AGGREGATE::rating) as avg_rating;
store JOIN0 into 'default.movie_rating' using org.apache.hive.hcatalog.pig.HCatStorer;

info@rittmanmead.com www.rittmanmead.com @rittmanmead 23


Demo - Spark Physical Mapping

info@rittmanmead.com www.rittmanmead.com @rittmanmead 24


pySpark
OdiOutFile -FILE=/tmp/ #Replace None RDD element to new defined def MIN(x): return min(x);
C___Calc_Ratings__Hive___Pig___Spark_.py - 'NoneRddElement' object, which overload the []
def AVG(x): return None if COUNT(x) == 0 else
CHARSET_ENCODING=UTF-8 operator.
SUM(x)/COUNT(x);
# -*- coding: utf-8 -*- #For example, MOV["MOVIE_ID"] return None
def COUNT(x): return len(filter(partial(is_not,
rather than TypeError: 'NoneType' object is
from pyspark import SparkContext, SparkConf None),x));
unsubscriptable when MOV is none RDD element.
from pyspark.sql import * def safeAggregate(x,y): return None if not y
def convert_to_none(x):
else x(y);
config =
return NoneRddElement() if x is None else x
SparkConf().setAppName("C___Calc_Ratings__Hive_ def getValue(type,value,format='%Y-%m-%d'):
__Pig___Spark_").setMaster("yarn-client") #Transform RDD element from dict to tuple to
try:
support RDD subtraction.
sc = SparkContext(conf = config)
if type is date:
#For example (MOV, (RAT, LAN)) transform to
sqlContext = SQLContext(sc)
(tuple(sorted(MOV.items())), return
sparkVersion = reduce(lambda sum, elem: sum*10 (tuple(sorted(RAT.items())),tuple(sorted(LAN.it datetime.strptime(value,format).date()
+ elem, map(lambda x: int(x) if x.isdigit() ems()))) else: return type(value)
else 0, sc.version.strip().split('.')), 0)
def dict2Tuple(t): except ValueError:return None;
import sys
return tuple(map(dict2Tuple, t)) if def getScaledValue(scale, value):
from datetime import * isinstance(t, (list, tuple)) else
tuple(sorted(t.items())) try: return '' if value is None else
hiveCtx = HiveContext(sc)
('%0.'+ str(scale) +'f')%float(value);
def convertRowToDict(row): #reverse dict2Tuple(t)
except ValueError:return '';
ret = {} def tuple2Dict(t):
def getStrValue(value, format='%Y-%m-%d'):
for num in range(0, len(row.__FIELDS__)) : return dict((x,y) for x,y in t) if not
isinstance(t[0][0], (list, tuple)) else if value is None : return ''
ret[row.__FIELDS__[num]] = row[num] tuple(map(tuple2Dict, t)) if isinstance(value, date): return
return ret from operator import is_not value.strftime(format)
from pyspark_ext import * from functools import partial if isinstance(value, str): return
unicode(value, 'utf-8')
#Local defs def SUM(x): return sum(filter(None,x));
if isinstance(value, unicode) : return value
def MAX(x): return max(x);
try: return unicode(value)

info@rittmanmead.com www.rittmanmead.com @rittmanmead 25


pySpark

OdiOSCommand "-OUT_FILE=/tmp/
C___Calc_Ratings__Hive___Pig___Spark_.out" "-ERR_FILE=/tmp/
C___Calc_Ratings__Hive___Pig___Spark_.err" "-WORKING_DIR=/tmp"
/usr/lib/spark/bin/spark-submit --master yarn-client /tmp/
C___Calc_Ratings__Hive___Pig___Spark_.py --py-files /tmp/
pyspark_ext.py --executor-memory 1G --driver-cores 1 --
executor-cores 1 --num-executors 2

info@rittmanmead.com www.rittmanmead.com @rittmanmead 26


Can we do that for Big Data?

ODI

Files - Logs
Files
API BigData SQL
Flume OLH/OSCH
Sqoop Sqoop

NoSQL Enterprise
Database DWH

Hive Hive
HBase HBase
HDFS HDFS
OLTP
Database
info@rittmanmead.com www.rittmanmead.com @rittmanmead 27
Oozie

workflow scheduler system to manage Apache Hadoop jobs

execution, scheduling, monitoring

integrated in hadoop ecosystem

no additional footprint

Limitation - No Load Plans

info@rittmanmead.com www.rittmanmead.com @rittmanmead 28


HDFS

info@rittmanmead.com www.rittmanmead.com @rittmanmead 29


Can we do that for Big Data?

ODI

Files - Logs
Files
API BigData SQL
Flume OLH/OSCH
Sqoop Sqoop

NoSQL Enterprise
Database DWH

Hive Hive
HBase HBase
HDFS HDFS
OLTP
Database
info@rittmanmead.com www.rittmanmead.com @rittmanmead 30
Oracle Big Data SQL

Gives us the ability to easily bring in Hadoop (Hive) data into


Oracle-based mappings

Oracle SQL to transform and join in Hive

Faster access to Hive data for real-time ETL scenarios

info@rittmanmead.com www.rittmanmead.com @rittmanmead 31


Oracle Big Data SQL

info@rittmanmead.com www.rittmanmead.com @rittmanmead 32


Oracle Big Data SQL

info@rittmanmead.com www.rittmanmead.com @rittmanmead 32


Oracle Big Data SQL

info@rittmanmead.com www.rittmanmead.com @rittmanmead 32


Supplement with Oracle Reference Data - SQOOP

Mapping physical details specify Sqoop KM for extract


(LKM SQL to Hive Sqoop)

IKM Hive Append used for join and load into Hive target

info@rittmanmead.com www.rittmanmead.com @rittmanmead 33


Supplement with Oracle Reference Data - SQOOP

info@rittmanmead.com www.rittmanmead.com @rittmanmead 33


Supplement with Oracle Reference Data - SQOOP

info@rittmanmead.com www.rittmanmead.com @rittmanmead 33


Can we do that for Big Data?

ODI

Files - Logs
Files
API BigData SQL
Flume OLH/OSCH
Sqoop Sqoop

NoSQL Enterprise
Database DWH

Hive Hive
HBase HBase
HDFS HDFS
OLTP
Database
info@rittmanmead.com www.rittmanmead.com @rittmanmead 34
Missing?

Streaming Capabilities

Spark Streaming

Kafka

info@rittmanmead.com www.rittmanmead.com @rittmanmead 35


Further Reading / Testing

http://www.rittmanmead.com/2015/04/odi12c-advanced-
big-data-option-overview-install/

http://www.rittmanmead.com/2015/04/so-whats-the-real-
point-of-odi12c-for-big-data-generating-pig-and-spark-
mappings/

Oracle BigData Lite VM - 4.2.1

info@rittmanmead.com www.rittmanmead.com @rittmanmead 36


Questions?

info@rittmanmead.com www.rittmanmead.com @rittmanmead 37


Questions?

Blogs:

- www.rittmanmead.com/blog

Contact:

- info@rittmanmead.com
- jerome.francoisse@rittmanmead.com
Twitter

- @rittmanmead
- @JeromeFr

info@rittmanmead.com www.rittmanmead.com @rittmanmead 38


Questions?

Blogs:

- www.rittmanmead.com/blog

Contact:

- info@rittmanmead.com
- jerome.francoisse@rittmanmead.com
Twitter

- @rittmanmead
- @JeromeFr

info@rittmanmead.com www.rittmanmead.com @rittmanmead 38


Rittman Mead Sessions
No Big Data HackingTime for a Complete ETL Oracle Business Intelligence Cloud Service
Solution with Oracle Data Integrator 12c Moving Your Complete BI Platform to the Cloud
[UGF5827] [UGF4906]

Jrme Franoisse | Sunday, Oct 25, 8:00am | Mark Rittman | Sunday, Oct 25, 2:30pm | Moscone
Moscone South 301
South 301

Empowering Users: Oracle Business Intelligence Oracle Data Integration Product Family: a
Enterprise Edition 12c Visual Analyzer [UGF5481] Cornerstone for Big Data [CON9609]
Edelweiss Kammermann | Sunday, Oct 25, 10:00am Mark Rittman | Wednesday, Oct 28, 12:15pm |
| Moscone West 3011
Moscone West 2022

A Walk Through the Kimball ETL Subsystems Developer Best Practices for Oracle Data
with Oracle Data Integration Solutions [UGF6311] Integrator Lifecycle Management [CON9611]
Michael Rainey | Sunday, Oct 25, 12:00pm | Jrme Franoisse | Thursday, Oct 29, 2:30 pm |
Moscone South 301 Moscone West 2022

info@rittmanmead.com www.rittmanmead.com @rittmanmead 39