Вы находитесь на странице: 1из 39

Oozi e Now and Beyond

! PRESENTED BY Mona Chitnis! Hadoop User Group, Yahoo Sunnyvale, October 16, 2013
Team In Action
2 Yahoo Confidential & Proprietary
! Alejandro Abdelnur
! Mohammad Islam
! Rohini Palaniswamy
! Robert Kanter
! Virag Kothari
! Mona Chitnis
! Ryota Egashira
! Michelle Chiang
! Bowen Zhang
OVERVI EW
4 Yahoo Confidential & Proprietary
Why Oozie?
The Problem The Need
! Doing something on the grid often
required multiple steps
! MapReduce job
! Pig job
! Streaming job
! HDFS operation (mkdir, chmod, etc)
! Workflow scheduler with better support for
grid jobs (native integration with Hadoop)
! orchestrate dependency between jobs
! execute at specific time or on data
availability
! retry jobs in the event of failures
(reliable)
! Multiple ad-hoc solutions existed
! custom job control
! shell scripts
! cron
! Common framework for communication
and execution of production process
! sync (clocked dataset) awareness
! async (unspecified freq) data
awareness
! Cost of building and running apps were
high
! development and applications
engineering
! support, operations, and hardware
! Horizontally scalable and extensible
system
! Open-source
! Workflows to couple resources instead
of having a monolithic code base
A server-based workflow
scheduling system to
manage Hadoop jobs
Overview
5 Yahoo Confidential & Proprietary
Oozie A Workflow Engine
! Oozie executes workflow defined as DAG of jobs
! The job type includes MapReduce, Pig, Hive, shell script, custom Java code
etc.
! Introduced in Oozie 1.x

start
M/R
job
M/R
job
decision
fork
Pig
job
M/R
job
join
end
Java
FS
job
ENOUGH
MORE
Control-flow nodes
(start, kill, end | fork, join, decision)
Action nodes
(map reduce, pig, hive, distcp, java, fs, sub-workflow, shell, ssh, email)
kill
OK
ERROR
Overview
Example M/R Action
JT and NN
Mapper
Reducer
Queue Name
Input Directory
Output Directory
6 Yahoo Confidential & Proprietary
Overview
7 Yahoo Confidential & Proprietary
Workflow State Transitions
Source: Chicago HUG, Dec 2012
Overview
8 Yahoo Confidential & Proprietary
Oozie (Coordinator) A Scheduler
! Oozie executes workflow based on
! time dependency (frequency)
! data dependency
! Introduced in 2.x
HDFS/ HCat
Oozie Server
Oozie
Client
Oozie
Workflow
WS API
Oozie
Coordinator
Check
Data Availability
Overview
9 Yahoo Confidential & Proprietary
Oozie (Bundle) A Pipeline Framework
! Users can define and execute a bundle of coordinator apps
! large scale data processing (inter-related coordinators)
! operability and manageability of pipelines
! User can start/stop/suspend/resume/rerun in the bundle level
! Introduced in 3.x, bundles are optional
HDFS/ HCat
Oozie Server
Oozie
Client
Oozie
Workflow
WS API
Oozie
Coordinator
Check
Data Availability
Bundle
Overview
10 Yahoo Confidential & Proprietary
Layers of Abstraction in Oozie
!""#$
&'()"*
!""#$
&'()"*
!""#$
&'()"*
!""#$
&'()"*
+, -".
+, -".
+, -".
/01
-".
234
-".
/01
-".
234
-".
!"#$%&

1. Bundle
!""#$ -". !""#$ -".
2. Coordinator
+, -".
3. Workflow
Overview
11 Yahoo Confidential & Proprietary
Architectural Overview

Oozie (Java Web-App)
Security
WS Callback WS API
DAG Engine
Oracle DB
Commands
C
o
m
m
a
n
d

Q
u
e
u
e

start rerun submit
Command
Executor
Thread Pool
Recovery
Daemon Thread
Action Executors
M/R fs Pig
pluggable, to
support additional
action types
I
n
s
t
r
u
m
e
n
t
a
t
i
o
n

W
F

s
t
o
r
e

W
F

l
i
b

sub-wf
executed
Asynchronously
via Command Queue
resume kill suspend
info
start
action
end
action
check
action
callback
signal
job
notification
Web Services (JSON/REST API)
Overview
12 Yahoo Confidential & Proprietary
Oozie Security, Multi-tenancy and Scalability
Oozie
Server
Hadoop Cluster
YARN
RM
Launcher
Mapper
Actual
M/R Job
1
Auth.
End User
(Kerberos, Y! specific)
2
Create
Launcher Job
(super-user)
3
Execute
User Job
(doAs)
5
Async Callback
4
Response
Overview
USE CASES
14 Yahoo Confidential & Proprietary
Use Case 1: Time Triggers
Execute your workflow every 15 minutes

00:15 00:30 00:45 01:00
Use Cases and Common Patterns
15 Yahoo Confidential & Proprietary
Use Case 2: Time and Data Triggers
Materialize your workflow every hour, but only run them when the input
data is ready (that is loaded to the grid every hour)

01:00 02:00 03:00 04:00
Hadoop
Input Data
Exists?
Use Cases and Common Patterns
16 Yahoo Confidential & Proprietary
Use Case 2: Time and Data Triggers
<coordinator-app name=coord1 frequency=${1*HOURS}>
<datasets>
<dataset name="logs" frequency=${1*HOURS} initial-instance="2009-01-01T23:59Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=inputLogs dataset="logs">
<instance>${current(0)}</instance>
</data-in>
</input-events>
<action>
<workow>
<app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path>
<conguration>
<property> <name>inputData</name><value>${dataIn(inputLogs)}</value> </property>
</conguration>
</workow>
</action>

Use Cases and Common Patterns
Dataset Definition
Input Events Definition
with time of coordinator action materialized (created)

Action Definition
17 Yahoo Confidential & Proprietary
Use Case 3: Rolling Window
00:15 00:30 00:45 01:00
01:00
01:15 01:30 01:45 02:00
02:00
Access 15 minute datasets and roll them up into hourly datasets

Use Cases and Common Patterns
18 Yahoo Confidential & Proprietary
Use Case 4: Sliding Window
Access last 24 hours of data, and roll them up every hour
01:00 02:00 03:00 24:00
24:00

02:00 03:00 04:00
+1 day
01:00
+1 day
01:00

03:00 04:00 05:00
+1 day
02:00
+1 day
02:00

Use Cases and Common Patterns
! 17 clusters
! 13,000 jobs/server day

! 2.8 M jobs/month
! 16% of all Hadoop jobs
! 75 products
! 2,000+ projects
! 255 monthly users
! 5.4 M compute hrs/month
! 770,000 workflows
! Between 1-8 actions
! Avg. 4 actions/workflow
! 250 coordinator jobs/day
! 67% of Oozie jobs kicked
thru coordinator
Proven Scale and Multi-tenancy
19 Yahoo Confidential & Proprietary
Where are We Today
20 Yahoo Confidential & Proprietary
Mix Of Job Types For Workflows
39%
29%
28%
4%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Jobs
Pig MapReduce Java Other
SAMPLE USE OF JOB TYPES
Pig ! Data processing/ filtering
! Aggregation
MapReduce ! Publishing data (HDFS/
HCat)
Java ! Legacy code and logic
Others ! Distcp and shell
! Data copy/ transfer
Where are We Today
FEATURE DEEP-DI VE
22 Yahoo Confidential & Proprietary
Existing Features (Oozie 3.x)
! HBase access through Oozie, via credentials
! HCatalog access through Oozie, via credentials
! Email action
! DistCp action (intra as well as inter-cluster copy)
! Shell action (run any script e.g. perl, python, hadoop CLI)
! Workflow dry-run & Fork-Join validation
! Bulk monitoring (REST API)
! Coordinator EL functions for parameterized workflows
! Job DAG
Whats New in Oozie
HBase Credentials
23 Yahoo Confidential & Proprietary
! Add in workflow.xml
! Add a section of "credentials". The type is "hbase.
! Specify the java action to use the credentials.
! Put hbase-site.xml in oozie application path. And use <file> in workflow.xml to put hbase-site.xml in the distributed cache. A copy of the
hbase-site.xml can be found in gateway:/home/gs/conf/hbase/hbase-site.xml.
! Put jars "guava-*.jar, zookeeper-*.jar, hbase-*.jar, protobuf-java-*.jar in workflow lib dir

! Make sure you are using Oozie XSD version 0.3 and above for the tag.

"#$%&'($#)*++ ,*-./0'$$)#'0 1-(,2/03%45$$64.5#$%&'($#57890:
";%.<.,=4*(2:
";%.<.,=4*( ,*-./0>?*2.8;.%=0 =@+./0>?*2.0: "A;%.<.,=4*(:
AA $+=4$,*( +%$+.%=4.2 ) 6$$&..+.%86,$<.8+*%.,=B >?*2.86$$&..+.%8C3$%3-
"A;%.<.,=4*(2:
"2=*%= =$/D-*+)%.<3;.)*;=4$,0 A:
"*;=4$, ,*-./E-*+)%.<3;.)*;=4$,F ;%.</0>?*2.8;.%=0:
"-*+)%.<3;.:
";$,'4G3%*=4$,:
"+%$+.%=@: ",*-.:-*+%.<8-*++.%8;(*22"A,*-.:
"H*(3.:I*-+(.J*++.%KL*2."AH*(3.: "A+%$+.%=@:
"+%$+.%=@: ",*-.:-*+%.<8%.<3;.%8;(*22"A,*-.:
"H*(3.:$%G8*+*;>.8$$64.8.1*-+(.8M.-$N.<3;.%"AH*(3.: "A+%$+.%=@: "A;$,'4G3%*=4$,:
"'4(.:>?*2.)24=.81-(O>?*2.)24=.81-("A'4(.:
"AP*H*:

! Refer to http://twiki.corp.yahoo.com/view/CCDI/UseHbaseCred
Whats New in Oozie
Oozie 4.0
24 Yahoo Confidential & Proprietary
HCatalog Integration
Job Notifications
SLA Monitoring
1
2
3
Whats New in Oozie
HCatalog Integration
! Oozie now supports HCatalog datasets, in addition to HDFS
! Query HCat server directly -OR-
! Receive partition created notifications
! With HDFS datasets, poll NameNode to check data availability
! Delay
! Single source



Oozie NameNode
/data/click/2013/03/10
/data/click/2013/03/11
/data/click/2013/03/12
.
HDFS
data exists?
data exists?
.
Whats New in Oozie
25 Yahoo Confidential & Proprietary
1
! HCat - metastore has info about HDFS
datasets, locations and file formats.
! Using HCat loader and storer, dataset can be
consumed uniformly using Pig, Hive and
Map/Reduce in Oozie, using the database,
table, partition abstraction.
! Oozie notified on partition availability via JMS
messages, to trigger workflows immediately
! Use JARs hcatalog-core.jar, webhcat-java-
client.jar, hive-common.jar, hive-exec.jar,
hive-metastore.jar, hive-serde.jar and
libfb303.jar in workflow lib

! Docs -
http://oozie.apache.org/docs/4.0.0/
DG_HCatalogIntegration.html
";$$%<4,*=$%)*++ ,*-./D>;*=);$$%<D Q :
"<*=*2.=2:
"<*=*2.= ,*-./D4,+)($G20 '%.C3.,;@/0RS;$$%<5>$3%2TUVWD:
"3%4)=.-+(*=.:!"#$%&'()*+,!")-+,!"&%-.*+,)/0!"1234+5!
"67'89+5!":31+;<*=>(?0!"<*=>(?+"A3%4)=.-+(*=.:
"<$,.)'(*G:"A<$,.)'(*G:
"A<*=*2.=:
"<*=*2.= ,*-./D$3=)($G20 '%.C3.,;@/DRS;$$%<5<*@2TUVWD:
"3%4)=.-+(*=.:!"#$%&'()*+,!")-+,!"(@&A@&&%-.*+,)/0!
")%&%7@&+;<*=>(?0!"<*=>(?+"A3%4)=.-+(*=.:
"<$,.)'(*G:"A<$,.)'(*G:
"A<*=*2.=:
888
"+%$+.%=@:
",*-.:BCD824"A,*-.:
"H*(3.:!"$((<)E)%&%C?F%<&>&>(?B>.&*<GH>?A@&HI HA>=HK+
"AH*(3.:

X4G *;=4$, 2;%4+=5
Y / ($*< FRML8RZYL[\F 324,G
$%G8*+*;>.8>;*=*($G8+4G8K]*=[$*<.%TV^
L / _`[Z\N Y La R_`[Z\N^
] / '$%.*;> L G.,.%*=. '$$B ?*%^
2=$%. ] 4,=$ FRbcZXcZdML8RbcZXcZdZYL[\F cI`ef
$%G8*+*;>.8>;*=*($G8+4G8K]*=I=$%.%TFRbcZXcZdXYNZ`Z`beFV^
26 Yahoo Confidential & Proprietary
Latest Oozie 4.0 Features
HCatalog Integration

Whats New in Oozie
With HCatalog + Notifications
High-level Diagram
HCatalog
Data Producer
HDFS
Update metadata
(ALTER TABLE click ADD PARTITION(data=2013/03/12)
location hdfs://data/click/2013/03/12)
/data/click/2013/03/12
Produce data (distcp, pig, M/R..)
Whats New in Oozie
27 Yahoo Confidential & Proprietary
With HCatalog + Notifications
High-level Diagram
Oozie
Message Bus
(e..g, ActiveMQ)
HCatalog
2. Register Topic
Data Producer
HDFS
1. Query/Poll Partition
Whats New in Oozie
28 Yahoo Confidential & Proprietary
With HCatalog + Notifications
High-level Diagram
Oozie
Message Bus
(e..g, ActiveMQ)
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer
HDFS
Produce data (distcp, pig, M/R..)
/data/click/2013/03/12
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=2013/03/12)
location hdfs://data/click/2013/03/12)
Whats New in Oozie
29 Yahoo Confidential & Proprietary


! Notification event sent on jobs status change
! Messages sent on the configured JMS-
compliant message broker
! Users should write message listeners to listen
on select topics (e.g. username)
! To filter more, apply JMS selectors on
messages.
! E.g. user, jobid, app-type, status, msg-type (JOB
or SLA).
! Docs -
http://oozie.apache.org/docs/4.0.0/
DG_JMSNotifications.html
Filter desired app-types for notification:
"+%$+.%=@:
",*-.:((L>*M/*<N>$*M2N*?&9%?).*<O*<N>$*M
P>.&*<M%AAM&QA*/"A,*-.:
"H*(3.:R(<SP.(RTU(-I R(<SP.(RT%$&>(?I
$((<)>?%&(<TU(-I $((<)>?%&(<T%$&>(?"AH*(3.:
"A+%$+.%=@:
Notification Msg Example:
Coordinator Action Failure Event
! Header (Selectors)
AppType Coordinator_Action
Status - FAILURE
User
App-Name
! Message Body (JSON)
ID (coord action id)
Parent ID (coord Job ID)
NominalTime
StartTime
EndTime
Status - FAILED, KILLED, SUSPENDED, TIMEDOUT
Error-Code, Error-Message (if KILLED or FAILED)

30 Yahoo Confidential & Proprietary
Latest Oozie 4.0 Features

Job Notifications
2
Whats New in Oozie
! Oozie can actively track SLAs on Jobs
! Start-time, End-time, Duration
! Event Status
! START_MET, START_MISS
! END_MET, END_MISS
! DURATION_MET, DURATION_MISS
! At any time, the SLA processing stage will reflect:
! Not_Started <-- Job not yet begun
! In_Process <-- Job started and is running, and SLAs are
being tracked
! Met <-- caused by an END_MET
! Miss <-- caused by an END_MISS
! Access/Filter SLA info via
! Web-console dashboard
! REST API
! JMS Messages
! Email alert
! Docs -
http://oozie.apache.org/docs/4.0.0/DG_SLAMonitoring.html


"#$%&'($#)*++ 1-(,2/!"#$%&&'$(%)&#*+,&)%
-./! 12,34%4,56!"#$%&&'$(%4,5%-.7!
352(684,59)+!:
...
".,< ,*-./!(3;!<:
"2(*54,'$:
"/.%E?(V>?%.5&>V*:RS,$-4,*(Z4-.W
"A2(*5,$-4,*()=4-.:
"/.%E/#(@.)5/&%<&:RS2>$3(<I=*%=W
"A2(*52>$3(<)2=*%=:
"/.%E/#(@.)5*?):RS2>$3(<\,<W
"A2(*52>$3(<).,<:
"2(*5-*1)<3%*=4$,:RS<3%*=4$,W
"A2(*5-*1)<3%*=4$,:
"2(*5*(.%=).H.,=2:2=*%=d-422B.,<d-422
"A2(*5*(.%=).H.,=2:
"2(*5*(.%=);$,=*;=:P$.g@*>$$
"A2(*5*(.%=);$,=*;=:
"A2(*54,'$:
"A#$%&'($#)*++:
31 Yahoo Confidential & Proprietary
Latest Oozie 4.0 Features

SLA Monitoring

3
Whats New in Oozie
SLA Monitoring Dashboard
32 Yahoo Confidential & Proprietary
Whats New in Oozie
Checking Oozie Job
33 Yahoo Confidential & Proprietary
1. CLI (yoozie_client)

$ oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-joe
----------------------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf
App Path : hdfs://localhost:8020/user/joe/workflows/map-reduce
Status : SUCCEEDED
Run : 0
User : joe
Group : users
Created : 2009-05-26 05:01
Started : 2009-05-26 05:01
Ended : 2009-05-26 05:01
Actions
---------------------------------------------------------------------------------------------------------------------
Action Name Type Status Transition External Id External Status Error Code Start End
------------------------------------------------------------------------------------------------------------------------------------------------------
hadoop1 map-reduce OK end job_200904281535_0254 SUCCEEDED - 2009-05-26 05:01 2009-05-26 05:01
------------------------------------------------------------------------------------------------------------------------------------------------------
Demo
Checking / Debugging Oozie Jobs
34 Yahoo Confidential & Proprietary
2. Web-Console
e.g. http://my-oozie-server:4080/oozie
Docs - https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook

Demo
What el se i s out t here?
36 Yahoo Confidential & Proprietary
Oozie vs. Other Workflow Systems
Champion Yahoo! (now ASF) LinkedIn Spotify
Apache
Affiliation
TLP License only License only
Language Java Java Python
Adoption
High, part of all standard Hadoop
distributions
Low Low
Code
Complexity
High (>100K lines) Medium (< 50K lines) Low (<10K lines)
Hadoop Job
Support
Extensive built-in support Limited job types Limited job types
Docs &
Support
Excellent Limited Limited
Auth. Kerberos, custom xml-based, custom Linux-based
Reruns Yes (recovery, retries at all levels) Partial
After removing output,
idempotent
UI Average Good -
Oozie at ASF
37 Yahoo Confidential & Proprietary
The Next Release
! Scalability and performance improvements to handle higher loads
! More 1 and 5 min frequency jobs
! High Availability with Load Balancing
! Flexible Cron-Based Scheduling
! Handling cluster Rolling upgrades for Hadoop 2.0


Roadmap
Q & A
39 Yahoo Confidential & Proprietary

Вам также может понравиться