Вы находитесь на странице: 1из 22

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

Thejas Nair pig team @ Yahoo!


http://pig.apache.org

Apache pig PMC member

What is Pig?
Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Pig Latin example


Users = load users as (name, age);

Fltrd = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url);
Jnd = join Fltrd by name, Pages by user;

Comparison with MR in Java


180 160 140 120 100 80 60 40 20 0

1/20 the lines of code


Minutes

300 250 200 150 100 50 0

1/16 the development time

Hadoop

Pig

Hadoop

Pig

What about Performance ?

Pig Compared to Map Reduce


Faster development time Data flow versus programming logic Many standard data operations (e.g. join) included Manages all the details of connecting jobs and data flow Copes with Hadoop version change issues

And, You Dont Lose Power


UDFs can be used to load, evaluate, aggregate, and store data External binaries can be invoked Metadata is optional Flexible data model Nested data types Explicit data flow programming

Pig performance
Pigmix : pig vs mapreduce

Pig optimization principles


vs RDBMS: There is absence of accurate models for data, operators and execution env Use available reliable info. Trust user choice. Use rules that help in most cases Rules based on runtime information

Logical Optimizations
Script
A = load B = foreach C = filter

Parser
Logical Plan
A -> B -> C

Logical Optimizer
Optimized L. Plan
A -> C -> B

Restructure given logical dataflow graph Apply filter, project, limit early Merge foreach, filter statements Operator rewrites

Physical Optimizations
Translator
Optimized L. Plan
X -> Y -> Z

Optimizer
Optimized Phy/MR Plan
M(PX-PYm) C(PYc)R(PYr) -> M(Z)

Phy/MR plan
M(PX-PYm) R(PYr) -> M(Z)

Physical plan: sequence of MR jobs having physical operators. Built-in rules. eg. use of combiner Specified in query - eg. join type

Hash Join
Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Users by name, Pages by user;

Map 1 Pages Users Pages block n


(1, user)

Reducer 1
(1, fred) (2, fred) (2, fred)

Map 2

Reducer 2
(1, jane) (2, jane) (2, jane)

Users block m

(2, name)

Skew Join
Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using skewed;

Map 1 Pages Users Pages block n S P


(1, user)

Reducer 1
(1, fred, p1) (1, fred, p2) (2, fred)

Map 2

Reducer 2 S P
(2, name) (1, fred, p3) (1, fred, p4) (2, fred)

Users block m

Merge Join
Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using merge;

Map 1 Pages
aaron . . . . . . . . zach

Users
aaron . . . . . . zach

Pages
aaron amr

Users
aaron

Map 2 Pages
amy barb

Users
amy

Replicated Join
Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using replicated;

Map 1 Pages
aaron aaron . . . . . . . zach

Users
aaron . zach

Pages
aaron amr

Users
aaron . zach

Map 2
Pages
amy barb

Users
aaron . zach

Group/cogroup optimizations
On sorted and collected data
grp = group Users by name using collected;

Pages
aaron aaron barney carol . . . . . . . zach

Map 1
aaron aaron barney

Map 2
carol . .

Multi-store script
A = load users as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into bydemo; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into bystate;
C1: group A: load B: filter C2: group C3: eval udf C2: eval udf store into bydemo store into bystate

Multi-Store Map-Reduce Plan


map filter split

local rearrange

local rearrange

reduce

package
foreach

multiplex

package foreach

Memory Management
Use disk if large objects dont fit into memory JVM limit > phy mem - Very poor performance Spill on memory threshold notification from JVM - unreliable pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.

Other optimizations
Aggressive use of combiner, secondary sort Lazy deserialization in loaders Better serialization format Faster regex lib, compiled pattern

Future optimization work


Improve memory management Join + group in single MR, if same keys used Even better skew handling Adaptive optimizations Automated hadoop tuning

Pig - fast and flexible


More flexibility in 0.8, 0.9 Udfs in scripting languages (python) MR job as relation Relation as scalar Turing complete pig (0.9)

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Further reading
Docs - http://pig.apache.org/docs/r0.7.0/ Papers and talks http://wiki.apache.org/pig/PigTalksPaper s Training videos in vimeo.com (search hadoop pig)

Вам также может понравиться