Академический Документы
Профессиональный Документы
Культура Документы
What is Pig?
Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster.
Fltrd = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url);
Jnd = join Fltrd by name, Pages by user;
Hadoop
Pig
Hadoop
Pig
Pig performance
Pigmix : pig vs mapreduce
Logical Optimizations
Script
A = load B = foreach C = filter
Parser
Logical Plan
A -> B -> C
Logical Optimizer
Optimized L. Plan
A -> C -> B
Restructure given logical dataflow graph Apply filter, project, limit early Merge foreach, filter statements Operator rewrites
Physical Optimizations
Translator
Optimized L. Plan
X -> Y -> Z
Optimizer
Optimized Phy/MR Plan
M(PX-PYm) C(PYc)R(PYr) -> M(Z)
Phy/MR plan
M(PX-PYm) R(PYr) -> M(Z)
Physical plan: sequence of MR jobs having physical operators. Built-in rules. eg. use of combiner Specified in query - eg. join type
Hash Join
Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Users by name, Pages by user;
Reducer 1
(1, fred) (2, fred) (2, fred)
Map 2
Reducer 2
(1, jane) (2, jane) (2, jane)
Users block m
(2, name)
Skew Join
Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using skewed;
Reducer 1
(1, fred, p1) (1, fred, p2) (2, fred)
Map 2
Reducer 2 S P
(2, name) (1, fred, p3) (1, fred, p4) (2, fred)
Users block m
Merge Join
Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using merge;
Map 1 Pages
aaron . . . . . . . . zach
Users
aaron . . . . . . zach
Pages
aaron amr
Users
aaron
Map 2 Pages
amy barb
Users
amy
Replicated Join
Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using replicated;
Map 1 Pages
aaron aaron . . . . . . . zach
Users
aaron . zach
Pages
aaron amr
Users
aaron . zach
Map 2
Pages
amy barb
Users
aaron . zach
Group/cogroup optimizations
On sorted and collected data
grp = group Users by name using collected;
Pages
aaron aaron barney carol . . . . . . . zach
Map 1
aaron aaron barney
Map 2
carol . .
Multi-store script
A = load users as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into bydemo; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into bystate;
C1: group A: load B: filter C2: group C3: eval udf C2: eval udf store into bydemo store into bystate
local rearrange
local rearrange
reduce
package
foreach
multiplex
package foreach
Memory Management
Use disk if large objects dont fit into memory JVM limit > phy mem - Very poor performance Spill on memory threshold notification from JVM - unreliable pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.
Other optimizations
Aggressive use of combiner, secondary sort Lazy deserialization in loaders Better serialization format Faster regex lib, compiled pattern
Further reading
Docs - http://pig.apache.org/docs/r0.7.0/ Papers and talks http://wiki.apache.org/pig/PigTalksPaper s Training videos in vimeo.com (search hadoop pig)