Академический Документы
Профессиональный Документы
Культура Документы
P.J. McBrien
Imperial College London
1/1
Introduction
LAN
S1
H1
S2
Sn1
...
Hn1
H2
Sn
Hn
2/1
MapReduce
MapReduce
data
nodes
D4
load
map
nodes
M5
M4
D3
M3
D2
D1
M2
M1
Big Data: Pig Latin
shuffle
reduce
nodes
R3
R2
R1
3/1
MapReduce
M2
simply cannot make up their minds,
or they cannot get the Prime
Minister to make up his mind. So
they go on in strange paradox,
decided only to be undecided,
resolved to be irresolute, adamant
for drift, solid for fluidity,
all-powerful to be impotent.
4/1
MapReduce
5/1
MapReduce
(adamant,1) (admiralty,1)
(all-powerful,1) (always,1) (am,1)
(anyone,1) (are,1) (assured,1) (be,3)
(can,1) (cannot,2) (decided,1)
(drift,1) (entirely,1) (even,1)
(everything,1) (farther,1) (first,1)
(fluid,1) (fluidity,1) (for,2) (get,1)
(go,1) (government,1) (he,2) (his,2)
(i,1) (impotent,1) (in,2)
(irresolute,1) (is,3)
6/1
MapReduce
Combine
Often (and in particular for aggregate operators on grouped data), the Reduce
process may be partially calculated on the Map nodes. Such a partial Reduce process
is called a Combine operations.
For example Sum(B) = Sum(B1 ) + Sum(B2 ) if B = B1 B2 using bag semantics
7/1
MapReduce
8/1
MapReduce
(adamant,1) (admiralty,1)
(all-powerful,1) (always,1) (am,1)
(anyone,1) (are,1) (assured,1) (be,3)
(can,1) (cannot,2) (decided,1)
(drift,1) (entirely,1) (even,1)
(everything,1) (farther,1) (first,1)
(fluid,1) (fluidity,1) (for,2) (get,1)
(go,1) (government,1) (he,2) (his,2)
(i,1) (impotent,1) (in,2)
(irresolute,1) (is,3)
9/1
MapReduce
MapReduce Implementations
10 / 1
Pig Latin
LOAD
The LOAD operator makes available a data source as a relation.
11 / 1
Pig Latin
LOAD
The LOAD operator makes available a data source as a relation.
account.tsv
100[tab]current[tab]McBrien, P.[tab][tab]67
101[tab]deposit[tab]McBrien, P.[tab]5.25[tab]67
103[tab]current[tab]Boyd, M.[tab][tab]34
107[tab]current[tab]Poulovassilis, A.[tab][tab]56
119[tab]deposit[tab]Poulovassilis, A.[tab]5.50[tab]56
125[tab]current[tab]Bailey, J.[tab][tab]56
11 / 1
Pig Latin
LOAD
The LOAD operator makes available a data source as a relation.
account.tsv
100[tab]current[tab]McBrien, P.[tab][tab]67
101[tab]deposit[tab]McBrien, P.[tab]5.25[tab]67
103[tab]current[tab]Boyd, M.[tab][tab]34
107[tab]current[tab]Poulovassilis, A.[tab][tab]56
119[tab]deposit[tab]Poulovassilis, A.[tab]5.50[tab]56
125[tab]current[tab]Bailey, J.[tab][tab]56
account =
LOAD / v o l / automed / d a t a / b a n k b r a n c h / a c c o u n t . t s v
AS ( no : i n t , t y p e : c h a r a r r a y , cname : c h a r a r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
11 / 1
Pig Latin
copy account.pig
account =
LOAD / v o l / automed/ d a t a / b a n k b r a n c h / a c c o u n t . t s v
AS ( no : i n t , t y p e : c h a r a r r a y , cname : c h a r a r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
STORE a c c o u n t INTO a c c o u n t c o p y USING P i g S t o r a g e ( , ) ;
Non-interactive
p i g x l o c a l c o p y a c c o u n t . p i g
12 / 1
Pig Latin
copy account.pig
account =
LOAD / v o l / automed/ d a t a / b a n k b r a n c h / a c c o u n t . t s v
AS ( no : i n t , t y p e : c h a r a r r a y , cname : c h a r a r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
STORE a c c o u n t INTO a c c o u n t c o p y USING P i g S t o r a g e ( , ) ;
Interactive
p i g x l o c a l
g r u n t>a c c o u n t =
LOAD / v o l / automed/ d a t a / b a n k b r a n c h / a c c o u n t . t s v
AS ( no : i n t , t y p e : c h a r a r r a y , cname : c h a r a r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
g r u n t>STORE a c c o u n t INTO a c c o u n t c o p y USING P i g S t o r a g e ( , ) ;
12 / 1
Pig Latin
copy account.pig
account =
LOAD / v o l / automed/ d a t a / b a n k b r a n c h / a c c o u n t . t s v
AS ( no : i n t , t y p e : c h a r a r r a y , cname : c h a r a r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
STORE a c c o u n t INTO a c c o u n t c o p y USING P i g S t o r a g e ( , ) ;
12 / 1
Pig Latin
Union
Difference
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
13 / 1
Pig Latin
Union
Difference
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
Project
FOREACH haliasi GENERATE hcolnamei,. . .
Projects certain column names from an alias
sortcode account
a c c o u n t s o r t c o d e b a g=
FOREACH a c c o u n t
GENERATE s o r t c o d e ;
a c c o u n t s o r t c o d e=
DISTINCT a c c o u n t s o r t c o d e b a g ;
13 / 1
Pig Latin
Union
Difference
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
Select
FILTER haliasi BY hpredicatei
Only passes those tuples in haliasi that match the hpredicatei
rate>0 account
a c c o u n t w i t h r a t e=
FILTER a c c o u n t
BY r a t e > 0 . 0 ;
13 / 1
Pig Latin
Union
Difference
no
100
101
103
107
119
125
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
type
current
deposit
current
current
deposit
current
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
Product
CROSS haliasi,haliasi
Produce the Cartesian product of two relations
13 / 1
Pig Latin
Union
Difference
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
Join
branch
rate>0 account
branch with interest account =
JOIN b r a n c h BY b r a n c h : : s o r t c o d e ,
a c c o u n t w i t h r a t e BY a c c o u n t w i t h r a t e : : s o r t c o d e ;
13 / 1
Pig Latin
Union
Difference
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
Union
UNION haliasi,haliasi
Perform a bag based union between two relations
13 / 1
Pig Latin
Union
Difference
no
100
101
103
107
119
125
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
type
current
deposit
current
current
deposit
current
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
Difference
No direct implementation. Can achieve the same result by performing a LEFT join,
and then elimating rows with null values.
no account no movement
a cco u n t a n d m o vem en t=
JOIN a c c o u n t BY no LEFT ,
movement BY no ;
a c c o u n t w i t h o u t m o v e m e n t=
FILTER a cco u n t a n d m o vem en t
BY movement : : no I S NULL ;
a c c o u n t n o w i t h o u t m o v e m e n t=
FOREACH a c c o u n t w i t h o u t m o v e m e n t
GENERATE no
P.J. McBrien (Imperial College London)
13 / 1
Pig Latin
sortcode
56
34
67
branch
bname
Wimbledon
Goodge St
Strand
cash
94340.45
8900.67
34005.00
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
a = FILTER a c c o u n t BY t y p e== c u r r e n t ;
ap = FOREACH a GENERATE no , s o r t c o d e ;
B
ap
no sortcode
100
67
103
34
107
56
C
ap
no sortcode
100
67
107
56
D
ap
sortcode
67
34
56
56
14 / 1
Pig Latin
sortcode
56
34
67
branch
bname
Wimbledon
Goodge St
Strand
cash
94340.45
8900.67
34005.00
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
a = FILTER b r a n c h BY ca s h <50000;
b = FILTER a c c o u n t BY t y p e== d e p o s i t ;
ab = JOIN a BY s o r t c o d e , b BY s o r t c o d e ;
abp = FOREACH ab GENERATE a : : s o r t c o d e AS s o r t c o d e ;
B
abp
sortcode
56
C
abp
sortcode
67
D
abp
sortcode
34
15 / 1
Pig Latin
B
sortcode (cash<50000 branch type=deposit account)
C
sortcode cash<50000 branch sortcode type=deposit account
D
sortcode cash<50000 branch sortcode type=deposit account
P.J. McBrien (Imperial College London)
16 / 1
Pig Latin
branch
bname
cash
Wimbledon 94340.45
Goodge St 8900.67
Strand
34005.00
no
100
101
100
107
103
100
107
101
119
movement
amount
tdate
2300.00 5/1/1999
4000.00 5/1/1999
-223.45 8/1/1999
-100.00 11/1/1999
145.50 12/1/1999
10.23 15/1/1999
345.56 15/1/1999
1230.00 15/1/1999
5600.00 18/1/1999
no
100
101
103
107
119
125
key
key
key
key
account
cname
rate sortcode
McBrien, P.
NULL
67
McBrien, P.
5.25
67
Boyd, M.
NULL
34
Poulovassilis, A. NULL
56
Poulovassilis, A. 5.50
56
Bailey, J.
NULL
56
type
current
deposit
current
current
deposit
current
branch(sortcode)
branch(bname)
movement(mid)
account(no)
fk
movement(no) account(no)
fk
account(sortcode) branch(sortcode)
no movement
17 / 1
Pig Latin
no movement
movement no bag =
FOREACH movement
GENERATE no ;
movement no =
DISTINCT movement no bag ;
DUMP movement no ;
18 / 1
Pig Latin
19 / 1
Pig Latin
20 / 1
Pig Latin
mid
1000
1001
1002
1004
1005
1006
1007
1008
1009
no
100
101
100
107
103
100
107
101
119
movement
amount
2300.00
4000.00
-223.45
-100.00
145.50
10.23
345.56
1230.00
5600.00
tdate
5/1/1999
5/1/1999
8/1/1999
11/1/1999
12/1/1999
15/1/1999
15/1/1999
15/1/1999
18/1/1999
21 / 1
Pig Latin
group
100
101
103
107
119
account movements
movement
{h1000,100,2300.0,1999-01-05i,h1002,100,-223.45,1999-01-08i,h1006,100,10.23,1999-01-15i}
{h1001,101,4000.0,1999-01-05i,h1008,101,1230.0,1999-01-15i}
{h1005,103,145.5,1999-01-12i}
{h1004,107,-100.0,1999-01-11i,h1007,107,345.56,1999-01-15i}
{h1009,119,5600.0,1999-01-18i}
21 / 1
Pig Latin
movement copy =
FOREACH a cco u n t m o vem en ts
GENERATE FLATTEN ( movement ) ;
mid
1000
1001
1002
1004
1005
1006
1007
1008
1009
movement copy
no
amount
100
2300.00
101
4000.00
100
-223.45
107
-100.00
103
145.50
100
10.23
107
345.56
101
1230.00
119
5600.00
tdate
5/1/1999
5/1/1999
8/1/1999
11/1/1999
12/1/1999
15/1/1999
15/1/1999
15/1/1999
18/1/1999
21 / 1
Pig Latin
account balance =
FOREACH a cco u n t m o vem en ts
GENERATE g r o u p AS no ,
SUM( movement . amount ) AS b a l a n c e ;
account balance
no
balance
100
2086.78
101
5230.00
103
145.50
107
245.56
119
5600.00
21 / 1
Pig Latin
Function
int COUNT(bag)
int COUNT STAR(bag)
double AVG(bag)
double MAX(bag)
double MIN(bag)
double SUM(bag)
bag DIFF(bag a,bag b)
Result
Returns
Returns
values).
Returns
Returns
Returns
Returns
Returns
22 / 1
Pig Latin
branch
bname
Wimbledon
Goodge St
Strand
cash
94340.45
8900.67
34005.00
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
B
abr
group no mv
100
3
101
2
103
1
107
2
119
1
125
1
C
abr
group no mv
100
3
101
2
103
1
107
2
119
1
125
0
Big Data: Pig Latin
D
abr
group no mv
100
3
101
2
103
1
107
2
119
1
23 / 1
Pig Latin
mid
1000
1001
1002
1004
1005
1006
1007
1008
1009
no
100
101
100
107
103
100
107
101
119
movement
amount
2300.00
4000.00
-223.45
-100.00
145.50
10.23
345.56
1230.00
5600.00
tdate
5/1/1999
5/1/1999
8/1/1999
11/1/1999
12/1/1999
15/1/1999
15/1/1999
15/1/1999
18/1/1999
24 / 1
Pig Latin
group
100
101
103
107
119
account movements
movement
{h100,2300.0i,h100,-223.45i,h100,10.23i}
{h101,4000.0i,h101,1230.0i}
{h1103,145.5,i}
{h107,-100.0i,h107,345.56i}
{h119,5600.0i}
24 / 1
Pig Latin
24 / 1
Pig Latin
24 / 1
Pig Latin
Nested Statements
SQL Query to find total of credits and of debits
SELECT
a c c o u n t . no ,
COUNT( movement . mid ) AS n o t r a n s ,
SUM(CASE WHEN amount >0.0 THEN amount ELSE 0 . 0 END) AS c r e d i t ,
SUM(CASE WHEN amount <0.0 THEN amount ELSE 0 . 0 END) AS d e b i t
FROM
a c c o u n t LEFT JOIN movement ON a c c o u n t . no=movement . no
GROUP BY a c c o u n t . no
no
100
103
119
125
107
101
no trans
3
1
1
0
2
2
credit
2310.23
145.50
5600.00
0.0
345.56
5230.00
debit
-223.45
0.0
0.0
0.0
-100.00
0.0
Powerful feature of SQL is that any statement that returns a value may appear in an
aggregate function.
25 / 1
Pig Latin
Nested Statements
SQL Query to find total of credits and of debits
SELECT
a c c o u n t . no ,
COUNT( movement . mid ) AS n o t r a n s ,
SUM(CASE WHEN amount >0.0 THEN amount ELSE 0 . 0 END) AS c r e d i t ,
SUM(CASE WHEN amount <0.0 THEN amount ELSE 0 . 0 END) AS d e b i t
FROM
a c c o u n t LEFT JOIN movement ON a c c o u n t . no=movement . no
GROUP BY a c c o u n t . no
25 / 1
Pig Latin
sortcode
56
34
67
mid
1000
1001
1002
1004
1005
1006
1007
1008
1009
branch
bname
cash
Wimbledon 94340.45
Goodge St 8900.67
Strand
34005.00
no
100
101
100
107
103
100
107
101
119
movement
amount
tdate
2300.00 5/1/1999
4000.00 5/1/1999
-223.45 8/1/1999
-100.00 11/1/1999
145.50 12/1/1999
10.23 15/1/1999
345.56 15/1/1999
1230.00 15/1/1999
5600.00 18/1/1999
no
100
101
103
107
119
125
key
key
key
key
type
current
deposit
current
current
deposit
current
account
cname
rate sortcode
McBrien, P.
NULL
67
McBrien, P.
5.25
67
Boyd, M.
NULL
34
Poulovassilis, A. NULL
56
Poulovassilis, A. 5.50
56
Bailey, J.
NULL
56
branch(sortcode)
branch(bname)
movement(mid)
account(no)
fk
movement(no) account(no)
fk
account(sortcode) branch(sortcode)
26 / 1
Pig Latin
withdrawl =
FILTER movement
BY
amount <0;
account with withdrawl bag =
JOIN a c c o u n t BY no ,
w i t h d r a w l BY no ;
account with withdrawl =
DISTINCT a c c o u n t w i t h w i t h d r a w l b a g ;
branch with withdrawl =
JOIN a c c o u n t w i t h w i t h d r a w l BY s o r t c o d e ,
b r a n c h BY s o r t c o d e ;
branch with withdrawl no =
FOREACH b r a n c h w i t h w i t h d r a w l
GENERATE bname , a c c o u n t : : no ;
DUMP b r a n c h w i t h w i t h d r a w l n o ;
P.J. McBrien (Imperial College London)
27 / 1
Pig Latin
branch =
LOAD / v o l / automed / d a t a / b a n k b r a n c h / b r a n c h . t s v
AS ( s o r t c o d e : i n t , bname : c h a r a r r a y , c a s h : f l o a t ) ;
account =
LOAD / v o l / automed / d a t a / b a n k b r a n c h / a c c o u n t . t s v
AS ( no : i n t , t y p e : c h a r a r r a y , cname : c h a r a r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
movement =
LOAD / v o l / automed / d a t a / b a n k b r a n c h / movement . t s v
AS ( mid : i n t , no : i n t , amount : d ou b l e , t d a t e : b y t e a r r a y ) ;
ac c ount m ove m e nt =
JOIN a c c o u n t BY no LEFT , movement BY no ;
customer details =
GROUP ac c ount m ove m e nt BY a c c o u n t : : cname ;
customer balance =
FOREACH c u s t o m e r d e t a i l s
GENERATE g r o u p AS cname , SUM( ac c ount m ove m e nt . movement : : amount ) AS b a l a n c e ;
DUMP
custom
e r b aCollege
l a n c eLondon)
;
P.J. McBrien
(Imperial
28 / 1
Pig Latin
branch . sortcode ,
b r a n c h . bname ,
COUNT( CASE WHEN t y p e= c u r r e n t THEN no ELSE NULL END) AS c u r r e n t ,
COUNT( CASE WHEN t y p e= d e p o s i t THEN no ELSE NULL END) AS d e p o s i t
FROM
a c c o u n t JOIN b r a n c h ON a c c o u n t . s o r t c o d e=b r a n c h . s o r t c o d e
GROUP BY b r a n c h . s o r t c o d e , b r a n c h . bname
ORDER BY b r a n c h . s o r t c o d e , b r a n c h . bname
branch account =
JOIN b r a n c h BY s o r t c o d e , a c c o u n t BY s o r t c o d e ;
branch detail =
GROUP b r a n c h a c c o u n t BY ( b r a n c h : : s o r t c o d e , b r a n c h : : bname ) ;
branch account types =
FOREACH b r a n c h d e t a i l {
current =
FILTER b r a n c h a c c o u n t
BY t y p e == c u r r e n t ;
deposit =
FILTER b r a n c h a c c o u n t
BY t y p e == d e p o s i t ;
GENERATE g r o u p . s o r t c o d e AS s o r t c o d e ,
g r o u p . bname AS bname ,
COUNT( c u r r e n t . no ) AS c u r r e n t ,
COUNT( d e p o s i t . no ) AS d e p o s i t ;
}
branch account types ordered =
ORDER b r a n c h a c c o u n t t y p e s
BY
s o r t c o d e , bname ;
P.J. McBrien (Imperial College London)
29 / 1
Pig Execution
Pig scripts are interpreted into a sequence of Hadoop Map, Combine, Shuffle, and
Reduce operations.
In general, a Pig script may require multiple MapReduce processes to be run.
Map and Combine processes run on nodes containing data.
Number of Reduce nodes used specified in the Pig script (and defaults to 1!)
Temporary files are used to allow output of one MapReduce process to be fed
back as input to another MapReduce process.
Projects (from GENERATE in Pig) are automatically pushed inside Joins, but
otherwise little optimisation is performed by the Pig interpreter.
30 / 1
Pig Execution
sortcode
56
34
67
branch
bname
Wimbledon
Goodge St
Strand
cash
94340.45
8900.67
34005.00
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
JOIN
DISTINCT
GENERATE
UNION
31 / 1
Pig Execution
Pig Operator
FILTER R BY A == val
FOREACH R GENERATE A, B, . . .
CROSS R, S
GROUP R BY A
JOIN R BY A, S BY B
JOIN R BY A LEFT OUTER, S BY B;
JOIN R BY A RIGHT OUTER, S BY B;
UNION R, S
Map or Reduce
Map
Map
Reduce
Combine,Reduce
Reduce
Reduce
Reduce
Reduce
32 / 1
Pig Joins
Types of Join
map
nodes
M5
s2
M4
s1
M3
r3
M2
r2
M1
r1
R3
h(r.a) K3
h(s.b) K3
R2
h(r.a) K2
h(s.b) K2
R1
h(r.a) K1
h(s.b) K1
33 / 1
Pig Joins
Types of Join
map
nodes
M5
s1
Replicated Joins
M4
r4
M3
r3
t u = JOIN r BY a , s BY b USING r e p l i c a t e d
replicate
M2
r2
M1
r1
33 / 1
Pig Joins
branch
bname
Wimbledon
Goodge St
Strand
cash
94340.45
8900.67
34005.00
no
100
101
103
107
119
125
type
current
deposit
current
current
deposit
current
account
cname
McBrien, P.
McBrien, P.
Boyd, M.
Poulovassilis, A.
Poulovassilis, A.
Bailey, J.
rate
NULL
5.25
NULL
NULL
5.50
NULL
sortcode
67
67
34
56
56
56
The size of branch is such it easily fits on one node, whilst account does not.
B
ba = JOIN account BY sortcode RIGHT, branch BY sortcode USING replicated;
C
ba = JOIN account BY sortcode LEFT, branch BY sortcode USING replicated;
D
ba = JOIN account BY sortcode, branch BY sortcode USING replicated;
P.J. McBrien (Imperial College London)
34 / 1