Вы находитесь на странице: 1из 40

Hive and Pig

Need for High-Level Languages


Hadoop is great for large-data processing!
But writing Java programs for everything is verbose and slow Not everyone wants to (or can) write Java code

Solution: develop higher-level data processing languages


Hive: HQL is like SQL Pig: Pig Latin is a bit like Perl

Hive and Pig


Hive: data warehousing application in Hadoop
Query language is HQL, variant of SQL Tables stored on HDFS as flat files Developed by Facebook, now open source

Pig: large-scale data processing system


Scripts are written in Pig Latin, a dataflow language Developed by Yahoo!, now open source Roughly 1/3 of all Yahoo! internal jobs

Common idea:
Provide higher-level language to facilitate large-data processing Higher-level language compiles down to Hadoop jobs

Hive: Background
Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded python Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that

Source: cc-licensed slide by Cloudera

Hive Components
Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe

Source: cc-licensed slide by Cloudera

Data Model
Tables
Typed columns (int, float, string, boolean) Also, list: map (for JSON-like data)

Partitions
For example, range-partition tables by date

Buckets
Hash partitions within ranges (useful for sampling, join optimization)

Source: cc-licensed slide by Cloudera

Metastore
Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and many other relational databases

Source: cc-licensed slide by Cloudera

Physical Layout
Warehouse directory in HDFS
E.g., /user/hive/warehouse

Tables stored in subdirectories of warehouse


Partitions form subdirectories of tables

Actual data stored in flat files


Control char-delimited text, or SequenceFiles With custom SerDe, can use arbitrary format

Source: cc-licensed slide by Cloudera

Hive: Example
Hive looks similar to an SQL database Relational join on two tables:
SELECT Table of word counts from Shakespeare collection s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 Table word counts from the bible ORDER BYof s.freq DESC LIMIT 10;
the I and to of a you my in is 25848 23031 19671 18038 16700 14170 12702 11297 10797 8882 62394 8854 38985 13526 34654 8057 2720 4135 12445 6884

Source: Material drawn from Cloudera training VM

Hive: Behind the Scenes


SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10;

(Abstract Syntax Tree)


(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF bible k) (= (. (TOK_TABLE_OR_COL s) word) (. (TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL k) freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k) freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))

(one or more of MapReduce jobs)

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage

Hive: Behind the Scenes


Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: hdfs://localhost:8022/tmp/hive-training/364214370/10002 Reduce Output Operator key expressions: expr: _col1 type: int sort order: tag: -1 value expressions: expr: _col0 type: string expr: _col1 type: int expr: _col2 type: int Reduce Operator Tree: Extract Limit File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: s TableScan alias: s Filter Operator predicate: expr: (freq >= 1) type: boolean Reduce Output Operator key expressions: expr: word type: string sort order: + Map-reduce partition columns: expr: word type: string tag: 0 value expressions: expr: freq type: int expr: word type: string k TableScan alias: k Filter Operator predicate: expr: (freq >= 1) type: boolean Reduce Output Operator key expressions: expr: word type: string sort order: + Map-reduce partition columns: expr: word type: string tag: 1 value expressions: expr: freq type: int

Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: Stage: Stage-0 expr: ((_col0 >= 1) and (_col2 >= 1)) Fetch Operator type: boolean limit: 10 Select Operator expressions: expr: _col1 type: string expr: _col0 type: int expr: _col2 type: int outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Example Data Analysis Task


Find users who tend to visit good pages.
Visits Pages

user Amy

url www.cnn.com

time 8:00

url
www.cnn.com www.flickr.com www.myblog.com www.crap.com
...

pagerank
0.9 0.9 0.7 0.2

Amy
Amy Amy Fred

www.crap.com
www.myblog.com www.flickr.com

8:05
10:00 10:05

cnn.com/index.htm 12:00
...

Pig Slides adapted from Olston et al.

Conceptual Dataflow
Load Visits(user, url, time) Load Pages(url, pagerank) Canonicalize URLs

Join url = url

Group by user

Compute Average Pagerank

Filter avgPR > 0.5

Pig Slides adapted from Olston et al.

System-Level Dataflow
Visits Pages

load canonicalize

...

...

load

join by url

the answer
Pig Slides adapted from Olston et al.

...
group by user

compute average pagerank filter

...

MapReduce Code
i i i i m m m m p p p p o o o o r r r r t t t t j j j j a a a a v v v v a a a a . . . . i u u u o t t t . i i i I l l l O . . . E A I L x r t i c r e s e a r t p t i o n ; y L i s t ; a t o r ; ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; } / / f o r D o t h e c ( S t r i n g f o r ( S t S t r o c . r e p r o s s 1 r i n i n g c o l o r t s : g p r o d u c t f i r s t ) s 2 : s e o u t v a l = l e c t ( n u l l e r . s e t S t a { c o k , t u a n d n e n s i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ; i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p i m p o r t o p r a g c . h a e . h a d o o p . m a p r e d . M a p p e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . oj no tb rc oo ln ;t r o l . J o b C i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p l p . s e t O u t p u t K e y C l l p . s e t O u t p u t V a l u e l p . s e t M a p p e r C l a s s v a l u e s F i l e I n p u t F o r m a t . a P a t h (u "s /e r / g a t e s / p a g e s " ) ) ; d ) { F i l e O u t p u t F o r m a t . y + " , " + s 1 + " , " + s 2 ; n e w P a t h ( " / u s e w T e x t ( o u t v a l ) ) ; l p . s e t N u m R e d u c e T a ( " O K " ) ; J o b l o a d P a g e s = n c o l l e c t t h e a C ( d s e s e s l L d s a o I ( s a n T s d p e ( P u x T a t t e g P . x e a c t s t l . . h a c c ( s l l l s a a p ) ; s s ) ; s s ) ; , n e w e t O u t r / g a t k s ( 0 ) w J o b } } } } u t F o r m a t ;p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > ; t ; ; t F o r p u t F ; p e r ;

p u t P a t h ( l p , e s / t m p / i n d e x e d _ p a g ; ( l p ) ;

p u b l i c c l a s s M R E x a m p l e { p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , p u b l i c / S i S S T / / T o / t n t t e / / e c v o i O u t R e P u l i n g f i i n g i n g t o P r e i t t o c o l d m a p u t C p o r t l t h l i n r s t C k e y v a l u t K e p e n d c a m e u t V a l e c t p o e e e o s u y ( L o n g l l e c t r r e p k e y = v a m m a = t = r i l n i g n e = l = n e a n i n f r o m . l =" n+ e ( o u t K e r r r u . l ( e 0 . i n w d e W o o o l i < t t t i , s e T x t a b l e k , T e x t v a l , T e x t , T e x t > o c , e r ) t h r o w s I O E x c e p t i o n u . e S e f b s x t t . i u t o r i r b ( i n s s k t n d t t e h g e C r y e ( x o i ) ) O m n ; v ; f ( ' , ' ) ; m a ) ; g ( f i r s t C o m m a l u e s o w e

r t r r x x .

wv a Tl eu xe t) (; " 1 y , o u t V a l ) ;

} } p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , p u b l i c v o O u t p u R / / P u S t r i n i n t f S t r i n i n t a i f ( a S t r i n T e x t / / P r / / i t T e x t o c . c o } i t e l g i g g g g o e d m a C o l l p o r t l t h l i n r s t C v a l e = e < k e y u t K e p e n d c a m e o u t V a l l e c t p e e e e o u I 1 ( L o n c t o r r r e k e y = v m m a e = n t e g 8 | | = l i y = n a n i f r o m l = n ( o u t K g W r i < T e x p o r t o u t a l . t = l i l f i i n r e e r . p a g e n e . s e w T n d e x . e w T e y , t a b l e k , T e x t v a l , t , T e x t > o c , e r ) t h r o w s I O E x c e p t i o n s . a S t e . t s C u r s > u b s e x t te o r i o b e 2 t ( i n g ( n d e x m s m t a r i I n t ( 5 ) r r i n g k e y ) k tn ho ew ) ; O f ( ' , ' + n g 1 ( ) ; v a l u e ) e t u r n ; ( 0 , f i ; vw ah li uc eh ) ; ; r s t C o m sf oi l we

e x t ( " 2 " + o u t V a l ) ;

v a l u e ) ;

} p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e x t > p u b l i c v o I O R F o i t u e r d r e e r a t t p u t p o r t e a c d o C e h u c e ( T r < T e x o l l e c r r e p v a l u e t t o e x t k e y > i t e r o r < T e x r t e r ) , f i g u , , t , T e x t > o c , t h r o w s I O E x c e p t i r e o u t w h i c h f i l

/ / s t o r e i t

/ / a c c o r d i n g l y . L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r w h i l e T S i f i r s t . a d d ( v a l u e . s e e t f u l ( i t x t r i n ( v b s t s e e t g a r s r . h a s N e x t ( ) = i t e r . n e x v Sa tl ru ie n g= ( )t ;. l u e . c h a r A t ( i n g ( 1 ) ) ; e c o n d . a d d ( v ) { t ( ) ; t o 0 ) = =

' 1 ' )

a l u e . s u b s t r i n g ( 1 ) )

J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l e . c l a s s le ft uJ .o sb N a m e ( " L o a d a n d F i l t e r U s e r s " ) ; l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) l f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; { l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s e r s . c l a p u b l i c v o i d m a p ( F i l e I n p u t F I o n r p m u a t P . a t d h d ( l f u , n e w T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u , O u c t t p o u r t < C T o e l x l t e , L o n g W r i t a b l e > o c , n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { l f u . s e t N u m R e d u c e T a s k s ( 0 ) ; o r m a t ; / / F i n d t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; J o b C o n f j o i n = n Me Rw E xJ ao mb pC lo en .f c( l a s s ) ; i n t s e c o n d C o m m a = l i n e . i Cn od me mx aO )f ;( ' , ' , f i r s t j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a g e s " ) ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C oj mo mi an ). ;s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r m / / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t j ao ni yn m. os re et ,O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u mj o ii nn s. ts ee at dO .u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ; B a s e T e x t o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s p s e ( r I . d c e l n a t s i s t ) y ; M a p T e x t > { o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ; j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ; } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w } P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ; p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w o n { i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t aP ba lt eh C( o" m/ pu as re ar b/ lg ea ,t e s / t m p / f i l t e r e d _ u s e r s " ) ) ; W r i t a b l e > { F i l e O u t p u t t F O o u r t m p a u t t . P s a e t h ( j o i n , n e w P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ; yT ,e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ; a + 1 ) ; I t e r a t o r < L o n g W r i t a b l e > i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l ej >o i on cJ ,o b . a d d D e p e n d i n g J o b ( l o a d U s e r s ) ; k n o w w h i c h f i l e R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p = n e x w a m J p o l b e C . o c n l f a ( s M s R ) E ; g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r i l ew h ( i t e r . h a s N e x t ( ) ) { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t a b l e . c l M a p R e d u c e B a s e r e p o r t e r . s e t S t a t u s ( " O K " ) ; g r o u p . s e t O u t p u t F o r ml ae tO (u St ep qu ut eF no cr em Fa it . c l a s s ) ; T e x t > { } g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l a s s ) ; g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . c l a s s ) ; o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ; g r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c l a s s ) ; } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p , n e w o n { } P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p , n e w m p l i e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e ,P a Lt oh n( g" W/ ru is te ar b/ lg ea ,t e s / t m p / g r o u p e d " ) ) ; T e x t > { g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ; J o b g r o u p J o b = n e w J o b ( g r o u p ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ; W r i t a b l e C o m p a r a b l e k e y , m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a m p l e . c l O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c , t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ; R e p o r t e r t hr re op wo sr t Ie Or E) x c e p t i o n { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ; t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a b l e . c l a } t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ; } t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q ou re mn ac te .F ci ll ae sO su )t ;p u t p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c l a s s ) ; i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e ,t o Tp e1 x0 t0 >. s {e t C o m b i n e r C l a s s ( L i m i t C l i c k s . c l a s s t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s . c l a s s ) { i n t c o u n t = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 0 , n e w p u b vl oi ic d r e d u c e ( P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ; L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 , n e w I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 5 " ) ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c , t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ; o n { R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ; e i t ' s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ; / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s w h i l e< (1 c0 o0 u n& t& i t e r . h a s N e x t ( ) ) { J o b C o n t r o l j c = n e w J o b C o 1 n 0 t 0 r o s l i ( t " e F s i n f d o r t o u p s n g > ( ) ; o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ; 1 8 t o 2 5 " ) ; i n g > ( ) ; c o u n t + + ; j c . a d d J o b ( l o a d P a g e s ) ; } j c . a d d J o b ( l o a d U s e r s ) ; } j c . a d d J o b ( j o i n J o b ) ; } j c . a d d J o b ( g r o u p J o b ) ; p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i oj nc . {a d d J o b ( l i m i t ) ; J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ; j c . r u n ( ) ; lt pJ .o sb eN a m e ( " L o a d P a g e s " ) ; } ; l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ; }

Pig Slides adapted from Olston et al.

Pig Latin Script


Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time;
Pages = load /data/pages as (url, pagerank);

VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5;
store GoodUsers into '/data/good_users';

Pig Slides adapted from Olston et al.

Java vs. Pig Latin


1/20 the lines of code
180 160 140 120 100 80 60 40 20 0
300 250

1/16 the development time

Minutes

200 150 100 50 0

Hadoop

Pig

Hadoop

Pig

Performance on par with raw Hadoop!

Pig Slides adapted from Olston et al.

Pig takes care of


Schema and type checking Translating into efficient physical dataflow
(i.e., sequence of one or more MapReduce jobs)

Exploiting data reduction opportunities


(e.g., early partial aggregation via a combiner)

Executing the system-level dataflow


(i.e., running the MapReduce jobs)

Tracking progress, errors, etc.

Hive + HBase?

Integration
Reasons to use Hive on HBase:
A lot of data sitting in HBase due to its usage in a real-time environment, but never used for analysis Give access to data in HBase usually only queried through MapReduce to people that dont code (business analysts) When needing a more flexible storage solution, so that rows can be updated live by either a Hive job or an application and can be seen immediately to the other

Reasons not to do it:


Run SQL queries on HBase to answer live user requests (its still a MR job) Hoping to see interoperability with other SQL analytics systems

Integration
How it works:
Hive can use tables that already exist in HBase or Hive table definitions HBase manage its own ones, but they still all reside in the to an existing table same HBasePoints instance

Manages this table from Hive

How it works:

Integration
HBase

When using an already existing table, defined as EXTERNAL, you can create multiple Hive tables that point to it

Hive table definitions


Points to some column

Points to other columns, different names

How it works:

Integration
HBase table
people d:fullname

Hive table definition


persons name STRING

Columns are mapped however you want, changing names and giving types

age INT
siblings MAP<string, string>

d:age
d:address f:

Integration
Drawbacks (that can be fixed with brain juice):
Binary keys and values (like integers represented on 4 bytes) arent supported since Hive prefers string representations, HIVE-1634 Compound row keys arent supported, theres no way of using multiple parts of a key as different fields This means that concatenated binary row keys are completely unusable, which is what people often use for HBase Filters are done at Hive level instead of being pushed to the region servers Partitions arent supported

Data Flows
Data is being generated all over the place:
Apache logs Application logs MySQL clusters HBase clusters

Data Flows Moving application log files


Transforms format Dumped into Read nightly Wild log file HDFS

Tailed continuou sly Inserted into Parses into HBase format HBase

Data Moving MySQL data


Dumped nightly with CSV import MySQL

Flows

HDFS

Tungsten replicator Inserted into Parses into HBase format HBase

Data Moving HBase data


HBase Prod

Flows
HBase MR

CopyTable MR job

Read in parallel

Imported in parallel into

* HBase replication currently only works for a single slave cluster, in our case HBase replicates to a backup cluster.

Use Cases
Front-end engineers
They need some statistics regarding their latest product

Research engineers
Ad-hoc queries on user data to validate some assumptions Generating statistics about recommendation quality

Business analysts
Statistics on growth and activity Effectiveness of advertiser campaigns Users behavior VS past activities to determine, for example, why certain groups react better to email communications Ad-hoc queries on stumbling behaviors of slices of the user base

Use Cases Using a simple table in HBase:


CREATE EXTERNAL TABLE blocked_users( userid INT, blockee INT, blocker INT, created BIGINT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:blockee,f:blocker,f:created") TBLPROPERTIES("hbase.table.name" = "m2h_repl-userdb.stumble.blocked_users"); HBase is a special case here, it has a unique row key map with :key Not all the columns in the table need to be mapped

Using a complicated table in HBase:

Use Cases

CREATE EXTERNAL TABLE ratings_hbase( userid INT, created BIGINT, urlid INT, rating INT, topic INT, modified BIGINT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b@0,:key#b@1,:key#b@2,default:rating#b,default:topic#b,default:modified#b") TBLPROPERTIES("hbase.table.name" = "ratings_by_userid"); #b means binary, @ means position in composite key (SU-specific hack)

Graph Databases

32

NEO4J (Graphbase)
A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes. Attach properties (key-value pairs) on nodes and relationships Relationships connect two nodes and both nodes and relationships can hold an

arbitrary amount of key-value pairs.


A graph database can be thought of as a key-value store, with full support for relationships. http://neo4j.org/

33

NEO4J

34

NEO4J

35

NEO4J

36

NEO4J

37

NEO4J

38

NEO4J
Properties

39

Dual license: open source and commercial

NEO4J Features

Well suited for many web use cases such as tagging, metadata annotations, social networks, wikis and other network-shaped or hierarchical data sets Intuitive graph-oriented model for data representation. Instead of static and rigid tables, rows and columns, you work with a flexible graph network consisting of nodes, relationships and properties. Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs. A disk-based, native storage manager completely optimized for storing graph structures for maximum performance and scalability Massive scalability. Neo4j can handle graphs of several billion nodes/relationships/properties on a single machine and can be sharded to scale out across multiple machines Fully transactional like a real database Neo4j traverses depths of 1000 levels and beyond at millisecond speed. (many orders of magnitude faster than relational systems)

40

Вам также может понравиться