Академический Документы
Профессиональный Документы
Культура Документы
Common idea:
Provide higher-level language to facilitate large-data processing Higher-level language compiles down to Hadoop jobs
Hive: Background
Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded python Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that
Hive Components
Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe
Data Model
Tables
Typed columns (int, float, string, boolean) Also, list: map (for JSON-like data)
Partitions
For example, range-partition tables by date
Buckets
Hash partitions within ranges (useful for sampling, join optimization)
Metastore
Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and many other relational databases
Physical Layout
Warehouse directory in HDFS
E.g., /user/hive/warehouse
Hive: Example
Hive looks similar to an SQL database Relational join on two tables:
SELECT Table of word counts from Shakespeare collection s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 Table word counts from the bible ORDER BYof s.freq DESC LIMIT 10;
the I and to of a you my in is 25848 23031 19671 18038 16700 14170 12702 11297 10797 8882 62394 8854 38985 13526 34654 8057 2720 4135 12445 6884
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage
STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: s TableScan alias: s Filter Operator predicate: expr: (freq >= 1) type: boolean Reduce Output Operator key expressions: expr: word type: string sort order: + Map-reduce partition columns: expr: word type: string tag: 0 value expressions: expr: freq type: int expr: word type: string k TableScan alias: k Filter Operator predicate: expr: (freq >= 1) type: boolean Reduce Output Operator key expressions: expr: word type: string sort order: + Map-reduce partition columns: expr: word type: string tag: 1 value expressions: expr: freq type: int
Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: Stage: Stage-0 expr: ((_col0 >= 1) and (_col2 >= 1)) Fetch Operator type: boolean limit: 10 Select Operator expressions: expr: _col1 type: string expr: _col0 type: int expr: _col2 type: int outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
user Amy
url www.cnn.com
time 8:00
url
www.cnn.com www.flickr.com www.myblog.com www.crap.com
...
pagerank
0.9 0.9 0.7 0.2
Amy
Amy Amy Fred
www.crap.com
www.myblog.com www.flickr.com
8:05
10:00 10:05
cnn.com/index.htm 12:00
...
Conceptual Dataflow
Load Visits(user, url, time) Load Pages(url, pagerank) Canonicalize URLs
Group by user
System-Level Dataflow
Visits Pages
load canonicalize
...
...
load
join by url
the answer
Pig Slides adapted from Olston et al.
...
group by user
...
MapReduce Code
i i i i m m m m p p p p o o o o r r r r t t t t j j j j a a a a v v v v a a a a . . . . i u u u o t t t . i i i I l l l O . . . E A I L x r t i c r e s e a r t p t i o n ; y L i s t ; a t o r ; ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; } / / f o r D o t h e c ( S t r i n g f o r ( S t S t r o c . r e p r o s s 1 r i n i n g c o l o r t s : g p r o d u c t f i r s t ) s 2 : s e o u t v a l = l e c t ( n u l l e r . s e t S t a { c o k , t u a n d n e n s i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ; i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p i m p o r t o p r a g c . h a e . h a d o o p . m a p r e d . M a p p e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . oj no tb rc oo ln ;t r o l . J o b C i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p l p . s e t O u t p u t K e y C l l p . s e t O u t p u t V a l u e l p . s e t M a p p e r C l a s s v a l u e s F i l e I n p u t F o r m a t . a P a t h (u "s /e r / g a t e s / p a g e s " ) ) ; d ) { F i l e O u t p u t F o r m a t . y + " , " + s 1 + " , " + s 2 ; n e w P a t h ( " / u s e w T e x t ( o u t v a l ) ) ; l p . s e t N u m R e d u c e T a ( " O K " ) ; J o b l o a d P a g e s = n c o l l e c t t h e a C ( d s e s e s l L d s a o I ( s a n T s d p e ( P u x T a t t e g P . x e a c t s t l . . h a c c ( s l l l s a a p ) ; s s ) ; s s ) ; , n e w e t O u t r / g a t k s ( 0 ) w J o b } } } } u t F o r m a t ;p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > ; t ; ; t F o r p u t F ; p e r ;
p u t P a t h ( l p , e s / t m p / i n d e x e d _ p a g ; ( l p ) ;
r t r r x x .
wv a Tl eu xe t) (; " 1 y , o u t V a l ) ;
e x t ( " 2 " + o u t V a l ) ;
v a l u e ) ;
/ / s t o r e i t
' 1 ' )
a l u e . s u b s t r i n g ( 1 ) )
J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l e . c l a s s le ft uJ .o sb N a m e ( " L o a d a n d F i l t e r U s e r s " ) ; l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) l f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; { l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s e r s . c l a p u b l i c v o i d m a p ( F i l e I n p u t F I o n r p m u a t P . a t d h d ( l f u , n e w T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u , O u c t t p o u r t < C T o e l x l t e , L o n g W r i t a b l e > o c , n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { l f u . s e t N u m R e d u c e T a s k s ( 0 ) ; o r m a t ; / / F i n d t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; J o b C o n f j o i n = n Me Rw E xJ ao mb pC lo en .f c( l a s s ) ; i n t s e c o n d C o m m a = l i n e . i Cn od me mx aO )f ;( ' , ' , f i r s t j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a g e s " ) ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C oj mo mi an ). ;s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r m / / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t j ao ni yn m. os re et ,O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u mj o ii nn s. ts ee at dO .u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ; B a s e T e x t o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s p s e ( r I . d c e l n a t s i s t ) y ; M a p T e x t > { o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ; j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ; } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w } P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ; p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w o n { i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t aP ba lt eh C( o" m/ pu as re ar b/ lg ea ,t e s / t m p / f i l t e r e d _ u s e r s " ) ) ; W r i t a b l e > { F i l e O u t p u t t F O o u r t m p a u t t . P s a e t h ( j o i n , n e w P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ; yT ,e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ; a + 1 ) ; I t e r a t o r < L o n g W r i t a b l e > i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l ej >o i on cJ ,o b . a d d D e p e n d i n g J o b ( l o a d U s e r s ) ; k n o w w h i c h f i l e R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p = n e x w a m J p o l b e C . o c n l f a ( s M s R ) E ; g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F o r i l ew h ( i t e r . h a s N e x t ( ) ) { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t a b l e . c l M a p R e d u c e B a s e r e p o r t e r . s e t S t a t u s ( " O K " ) ; g r o u p . s e t O u t p u t F o r ml ae tO (u St ep qu ut eF no cr em Fa it . c l a s s ) ; T e x t > { } g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l a s s ) ; g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . c l a s s ) ; o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ; g r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c l a s s ) ; } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p , n e w o n { } P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p , n e w m p l i e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e ,P a Lt oh n( g" W/ ru is te ar b/ lg ea ,t e s / t m p / g r o u p e d " ) ) ; T e x t > { g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ; J o b g r o u p J o b = n e w J o b ( g r o u p ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ; W r i t a b l e C o m p a r a b l e k e y , m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a m p l e . c l O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c , t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ; R e p o r t e r t hr re op wo sr t Ie Or E) x c e p t i o n { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ; t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a b l e . c l a } t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ; } t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q ou re mn ac te .F ci ll ae sO su )t ;p u t p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c l a s s ) ; i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e ,t o Tp e1 x0 t0 >. s {e t C o m b i n e r C l a s s ( L i m i t C l i c k s . c l a s s t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s . c l a s s ) { i n t c o u n t = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 0 , n e w p u b vl oi ic d r e d u c e ( P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ; L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 , n e w I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 5 " ) ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c , t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ; o n { R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n { J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ; e i t ' s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ; / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s w h i l e< (1 c0 o0 u n& t& i t e r . h a s N e x t ( ) ) { J o b C o n t r o l j c = n e w J o b C o 1 n 0 t 0 r o s l i ( t " e F s i n f d o r t o u p s n g > ( ) ; o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ; 1 8 t o 2 5 " ) ; i n g > ( ) ; c o u n t + + ; j c . a d d J o b ( l o a d P a g e s ) ; } j c . a d d J o b ( l o a d U s e r s ) ; } j c . a d d J o b ( j o i n J o b ) ; } j c . a d d J o b ( g r o u p J o b ) ; p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i oj nc . {a d d J o b ( l i m i t ) ; J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ; j c . r u n ( ) ; lt pJ .o sb eN a m e ( " L o a d P a g e s " ) ; } ; l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ; }
VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5;
store GoodUsers into '/data/good_users';
Minutes
Hadoop
Pig
Hadoop
Pig
Hive + HBase?
Integration
Reasons to use Hive on HBase:
A lot of data sitting in HBase due to its usage in a real-time environment, but never used for analysis Give access to data in HBase usually only queried through MapReduce to people that dont code (business analysts) When needing a more flexible storage solution, so that rows can be updated live by either a Hive job or an application and can be seen immediately to the other
Integration
How it works:
Hive can use tables that already exist in HBase or Hive table definitions HBase manage its own ones, but they still all reside in the to an existing table same HBasePoints instance
How it works:
Integration
HBase
When using an already existing table, defined as EXTERNAL, you can create multiple Hive tables that point to it
How it works:
Integration
HBase table
people d:fullname
Columns are mapped however you want, changing names and giving types
age INT
siblings MAP<string, string>
d:age
d:address f:
Integration
Drawbacks (that can be fixed with brain juice):
Binary keys and values (like integers represented on 4 bytes) arent supported since Hive prefers string representations, HIVE-1634 Compound row keys arent supported, theres no way of using multiple parts of a key as different fields This means that concatenated binary row keys are completely unusable, which is what people often use for HBase Filters are done at Hive level instead of being pushed to the region servers Partitions arent supported
Data Flows
Data is being generated all over the place:
Apache logs Application logs MySQL clusters HBase clusters
Tailed continuou sly Inserted into Parses into HBase format HBase
Flows
HDFS
Flows
HBase MR
CopyTable MR job
Read in parallel
* HBase replication currently only works for a single slave cluster, in our case HBase replicates to a backup cluster.
Use Cases
Front-end engineers
They need some statistics regarding their latest product
Research engineers
Ad-hoc queries on user data to validate some assumptions Generating statistics about recommendation quality
Business analysts
Statistics on growth and activity Effectiveness of advertiser campaigns Users behavior VS past activities to determine, for example, why certain groups react better to email communications Ad-hoc queries on stumbling behaviors of slices of the user base
Use Cases
CREATE EXTERNAL TABLE ratings_hbase( userid INT, created BIGINT, urlid INT, rating INT, topic INT, modified BIGINT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b@0,:key#b@1,:key#b@2,default:rating#b,default:topic#b,default:modified#b") TBLPROPERTIES("hbase.table.name" = "ratings_by_userid"); #b means binary, @ means position in composite key (SU-specific hack)
Graph Databases
32
NEO4J (Graphbase)
A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes. Attach properties (key-value pairs) on nodes and relationships Relationships connect two nodes and both nodes and relationships can hold an
33
NEO4J
34
NEO4J
35
NEO4J
36
NEO4J
37
NEO4J
38
NEO4J
Properties
39
NEO4J Features
Well suited for many web use cases such as tagging, metadata annotations, social networks, wikis and other network-shaped or hierarchical data sets Intuitive graph-oriented model for data representation. Instead of static and rigid tables, rows and columns, you work with a flexible graph network consisting of nodes, relationships and properties. Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs. A disk-based, native storage manager completely optimized for storing graph structures for maximum performance and scalability Massive scalability. Neo4j can handle graphs of several billion nodes/relationships/properties on a single machine and can be sharded to scale out across multiple machines Fully transactional like a real database Neo4j traverses depths of 1000 levels and beyond at millisecond speed. (many orders of magnitude faster than relational systems)
40