Вы находитесь на странице: 1из 25

+

Hive/HBase Integration
or, MaybeSQL?
April 2010

John Sichi
Facebook
Agenda

» Use Cases
» Architecture
» Storage Handler
» Load via INSERT
» Query Processing
» Bulk Load
» Q&A

Facebook
Motivations

» Data, data, and more data


› 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
› About 8x increase per year
» Queries, queries, and more queries
› More than 200 unique users querying per day
› 7500+ queries on production cluster per day; mixture of ad-
hoc queries and ETL/reporting queries
» They want it all and they want it now
› Users expect faster response time on fresher data
› Sampled subsets aren’t always good enough

Facebook
How Can HBase Help?

» Replicate dimension tables from transactional databases


with low latency and without sharding
› (Fact data can stay in Hive since it is append-only)
» Only move changed rows
› “Full scrape” is too slow and doesn’t scale as data keeps
growing
› Hive by itself is not good at row-level operations
» Integrate into Hive’s map/reduce query execution plans
for full parallel distributed processing
» Multiversioning for snapshot consistency?

Facebook
Use Case 1: HBase As ETL Data Target

Source Hive INSERT


Files/Ta … SELECT … HBase
bles

Facebook
Use Case 2: HBase As Data Source

HBase

Hive SELECT Query


… JOIN … Result
GROUP BY …

Other
Files/Ta
bles

Facebook
Use Case 3: Low Latency Warehouse

Continuous Update
HBase

Hive
Queries
Other
Periodic Load Files/Ta
bles

Facebook
HBase Architecture

Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html


Hive Architecture

Facebook
All Together Now!

Facebook
Hive CLI With HBase

» Minimum configuration needed:

hive \
--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar \
-hiveconf hbase.zookeeper.quorum=zk1,zk2…

hive> create table …

Facebook
Storage Handler

CREATE TABLE users(


userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
“hbase.columns.mapping” =
“small:name,small:email,large:notes”)
TBLPROPERTIES (
“hbase.table.name” = “user_list”
);
Facebook
Column Mapping

» First column in table is always the row key


» Other columns can be mapped to either:
› An HBase column (any Hive type)
› An HBase column family (must be MAP type in Hive)
» Multiple Hive columns can map to the same HBase column
or family
» Limitations
› Currently no control over type mapping (always string in
HBase)
› Currently no way to map HBase timestamp attribute

Facebook
Load Via INSERT

INSERT OVERWRITE TABLE users


SELECT * FROM …;
 Hive task writes rows to HBase via
org.apache.hadoop.hbase.mapred.TableOutputFormat
 HBaseSerDe serializes rows into BatchUpdate objects
(currently all values are converted to strings)
 Multiple rows with same key -> only one row written
 Limitations
 No write atomicity yet
 No way to delete rows
 Write parallelism is query-dependent (map vs reduce)

Facebook
Map-Reduce Job for INSERT

HBase

Facebook
From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
Map-Only Job for INSERT

HBase

Facebook
Query Processing

SELECT name, notes FROM users WHERE userid=‘xyz’;


 Rows are read from HBase via
org.apache.hadoop.hbase.mapred.TableInputFormatBase
 HBase determines the splits (one per table region)
 HBaseSerDe produces lazy rows/maps for RowResults
 Column selection is pushed down
 Any SQL can be used (join, aggregation, union…)
 Limitations
 Currently no filter pushdown
 How do we achieve locality?

Facebook
Metastore Integration

» DDL can be used to create metadata in Hive and HBase


simultaneously and consistently
» CREATE EXTERNAL TABLE: register existing Hbase
table
» DROP TABLE: will drop HBase table too unless it was
created as EXTERNAL
» Limitations
› No two-phase-commit for DDL operations
› ALTER TABLE is not yet implemented
› Partitioning is not yet defined
› No secondary indexing

Facebook
Bulk Load

 Ideally…
SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE users SELECT … ;
 But for now, you have to do some work and issue multiple
Hive commands
1 Sample source data for range partitioning
2 Save sampling results to a file
3 Run CLUSTER BY query using HiveHFileOutputFormat
and TotalOrderPartitioner (sorts data, producing a large
number of region files)
4 Import HFiles into HBase
5 HBase can merge files if necessary
Facebook
Range Partitioning During Sort

(H)
(R)

TotalOrderPartitioner
A-G loadtable.rb

H-Q
HBase

R-Z

Facebook
Sampling Query For Range Partitioning

Given 5 million users in a table bucketed into 1000 buckets of


5000 users each, pick 9 user_ids which partition the set of
all user_ids into 10 nearly-equal-sized ranges.

select user_id from


(select user_id
from hive_user_table
tablesample(bucket 1 out of 1000 on user_id) s
order by user_id) sorted_user_5k_sample
where (row_sequence() % 501)=0;

Facebook
Sorting Query For Bulk Load
set mapred.reduce.tasks=12;
set hive.mapred.partitioner=
org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
set total.order.partitioner.path=/tmp/hb_range_key_list;
set hfile.compression=gz;
create table hbsort(user_id string, user_type string, ...)
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’
outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’
tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');

insert overwrite table hbsort


select user_id, user_type, createtime, …
from hive_user_table
cluster by user_id;

Facebook
Deployment

» Latest Hive trunk (will be in Hive 0.6.0)


» Requires Hadoop 0.20+
» Tested with HBase 0.20.3 and Zookeeper 3.2.2
» 20-node hbtest cluster at Facebook
» No performance numbers yet
› Currently setting up tests with about 6TB (gz compressed)

Facebook
Questions?

» hive-user@hadoop.apache.org
» jsichi@facebook.com
» http://wiki.apache.org/hadoop/Hive/HBaseIntegration
» http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad

» Special thanks to Samuel Guo for the early versions of the


integration code
Facebook
Hey, What About HBQL?

» HBQL focuses on providing a convenient language layer


for managing and accessing individual HBase tables, and
is not intended for heavy-duty SQL processing such as
joins and aggregations
» HBQL is implemented via client-side calls, whereas
Hive/HBase integration is implemented via map/reduce
jobs

Facebook

Вам также может понравиться