Hadoop Hbase and Hive

+
Hive/HBase Integration
or, MaybeSQL?
April 2010
John Sichi
Facebook
Agenda
» Use Cases
» Architecture
» Storage Handler
» Load via INSERT
» Query Processing
» Bulk Load
» Q&A
Facebook
Motivations
» Data, data, and more data

› 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009
› About 8x increase per year
» Queries, queries, and more queries
› More than 200 unique users querying per day
› 7500+ queries on production cluster per day; mixture of ad-
hoc queries and ETL/reporting queries
» They want it all and they want it now
› Users expect faster response time on fresher data
› Sampled subsets aren’t always good enough
Facebook
How Can HBase Help?
» Replicate dimension tables from transactional databases

with low latency and without sharding
› (Fact data can stay in Hive since it is append-only)
» Only move changed rows
› “Full scrape” is too slow and doesn’t scale as data keeps
growing
› Hive by itself is not good at row-level operations
» Integrate into Hive’s map/reduce query execution plans
for full parallel distributed processing
» Multiversioning for snapshot consistency?
Facebook
Use Case 1: HBase As ETL Data Target
Source Hive INSERT

Files/Ta … SELECT … HBase
bles
Facebook
Use Case 2: HBase As Data Source
HBase
Hive SELECT Query

… JOIN … Result
GROUP BY …
Other
Files/Ta
bles
Facebook
Use Case 3: Low Latency Warehouse
Continuous Update
HBase
Hive
Queries
Other
Periodic Load Files/Ta
bles
Facebook
HBase Architecture
Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Hive Architecture
Facebook
All Together Now!
Facebook
Hive CLI With HBase
» Minimum configuration needed:
hive \
--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar \
-hiveconf hbase.zookeeper.quorum=zk1,zk2…
hive> create table …
Facebook
Storage Handler
CREATE TABLE users(

userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
“hbase.columns.mapping” =
“small:name,small:email,large:notes”)
TBLPROPERTIES (
“hbase.table.name” = “user_list”
);
Facebook
Column Mapping
» First column in table is always the row key

» Other columns can be mapped to either:
› An HBase column (any Hive type)
› An HBase column family (must be MAP type in Hive)
» Multiple Hive columns can map to the same HBase column
or family
» Limitations
› Currently no control over type mapping (always string in
HBase)
› Currently no way to map HBase timestamp attribute
Facebook
Load Via INSERT
INSERT OVERWRITE TABLE users

SELECT * FROM …;
 Hive task writes rows to HBase via
org.apache.hadoop.hbase.mapred.TableOutputFormat
 HBaseSerDe serializes rows into BatchUpdate objects
(currently all values are converted to strings)
 Multiple rows with same key -> only one row written
 Limitations
 No write atomicity yet
 No way to delete rows
 Write parallelism is query-dependent (map vs reduce)
Facebook
Map-Reduce Job for INSERT
HBase
Facebook
From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png
Map-Only Job for INSERT
HBase
Facebook
Query Processing
SELECT name, notes FROM users WHERE userid=‘xyz’;

 Rows are read from HBase via
org.apache.hadoop.hbase.mapred.TableInputFormatBase
 HBase determines the splits (one per table region)
 HBaseSerDe produces lazy rows/maps for RowResults
 Column selection is pushed down
 Any SQL can be used (join, aggregation, union…)
 Limitations
 Currently no filter pushdown
 How do we achieve locality?
Facebook
Metastore Integration
» DDL can be used to create metadata in Hive and HBase

simultaneously and consistently
» CREATE EXTERNAL TABLE: register existing Hbase
table
» DROP TABLE: will drop HBase table too unless it was
created as EXTERNAL
» Limitations
› No two-phase-commit for DDL operations
› ALTER TABLE is not yet implemented
› Partitioning is not yet defined
› No secondary indexing
Facebook
Bulk Load
 Ideally…
SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE users SELECT … ;
 But for now, you have to do some work and issue multiple
Hive commands
1 Sample source data for range partitioning
2 Save sampling results to a file
3 Run CLUSTER BY query using HiveHFileOutputFormat
and TotalOrderPartitioner (sorts data, producing a large
number of region files)
4 Import HFiles into HBase
5 HBase can merge files if necessary
Facebook
Range Partitioning During Sort
(H)
(R)
TotalOrderPartitioner
A-G loadtable.rb
H-Q
HBase
R-Z
Facebook
Sampling Query For Range Partitioning
Given 5 million users in a table bucketed into 1000 buckets of

5000 users each, pick 9 user_ids which partition the set of
all user_ids into 10 nearly-equal-sized ranges.
select user_id from

(select user_id
from hive_user_table
tablesample(bucket 1 out of 1000 on user_id) s
order by user_id) sorted_user_5k_sample
where (row_sequence() % 501)=0;
Facebook
Sorting Query For Bulk Load
set mapred.reduce.tasks=12;
set hive.mapred.partitioner=
org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
set total.order.partitioner.path=/tmp/hb_range_key_list;
set hfile.compression=gz;
create table hbsort(user_id string, user_type string, ...)
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’
outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’
tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');
insert overwrite table hbsort

select user_id, user_type, createtime, …
from hive_user_table
cluster by user_id;
Facebook
Deployment
» Latest Hive trunk (will be in Hive 0.6.0)

» Requires Hadoop 0.20+
» Tested with HBase 0.20.3 and Zookeeper 3.2.2
» 20-node hbtest cluster at Facebook
» No performance numbers yet
› Currently setting up tests with about 6TB (gz compressed)
Facebook
Questions?
» hive-user@hadoop.apache.org
» jsichi@facebook.com
» http://wiki.apache.org/hadoop/Hive/HBaseIntegration
» http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad
» Special thanks to Samuel Guo for the early versions of the

integration code
Facebook
Hey, What About HBQL?
» HBQL focuses on providing a convenient language layer

for managing and accessing individual HBase tables, and
is not intended for heavy-duty SQL processing such as
joins and aggregations
» HBQL is implemented via client-side calls, whereas
Hive/HBase integration is implemented via map/reduce
jobs
Facebook

Hadoop Hbase and Hive

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hadoop Hbase and Hive

Загружено:

Авторское право:

Доступные форматы

+

» Data, data, and more data

» Replicate dimension tables from transactional databases

Source Hive INSERT

Hive SELECT Query

Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

» Minimum configuration needed:

hive> create table …

CREATE TABLE users(

» First column in table is always the row key

INSERT OVERWRITE TABLE users

SELECT name, notes FROM users WHERE userid=‘xyz’;

» DDL can be used to create metadata in Hive and HBase

Given 5 million users in a table bucketed into 1000 buckets of

select user_id from

insert overwrite table hbsort

» Latest Hive trunk (will be in Hive 0.6.0)

» Special thanks to Samuel Guo for the early versions of the

» HBQL focuses on providing a convenient language layer

Вам также может понравиться