Вы находитесь на странице: 1из 26

First Taste of Apache Hive

About Hadoop

Hadoop consists of HDFS and MapReduce. When we talk a bout Hadoop, we usually refer both components, but they can be deployed and work independently. Name Node: HDFS data dictionary. Also used to record various counters used by MapReduce jobs. Job Tracker: Used for MapReduce job management, from resource management, job dispatch, and job status. YARN (Hadoop 2.0 has rea-architected the functions provided by Job Tracker and split resource management and job management to different modules.)

MapReduce: distributed computing framework

Job Tracker

HDFS: distributed storage and file system

Name Node

About Hive
Hive is a RDBMS and SQL engine based on Hadoop. Hive can have native tables and external tables.

The data for native tables are inside Hadoop HDFS. The source of external tables can be data stored in HDFS, or from another HADOOP based NoSQL database HBASE, or any other RDBMS using JDBCStorageHandler, or any other NoSQL databases using supported storage handler. Even with external tables from external data sources, Hive still needs HDFS for temp space usages and intermediate data storage.

Hive stores meta data (database/schema definition, table definition, index definition, partition definition, privileges) into meta data store, a RDBMS such as Java Derby DB, or MySQL, Oracle, etc. Hive Query language QL syntax is very similar to MySQL SQL language, but only provide a subset of functions.

App/User Interface Hive Driver (QL) DHFS MapReduce MetaDataStore (Data Dictionary) External Data Sources

Hive Process Flow


Parser
App/User Interface 1 Parse Tree Logic Plan (SQL Ops) Logic Optimization
1. The user sends a query to Hive using either its CLI interface, JDBC/ODBC interface or Thrift server. 2. The query will be passed to Hive QL Driver. 3. Hive QL will compile (parse) the query, apply a small set of rule based optimizations, and generate one or more MapReduce Jobs. 4. Hive QL drive will submit generated MapReduce jobs to Hadoop job tracker for execution. 5. Logic Optimization is SQL optimization. 6. Physical Optimization will scan generated MaPreduce tasks to merge or eliminate redundant tasks.

Driver
Compile 2

Physical Plan(MR Tasks) Physical Optimization

Execute

Hadoop Job Tracker

Compare to Oracle
Think Hive query as Oracle query with large parallel operations across a very large Oracle RAC, without cache fusion. A normal Oracle operation consists of two sets of PX processes, producers and consumers, and a PX coordinator. For hive, the producer is Hadoop Mapper. The consumer is Hadoop Reducer. The PX coordinator is Hadoop Job Tracker. The differences: A typical Oracle PX execution plan can have multiple operations and the allocated PX processes will be reused to execute the multiple operations. A typical Hive query can have one or more relatively independent Hadoop jobs (while jobs are interdependent from Hive point of view, they are independent from Hadoop point of view). If more than one jobs are required for a Hive query, the Hive PX processes Mapper and Reducer slots have to be reallocated and re-launched. This is a very important latency for complicated hive queries. On the other hand, a Hive query requiring multiple Hadoop MapReduce jobs can be compared to an Oracle query plan with multiple PX Coordinators to deal with unmergable subqueries and subqueries inside SELECT list requiring PX.

Quick Start
Cloudera provides preinstalled Hadoop VM image we can play with: https://ccp.cloudera.com/display/SUPPORT/Demo+VMs Note earlier versions (CDH3..) VMs could have broken hive functionalities. So try to use CDH4... CDH4uses Hadoop 2.0 (YARN) which splits job tracker into multiple modules. A major issue for Hive query with multiple MapReduce jobs is the job dispatcher from job tracker and Mapper and Reducer slots allocation. One major purpose of Hadoop 2.0 (YARN) is to optimize resource allocation. But looks like Hive has yet to take advantage of this feature.

Play With Hive


Stock List and Quotes downloaded from NASDAQ and Y!Finance Hive command line Log configuration file you can adjust log level.

Easy to create a new database (schema)


Display database names (MySQL syntax)

Session level query related log such as query text and plan.

Equivalent to Oracle alter session set current_schema=stocks. MySQL style syntax.

Create Table and Load Data

1. create table will create a table the data will be managed by Hive 2. create external table will create table linked to other data source for example, data stored on HDFS loaded/generated by methods not related to Hive 3. If data is going to be loaded from other source, better specify its row format and column format. 4. Here I loaded data from three local files using into table syntax. The source data can also be HDFS data. Hive will simply copy the data files. overwrite into table syntax will delete whatever already stored and copy the new file. If I used this option, I will end up with only one exchange list.

A Simple Query At Work


Two MR jobs are generated. Most likely one for GROUP BY, one for ORDER BY.

Hadoop MR job tracking information. Detail URL is provided. From there, we can check performance metrics and task logs The line following tracking URL is the command to kill the job, equivalent to Oracle alter system kill session

Job Status

Detail can be accessed from above menu. Counter for performance metrics. Map Tasks and Reduced Tasks for individual map and task performance data and logs.

Job log can be accessed here

1. Job counters are performance and resource usage metrics specific to the job. 2. For long running job, looking for the very large numbers especially IO and network communication.

Job log can give detail about how the job is managed, for example, how resources are allocated and when resources are allocated. It is equivalent to Oracle 10046 trace file for query coordinator of PX operation. For a Hive query requiring multiple MR jobs on a busy cluster, job dispatch latency is going to be a huge headache.

Tracking status for all Map Tasks. There is also a page for all Reduce Tasks. From there, we can further drill down to individual task attempt, especially when most of the tasks are completed while there is one or two still hanging around.

Individual task performance data. There are performance counters specific to the task. The log link is more interesting for app dev troubleshooting.

Task log is very useful to find out business logic errors. The tiemstamp can help you to identify performance bottleneck operations. It is equivalent to Oracle 10046 trace file for PX slave process. For small MR job, the overhead to get job scheduled and tasks to be launched is significant.

Partition Support
Here I created a table PRICES to store stock history prices. The table is partitioned by stock tick. Then I loaded stock quote CSV files for several ticks. Each tick has its own file on Hive HDFS storage.

Partition Pruning

The queries use partition column explicitly. MSFT has longer history and its partition is larger. FB is new so only very little data inside its partition.

JOIN

Join Optimization Map Join

1. Map join is equivalent to Oracle PX hash join with one


table very small and broadcast is used for PX distribution. 2. For Hive, the HASH table is built by Hive Driver, and use Hadoop distributed cache to send the data to nodes which will run the Map join. 3. The join will be done at mapper stage (hence named). No reducer will be used. Because one fewer layer of resource allocation and data transfer, the final time is less than half of without using this optimization.

Join Optimization Skewed Data


A usual Hive join is like HASH partitioned MERGE SORT join. Data from two sources are hash partitioned and distributed to reducers according to the hash values. For very large data sets, if data are highly skewed on several join key value, for example 1 or 2, the join will be focused on 1 or 2 reducers and execution time could be very long, or even run out memory. Skewed data join is not Hive specific issue. Oracle hash join with PX suffers the same. Even without PX, skewed data also cause high CPU usages because of hash value collision. For Hive, if one data source is small, Map Join can be used to reduce the impact from data skew.

Join Optimization Skewed Data


Hive also introduced another join strategy: skewed join. It will split the job into two jobs: one for non skewed data, one for skewed data. For non skewed data, the join job is as normal. For the skewed data, the join key set is small and most likely known at runtime, Hive will use other join optimization such as MAPJOIN. Session level parameters are used to tell Hive to consider skewed join:
set hive.optimize.skewjoin=true; set hive.skewjoin.key={a threshold number for the row counts on skewed key, default to 100,000 }

Note the implementation is not very stable. The join is even sensitive to the table orders presented in the query and query execution could be terminated because of the table order. Because it adds more Hive stages (Hive local jobs plus Hadoop jobs), this strategy should be considered only if a query job runs intolerable long.

Aggregation Optimization Map Side Aggregation


A typical MapReduce or PIG job uses Combiner for partial aggregation at map side, to reduce the data size passed to the reducers. Note Oracle parallel operation has similar feature for GROUP BY (frequently you can see two GROUP BY steps in a PX plan). Hive does not support Combiner. Instead, it introduced a concept called map size aggregation, and it is enabled by default:
set hive.map.aggr=true; To verify it, use: set hive.map.aggr;

My sample query for test and the impact is verified from job status tracking:
Query: select sector, count(*) from tick_simple group by sector; With hive.map.aggr=true, map input record count is 6430, and map output count is 55. So only 55 records will be sent to reducer side. With hive.map.aggr=false, map input record count is 6430, and map output count is 6430. About 116 times of more data will be sent to reducer side. It will not make much difference for this small test, but it will make huge difference for product data with billions of input records.

Function Support
Hive supports usual functions with built in functions. Standard functions: abs, round, floor, ucase, concat, etc. Aggregate functions: avg, sum, etc Table generate functions (Oracle table functions): explode User defined functions can be written in Java, packaged in Jar and loaded per session base. However, making user defined function permanent requires to modify Hive source code and rebuild Hive package. User defined functions do provide us the opportunity to create common functions shared by multiple products, or to take advantages of Hive general QL capability while wrap project specific requirement into user defined functions. For example, for RMX, the source data are events spilled out from ad servers, collected by data highway, dumped on Hadoop GRID, and RMX data pipelines use MapReduce jobs or PIG jobs to reprocess the data which will be consumed by other applications. A common requirement is to split/extract the data related to advertiser, publisher and network, etc, in tabular format from each event. This can be easily written into a Hive Table generate function and the remaining works are just standard SQL operations.

Function At Work
Hive does not provide built in method to handle CSV with quotes. So my test table has data quality issue after column company name because some names have comma inside. So my first test query failed to give correct data for sector names. For the following test, I create a table with a single column and load the one tick symbol file nasdaq.csv. Then I use split function with CSV regex to extract correct sector name. However, I have not figured out how to use SPLIT function to extract all columns to build another table with CTAS. But it not hard to use Java String SPLIT function to build a Hive Table generate function for that purpose.

Other Interesting Things


Index: Hive has two types of indexes. A normal index stores (key, HDFS file name, key offset) into another hive table. Hive also supports bitmap indexes. However, since Hadoop typically uses 64MB or 128MB block size, index is not really very useful in Hive. Furthermore, Hive index only works with partitioned table if Hive can prune the partitions based on index data. RDBMS connections: Hive current does not have builtin storage handler to access data from other RDBMS databases such as Oracle and MySQL (there is a third party JDBCStorageHandler, though). It will be interesting if we can use Hive rather than Scoop to dump or access data from Oracle or MySQL directly. NoSQL: Hive does provide builtin storage handlers to access data on some NoSQL databases such as HBASE, Cassandra, and DynomDB. If log level is set appropriately, hive log file hive.log have information similar to Oracle 10053 trace file for query parsing process. This can help to figure out why a specific feature is not used or used. Hive also provide EXPLAIN for explain plan. Execution plan can be found from session level log file (though not user friendly formatted).

My Wish List
Better YARN (Hadoop 2.0) integration so that it can take advantage of new job and resource management and allocation architecture. Better documentation and manuals: the user documents are rare and incomplete. Documentation about its source code: for open source products without professional technical support, a clear way to help users to navigate the source code modules and to understand the design ideas is very important.

References
Book: Programming Hive Jason Rutherglen, Dean Wampler, Edward Capriolo Home page: http://hive.apache.org, check its wiki for links to presentations and manuals Source code: http://svn.apache.org/viewvc/hive: this is the only place to find out why the feature you like does not work, or your query encounters error. Hive internal: http://www.slideshare.net/mobile/recruitcojp/internal-hive Side notes: compare your understanding of Oracle, especially query optimization, will help to understand Hive and make it perform better. However, Hive developers might take more hints from MySQL which is also open source project.

Вам также может понравиться