Академический Документы
Профессиональный Документы
Культура Документы
About Hadoop
Hadoop consists of HDFS and MapReduce. When we talk a bout Hadoop, we usually refer both components, but they can be deployed and work independently. Name Node: HDFS data dictionary. Also used to record various counters used by MapReduce jobs. Job Tracker: Used for MapReduce job management, from resource management, job dispatch, and job status. YARN (Hadoop 2.0 has rea-architected the functions provided by Job Tracker and split resource management and job management to different modules.)
Job Tracker
Name Node
About Hive
Hive is a RDBMS and SQL engine based on Hadoop. Hive can have native tables and external tables.
The data for native tables are inside Hadoop HDFS. The source of external tables can be data stored in HDFS, or from another HADOOP based NoSQL database HBASE, or any other RDBMS using JDBCStorageHandler, or any other NoSQL databases using supported storage handler. Even with external tables from external data sources, Hive still needs HDFS for temp space usages and intermediate data storage.
Hive stores meta data (database/schema definition, table definition, index definition, partition definition, privileges) into meta data store, a RDBMS such as Java Derby DB, or MySQL, Oracle, etc. Hive Query language QL syntax is very similar to MySQL SQL language, but only provide a subset of functions.
App/User Interface Hive Driver (QL) DHFS MapReduce MetaDataStore (Data Dictionary) External Data Sources
Driver
Compile 2
Execute
Compare to Oracle
Think Hive query as Oracle query with large parallel operations across a very large Oracle RAC, without cache fusion. A normal Oracle operation consists of two sets of PX processes, producers and consumers, and a PX coordinator. For hive, the producer is Hadoop Mapper. The consumer is Hadoop Reducer. The PX coordinator is Hadoop Job Tracker. The differences: A typical Oracle PX execution plan can have multiple operations and the allocated PX processes will be reused to execute the multiple operations. A typical Hive query can have one or more relatively independent Hadoop jobs (while jobs are interdependent from Hive point of view, they are independent from Hadoop point of view). If more than one jobs are required for a Hive query, the Hive PX processes Mapper and Reducer slots have to be reallocated and re-launched. This is a very important latency for complicated hive queries. On the other hand, a Hive query requiring multiple Hadoop MapReduce jobs can be compared to an Oracle query plan with multiple PX Coordinators to deal with unmergable subqueries and subqueries inside SELECT list requiring PX.
Quick Start
Cloudera provides preinstalled Hadoop VM image we can play with: https://ccp.cloudera.com/display/SUPPORT/Demo+VMs Note earlier versions (CDH3..) VMs could have broken hive functionalities. So try to use CDH4... CDH4uses Hadoop 2.0 (YARN) which splits job tracker into multiple modules. A major issue for Hive query with multiple MapReduce jobs is the job dispatcher from job tracker and Mapper and Reducer slots allocation. One major purpose of Hadoop 2.0 (YARN) is to optimize resource allocation. But looks like Hive has yet to take advantage of this feature.
Session level query related log such as query text and plan.
1. create table will create a table the data will be managed by Hive 2. create external table will create table linked to other data source for example, data stored on HDFS loaded/generated by methods not related to Hive 3. If data is going to be loaded from other source, better specify its row format and column format. 4. Here I loaded data from three local files using into table syntax. The source data can also be HDFS data. Hive will simply copy the data files. overwrite into table syntax will delete whatever already stored and copy the new file. If I used this option, I will end up with only one exchange list.
Hadoop MR job tracking information. Detail URL is provided. From there, we can check performance metrics and task logs The line following tracking URL is the command to kill the job, equivalent to Oracle alter system kill session
Job Status
Detail can be accessed from above menu. Counter for performance metrics. Map Tasks and Reduced Tasks for individual map and task performance data and logs.
1. Job counters are performance and resource usage metrics specific to the job. 2. For long running job, looking for the very large numbers especially IO and network communication.
Job log can give detail about how the job is managed, for example, how resources are allocated and when resources are allocated. It is equivalent to Oracle 10046 trace file for query coordinator of PX operation. For a Hive query requiring multiple MR jobs on a busy cluster, job dispatch latency is going to be a huge headache.
Tracking status for all Map Tasks. There is also a page for all Reduce Tasks. From there, we can further drill down to individual task attempt, especially when most of the tasks are completed while there is one or two still hanging around.
Individual task performance data. There are performance counters specific to the task. The log link is more interesting for app dev troubleshooting.
Task log is very useful to find out business logic errors. The tiemstamp can help you to identify performance bottleneck operations. It is equivalent to Oracle 10046 trace file for PX slave process. For small MR job, the overhead to get job scheduled and tasks to be launched is significant.
Partition Support
Here I created a table PRICES to store stock history prices. The table is partitioned by stock tick. Then I loaded stock quote CSV files for several ticks. Each tick has its own file on Hive HDFS storage.
Partition Pruning
The queries use partition column explicitly. MSFT has longer history and its partition is larger. FB is new so only very little data inside its partition.
JOIN
Note the implementation is not very stable. The join is even sensitive to the table orders presented in the query and query execution could be terminated because of the table order. Because it adds more Hive stages (Hive local jobs plus Hadoop jobs), this strategy should be considered only if a query job runs intolerable long.
My sample query for test and the impact is verified from job status tracking:
Query: select sector, count(*) from tick_simple group by sector; With hive.map.aggr=true, map input record count is 6430, and map output count is 55. So only 55 records will be sent to reducer side. With hive.map.aggr=false, map input record count is 6430, and map output count is 6430. About 116 times of more data will be sent to reducer side. It will not make much difference for this small test, but it will make huge difference for product data with billions of input records.
Function Support
Hive supports usual functions with built in functions. Standard functions: abs, round, floor, ucase, concat, etc. Aggregate functions: avg, sum, etc Table generate functions (Oracle table functions): explode User defined functions can be written in Java, packaged in Jar and loaded per session base. However, making user defined function permanent requires to modify Hive source code and rebuild Hive package. User defined functions do provide us the opportunity to create common functions shared by multiple products, or to take advantages of Hive general QL capability while wrap project specific requirement into user defined functions. For example, for RMX, the source data are events spilled out from ad servers, collected by data highway, dumped on Hadoop GRID, and RMX data pipelines use MapReduce jobs or PIG jobs to reprocess the data which will be consumed by other applications. A common requirement is to split/extract the data related to advertiser, publisher and network, etc, in tabular format from each event. This can be easily written into a Hive Table generate function and the remaining works are just standard SQL operations.
Function At Work
Hive does not provide built in method to handle CSV with quotes. So my test table has data quality issue after column company name because some names have comma inside. So my first test query failed to give correct data for sector names. For the following test, I create a table with a single column and load the one tick symbol file nasdaq.csv. Then I use split function with CSV regex to extract correct sector name. However, I have not figured out how to use SPLIT function to extract all columns to build another table with CTAS. But it not hard to use Java String SPLIT function to build a Hive Table generate function for that purpose.
My Wish List
Better YARN (Hadoop 2.0) integration so that it can take advantage of new job and resource management and allocation architecture. Better documentation and manuals: the user documents are rare and incomplete. Documentation about its source code: for open source products without professional technical support, a clear way to help users to navigate the source code modules and to understand the design ideas is very important.
References
Book: Programming Hive Jason Rutherglen, Dean Wampler, Edward Capriolo Home page: http://hive.apache.org, check its wiki for links to presentations and manuals Source code: http://svn.apache.org/viewvc/hive: this is the only place to find out why the feature you like does not work, or your query encounters error. Hive internal: http://www.slideshare.net/mobile/recruitcojp/internal-hive Side notes: compare your understanding of Oracle, especially query optimization, will help to understand Hive and make it perform better. However, Hive developers might take more hints from MySQL which is also open source project.