Вы находитесь на странице: 1из 61

Apache Hive

HIVE
Hive
• Data warehousing package built on top of
hadoop.
• Used for data analysis on structured data.
• Targeted towards users comfortable with SQL.
• It is similar to SQL and called HiveQL.
• Abstracts complexity of hadoop.
• No Java is required.
• Developed by Facebook.
Features of Hive
How is it Different from SQL
•The major difference is that a Hive query
executes on a Hadoop infrastructure rather than
a traditional database.
•This allows Hive to handle huge data sets - data
sets so large that high-end, expensive, traditional
databases would fail.
•The internal execution of a Hive query is via a
series of automatically generated Map Reduce
jobs
When not to use Hive
• Semi-structured or complete unstructured data.
• Hive is not designed for online transaction
processing.
• It is best for batch jobs over large sets of data.
• Latency for Hive queries is generally very high in
minutes, even when data sets are very small (say
a few hundred megabytes).
• It cannot be compared with systems such as
oracle where analyses are conducted on a
significantly smaller amount of data.
Install Hive
•To install hive
• untar the .gz file using tar –xvzf hive-0.13.0-bin.tar.gz
•To initialize the environment variables, export the
following:
• export HADOOP_HOME=/home/usr/hadoop-0.20.2
(Specifies the location of the installation directory
of hadoop.)
• export HIVE_HOME=/home/usr/hive-0.13.0-bin
(Specifies the location of the hive to the environment
variable.)
• export PATH=$PATH:$HIVE_HOME/bin

Hive configurations
• Hive default configuration is stored in hive-default.xml
file in the conf directory
• Hive comes configured to use derby as the metastore
Hive Modes
To start the hive shell, type hive and Enter.
• Hive in Local mode
 No HDFS is required, All files run on local file
system.
 hive> SET mapred.job.tracker=local
• Hive in MapReduce(hadoop) mode
 hive> SET mapred.job.tracker=master:9001;
Introducing data types
• The primitive data types in hive include
Integers, Boolean, Floating point,
Date,Timestamp and Strings.
• The below table lists the size of data types:
Type Size
-------------------------
TINYINT 1 byte
SMALLINT 2 byte
INT 4 byte
BIGINT 8 byte
FLOAT 4 byte (single precision floating point numbers)
DOUBLE 8 byte (double precision floating point numbers)
BOOLEAN TRUE/FALSE value
STRING Max size is 2GB.
• Complex data Type : Array ,Map ,Structs
Configuring Hive
• Hive is configured using an XML configuration file called
hivesite.xml and is located in Hive’s conf directory.
• Execution engines
 Hive was originally written to use MapReduce as its execution engine,
and that is still the default.
 We can use Apache Tez as its execution engine, and also work is
underway to support Spark, too. Both Tez and Spark are general
directed acyclic graph (DAG) engines that offer more flexibility and
higher performance than MapReduce.
 It’s easy to switch the execution engine on a per-query basis, so you
can see the effect of a different engine on a particular query.
 Set Hive to use Tez: hive> SET hive.execution.engine=tez;
 The execution engine is controlled by the hive.execution.engine
property, which defaults to “mr” (for MapReduce).
Hive Architecture
Components
• Thrift Client
 It is possible to interact with hive by using any
programming language that usages Thrift server. For e.g.
 Python
 Ruby
• JDBC Driver
 Hive provides a pure java JDBC driver for java application
to connect to hive , defined in the class
org.hadoop.hive.jdbc.HiveDriver
• ODBC Driver
 An ODBC driver allows application that supports ODBC
protocol
Components
• Metastore
 This is the central repository for Hive metadata.
 By default, Hive is configured to use Derby as the metastore. As a result of the
configuration, a metastore_db directory is created in each working folder.
• What are the problems with the default metastore
 Users cannot see the tables created by others if they do not use the same
metastore_db.
 Only one embedded Derby database can access the database files at any given
point of time
 Results in only one open Hive session with a metastore. Not possible to have
multiple sessions with Derby as the metastore.
Solution
 We can use a standalone database either on the same machine or on a
remote machine as a metastore and any JDBC-compliant database can be used
 MySQL is a popular choice for the standalone metastore.
Configuring MySQL as metastore
 Install MySQL Admin/Client
 Create a Hadoop user and grant permissions to the user
 mysql -u root –p
 mysql> Create user 'hadoop'@'localhost' identified by 'hadoop‘;
 mysql> Grant ALL on *.* to 'hadoop'@'localhost' with GRANT option;
 Modify the following properties in hive-site.xml to use MySQL instead of Derby. This creates a
database in MySql by the name – Hive :
 name : javax.jdo.option.ConnectionUR
 value :
dbc:mysql://localhost:3306/Hive?createDatabaseIfNotExist=true
 name : javax.jdo.option.ConnectionDriverName
 value : com.mysql.jdbc.Driver
 name : javax.jdo.option.ConnectionUserName
 value : hadoop
 name : javax.jdo.option.ConnectionPassword
 value : hadoop
Hive Program Structure
• The Hive Shell
 The shell is the primary way that we will interact with Hive, by issuing
commands in HiveQL.
 HiveQL is heavily influenced by MySQL, so if you are familiar with
MySQL, you should feel at home using Hive.
 The command must be terminated with a semicolon to tell Hive to
execute it.
 HiveQL is generally case insensitive.
 The Tab key will autocomplete Hive keywords and functions.
• Hive can run in non-interactive mode.
 Use -f option to run the commands in the specified file,
 hive -f script.hql
 For short scripts, you can use the -e option to specify the commands
inline, in which case the final semicolon is not required.
 hive -e 'SELECT * FROM dummy'
Ser-de
• A SerDe is a combination of a Serializer and a
Deserializer (hence, Ser-De).
• The Serializer, however, will take a Java object that Hive
has been working with, and turn it into something that
Hive can write to HDFS or another supported system.
• Serializer is used when writing data, such as through an
INSERT-SELECT statement.
• The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java
object that Hive can manipulate.
• Deserializer is used at query time to execute SELECT
statements.
Hive Tables
A Hive table is logically made up of the data being stored in HDFS and the
associated metadata describing the layout of the data in the MySQL table.
• Managed Table
 When you create a table in Hive and load data into a managed table, it is moved into
Hive’s warehouse directory.
 CREATE TABLE managed_table (dummy STRING);
 LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• External Table
 Alternatively, you may create an external table, which tells Hive to refer to the data that
is at an existing location outside the warehouse directory.
 The location of the external data is specified at table creation time:
 CREATE EXTERNAL TABLE external_table (dummy STRING)
 LOCATION '/user/tom/external_table';
 LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
• When you drop an external table, Hive will leave the data untouched and
only delete the metadata.
• Hive does not do any transformation while loading data into tables. Load
operations are currently pure copy/move operations that move data files
into locations corresponding to Hive tables.
Storage Format
Text File
When you create a table with no ROW FORMAT or STORED AS
clauses, the default format is delimited text with one row per
line.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘
Storage Format
RC: Record Columnar File
The RC format was designed for clusters with MapReduce in
mind. It is a huge step up over standard text files. It’s a mature
format with ways to ingest into the cluster without ETL. It is
supported in several hadoop system components.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
Storage Format
ORC: Optimized Row Columnar File
The ORC format showed up in Hive 0.11 onwards. As the name
implies, it is more optimized than the RC format. If you want to
hold onto speed and compress the data as much as possible,
then ORC is best.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileOutputFormat'
Practice Session
• CREATE DATABASE|SCHEMA [IF NOT EXISTS]
<database name>
or
hive> CREATE SCHEMA testdb;

SHOW DATABASES;
DROP SCHEMA userdb;
• CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT
EXISTS] [db_name.] table_name

• [(col_name data_type [COMMENT col_comment],


...)] [COMMENT table_comment] [ROW FORMAT
row_format] [STORED AS file_format]

• Loading a data
LOAD DATA [local ] INPATH
'hdfs_file_or_directory_path'
Create Table
• Managed Table
CREATE TABLE Student (sno int, sname string, year
int) row format delimited fields terminated by ',';
• External Table
CREATE EXTERNAL TABLE Student(sno int, sname
string, year int) row format delimited fields
terminated by ',‘LOCATION '/user/external_table';
Load Data to table
To store the local files to hive location
• LOAD DATA local INPATH
'/home/cloudera/SampleDataFile/student_ma
rks.csv' INTO table Student;
To store file located in HDFS file system to hive
table location
• LOAD DATA INPATH
'/user/cloudera/Student_Year.csv' INTO table
Student;
Table Commands
• Insert Data
 INSERT OVERWRITE TABLE targettable
select col1, col2 from source (to overwrite data)
 INSERT INTO TABLE targettbl
select col1, col2 from source (to append data)
• Multitable insert
 From sourcetable
INSERT OVERWRITE TABLE table1
select col1,col2 where condition1
INSERT OVERWRITE TABLE table2
select col1,col2 where condition2
• Create table..as Select
 Create table table1 as select col1,col2 from source;
• Create a new table with existing schema like other table
 Create table newtable like existingtable;
Database Commands
• Displays all created DB List.
 Show Databases;
• To Create new database with default properties.
 Create Database DBName;
• Create database with comment
 Create Database DBName comment ‘holds backup data’ ;
• To Use Database
 Use DBName;
• To View the database details
 DESCRIBE DATABASE EXTENDED DbName
Table Commands
• To list all tables
Show Tables;
• Displaying all contents of the table
select * from <table-name>;
select * from Student_Year where year = 2011;
• Display header information along with Data
set hive.cli.print.header=true;
• Using Group by
select year,count(sno) from Student_Year group by
year;
Table Commands
• SubQueries
 A subquery is a SELECT statement that is embedded in another SQL
statement.
 Hive has limited support for subqueries, permitting a subquery in the
FROM clause of a SELECT statement, or in the WHERE clause in certain
cases.
 The following query finds the average maximum temperature for
every year and weather station:
SELECT year, AVG(max_temperature)
FROM (
SELECT year, MAX(temperature) AS max_temperature
FROM records2
GROUP BY year
) mt
GROUP BY year;
Table Commands
Alter table
• To Add column
 ALTER TABLE student ADD COLUMNS (Year string);
• To Modify a column
 ALTER TABLE table_name CHANGE old_col_name new_col_name
new_data_type
• Changes the table name;
 Alter table Employee RENAME to emp;
• Drops a partition
 ALTER table MyTable DROP PARTITION (age=17) -- Drop Table
• DROP TABLE
 DROP TABLE operatordetails;
• Describe Table Schema
 Desc Employee;
 Describe extended Employee; -- displays detailed information
View
• A view is a sort of “virtual table” that is defined by a SELECT
statement.
• Views may also be used to restrict users’ access to particular
subsets of tables that they are authorized to see.
• In Hive, a view is not materialized to disk when it is created; rather,
the view’s SELECT statement is executed when the statement that
refers to the view is run.
• Views are included in the output of the SHOW TABLES command,
and you can see more details about a particular view, including the
query used to define it, by issuing the DESCRIBE EXTENDED
view_name command.
 Create Views
 CREATE VIEW view_name (id,name) AS SELECT * from users;
 Drop a view
 Drop view viewName;
Joins
• Only equality joins, outer joins, and left semi
joins are supported in Hive.
• Hive does not support join conditions that are
not equality conditions as it is very difficult to
express such conditions as a map/reduce job.
Also, more than two tables can be joined in
Hive
Example-Join
• hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
• hive> SELECT * FROM items;
2 Tie
4 Coat
3 Hat
1 Scarf
Table Commands
• Using Join
 One of the nice things about using Hive, rather than raw MapReduce,
is that Hive makes performing commonly used operations very simple.
• We can perform an inner join on the two tables as follows:
 hive> SELECT sales.*, items.* FROM sales JOIN items ON (sales.id =
items.id);
 hive> SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1)
JOIN c ON (c.KEY = b.key1)
• You can see how many MapReduce jobs Hive will use for any particular
query by prefixing it with the EXPLAIN keyword:,
• For even more detail, prefix the query with EXPLAIN EXTENDED.
 EXPLAIN SELECT sales.*, items.* FROM sales JOIN items ON (sales.id
= items.id);
Table Commands
• Outer joins
 Outer joins allow you to find non-matches in the
tables being joined.
 hive> SELECT sales.*, items.* FROM sales LEFT
OUTER JOIN items ON (sales.id = items.id);
 hive> SELECT sales.*, items.* FROM sales RIGHT
OUTER JOIN items ON (sales.id = items.id);
 hive>SELECT sales.*, items.* FROM sales FULL
OUTER JOIN items ON (sales.id = items.id);
Map Side Join

• If all but one of the tables being joined are small, the join can be
performed as a map only job.
• The query does not need a reducer. For every mapper a,b is read
completely. A restriction is that a FULL/RIGHT OUTER JOIN b cannot be
performed.

• SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key


Partitioning in Hive
• Using partitions, you
can make it faster to
execute queries on
slices of the data.
• A table can have one
or more partition
columns.
• A separate data
directory is created for
each distinct value
combination in the
partition columns.
Partitioning in Hive
• Partitions are defined at the time of creating a table
using PARTITIONED BY clause is used to create
partition.
Static Partition (Example-1)
 CREATE TABLE student_partnew (name STRING,id
int,marks String) PARTITIONED BY (pyear STRING) row
format delimited fields terminated by ',';
 LOAD DATA LOCAL INPATH '/home/notroot/std_2011.csv'
INTO TABLE student_partnew PARTITION (pyear='2011');
 LOAD DATA LOCAL INPATH '/home/notroot/std_2012.csv'
INTO TABLE student_partnew PARTITION (pyear='2012');
 LOAD DATA LOCAL INPATH '/home/notroot/std_2013.csv'
INTO TABLE student_partnew PARTITION (pyear='2013');
Partitioning in Hive
Static Partition (Example-2)
• CREATE TABLE student_New (id int,name string,marks
int,year int) row format delimited fields terminated by ',';
• LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/Student_new.csv'
INTO table Student_New;
• CREATE TABLE student_part (id int,name string,marks int,)
PARTITIONED BY (year STRING);
• INSERT into TABLE student_part PARTITION(pyear='2012' )
SELECT id,name,marks from student_new WHERE
year='2012';
SHOW Partition
• SHOW PARTITIONS month_part;
Partitioning in Hive
Dynamic Partition
• To enable dynamic partitions
 set hive.exec.dynamic.partition=true;
(To enable dynamic partitions, by default it is false)
 set hive.exec.dynamic.partition.mode=nonstrict;
(To allow a table to be partitioned based on multiple columns in hive, in
such case we have to enable the nonstrict mode)
 set hive.exec.max.dynamic.partitions.pernode=300;
(The default value is 100, we have to modify the same according to the
possible no of partitions that would come in your case)

hive.exec.max.created.files=150000
(IThe default values is 100000 but for larger tables it can exceed the default,
so we may have to update the same. )
Partitioning in Hive
• CREATE TABLE Stage_oper_Month (oper_id string, Creation_Date string,
oper_name String, oper_age int, oper_dept String, oper_dept_id int, opr_status
string, EYEAR STRING, EMONTH STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';
• LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/user_info.csv'INTO TABLE
Stage_oper_Month;
• CREATE TABLE Fact_oper_Month (oper_id string, Creation_Date string, oper_name
String, oper_age int, oper_dept String, oper_dept_id int) PARTITIONED BY
(opr_status string, eyear STRING, eMONTH STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
• FROM Stage_oper_Month INSERT OVERWRITE TABLE Fact_oper_Month
PARTITION (opr_status, eyear, eMONTH) SELECT oper_id, Creation_Date,
oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR, EMONTH
DISTRIBUTE BY opr_status, eyear, eMONTH;
• (Select from partition table)
 Select oper_id, oper_name, oper_dept from Fact_oper_Month where
eyear=2010 and emonth=1;
Bucketing Features
• Partitioning gives effective results when there are limited number of
partitions and comparatively equal sized partitions
• To overcome the problem of partitioning, Hive provides Bucketing
concept, another technique for decomposing table data sets into more
manageable parts.
• Bucketing concept is based on (hashing function on the bucketed
column) mod (by total number of buckets)
• Use CLUSTERED BY clause to divide the table into buckets.
• Bucketing can be done along with Partitioning on Hive tables and even
without partitioning.
• Bucketed tables will create almost equally distributed data file parts.
• To populate the bucketed table, we need to set the property
 set hive.enforce.bucketing = true;
Bucketing Advantage
Bucketing Advantages
• Bucketed tables offer efficient sampling than by non-bucketed
tables. With sampling, we can try out queries on a fraction of data
for testing and debugging purpose when the original data sets are
very huge.
• As the data files are equal sized parts, map-side joins will be faster
on bucketed tables than non-bucketed tables. In Map-side join, a
mapper processing a bucket of the left table knows that the
matching rows in the right table will be in its corresponding bucket,
so it only retrieves that bucket (which is a small fraction of all the
data stored in the right table).
• Similar to partitioning, bucketed tables provide faster query
responses than non-bucketed tables.
Bucketing Example
• We can create bucketed tables with the help of CLUSTERED BY clause and
optional SORTED BY clause in CREATE TABLE statement and DISTRIBUTED
BY clause in load statement.
• CREATE TABLE Month_bucketed (oper_id string, Creation_Date string,
oper_name String, oper_age int,oper_dept String, oper_dept_id int,
opr_status string, eyear string , emonth string) CLUSTERED BY(oper_id)
SORTED BY (oper_id,Creation_Date) INTO 10 BUCKETS ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
Similar to partitioned tables, we cannot directly load bucketed tables
with LOAD DATA (LOCAL) INPATH command, rather we need to use INSERT
OVERWRITE TABLE … SELECT …FROM clause from another table to populate
the bucketed tables.
• INSERT OVERWRITE TABLE Month_bucketed SELECT oper_id,
Creation_Date, oper_name, oper_age, oper_dept, oper_dept_id,
opr_status, EYEAR, EMONTH FROM stage_oper_month DISTRIBUTE BY
oper_id sort by oper_id, Creation_Date;
Partitioning with Bucketing
• CREATE TABLE Month_Part_bucketed (oper_id string,
Creation_Date string, oper_name String, oper_age int,oper_dept
String, oper_dept_id int) PARTITIONED BY (opr_status string, eyear
STRING, eMONTH STRING) CLUSTERED BY(oper_id) SORTED BY
(oper_id,Creation_Date) INTO 12 BUCKETS ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
• FROM Stage_oper_Month stg INSERT OVERWRITE TABLE
Month_Part_bucketed PARTITION(opr_status, eyear, eMONTH)
SELECT stg.oper_id, stg.Creation_Date, stg.oper_name,
stg.oper_age, stg.oper_dept, stg.oper_dept_id, stg.opr_status,
stg.EYEAR, stg.EMONTH DISTRIBUTE BY opr_status, eyear,
eMONTH;
Note: Unlike partitioned columns (which are not included in table
columns definition), Bucketed columns are included in table
definition as shown in above code
for oper_id and creation_date columns.
Table Sampling in Hive
Table Sampling in hive is nothing but extraction small fraction of data from
the original large data sets. It is similar to LIMIT operator in Hive.
Difference between LIMIT and TABLESAMPLE in Hive.
 In many cases a LIMIT clause executes the entire query, and then only returns
limited results.
 But Sampling will only select a portion of data to perform query.
To see the performance difference between bucketed and non-bucketed
tables.
 Query-1: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept
FROM month_bucketed TABLESAMPLE(BUCKET 12 OUT OF 12 ON oper_id);
 Query-2: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept
FROM stage_oper_month limit 18;
Note: Query-1 should always perform faster that query-2
To perform random sampling with Hive
 SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM
month_bucketed TABLESAMPLE (1 percent);
Hive UDF
• UDF is a java code which must satisfy the following two properties.
• UDF must implement at least one evaluate() method
• UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
Sample UDF
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null)
{ return null;
}
return new Text(s.toString().toLowerCase());
}
}
• hive> add jar my_jar.jar;
• hive> create temporary function my_lower as 'com.example.hive.udf.Lower';
• hive> select empid , my_lower(empname) from employee;
Hive UDAF
• A UDAF works on multiple input rows and creates a single output
row. Aggregate functions include such functions as COUNT and
MAX.
• An aggregate function is more difficult to write than a regular UDF.
• UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF
• Contain one or more nested static classes implementing
org.apache.hadoop.hive.ql.exec.UDAFEvaluator
• UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF
An evaluator must implement five methods
• init()
 The init() method initializes the evaluator and resets its internal
state.
 In MaximumIntUDAFEvaluator, we set the IntWritable object
holding the final result to null.
Hive UDAF
• iterate()
 The iterate() method is called every time there is a new value to be
aggregated. The evaluator should update its internal state with the result
of performing the aggregation. The arguments that iterate() takes
correspond to those in the Hive function from which it was called.
 In this example, there is only one argument. The value is first checked to
see whether it is null, and if it is, it is ignored. Otherwise, the result
instance variable is set either to value’s integer value (if this is the first
value that has been seen) or to the larger of the current result and value
(if one or more values have already been seen). We return true to indicate
that the input value was valid.
• terminatePartial()
 The terminatePartial() method is called when Hive wants a result for the
partial aggregation. The method must return an object that encapsulates
the state of the aggregation.
 In this case, an IntWritable suffices because it encapsulates either the
maximum value seen or null if no values have been processed.
Hive UDAF
• merge()
 The merge() method is called when Hive decides to combine one
partial aggregation with another. The method takes a single object,
whose type must correspond to the return type of the
terminatePartial() method.
 In this example, the merge() method can simply delegate to the
iterate() method because the partial aggregation is represented in the
same way as a value being aggregated. This is not generally the
case(we’ll see a more general example later), and the method should
implement the logic to combine the evaluator’s state with the state of
the partial aggregation.
• terminate()
 The terminate() method is called when the final result of the
aggregation is needed. The evaluator should return its state as a value.
 In this case, we return the result instance variable.
Hive UDAF
package com.hadoopbook.hive;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.io.IntWritable;
public class HiveUDAFSample extends UDAF {
public static class MaximumIntUDAFEvaluator implements UDAFEvaluator {
private IntWritable result;
public void init() {
result = null;
}
public boolean iterate(IntWritable value) {
if (value == null) {
return true;
}
Hive UDAF
if (result == null) {
result = new IntWritable(value.get());
} else {
result.set(Math.max(result.get(), value.get()));
}
return true;
}
public IntWritable terminatePartial() {
return result;
}
public boolean merge(IntWritable other) {
return iterate(other);
}
public IntWritable terminate() {
return result;
}
}
}
Hive UDAF
To Use UDAF in hive;
• hive> add jar my_jar.jar;
• hive> CREATE TEMPORARY FUNCTION maximum AS
'com.hadoopbook.hive.HiveUDAFSample';
• hive>SELECT maximum(salary) FROM employee;
Performance Tuning
Partitioning Tables:
• Hive partitioning is an effective method to improve the
query performance on larger tables. Partitioning allows
you to store data in separate sub-directories under
table location. It greatly helps the queries which are
queried upon the partition key(s). Although the
selection of partition key is always a sensitive decision,
it should always be a low cardinal attribute, e.g. if your
data is associated with time dimension, then date
could be a good partition key. Similarly, if data has
association with location, like a country or state, then
it’s a good idea to have hierarchical partitions like
country/state.
Performance Tuning
De-normalizing data:
• Normalization is a standard process used to model
your data tables with certain rules to deal with
redundancy of data and anomalies. In simpler words, if
you normalize your data sets, you end up creating
multiple relational tables which can be joined at the
run time to produce the results. Joins are expensive
and difficult operations to perform and are one of the
common reasons for performance issues. Because of
that, it’s a good idea to avoid highly normalized table
structures because they require join queries to derive
the desired metrics.
Performance Tuning
Compress map/reduce output:
• Compression techniques significantly reduce the intermediate data
volume, which internally reduces the amount of data transfers
between mappers and reducers. All this generally occurs over the
network. Compression can be applied on the mapper and reducer
output individually. Keep in mind that gzip compressed files are not
splittable. That means this should be applied with caution. A
compressed file size should not be larger than a few hundred
megabytes. Otherwise it can potentially lead to an imbalanced job.
• Other options of compression codec could be snappy, lzo, bzip, etc.
• For map output compression set mapred.compress.map.output to
true
• For job output compression set mapred.output.compress to true
Performance Tuning
Map join:
• Map joins are really efficient if a table on the
other side of a join is small enough to fit in the
memory. Hive supports a parameter,
hive.auto.convert.join, which when it’s set to
“true” suggests that Hive try to map join
automatically. When using this parameter, be
sure the auto convert is enabled in the Hive
environment.
Performance Tuning
Bucketing:
• Bucketing improves the join performance if the bucket key and join
keys are common. Bucketing in Hive distributes the data in different
buckets based on the hash results on the bucket key. It also reduces
the I/O scans during the join process if the process is happening on
the same keys (columns).
• Additionally it’s important to ensure the bucketing flag is set (SET
hive.enforce.bucketing=true;) every time before writing data to the
bucketed table. To leverage the bucketing in the join operation we
should SET hive.optimize.bucketmapjoin=true. This setting hints to
Hive to do bucket level join during the map stage join. It also
reduces the scan cycles to find a particular key because bucketing
ensures that the key is present in a certain bucket.
Performance Tuning
Parallel execution:
• As HIVE queries are inbuilt translated to a
number of map reduce jobs, but having
multiple Map-reduce jobs is not enough, real
advantage is of their parallel execution and
as noted above simply writing a query does
not achieve this.
• SELECT table1.a FROM
table1 JOIN table2 ON (table1.a =table2.a )
join table3 ON (table3.a=table1.a)
join table4 ON (table4.b=table3.b);
• Output: Execution time : 800 sec
But let us check the execution plan for this:
observations (see picture highlighted area):
• Total Map-Reduce Jobs: 2.
• Serially Launched & Run.
Performance Tuning
Parallel execution:
• To achieve this, we thought about query re-writing in a
way to segregate the query into independent units which
HIVE could work upon as independent map reduce jobs
running parallel. Following is what we did to our query:
• SELECT r1.a FROM
(SELECT table1.a FROM table1 JOIN table2 ON table1.a
=table2.a ) r1
JOIN
(SELECT table3.a FROM table3 JOIN table4 ON table3.b
=table4.b ) r2
ON (r1.a =r2.a) ;
• Output: Same results. But Execution time: 464 secs
observations:
• Total Map-Reduce Jobs: 5 (see picture highlighted area).
• Jobs are parallel Launched & Run. (see highlighted area).
• Decrease in query execution time (around 50% in our case)
Points to Note:
• Need to set hive.exec.parallel parameter to set to TRUE.
• To control how many jobs at most can be executed in
parallel set hive.exec.parallel.thread.number parameter.
Thank You

• Question?
• Feedback?

explorehadoop@gmail.com

Вам также может понравиться