Вы находитесь на странице: 1из 27

BIG DATA

ANALYTICS
AND
DEVELOPERS
TRAINING
Hive Advanced
Today’s Objectives
• Use HQL commands to perform DML queries

• Implement joins in Hive

• Perform performance-tuning and query optimization in Hive

• Explain various execution types in Hive

• Explain various Hive files and formats

• Discuss security in Hive


Hive Datatypes
• TINYINT: 1-byte signed integer • ARRAY: order collection of elements of same datatype Double
• SMALLINT: 2-byte signed integer • MAPS: unordered collection of key/value pairs.
• INT: 4-byte signed integer Key are of primary data type, values any type Float
• BIGINT: 8-byte signed integer • STRUCT: collection of element of different datatype
• DECIMAL: user defined precision and scale
Decimal
• FLOAT: 4-byte
• DOUBLE: 8-bytes
• VARCHAR: 1-65355 characters Variable Number BigInt
• CHAR: Fixed character length String Char
Int
String
Varchar
Primary Data SmallInt
Types
Arrays
Hive Data Types TinyInt
Complex Data
Types Maps
Boolean
Misc
Struct
Binary
• UNION: collection of heterogeneous data type
Union UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
HiveQL Queries
Hive Query Language (HiveQL) queries are written to perform database operations in Hive.

The most common type of HiveQL query is the SELECT statement.

Apart from SELECT statement, the other important HiveQL queries are Limit clause, nested
queries, CASE...WHEN...THEN queries, LIKE and RLIKE queries, GROUP BY queries etc.

Select Statement
SELECT <column1>, <column2> FROM <table_name>;
Select Records from Columns Starting with a Given String
SELECT string* FROM <table_name>;
Limit Clause
SELECT * FROM <table_name> limit 10;
Contd.
Nested Queries
SELECT * FROM <table_name> where <condition> <compares> (SELECT <column>
FROM <table_name>);
CASE...WHEN...THEN Queries
SELECT <column1>, CASE WHEN <condition1> THEN <option1>, WHEN <condition2>
THEN <option2>, ELSE <option3> END AS <column2> FROM <table_name>;
LIKE and RLIKE Queries
SELECT * FROM <table_name> WHERE <column1> LIKE ‘%string%’;
SELECT * FROM <table_name> WHERE <column2> RLIKE ‘.*(string)*.’;
For example, 'foobar' RLIKE '^f*r$' evaluates to TRUE
GROUP BY Queries
SELECT <column1>, <column2> FROM <table_name> GROUP BY <column1>;
HAVING Queries
SELECT <column1>, <column2> FROM <table_name> GROUP BY <column1> HAVING
<column1=value1> OR <column1=value2>;
Manipulating Column Values Using Functions
Types of Functions in Hive Built-In Functions in Hive

Arithmetic Functions
Built-In Function
Mathematical Functions

User-Defined Functions Aggregate Functions

Applying Arithmetic Functions


SELECT count(*) FROM <table_name>;
Applying Mathematical Functions
SELECT avg(<column>) FROM <table_name>;
Applying Aggregate Functions
SELECT upper(<column>) FROM <table_name>;
User-Defined Functions
Creating user-defined functions requires writing Java classes and overriding the evaluate()
method, which is a part of the org.apache.hadoop.hive.ql.exec.UDF class that is extended to
modify the default behavior of evaluate() method.
The Full signature of one of the evaluate could be “public Text evaluate(Text s)”.
The steps involved in this process are:

1. Creation of a Java class.


2. Creation of a JAR file from the class.
3. Creation of the function from the JAR file and the Java file created in the previous
steps.
Add the JAR File to Hive
ADD JAR [Location of the JAR File]/<jar_file_name>;
Create the User Defined Fnction touppercase
CREATE TEMPORORY FUNCTION touppercase as ‘[Java class location].<class_name>’;
Use the User Defined Function
SELECT touppercase(<column>) FROM <table_name>;
Check Your Understanding
Q.1. Write a user-defined function (program) to convert the first letter of a word to
uppercase.
Understand Some More Java Based Hive
package com.taps.hadoop.bdsession.hive.advanced;

import org.apache.hadoop.hive.ql.exec.UDF; // Imports for the UDF


import org.apache.hadoop.io.Text; // Overwrite to define your customer String
manupulations

public final class FUpper extends UDF {


public Text evaluate(final Text data) {
if (data == null) {
return null;
}
return new Text Character.toString(data.charAt(0)).toUpperCase()+data.substring(1););
}
}

Create temporary function fupper as


‘com.taps.hadoop.bdsession.hive.advanced.FUpper’ ;

Hive>select fupper from titles group by fupper(title);


JOINS in Hive
Types of Joins Inner Join
SELECT <col1>, <col2> FROM <table1> t1
Inner Join JOIN <table2> t2 ON (<condition>);
Right Outer Join
Outer Join
SELECT <col1>, <col2> FROM <table1>
• Left Outer Join
RIGHT OUTER JOIN <table2> ON <cond.>;
• Right Outer Join
• Full Outer Join Left Outer Join
SELECT <col1>, <col2> FROM <table1>
Cross Join LEFT OUTER JOIN <table2> ON <cond.>;

Full Outer Join


SELECT <col1>, <col2> FROM <table1> FULL OUTER JOIN <table2> ON (<condition>);
Cross Join
SELECT * FROM <table1> JOIN <table2>;
JOINS in Hive
Types of Joins Inner Join
SELECT <col1>, <col2> FROM <table1> t1
Inner Join JOIN <table2> t2 ON (<condition>);

Outer Join
• Left Outer Join
• Right Outer Join
• Full Outer Join

Cross Join

Right Outer Join


SELECT <col1>, <col2> FROM <table1>
RIGHT OUTER JOIN <table2> ON <cond.>;
JOINS in Hive
Left Outer Join
SELECT <col1>, <col2> FROM <table1> LEFT OUTER
JOIN <table2> ON <cond.>;

Full Outer Join


SELECT <col1>, <col2> FROM <table1> FULL OUTER
JOIN <table2> ON (<condition>);

Cross Join
SELECT * FROM <table1> JOIN <table2>;
Hive Best Practices

The Hive Best Practices Involve

Use of Partitions: Dividing data into small partitions to make it manageable

Denormalization: Hive does not require normalization to avoid complexity

Use of Bucketing: Distributing data into user-defined clusters by calculating hash


code

Apart from these points, effective use of compressions is also one of the Hive best practices.
Performance-Tuning and
Query Optimizations
Tuning or optimizing Hive queries requires understanding of how a Hive query works.
To know about the working of Hive queries, the EXPLAIN command is used.
Usually, the output of an EXPLAIN command consists of three parts.

Parts of EXPLAIN command output Parameter Description


hive.limit.optimiz Allows sampling on
Abstract syntax tree e.enable limited set of files instead
of complete source files.
Different stage dependencies
hive.limit.row.ma Decides the maximum
x.size number of rows LIMIT
Description of each stage should consider while
executing the query.
The LIMIT command restricts the output to
hive.limit.optimiz Decides the maximum
a certain number of rows, but only after it
e.limit.file number of files being
has processed the complete result. To
sampled to run the LIMIT
optimize this, Hive allows setting of some query.
parameters that are shown in the table.
Various Execution Types
Various Execution Types in Hive Setting hive.exec.mode.local.auto to TRUE
enables Hive to decide about the need for a
Local Execution local node execution.

Parallel Execution Hive breaks the job into certain stages that
are executed sequentially. Set the value of
Indexes hive.exec.parallel to true enables a parallel
execution of certain stages of the job.
Speculative Execution

While Hive allows creation of indexes on columns to accelerate the execution of GROUP BY
command, it allows speculative execution of jobs by setting parameters as given below:

Parameter Description
mapred.map.tasks.speculative.execution Runs more than one instance of map tasks.
mapred.reduce.tasks.speculative.execution Runs more than one instance of reduce
tasks.
Hive File and Record Formats
File Formats

Text Files
• Input Format: org.apache.hadoop.mapred.TextInputFormat
• Output Format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Sequence File
RCFile

Record Formats (SerDes) SerDe holds the logic to convert unstructured data
into records and is implemented using Java.
Regex
In addition to Regex and Avro SerDes, JavaScript Object
Avro Notation (JSON) is a standard format that uses readable text
to transmit data objects consisting of attribute-value pair.
ORC
JSON SerDe is used to transmit data between applications.
Thrift
Contd.
Features of JSON SerDe

Read files stored in JSON format

Supports complex JSON nodes such as ARRAY and Maps

Supports nested data structures

Converts processed data to JSON records using INSERT INTO command.

JSON SerDe can be used by performing the following steps:


1. Build the JSON SerDe source code and create a JAR file from the Java class file.
2. Copy the created JAR to the node where hive is installed and add the JAR to Hive.
3. Create a table, which uses files containing JSON records, using this SerDe.
4. Use the JSON SerDe to retrieve records.
HiveThrift Service
HiveThrift Services are run on Hive Thrift Server to which the following clients can connect:

Code for Building a Sample JDBC HiveThrift Client


HiveThrift Server Clients
public class HiveJdbcClient {
JDBC private static String DRIVER_NAME =
"org.apache.hadoop.hive.jdbc.HiveDriver";
ODBC public static void main(String[] args) throws SQLException
{
Python try {
Class.forName(DRIVER_NAME);
PHP } catch (ClassNotFoundException e) { }
Connection con = DriverManager.getConnection();
Statement stmt = con.createStatement();
ResultSet res = stmt.executeQuery();
}
} Refer to the WCBDD study guide
for the complete code and
explanation.
Security in Hive

Hive security is, primarily, a matter of two concepts; Authentication and Authorization.

Hive Security Privilege Description

Authentication ALL All operations


CREATE Privilege to create tables
Authorization
DROP Privilege to drop tables
Setting hive.files.umask.value ALTER Privilege to alter tables
decides the default permissions for
newly created items while Setting INDEX Privilege to create index on table
hive.metastore.authorization.storag LOCK Privilege to lock/unlock tables at the time
e.checks to TRUE examines whether of concurrency
the user has the permission to drop
the table in Hive. Likewise, SELECT Privilege to run SELECT query
authorizations can be added in
Hive. UPDATE Privilege to load or update table/partition
Points to Remember.
The most common operation in SQL is the SELECT statement.
There are two types of functions in Hive:
Built-in functions
User-defined functions
Hive supports a variety of built-in functions, such as:
Arithmetic functions
Mathematical functions
Aggregated functions
Points to Remember.
Hive, too, supports the LIMIT clause with which you can restrict the SELECT statement
results to a retrieve only a specific number of lines.

Hive supports nested queries, where the output of the inner query can be specified as the
input to the outer query.

Hive allows use of CASE statements to enable you to classify records depending on various
inputs.

LIKE and RLIKE operators compare and match strings or substrings from a given set of
records.

Hive supports joining of one or more tables together to get useful aggregate information. The
various joins Hive supports are as follows:
Points to Remember.
Inner joins
Outer joins

Map side joins are recommended when you need to join two tables in which one table is
smaller than the other table.

The concept of partitions in Hive helps in maintaining a new partition for every new day
without too much effort.

Hive does not have any primary and foreign keys because it is not meant to run complex
relational queries.

By keeping denormalized data, you avoid multiple disk seeks, which is generally the case
when there are foreign key relations.
Points to Remember.
Hive does a complete table scan to process a query.

Bucketing is a similar optimization technique as partitioning. Buckets distribute data load


into user defined set of clusters by calculating the hash code of the key mentioned in the
query.

Compression of data on HDFS makes data smaller to query on, which ultimately helps in
reducing the query time.

An EXPLAIN output normally consist of the following three parts:


Abstract syntax tree
Different stage dependencies
Description of each stage
Points to Remember.
Speculative execution is a Hadoop feature that invokes duplicate tasks on multiple nodes,
and whichever node completes the task first, that attempt is considered a valid task
attempt.

A sequence file improves performance because it is a file format that contains key value
pairs in binary format.

RCFile is record columnar file format. RCFiles are designed for:


Fast data load
Fast query execution
Efficient data storage and better disk utilization
Ability to adapt dynamic data access patterns
Points to Remember.
RCFile combines the advantages of row and column store by horizontal and vertical
partitioning strategy.
SerDe is the abbreviated form for Serializer/Deserializer. SerDe holds the logic to convert
unstructured data into records. SerDes are implemented using Java.
Commonly used built-in and third-party SerDes available for Hive are:
Regex
Avro
ORC
Thrift
Points to Remember.
Even though CLI is the most popular way to access Hive, it is difficult to integrate the
command line with user-interacting applications.
To overcome the limitations of CLI, Hive supports other ways, like Hive metastore–HiveThrift
or HiveServer; to connect to it and execute queries.
Hive Server is an optional service that allows a remote client to submit requests to Hive.
Hive supports UNIX-like authentication for the tables and folders created by the user.
Hive supports user, group, and role-based authorizations.

Вам также может понравиться