Академический Документы
Профессиональный Документы
Культура Документы
HIVE
www.edureka.co/big-data-and-hadoop
Course Topics
Module 1 Module 6
» Understanding Big Data and Hadoop » HIVE
Module 2 Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase
Module 3 Module 8
» Hadoop MapReduce Framework » Advance HBase
Module 4 Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
Module 5
» PIG Module 10
» Oozie and Hadoop Project
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
Understand What is Hive and its Use Cases
Slide 3 www.edureka.co/big-data-and-hadoop
Hive Background
Started at Facebook
Data was collected by nightly cron jobs into Oracle DB
“ETL” via hand-coded python
Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that
Traditional RDBMS…
Solution…
Hard to Users
program know
SQL well
Slide 5 www.edureka.co/big-data-and-hadoop
What is Hive?
Data Warehousing package built on top of Hadoop
Used for data analysis
Targeted towards users comfortable with SQL
It is similar to SQL and called HiveQL
For managing and querying structured data
Abstracts complexity of Hadoop
No need to learn java and Hadoop APIs
Developed by Facebook and contributed to community
Facebook analyzed several Terabytes of data everyday using Hive
Slide 6 www.edureka.co/big-data-and-hadoop
What is Hive? (Contd.)
Defines
SQL-Like Data
Query Warehouse
Language Infrastructure
called QL
Allows
programmers to
plug-in custom Provides tools to
mappers and enable easy data
reducers ETL
Slide 7 www.edureka.co/big-data-and-hadoop
Where to use Hive?
Data
Mining
Customer- Predictive
facing Modeling,
Business Hypothesis
Intelligence Testing
Slide 8 www.edureka.co/big-data-and-hadoop
Why go for Hive When Pig is there?
PigLatin: HiveQL:
Pig is used by Programmers and Researchers Hive is used by Analysts generating daily reports
Slide 9 www.edureka.co/big-data-and-hadoop
Why go for Hive When Pig is there? (Contd.)
Slide 10 www.edureka.co/big-data-and-hadoop
Hive Architecture
Karmasphere Hue Qubole Others…
JDBC ODBC
Driver
Metastore
(compiles, optimizes, executes)
Hadoop
Master
*Resource DFS
Name Node
Manager
Slide 11 www.edureka.co/big-data-and-hadoop
Hive Components
Shell
HIVE
Driver Metastore
Components
Execution
Compiler
Engine
Slide 12 www.edureka.co/big-data-and-hadoop
Metastore
HIVE Service JVM
Driver Metastore
Local
Metastore MySQL
Driver Metastore
Metastore
Driver
Remote Server JVM
Metastore MySQL
Driver Metastore
Server JVM
Slide 13 www.edureka.co/big-data-and-hadoop
Limitations of HIVE?
Slide 14 www.edureka.co/big-data-and-hadoop
Abilities of HIVE Query Language
Hive Query Language provides the basic SQL-like operations
Slide 15 www.edureka.co/big-data-and-hadoop
Differences with Traditional RDBMS
Schema on Read vs Schema on Write
» Hive does not verify the data when it is loaded, but rather when a query is issued.
» Schema on read makes for a very fast initial load, since the data does not have to be read, parsed and
serialized to disk in the database’s internal format. The load operation is just a file copy or move.
Slide 16 www.edureka.co/big-data-and-hadoop
Type System
Integers
Boolean Type TINYINT – 1 byte integer
BOOLEAN – TRUE/FALSE SMALLINT – 2 byte integer
INT – 4 byte integer
BIGINT – 8 byte integer
Primitive
Types
Slide 17 www.edureka.co/big-data-and-hadoop
Complex Types
Complex Types can be built up from primitive types and other composite types using the following three operators:
Operators
Slide 18 www.edureka.co/big-data-and-hadoop
Hive Data Models
Databases HIVE Data (In the order of granularity)
» Namespaces
Tables
» Schemas in namespaces Databases Tables
Partitions timestamp
» How data is stored in HDFS
» Grouping data bases on some column
Userid
» Can have one or more columns
IP
Slide 19 www.edureka.co/big-data-and-hadoop
Partitions
Partition means dividing a table into a coarse grained parts based on the value of a partition column
such as a date. This make it faster to do queries on slices of the data.
Partitions
Slide 20 www.edureka.co/big-data-and-hadoop
Buckets
Buckets give extra structure to the data that may be used for more efficient queries.
» A join of two tables that are bucketed on the same columns – including the join column can be implemented as
a Map Side Join.
» Bucketing by user ID means we can quickly evaluate a user based query by running it on a randomized sample
of the total set of users.
Buckets (Cluster)
Bucket
Slide 21 www.edureka.co/big-data-and-hadoop
Create Database and Table
Create Database.
» Create database retail;
Use Database.
» Use retail;
Slide 22 www.edureka.co/big-data-and-hadoop
Create Database and Table (Contd.)
Create table for storing transactional records.
» Create table txnrecords(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING,
product STRING, city STRING, state String, Spendby String ) row format delimited fields terminated by ‘,’
stored as textfile;
Slide 23 www.edureka.co/big-data-and-hadoop
External Tables
Create the table in another HDFS location and not in warehouse directory
Hive does not delete the table (or HDFS files) even when the tables are dropped
It leaves the table untouched and only metadata about the tables are deleted
Slide 24 www.edureka.co/big-data-and-hadoop
Load Data
Load the data into the table.
» LOAD DATA LOCAL INPATH ’/home/edureka/txns’ OVERWRITE INTO TABLE txnrecords;
Slide 25 www.edureka.co/big-data-and-hadoop
Queries
Select
» Select count(*) from txnrecords;
Aggregation
» Select count (DISTINCT category) from txnrecords;
Grouping
» Select category, sum( amount ) from txnrecords group by category;
Slide 26 www.edureka.co/big-data-and-hadoop
Managing Outputs
Inserting Output into another table.
» INSERT OVERWRITE TABLE results (SELECT * from txnrecords);
Slide 27 www.edureka.co/big-data-and-hadoop
Hive Command Blog
http://www.edureka.co/blog/hive-commands/
Slide 28 www.edureka.co/big-data-and-hadoop
Hive Script
Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and effort
invested in writing and executing each command manually.
Hive supports scripting from Hive 0.10.0 and above versions. myqueries.sql hive
script
Slide 29 www.edureka.co/big-data-and-hadoop
Hive Script (Contd.)
Command to execute the hive script : hive -f myqueries.sql
The script runs and executed all the queries one by one in a single go and saves the output in hive/output
directory.
Slide 30 www.edureka.co/big-data-and-hadoop
Hive Script Blog
http://www.edureka.co/blog/apache-hadoop-hive-script/
Slide 31 www.edureka.co/big-data-and-hadoop
Joining Two Tables
User Table
Id Email Language Location
1 edureka@1.com EN US
2 edureka@2.com EN GB
3 edureka@3.com FR FR
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper
Slide 32 www.edureka.co/big-data-and-hadoop
Joining Two Tables (Contd.)
User Table
Id Email Language Location
1 edureka@1.com EN US
2 edureka@2.com EN GB Prod 1
3 edureka@3.com FR FR
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper
Slide 33 www.edureka.co/big-data-and-hadoop
Joining Two Tables (Contd.)
User Table
Id Email Language Location
1 edureka@1.com EN US
2 edureka@2.com EN GB
3 edureka@3.com FR FR Prod 2
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper
Slide 34 www.edureka.co/big-data-and-hadoop
Joining Two Tables (Contd.)
User Table
Id Email Language Location Product Location
1 edureka@1.com EN US Prod-1 3
Prod-2 1
2 edureka@2.com EN GB
3 edureka@3.com FR FR
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper
Slide 35 www.edureka.co/big-data-and-hadoop
Hive UDF
Slide 36 www.edureka.co/big-data-and-hadoop
Revisiting Use Case in Healthcare
Load CSV file into Hive
HDFS
Read data from
Hive table
De-identify columns
and store the data
back in a Hive table
Hive Script
Slide 37 www.edureka.co/big-data-and-hadoop
HealthCare UDF
package myudf;
private String encrypt(String strToEncrypt, byte[] key) throws NoSuchAlgorithmException,
NoSuchPaddingException, InvalidKeyException, IllegalBlockSizeException, BadPaddingException
{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);
String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
System.out.println("------------encryptedString"+encryptedString);
return encryptedString.trim();
Slide 38 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
Adding myudf jar.
Slide 39 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
Creating a function deIdentify for the UDF.
Slide 40 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
Storing the output in a local directory hive/output
Slide 41 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
Storing the output on HDFS in out directory.
Slide 42 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
The output after decrypting the healthcare dataset.
Slide 43 www.edureka.co/big-data-and-hadoop
Assignment for Hive
Referring the documents present in the LMS under assignment.
Slide 44 www.edureka.co/big-data-and-hadoop
Pre-work
Go through http://www.edureka.in/blog/map-side-join-vs-join/
Slide 45 www.edureka.co/big-data-and-hadoop
Agenda for Next Class
Joins in Hive
Dynamic Partitioning in Hive
Custom MapReduce Scripts
Hive UDF
Introduction to HBase
HBase Storage Architecture
Cluster Deployment
Slide 46 www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Slide 47 www.edureka.co/big-data-and-hadoop