Вы находитесь на странице: 1из 10

Overview of Hive

Objectives

After completing this lesson, you should be able to:


• Define Hive
• Describe the Hive data flow
• Create a Hive database

6-2

1
Hive
• Hive is an open source Apache project and was originally
developed by Facebook.
• Hive enables analysts who are familiar with SQL to query
data stored in HDFS by using HiveQL (a SQL-like
language).
• It is an infrastructure built on top of Hadoop that supports
the analysis of large data sets.
• Hive transforms HiveQL queries into standard MapReduce
jobs (high level abstraction on top of MapReduce).
• Hive communicates with the JobTracker to initiate the
MapReduce job.

6-3

Features

Uses familiar SQL syntax (HiveQL)


Interactive
Scalable –works with “big data” on a cluster
Really most appropriate for data warehouse applications
Easy OLAP queries –WAY easier than writing MapReduce in
Java
Highly optimized

6-4

2
Use Case: Storing Clickstream Data

6-5

Defining Tables over HDFS

A table in Hive is mapped to HDFS directories

6-6

3
Defining Tables over HDFS

{"custid":1185972,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
...

CREATE EXTERNAL TABLE default.movieapp_log_json(


custid int ,
movieid int , HiveQL (simple SQL-
genreid int , Like SQL Syntax to
time string ,
recommended string ,
query the click stream
activity int , data)
rating int ,
price float ,
position int ) SerDe option
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/applog_json'

6-7

Another Example

6-8

4
Where is the data?

LOAD DATA
•MOVES data from a distributed filesystem into Hive

LOAD DATA LOCAL


•COPIES data from your local filesystem into Hive

6-9

Partitioning

6 - 10

5
Hive: Data Units

Databases

Tables

Partitions

6 - 11

The Hive Metastore Database

• Contains metadata regarding databases, tables, and


partitions
• Contains information about how the rows and columns are
delimited in the HDFS files that are used in the queries
• Is an RDBMS database, such as MySQL, where Hive
persists table schemas and other system metadata

6 - 12

6
Hive Framework

External
interfaces
CLI JDBC Hue

HiveServer2

6 - 13

Ways to use Hive

■ Interactive via hive> prompt / Command line interface (CLI)


■Through Ambari/ Hue
■Through JDBC/ODBC server
■Through Thrift service
■Via Oozie

6 - 14

7
Creating a Hive Database
1. Start hive.

2. Create the database.

3. Verify the database creation.

6 - 15

Data Manipulation in Hive

Hive SELECT with a WHERE clause:


Map Task
Map Task
SELECT a,sum(b) Map Task
FROM myTable Map Task
WHERE a < 100
GROUP BY a

Reduce Task
Reduce Task

Result

6 - 16

8
Data Manipulation in Hive: Nested Queries
Job # 1
Map Task
3 1 Map Task
Map Task
SELECT mt.a, mt.timesTwo,otherTable.z as ID FROM( Map Task

SELECT a, sum(b*b) AS timesTwo


FROM myTable
GROUP BY a)mt Reduce Task
Reduce Task
JOIN otherTable
3 2
ON otherTable.z = mt.a
GROUP BY mt.a,otherTable.z Temporary
Result

Map Task
Map Task
Map Task Job # 2
Map Task
Notes:
• Subqueries are treated as
sequential MapReduce jobs. 4
Reduce Task
• Jobs execute from the Reduce Task Output
innermost query outward.

6 - 17

Steps in a Hive Query

SELECT suit, COUNT(*)


FROM cards
WHERE face_value > 10 HiveQL
GROUP BY suit;

Map task Reduce task


Shuffle

If face_card: emit(suit,
emit(suit, count(suit))
card)

Hadoop Cluster (Job Tracker or Resource Manager)

6 - 18

9
Hive-Based Applications

• Log processing
• Text mining
• Document indexing
• Business analytics
• Predictive modeling

6 - 19

Summary

In this lesson, you should have learned how to:


• Define Hive
• Describe the Hive data flow
• Create a Hive database

6 - 20

10

Вам также может понравиться