Академический Документы
Профессиональный Документы
Культура Документы
Objectives
6-2
1
Hive
• Hive is an open source Apache project and was originally
developed by Facebook.
• Hive enables analysts who are familiar with SQL to query
data stored in HDFS by using HiveQL (a SQL-like
language).
• It is an infrastructure built on top of Hadoop that supports
the analysis of large data sets.
• Hive transforms HiveQL queries into standard MapReduce
jobs (high level abstraction on top of MapReduce).
• Hive communicates with the JobTracker to initiate the
MapReduce job.
6-3
Features
6-4
2
Use Case: Storing Clickstream Data
6-5
6-6
3
Defining Tables over HDFS
{"custid":1185972,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
...
6-7
Another Example
6-8
4
Where is the data?
LOAD DATA
•MOVES data from a distributed filesystem into Hive
6-9
Partitioning
6 - 10
5
Hive: Data Units
Databases
Tables
Partitions
6 - 11
6 - 12
6
Hive Framework
External
interfaces
CLI JDBC Hue
HiveServer2
6 - 13
6 - 14
7
Creating a Hive Database
1. Start hive.
6 - 15
Reduce Task
Reduce Task
Result
6 - 16
8
Data Manipulation in Hive: Nested Queries
Job # 1
Map Task
3 1 Map Task
Map Task
SELECT mt.a, mt.timesTwo,otherTable.z as ID FROM( Map Task
Map Task
Map Task
Map Task Job # 2
Map Task
Notes:
• Subqueries are treated as
sequential MapReduce jobs. 4
Reduce Task
• Jobs execute from the Reduce Task Output
innermost query outward.
6 - 17
If face_card: emit(suit,
emit(suit, count(suit))
card)
6 - 18
9
Hive-Based Applications
• Log processing
• Text mining
• Document indexing
• Business analytics
• Predictive modeling
6 - 19
Summary
6 - 20
10