Introduction To The Big Data Ecosystem

Overview of Hive
Objectives
After completing this lesson, you should be able to:

• Define Hive
• Describe the Hive data flow
• Create a Hive database
6-2
1
Hive
• Hive is an open source Apache project and was originally
developed by Facebook.
• Hive enables analysts who are familiar with SQL to query
data stored in HDFS by using HiveQL (a SQL-like
language).
• It is an infrastructure built on top of Hadoop that supports
the analysis of large data sets.
• Hive transforms HiveQL queries into standard MapReduce
jobs (high level abstraction on top of MapReduce).
• Hive communicates with the JobTracker to initiate the
MapReduce job.
6-3
Features
Uses familiar SQL syntax (HiveQL)

Interactive
Scalable –works with “big data” on a cluster
Really most appropriate for data warehouse applications
Easy OLAP queries –WAY easier than writing MapReduce in
Java
Highly optimized
6-4
2
Use Case: Storing Clickstream Data
6-5
Defining Tables over HDFS
A table in Hive is mapped to HDFS directories
6-6
3
Defining Tables over HDFS
{"custid":1185972,"movieid":null,"genreid":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
...
CREATE EXTERNAL TABLE default.movieapp_log_json(

custid int ,
movieid int , HiveQL (simple SQL-
genreid int , Like SQL Syntax to
time string ,
recommended string ,
query the click stream
activity int , data)
rating int ,
price float ,
position int ) SerDe option
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/applog_json'
6-7
Another Example
6-8
4
Where is the data?
LOAD DATA
•MOVES data from a distributed filesystem into Hive
LOAD DATA LOCAL

•COPIES data from your local filesystem into Hive
6-9
Partitioning
6 - 10
5
Hive: Data Units
Databases
Tables
Partitions
6 - 11
The Hive Metastore Database
• Contains metadata regarding databases, tables, and

partitions
• Contains information about how the rows and columns are
delimited in the HDFS files that are used in the queries
• Is an RDBMS database, such as MySQL, where Hive
persists table schemas and other system metadata
6 - 12
6
Hive Framework
External
interfaces
CLI JDBC Hue
HiveServer2
6 - 13
Ways to use Hive
■ Interactive via hive> prompt / Command line interface (CLI)

■Through Ambari/ Hue
■Through JDBC/ODBC server
■Through Thrift service
■Via Oozie
6 - 14
7
Creating a Hive Database
1. Start hive.
2. Create the database.
3. Verify the database creation.
6 - 15
Data Manipulation in Hive
Hive SELECT with a WHERE clause:

Map Task
Map Task
SELECT a,sum(b) Map Task
FROM myTable Map Task
WHERE a < 100
GROUP BY a
Reduce Task
Reduce Task
Result
6 - 16
8
Data Manipulation in Hive: Nested Queries
Job # 1
Map Task
3 1 Map Task
Map Task
SELECT mt.a, mt.timesTwo,otherTable.z as ID FROM( Map Task
SELECT a, sum(b*b) AS timesTwo

FROM myTable
GROUP BY a)mt Reduce Task
Reduce Task
JOIN otherTable
3 2
ON otherTable.z = mt.a
GROUP BY mt.a,otherTable.z Temporary
Result
Map Task
Map Task
Map Task Job # 2
Map Task
Notes:
• Subqueries are treated as
sequential MapReduce jobs. 4
Reduce Task
• Jobs execute from the Reduce Task Output
innermost query outward.
6 - 17
Steps in a Hive Query
SELECT suit, COUNT(*)

FROM cards
WHERE face_value > 10 HiveQL
GROUP BY suit;
Map task Reduce task

Shuffle
If face_card: emit(suit,
emit(suit, count(suit))
card)
Hadoop Cluster (Job Tracker or Resource Manager)
6 - 18
9
Hive-Based Applications
• Log processing
• Text mining
• Document indexing
• Business analytics
• Predictive modeling
6 - 19
Summary
In this lesson, you should have learned how to:

• Define Hive
• Describe the Hive data flow
• Create a Hive database
6 - 20
10

Introduction To The Big Data Ecosystem

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Introduction To The Big Data Ecosystem

Загружено:

Авторское право:

Доступные форматы

Overview of Hive

After completing this lesson, you should be able to:

Uses familiar SQL syntax (HiveQL)

Defining Tables over HDFS

A table in Hive is mapped to HDFS directories

CREATE EXTERNAL TABLE default.movieapp_log_json(

LOAD DATA LOCAL

The Hive Metastore Database

• Contains metadata regarding databases, tables, and

Ways to use Hive

■ Interactive via hive> prompt / Command line interface (CLI)

2. Create the database.

3. Verify the database creation.

Data Manipulation in Hive

Hive SELECT with a WHERE clause:

SELECT a, sum(b*b) AS timesTwo

Steps in a Hive Query

SELECT suit, COUNT(*)

Map task Reduce task

Hadoop Cluster (Job Tracker or Resource Manager)

In this lesson, you should have learned how to:

Вам также может понравиться