Вы находитесь на странице: 1из 23

EEDC

34330

Execution
Environments for
Distributed
Computing

Apache Hive

Master in Computer Architecture,


Networks and Systems - CANS

Homework number: 3
Group number: EEDC-1
Group members:
Hugo Prez vhpvmx@gmail.com
Sergio Mendoza sergiomendo@gmail.com
Carlos Fenoy carles.fenoy@gmail.com

Outline
Introduction
Hive Database
Data Model
Query Language
Hive Arquitecture
Conclusions

Introduction
Origins on Facebook...
Facebook has 500.000.000 logs per day
Facebook shares a billion pieces of content daily
Facebook stores a vast amount of data

Introduction
What's the problem?
250 million photos per day
2.7 billion likes and comments per day
2 billion total registered users
100 billion friendships
...

TOO MUCH DATA!!

Introduction
What is Apache Hive?
Hive is a data warehouse infrastructure

Introduction
What is Apache Hive?
Hive is a data warehouse infrastructure

and what is a Data Warehouse (DW)?


a DW is a database for reporting and analysis

Introduction
How does Apache Hive works?
Hive is built on top of Hadoop
Hive stores data in the HDFS
Hive compile SQL queries as MapReduce jobs
and run the jobs in the cluster

Introduction
How does Apache Hive works?

HiveQL query

Introduction
How does a simple web app works?

MySQL query

Outline
Introduction
Hive Database
Data Model
Query Language
Hive Arquitecture
Conclusions

Data Model
Hive structures data into the wellunderstood database concepts like tables, columns,
rows.

HiveQL
Hive defines a simple SQL-like query language,
called QL
- Supports DDL and DML.

- Users can embed custom map-reduce scripts


- Supports UDF, UDAF and UDTF.

HiveQL Extract
REDUCE subq2.school, subq2.meme, subq2.cnt
USING top10.py AS (school,meme,cnt)
FROM (SELECT subq1.school, subq1.meme, COUNT(1)
AS cnt FROM (MAP b.school, a.status
USING meme-extractor.py AS (school,meme)
FROM status_updates a JOIN profiles b ON (a.userid = b.
userid) ) subq1
GROUP BY subq1.school, subq1.meme
DISTRIBUTE BY school, meme
SORT BY school, meme, cnt desc
) subq2;

Outline
Introduction
Hive Database
Data Model
Query Language
Hive Arquitecture
Conclusions

Architecture

Architecture
External Interfaces - provides both user interfaces like
command line (CLI) and web UI, and application
programming interfaces (API) like JDBC and ODBC
Thrift Server exposes a very simple client API to
execute HiveQL statements
Metastore is the system catalog. All other components
of Hive interact with the metastore.

Architecture
Driver manages the life cycle of a HiveQL statement
during compilation, optimization and execution
Compiler translates statements into a plan which
consists of a DAG of map-reduce jobs
The driver submits the individual map-reduce jobs
from the DAG to the Execution Engine in a
topological order

Metastore
The metastore is the system catalog which contains
metadata about the tables stored in Hive.
Database - is a namespace for tables.
Table - Metadata for table contains list of columns
and their types, owner, storage and SerDe information
Partition - Each partition can have its own columns and
SerDe and storage information

Query Compiler
Parser transforms a query string to a parse
tree representation.
Semantic Analyzer transforms the parse tree to a blockbased internal query representation.
Logical Plan Generator converts the internal
query representation to a logical plan, which consists of a
tree of logical operators
Optimizer performs multiple passes over the logical plan
and rewrites it in several ways
Physical Plan Generator converts the logical plan into a
physical plan, consisting of a DAG of map-reduce jobs

Outline
Introduction
Hive Database
Data Model
Query Language
Hive Arquitecture
Conclusions

Conclusions
Hive provides a solution to perform business
intelligence of huge data on top of mature
Hadoop map-reduce platform.
The SQL-like HiveQL cuts off the learning
curve compared with low-level map-reduce
programs.

Questions?

Links:
http://i.stanford.edu/~ragho/hive-icde2010.pdf
http://www.vldb.org/pvldb/2/vldb09-938.pdf
http://hive.apache.org/
https://cwiki.apache.org/Hive/languagemanualtransform.html
http://biggdata.blogspot.com/2011/04/refreshingtrendingtopics-website-data.html
http://code.google.com/p/hivemrc/wiki/AboutHiveCore

Вам также может понравиться