Вы находитесь на странице: 1из 23

EEDC

34330

Execution Environments for Distributed Computing


Master in Computer Architecture, Networks and Systems - CANS

Apache Hive

Homework number: 3 Group number: EEDC-1 Group members:


Hugo Prez vhpvmx@gmail.com Sergio Mendoza sergiomendo@gmail.com Carlos Fenoy carles.fenoy@gmail.com

Outline
Introduction Hive Database Data Model Query Language Hive Arquitecture Conclusions

Introduction
Origins on Facebook...
Facebook has 500.000.000 logs per day Facebook shares a billion pieces of content daily Facebook stores a vast amount of data

Introduction
What's the problem?
250 million photos per day 2.7 billion likes and comments per day 2 billion total registered users 100 billion friendships ...

TOO MUCH DATA!!

Introduction
What is Apache Hive?
Hive is a data warehouse infrastructure

Introduction
What is Apache Hive?
Hive is a data warehouse infrastructure

and what is a Data Warehouse (DW)?


a DW is a database for reporting and analysis

Introduction
How does Apache Hive works?
Hive is built on top of Hadoop
Hive stores data in the HDFS
Hive compile SQL queries as MapReduce jobs and run the jobs in the cluster

Introduction
How does Apache Hive works?

HiveQL query

Introduction
How does a simple web app works?

MySQL query

Outline
Introduction Hive Database Data Model Query Language Hive Arquitecture Conclusions

Data Model
Hive structures data into the wellunderstood database concepts like tables, columns, rows.

HiveQL
Hive defines a simple SQL-like query language, called QL - Supports DDL and DML.

- Users can embed custom map-reduce scripts


- Supports UDF, UDAF and UDTF.

HiveQL Extract
REDUCE subq2.school, subq2.meme, subq2.cnt USING top10.py AS (school,meme,cnt) FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM (MAP b.school, a.status USING meme-extractor.py AS (school,meme) FROM status_updates a JOIN profiles b ON (a.userid = b. userid) ) subq1 GROUP BY subq1.school, subq1.meme DISTRIBUTE BY school, meme SORT BY school, meme, cnt desc ) subq2;

Outline
Introduction Hive Database Data Model Query Language Hive Arquitecture Conclusions

Architecture

Architecture
External Interfaces - provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC Thrift Server exposes a very simple client API to execute HiveQL statements Metastore is the system catalog. All other components of Hive interact with the metastore.

Architecture
Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution Compiler translates statements into a plan which consists of a DAG of map-reduce jobs The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order

Metastore
The metastore is the system catalog which contains metadata about the tables stored in Hive. Database - is a namespace for tables. Table - Metadata for table contains list of columns and their types, owner, storage and SerDe information Partition - Each partition can have its own columns and SerDe and storage information

Query Compiler
Parser transforms a query string to a parse tree representation. Semantic Analyzer transforms the parse tree to a blockbased internal query representation. Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators Optimizer performs multiple passes over the logical plan and rewrites it in several ways Physical Plan Generator converts the logical plan into a physical plan, consisting of a DAG of map-reduce jobs

Outline
Introduction Hive Database Data Model Query Language Hive Arquitecture Conclusions

Conclusions
Hive provides a solution to perform business intelligence of huge data on top of mature Hadoop map-reduce platform. The SQL-like HiveQL cuts off the learning curve compared with low-level map-reduce programs.

Questions?

Links: http://i.stanford.edu/~ragho/hive-icde2010.pdf http://www.vldb.org/pvldb/2/vldb09-938.pdf http://hive.apache.org/ https://cwiki.apache.org/Hive/languagemanualtransform.html http://biggdata.blogspot.com/2011/04/refreshingtrendingtopics-website-data.html http://code.google.com/p/hivemrc/wiki/AboutHiveCore

Вам также может понравиться