Академический Документы
Профессиональный Документы
Культура Документы
34330
Apache Hive
Outline
Introduction Hive Database Data Model Query Language Hive Arquitecture Conclusions
Introduction
Origins on Facebook...
Facebook has 500.000.000 logs per day Facebook shares a billion pieces of content daily Facebook stores a vast amount of data
Introduction
What's the problem?
250 million photos per day 2.7 billion likes and comments per day 2 billion total registered users 100 billion friendships ...
Introduction
What is Apache Hive?
Hive is a data warehouse infrastructure
Introduction
What is Apache Hive?
Hive is a data warehouse infrastructure
Introduction
How does Apache Hive works?
Hive is built on top of Hadoop
Hive stores data in the HDFS
Hive compile SQL queries as MapReduce jobs and run the jobs in the cluster
Introduction
How does Apache Hive works?
HiveQL query
Introduction
How does a simple web app works?
MySQL query
Outline
Introduction Hive Database Data Model Query Language Hive Arquitecture Conclusions
Data Model
Hive structures data into the wellunderstood database concepts like tables, columns, rows.
HiveQL
Hive defines a simple SQL-like query language, called QL - Supports DDL and DML.
HiveQL Extract
REDUCE subq2.school, subq2.meme, subq2.cnt USING top10.py AS (school,meme,cnt) FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM (MAP b.school, a.status USING meme-extractor.py AS (school,meme) FROM status_updates a JOIN profiles b ON (a.userid = b. userid) ) subq1 GROUP BY subq1.school, subq1.meme DISTRIBUTE BY school, meme SORT BY school, meme, cnt desc ) subq2;
Outline
Introduction Hive Database Data Model Query Language Hive Arquitecture Conclusions
Architecture
Architecture
External Interfaces - provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC Thrift Server exposes a very simple client API to execute HiveQL statements Metastore is the system catalog. All other components of Hive interact with the metastore.
Architecture
Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution Compiler translates statements into a plan which consists of a DAG of map-reduce jobs The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order
Metastore
The metastore is the system catalog which contains metadata about the tables stored in Hive. Database - is a namespace for tables. Table - Metadata for table contains list of columns and their types, owner, storage and SerDe information Partition - Each partition can have its own columns and SerDe and storage information
Query Compiler
Parser transforms a query string to a parse tree representation. Semantic Analyzer transforms the parse tree to a blockbased internal query representation. Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators Optimizer performs multiple passes over the logical plan and rewrites it in several ways Physical Plan Generator converts the logical plan into a physical plan, consisting of a DAG of map-reduce jobs
Outline
Introduction Hive Database Data Model Query Language Hive Arquitecture Conclusions
Conclusions
Hive provides a solution to perform business intelligence of huge data on top of mature Hadoop map-reduce platform. The SQL-like HiveQL cuts off the learning curve compared with low-level map-reduce programs.
Questions?