Вы находитесь на странице: 1из 4

How does Hive on Hadoop work?

Apache Hadoop

Hadoop is not a database, it is basically a distributed file system which is used to


process and store large data sets across the computer cluster.
It has two main core components HDFS(Hadoop Distributed File System) and
MapReduce. HDFS is the storage layer which is used to store a large amount of data across
computer clusters.
MapReduce is a programming model that processes the large data sets by splitting
them into several blocks of data. These blocks are then distributed across the nodes on
different machines present inside the computer cluster.

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive
that are used to help Hadoop modules.
 Sqoop: It is used to import and export data to and from between HDFS and
RDBMS.
 Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Apache HIVE

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It


is the excellent big data software that helps in writing, reading, and managing huge datasets
present in the distributive storage. It is an open source project built on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.
There is a special language similar to SQL known as HiveQL that converts the queries
into MapReduce programmes that can be executed on datasets in HDFS (Hadoop
Distributed File System).
Hive is seen as a Data Warehouse Infrastructure and is used as an ETL (Extraction-
Transformation-Loading) tool. It improves the flexibility in schema design with data
serialization and deserialization. It is an excellent tool for querying historical data.

Use of Hive:

 To query large datasets: Apache Hive is specially used for analytics purposes
on huge datasets. It is an easy way to approach and quickly carry out complex
querying on datasets and inspect the datasets stored in the Hadoop ecosystem.
 For extensibility: Apache Hive contains a range of user APIs that help in building
the custom behavior for the query engine.
 For someone familiar with SQL concepts: If you are familiar with SQL, Hive
will be very easy to use as you will see many similarities between the two. Hive uses
the clauses like select, where, order by, group by, etc. similar to SQL.
 To work on Structured Data: In case of structured data, Hive is widely adopted
everywhere.
 To analyze historical data: Apache Hive is a great tool for analysis and querying
of the data which is historical and collected over a period.

Apache HIVE Architecture

The Hive query language is called the HiveQL but is not exactly a structured query language. But
HiveQL offers various other extensions that are not part of the SQL. You can create multi-table
inserts, create table as select but it has only a basic support for indexes. But HiveQL does not provide
support for Online Transaction Processing and view materialization. It only offers sub-query support.
Now it is possible to have full ACID properties along with update, insert and delete functionalities.
The way in which Hive stores and queries data has a close resemblance with the regular databases.
But since Hive is built on top of the Hadoop ecosystem it has to adhere to the rules set forth by the
Hadoop framework.

There are 4 main components as part of Hive Architecture.

1. Hadoop core components(Hdfs, MapReduce) –

i) HDFS: When we load the data into a Hive Table it internally stores the data in HDFS path i.e
by default in hive warehouse directory.

ii) MapReduce: When we Run the below query, it will run a Map Reduce job by converting or
compiling the query into a java class file, building a jar and execute this jar file.

2. Metastore - is a namespace for tables. This is a crucial part for the hive as all the
metadata information related to the hive such as details related to the table, columns,
partitions, location is present as part of it. Usually, the Metastore is available as part of
the Relational databases eg: MySql

3. Driver - The component that parses the query, does semantic analysis on the
different query blocks and query expressions and eventually generates an
execution plan with the help of the table and partition metadata looked up
from the metastore. The execution plan created by the compiler is a DAG of
stages.
A bunch of jar files that are part of hive package help in converting these
HiveQL queries into equivalent MapReduce jobs(java) and execute them on
MapReduce.

4. Hive Clients - It is the interface through which we can submit the hive queries.
eg: hive CLI, beeline are some of the terminal interfaces, we can also use the
Web-interface like Hue, Ambari to perform the same.
On connecting to Hive via CLI or Web-Interface it establishes a connection to
Metastore as well.
If any query kindly visit our blog for more information about Big Data hadoop and its
ecosystem :How Big Data Works | Big Data Hadoop Tutorials for freshers

https://www.quora.com/How-does-Hive-on-Hadoop-work/answer/Prwatech?prompt_topic_bio=1

Вам также может понравиться