You are on page 1of 16

Hadoop Architecture

Apache Hadoop is an open-source software framework for storage and
large-scale processing of data-sets on clusters of commodity hardware.
There are mainly five building blocks inside this runtime envinroment
(from bottom to top):

 the cluster is the set of host machines (nodes). Nodes may be

partitioned in racks. This is the hardware part of the infrastructure.

 the YARN Infrastructure (Yet Another Resource Negotiator) is the

framework responsible for providing the computational resources
(e.g., CPUs, memory, etc.) needed for application executions. Two
important elements are:

o the Resource Manager (one per cluster) is the master. It

knows where the slaves are located (Rack Awareness) and
how many resources they have. It runs several services, the
most important is the Resource Scheduler which decides how
to assign the resources.

o the Node Manager (many per cluster) is the slave of the

infrastructure. When it starts, it announces himself to the
Resource Manager. Periodically, it sends an heartbeat to the
Resource Manager. Each Node Manager offers some resources
to the cluster. Its resource capacity is the amount of memory
and the number of vcores. At run-time, the Resource Scheduler
will decide how to use this capacity: a Container is a fraction
of the NM capacity and it is used by the client for running a

 the HDFS Federation is the framework responsible for providing

permanent, reliable and distributed storage. This is typically used
for storing inputs and output (but not intermediate ones).
 other alternative storage solutions. For instance, Amazon uses the
Simple Storage Service (S3).

 the MapReduce Framework is the software layer implementing

the MapReduce paradigm.

The YARN infrastructure and the HDFS federation are completely

decoupled and independent: the first one provides resources for running
an application while the second one provides storage. The MapReduce
framework is only one of many possible framework which runs on top of
YARN (although currently is the only one implemented).

YARN:application startup


Job submitter

#node #node #node

Maqnager2 Maqnager3 Maqnager4

 the Job Submitter (the client)

 the Resource Manager (the master)

 the Node Manager (the slave)

The application startup process is the following:

1. a client submits an application to the Resource Manager

2. the Resource Manager allocates a container

3. the Resource Manager contacts the related Node Manager

4. the Node Manager launches the container

5. the Container executes the Application Master

The Application Master is responsible for the execution of a single

application. It asks for containers from the Resource Scheduler (Resource
Manager) and executes specific programs (e.g., the main of a Java class)
on the obtained containers. The Application Master knows the application
logic and thus it is framework-specific. The MapReduce framework
provides its own implementation of an Application Master.
The Resource Manager is a single point of failure in YARN. Using
Application Masters, YARN is spreading over the cluster the metadata
related to running applications. This reduces the load of the Resource
Manager and makes it fast recoverable.

IBM developed GPFS(General Parallel File System) in 1998 as a SAN file system for use in HPC
applications and IBM.

IBM hooked GPFS to Hadoop, and today IBM is running GPFS.

GPFS is basically storage file system developed as a SAN file system. Being an storage system one
can not attach it directly with Hadoop system that makes a cluster. In order to do this IBM FPO(File
placement optimization) comes in picture and bridge the gap.

FPO is essentially emulation of key component of HDFS which is moving the workload from the
application to data. Basically it move the job to Data instead of moving the Data to job.

GPFS is POSIX compliant, which enables any other applications running on top of the Hadoop
cluster to access data stored in the file system in a straightforward manner. With HDFS, only
Hadoop applications can access the data, and they must go through the Java-based HDFS API.
So major difference is framework verses file system (GPFS) gives the flexibility to users to access
storage data from Hadoop and Non Hadoop system which free users to create more flexible
workflow (Big Data or ETL or online). In that case one can create the series of ETL process with
multiple execution steps, local data or java processes to manipulate the data. Also ETL can be
plugged with MapReduce to execute the process for workflow.

1.Hierarchical storage management:

Allows sufficient usage of disk drives with different performance characteristics

2. High performance support for MapReduce applications

Stripes data across disks by using metablocks, which allows a MapReduce split to be spread over
local disks

3. High performance support for traditional applications

 Manages metadata by using the local node when possible rather than reading
metadata into memory unnecessarily
 Caches data on the client side to increase throughput of random reads
 Supports concurrent reads and writes by multiple programs
 Provides sequential access that enables fast sorts, improving performance for
query languages such as Pig and Jaql
4. High availability
Has no single point of failure because the architecture supports the following attributes:
 Distributed metadata
 Replication of both metadata and data
 Node quorums
 Automatic distributed node failure recovery and reassignment
5.POSIX compliance

Is fully POSIX compliant, which provides the following benefits:

 Support for a wide range of traditional applications
 Support for UNIX utilities, that enable file copying by using FTP or SCP
 Updating and deleting data
 No limitations or performance issues when using a Lucene text index
6. Data replication

Provides cluster-to-cluster replication over a wide area network

Hadoop - HDFS

Hadoop File System was developed using distributed file system design. It
is run on commodity hardware. Unlike other distributed systems, HDFS is
highly faulttolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store
such huge data, the files are stored across multiple machines. These files
are stored in redundant fashion to rescue the system from possible data
losses in case of failure. HDFS also makes applications available to parallel

Features of HDFS
 It is suitable for the distributed storage and processing.

 Hadoop provides a command interface to interact with HDFS.

 The built-in servers of namenode and datanode help users to easily check the
status of cluster.
 Streaming access to file system data.

 HDFS provides file permissions and authentication.

HDFS Architecture
Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following


The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software. It is a software that can be
run on commodity hardware. The system having the namenode acts as the
master server and it does the following tasks:

 Manages the file system namespace.

 Regulates client’s access to files.

 It also executes file system operations such as renaming, closing, and opening
files and directories.

The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes
manage the data storage of their system.

 Datanodes perform read-write operations on the file systems, as per client


 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.

Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes. These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the
need to change in HDFS configuration.

Goals of HDFS
 Fault detection and recovery : Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.

 Huge datasets : HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.

 Hardware at data : A requested task can be done efficiently, when the

computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
Hadoop - MapReduce
MapReduce is a framework using which we can write applications to process
huge amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.

MapReduce is a processing technique and a program model for distributed

computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed
after the map job.

The major advantage of MapReduce is that it is easy to scale data processing

over multiple computing nodes. Under the MapReduce model, the data
processing primitives are called mappers and reducers. Decomposing a data
processing application into mappers and reducers is sometimes nontrivial. But,
once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has
attracted many programmers to use the MapReduce model.

The Algorithm
 Generally MapReduce paradigm is based on sending the computer to where the
data resides!

 MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.

o Map stage : The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is stored in
the Hadoop file system (HDFS). The input file is passed to the mapper
function line by line. The mapper processes the data and creates several
small chunks of data.

o Reduce stage : This stage is the combination of the Shufflestage and

the Reduce stage. The Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.

 The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the

 Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.

 After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and
produces a set of <key, value> pairs as the output of the job, conceivably
of different types.

The key and the value classes should be in serialized manner by the
framework and hence, need to implement the Writable interface.
Additionally, the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework. Input and Output types of a
MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3,

Input Output
Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Jaql (JAQL) is a functional data processing and query language most commonly used
for JSON query processing on BigData.

t started as an Open Source project at Google[1] but the latest release was on 7/12/2010. IBM[2] took it
over as primary data processing language for their Hadoop software package BigInsights.

Although having been developed for JSON it supports a variety of other data sources
like CSV, TSV, XML.

JAQL supports[4] Lazy Evaluation, so expressions are only materialized when needed.

The basic concept of JAQL is

source -> operator(parameter) -> sink ;

where a sink can be a source for a downstream operator. So typically a JAQL program has to
following structure, expressing a data processing graph:

source -> operator1(parameter) -> operator2(parameter) -> operator2(parameter)

-> operator3(parameter) -> operator4(parameter) -> sink ;

Most commonly for readability reasons JAQL programs are linebreaked after the arrow, as is also a
common idiom in Twitter SCALDING:

source -> operator1(parameter)

-> operator2(parameter)
-> operator2(parameter)
-> operator3(parameter)
-> operator4(parameter)
-> sink ;

Core Operators

Use the EXPAND expression to flatten nested arrays. This expression takes as input an array of
nested arrays [ [ T ] ] and produces an output array [ T ], by promoting the elements of each nested
array to the top-level output array.


Use the FILTER operator to filter away elements from the specified input array. This operator takes
as input an array of elements of type T and outputs an array of the same type, retaining those
elements for which a predicate evaluates to true. It is the Jaql equivalent of the SQL WHERE clause.

data = [
{name: "Jon Doe", income: 20000, mgr: false},
{name: "Vince Wayne", income: 32500, mgr: false},
{name: "Jane Dean", income: 72000, mgr: true},
{name: "Alex Smith", income: 25000, mgr: false}

data -> filter $.mgr;

"income": 72000,
"mgr": true,
"name": "Jane Dean"

data -> filter $.income < 30000;

"income": 20000,
"mgr": false,
"name": "Jon Doe"
"income": 25000,
"mgr": false,
"name": "Alex Smith"


Use the GROUP expression to group one or more input arrays on a grouping key and applies an
aggregate function per group.

Use the JOIN operator to express a join between two or more input arrays. This operator supports
multiple types of joins, including natural, left-outer, right-outer, and outer joins.

Use the SORT operator to sort an input by one or more fields.


The TOP expression selects the first k elements of its input. If a comparator is provided, the output is
semantically equivalent to sorting the input, then selecting the first k elements.

Use the TRANSFORM operator to realize a projection or to apply a function to all items of an output.

Apache Pig[1] is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce

Apache Pig[1] is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce

Below is an example of a "Word Count" program in Pig Latin:

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS


-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES '\\w+';

-- create a group for each word

word_groups = GROUP filtered_words BY word;

-- count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count,
group AS word;

-- order the records by count

ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

The above program will generate parallel executable tasks which can be distributed across multiple
machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages
on the internet.