Introduction To Hadoop Ecosystem

Introduction to Hadoop
Ecosystem
Introduction
 “Hadoop is a framework that allows for the
distributed processing of large data sets across
clusters of computers using simple programming
models”.
 In other words, Hadoop is a ‘software library’ that

allows its users to process large datasets across
distributed clusters of computers, thereby
enabling them to gather, store and analyze
huge sets of data.
 Hadoop provides various tools and technologies,

collectively termed as the Hadoop ecosystem, to
enable development and deployment of Big Data
solutions.
Hadoop Ecosystem
 Hadoop ecosystem can be defined as a
comprehensive collection of tools and
technologies that can be effectively implemented and
deployed to provide Big Data solutions in a cost-
effective manner.
 MapReduce and Hadoop Distributed File System

(HDFS) are two core components of the Hadoop
ecosystem that provide a great starting point to manage
Big Data; however they are not sufficient to deal with
the Big Data challenges.
 Along with these two, the Hadoop ecosystem provides
a collection of various elements to support the
complete development and deployment of Big
Data solutions.
 All these elements enable users to process large

datasets in real time and provide tools to support
various types of Hadoop projects, schedule jobs
and manage cluster resources.
 In short, MapReduce and HDFS provide the necessary
services and basic structure to deal with the core
requirements of Big Data solutions.
 Other services and tools of the ecosystem provide the

environment and components required to build and
manage purpose-driven Big Data applications.
Hadoop Distributed File System (HDFS)
 Hadoop Distributed File System (HDFS) is designed to

reliably store very large files across machines in a large
cluster.
 Distribute large data file into blocks

 Blocks are managed by different nodes in the cluster
 Each block is replicated on multiple nodes
 Name node stores metadata information about files and
blocks
 Some of the terms related to HDFS:
 Huge documents:
 HDFS is a file system intended for keeping huge

documents for the future analysis.
 Appliance hardware: .
 A device that is dedicated to a specific function in

contrast to a general-purpose computer.
 Streaming information access:
 HDFS is created for batch processing.
 The priority is given to the high throughput of data

access rather than the low latency of data access.
 A dataset is commonly produced or replicated from the

source, and then various analyzes are performed on the
dataset in the long run.
 Low-latency information access:
 Applications that permit access to information in

milliseconds do not function well with HDFS.
 Loads of small documents:
 The Namenode holds file system data information in

memory, the quantity of documents in a file system is
administered in terms of the memory on the server.
 As a dependable guideline, each document and registry
takes around 150 bytes.
HDFS Architecture
 HDFS architecture has a master-slave architecture.
 It comprises a NameNode and a number of DataNodes.
 The NameNode is the master that manages the various

DataNodes.
 The NameNode manages HDFS cluster metadata,

where DataNode store the data.
 Records and directories are presented by clients to the
NameNode.
 These records and directories are managed on the

NameNode.
 Operations on them, such as their modification or

opening and closing them are performed by the
NameNode.
 On the other hand, internally, a file is divided into one
or more blocks, which are stored in a group of
DataNodes.
 DataNodes can also execute operations like the

creation, deletion, and replication of blocks, depending
on the instructions from the NameNode.
 In the HDFS architecture, data is stored in different
blocks.
 Blocks are managed by the different nodes.
 Default block size is 64 MB, although numerous HDFS

installations utilizes 128 MB.
 Some of the failure management tasks:
 Monitoring:
 DataNode and NameNode communicate through

continuous singals (“heartbeat”).
 If signal is not heard by either of the two, the node is

considered to have failed and would be no longer
available.
 The failed node is replaced by the replica.

 Rebalancing:
 According to this process, the blocks are shifted from

one to another location where ever the free space is
available.
 Metadata replication:
 Maintain the replica of the corresponding files on the

same HDFS.
NameNodes and DataNodes
 An HDFS cluster has two node types working in a slave
master design:
 A NameNode (the master) and various DataNodes

(slaves).
 The NameNode deals with the file system.
 It stores the metadata for all the documents and

indexes in the file system.
 The metadata is stored on the local disk as two files:
 The file system and the edit log.
 The Namenode is aware of the DataNodes on which all

the pieces of a given document are found.
 A client accesses the file system on behalf of the user

by communicating with the DataNodes and NameNode.
 DataNodes are the workhorses of a file system.
 They store and recover blocks when they are asked to

(by clients, or the NameNode), and they report back to
the NameNode occasionally.
 Without the NameNode, the file system cannot be

used.
 In fact, if the machine using the NameNode crashes, all
files on the file system would be lost.
 To overcome this, Hadoop provides two methods.
 One way is to take the back up of the documents.
 Another way is to run a secondary NameNode.
 Secondary NameNode will be updated periodically.

Features of HDFS
 The important key features are:
 Data Replication
 Data Resilience
 Highly fault-tolerant
 Data Integrity
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
Data Replication
 Default replication is 3-fold
 HDFS primarily maintains one replica of each block
locally.
 A second replica of the block is then placed on a

different rack to guard against rack failure.
 A third replica is maintained on a different server of a

remote rack.
 Finally additional replicas are sent to random locations

in local and remote clusters.
Data Resilience
 Resiliency is the ability of a server, network, storage

system, or an entire data center, to recover quickly and
continue operating even when there has been an
equipment failure, power outage or other disruption.
Fault tolerance
 Fault tolerance is the property that enables a system to
continue operating properly in the event of the failure of
(one or more faults within) some of its components.
 A HDFS instance may consist of thousands of server

machines, each storing part of the file system’s data.
 Since we have huge number of components and that each
component has high probability of failure means that there is
always some component that is non-functional.
 Detection of faults and quick, automatic recovery from them

is a core architectural goal of HDFS.
17/04/2018
27
Data Integrity
 Consider a situation: a block of data fetched from
Datanode arrives corrupted.
 This corruption may occur because of faults in a storage
device, network faults, or buggy software.
 A HDFS client creates the checksum of every block of

its file and stores it in hidden files in the HDFS
namespace.
 When a clients retrieves the contents of file, it verifies

that the corresponding checksums match.
 If does not match, the client can retrieve the block from
a replica.
17/04/2018 28
 HDFS ensures data integrity throughout the cluster
with the help of the following features:
 Maintaining Transaction Logs:
 HDFS maintains transaction logs in order to monitor

every operation and carry out effective auditing and
recovery of data in case something goes wrong.
 Validation Checksum:
 Checksum is an effective error-detection technique

wherein a numerical value is assigned to a transmitted
message on the basis of the number of bits contained in
the message.
 HDFS uses checksum validation for verification of the

content of a file.
 Validation is carried out as follows:
1. When a file is requested by the client, the contents are

verified using checksum.
2. If the checksums of the received and sent messages match,

the file operations proceed further; otherwise, an error is
reported.
3. The message receiver verifies the checksum of the message

to ensure that it is the same as in the sent message. If a
difference is identified in the two values, the message is
discarded assuming that it has been tempered/altered with
in transition. Checksum files are hidden to avoid tempering
 Creating Data Blocks:
 HDFS maintains replicated copies of data blocks to

avoid corruption of a file due to failure of a server.
 Data blocks are also called as block servers.

 High Throughput
 Throughput is a measure of how many units of

information a system can process in a given amount of
time. So HDFS provides high throughput.
 Suitable for applications with large data sets
 HDFS is suitable for applications which they need to

store, collect or analyze large data sets.
 Streaming access to file system data
 Eventhough HDFS gives priority to Batch processing

streaming access to files is given.
Data Pipelining
 A connection between multiple DataNodes that
supports movement of data across servers is termed as
a pipeline.
 Client retrieves a list of DataNodes on which to place

replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next
DataNode in the Pipeline
 When all replicas are written, the Client moves on to
write the next block in file
Map Reduce
Working of Map Reduce
 Input phase
 Here we have a Record Reader that translates each record
in an input file and sends the parsed data to the mapper in
the form of key-value pairs.
 The Mapper
 Reads data as key/value pairs
◦ The key is often discarded
 Outputs zero or more key/value pairs
 Map is a user-defined function, which takes a series of key-

value pairs and processes each one of them to generate zero
or more key-value pairs.
 Intermediate Keys − They key-value pairs generated
by the mapper are known as intermediate keys.
 Combiner − A combiner is a type of local Reducer

that groups similar data from the map phase into
identifiable sets.
 It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values
in a small scope of one mapper.
 It is not a part of the main MapReduce algorithm; it is

optional.
 Shuffle and Sort
 Output from the mapper is sorted by key
 All values with the same key are guaranteed to go to

the same machine.
 The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local
machine, where the Reducer is running.
 The individual key-value pairs are sorted by key into a

larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in
the Reducer task.
 The Reducer
 Called once for each unique key
 Gets a list of all values associated with a key as input
 The reducer outputs zero or more final key/value pairs

◦ Usually just one output per input key.
Output Phase − In the output phase, we have an

output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file
using a record writer.
Simple Example
MapReduce: Word Count Example
MapReduce-Example
 Let us take a real-world example to comprehend the

power of MapReduce.
 Twitter receives around 500 million tweets per day,

which is nearly 3000 tweets per second.
 The following illustration shows how Tweeter manages

its tweets with the help of MapReduce.
 The MapReduce algorithm performs the following
actions:
 Tokenize − Tokenizes the tweets into maps of tokens

and writes them as key-value pairs.
 Filter − Filters unwanted words from the maps of

tokens and writes the filtered maps as key-value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of

similar counter values into small manageable units.
MapReduce Features
 Automatic parallelization and distribution
 Fault-Tolerance
 Used to process large data sets.
 Users can write scripts in many languages like java,

python, Ruby and so on.

Introduction To Hadoop Ecosystem

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Introduction To Hadoop Ecosystem

Загружено:

Авторское право:

Доступные форматы

Introduction to Hadoop

 In other words, Hadoop is a ‘software library’ that

 Hadoop provides various tools and technologies,

 MapReduce and Hadoop Distributed File System

 All these elements enable users to process large

 Other services and tools of the ecosystem provide the

 Hadoop Distributed File System (HDFS) is designed to

 Distribute large data file into blocks

 HDFS is a file system intended for keeping huge

 A device that is dedicated to a specific function in

 HDFS is created for batch processing.

 The priority is given to the high throughput of data

 A dataset is commonly produced or replicated from the

 Applications that permit access to information in

 Loads of small documents:

 The Namenode holds file system data information in

 It comprises a NameNode and a number of DataNodes.

 The NameNode is the master that manages the various

 The NameNode manages HDFS cluster metadata,

 These records and directories are managed on the

 Operations on them, such as their modification or

 DataNodes can also execute operations like the

 Blocks are managed by the different nodes.

 Default block size is 64 MB, although numerous HDFS

 DataNode and NameNode communicate through

 If signal is not heard by either of the two, the node is

 The failed node is replaced by the replica.

 According to this process, the blocks are shifted from

 Maintain the replica of the corresponding files on the

 A NameNode (the master) and various DataNodes

 The NameNode deals with the file system.

 It stores the metadata for all the documents and

 The file system and the edit log.

 The Namenode is aware of the DataNodes on which all

 A client accesses the file system on behalf of the user

 They store and recover blocks when they are asked to

 Without the NameNode, the file system cannot be

 To overcome this, Hadoop provides two methods.

 One way is to take the back up of the documents.

 Another way is to run a secondary NameNode.

 Secondary NameNode will be updated periodically.

 A second replica of the block is then placed on a

 A third replica is maintained on a different server of a

 Finally additional replicas are sent to random locations

 Resiliency is the ability of a server, network, storage

 A HDFS instance may consist of thousands of server

 Detection of faults and quick, automatic recovery from them

 A HDFS client creates the checksum of every block of

 When a clients retrieves the contents of file, it verifies

 Maintaining Transaction Logs:

 HDFS maintains transaction logs in order to monitor

 Checksum is an effective error-detection technique

 HDFS uses checksum validation for verification of the

1. When a file is requested by the client, the contents are

2. If the checksums of the received and sent messages match,

3. The message receiver verifies the checksum of the message

 HDFS maintains replicated copies of data blocks to

 Data blocks are also called as block servers.

 Throughput is a measure of how many units of

 Suitable for applications with large data sets

 HDFS is suitable for applications which they need to

 Streaming access to file system data