Вы находитесь на странице: 1из 46

Introduction to Hadoop

Ecosystem
Introduction
 “Hadoop is a framework that allows for the
distributed processing of large data sets across
clusters of computers using simple programming
models”.

 In other words, Hadoop is a ‘software library’ that


allows its users to process large datasets across
distributed clusters of computers, thereby
enabling them to gather, store and analyze
huge sets of data.

 Hadoop provides various tools and technologies,


collectively termed as the Hadoop ecosystem, to
enable development and deployment of Big Data
solutions.
Hadoop Ecosystem
 Hadoop ecosystem can be defined as a
comprehensive collection of tools and
technologies that can be effectively implemented and
deployed to provide Big Data solutions in a cost-
effective manner.

 MapReduce and Hadoop Distributed File System


(HDFS) are two core components of the Hadoop
ecosystem that provide a great starting point to manage
Big Data; however they are not sufficient to deal with
the Big Data challenges.
 Along with these two, the Hadoop ecosystem provides
a collection of various elements to support the
complete development and deployment of Big
Data solutions.

 All these elements enable users to process large


datasets in real time and provide tools to support
various types of Hadoop projects, schedule jobs
and manage cluster resources.
 In short, MapReduce and HDFS provide the necessary
services and basic structure to deal with the core
requirements of Big Data solutions.

 Other services and tools of the ecosystem provide the


environment and components required to build and
manage purpose-driven Big Data applications.
Hadoop Distributed File System (HDFS)

 Hadoop Distributed File System (HDFS) is designed to


reliably store very large files across machines in a large
cluster.

 Distribute large data file into blocks


 Blocks are managed by different nodes in the cluster
 Each block is replicated on multiple nodes
 Name node stores metadata information about files and
blocks
 Some of the terms related to HDFS:
 Huge documents:

 HDFS is a file system intended for keeping huge


documents for the future analysis.

 Appliance hardware: .

 A device that is dedicated to a specific function in


contrast to a general-purpose computer.
 Streaming information access:

 HDFS is created for batch processing.

 The priority is given to the high throughput of data


access rather than the low latency of data access.

 A dataset is commonly produced or replicated from the


source, and then various analyzes are performed on the
dataset in the long run.
 Low-latency information access:

 Applications that permit access to information in


milliseconds do not function well with HDFS.

 Loads of small documents:

 The Namenode holds file system data information in


memory, the quantity of documents in a file system is
administered in terms of the memory on the server.
 As a dependable guideline, each document and registry
takes around 150 bytes.
HDFS Architecture
 HDFS architecture has a master-slave architecture.

 It comprises a NameNode and a number of DataNodes.

 The NameNode is the master that manages the various


DataNodes.

 The NameNode manages HDFS cluster metadata,


where DataNode store the data.
 Records and directories are presented by clients to the
NameNode.

 These records and directories are managed on the


NameNode.

 Operations on them, such as their modification or


opening and closing them are performed by the
NameNode.
 On the other hand, internally, a file is divided into one
or more blocks, which are stored in a group of
DataNodes.

 DataNodes can also execute operations like the


creation, deletion, and replication of blocks, depending
on the instructions from the NameNode.
 In the HDFS architecture, data is stored in different
blocks.

 Blocks are managed by the different nodes.

 Default block size is 64 MB, although numerous HDFS


installations utilizes 128 MB.
 Some of the failure management tasks:

 Monitoring:

 DataNode and NameNode communicate through


continuous singals (“heartbeat”).

 If signal is not heard by either of the two, the node is


considered to have failed and would be no longer
available.

 The failed node is replaced by the replica.


 Rebalancing:

 According to this process, the blocks are shifted from


one to another location where ever the free space is
available.

 Metadata replication:

 Maintain the replica of the corresponding files on the


same HDFS.
NameNodes and DataNodes
 An HDFS cluster has two node types working in a slave
master design:

 A NameNode (the master) and various DataNodes


(slaves).

 The NameNode deals with the file system.

 It stores the metadata for all the documents and


indexes in the file system.
 The metadata is stored on the local disk as two files:

 The file system and the edit log.

 The Namenode is aware of the DataNodes on which all


the pieces of a given document are found.

 A client accesses the file system on behalf of the user


by communicating with the DataNodes and NameNode.
 DataNodes are the workhorses of a file system.

 They store and recover blocks when they are asked to


(by clients, or the NameNode), and they report back to
the NameNode occasionally.

 Without the NameNode, the file system cannot be


used.
 In fact, if the machine using the NameNode crashes, all
files on the file system would be lost.

 To overcome this, Hadoop provides two methods.

 One way is to take the back up of the documents.

 Another way is to run a secondary NameNode.

 Secondary NameNode will be updated periodically.


Features of HDFS
 The important key features are:

 Data Replication
 Data Resilience
 Highly fault-tolerant
 Data Integrity
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
Data Replication
 Default replication is 3-fold
 HDFS primarily maintains one replica of each block
locally.

 A second replica of the block is then placed on a


different rack to guard against rack failure.

 A third replica is maintained on a different server of a


remote rack.

 Finally additional replicas are sent to random locations


in local and remote clusters.
Data Resilience

 Resiliency is the ability of a server, network, storage


system, or an entire data center, to recover quickly and
continue operating even when there has been an
equipment failure, power outage or other disruption.
Fault tolerance
 Fault tolerance is the property that enables a system to
continue operating properly in the event of the failure of
(one or more faults within) some of its components.

 A HDFS instance may consist of thousands of server


machines, each storing part of the file system’s data.
 Since we have huge number of components and that each
component has high probability of failure means that there is
always some component that is non-functional.

 Detection of faults and quick, automatic recovery from them


is a core architectural goal of HDFS.

17/04/2018
27
Data Integrity
 Consider a situation: a block of data fetched from
Datanode arrives corrupted.
 This corruption may occur because of faults in a storage
device, network faults, or buggy software.

 A HDFS client creates the checksum of every block of


its file and stores it in hidden files in the HDFS
namespace.

 When a clients retrieves the contents of file, it verifies


that the corresponding checksums match.
 If does not match, the client can retrieve the block from
a replica.
17/04/2018 28
 HDFS ensures data integrity throughout the cluster
with the help of the following features:

 Maintaining Transaction Logs:

 HDFS maintains transaction logs in order to monitor


every operation and carry out effective auditing and
recovery of data in case something goes wrong.
 Validation Checksum:

 Checksum is an effective error-detection technique


wherein a numerical value is assigned to a transmitted
message on the basis of the number of bits contained in
the message.

 HDFS uses checksum validation for verification of the


content of a file.
 Validation is carried out as follows:

1. When a file is requested by the client, the contents are


verified using checksum.

2. If the checksums of the received and sent messages match,


the file operations proceed further; otherwise, an error is
reported.

3. The message receiver verifies the checksum of the message


to ensure that it is the same as in the sent message. If a
difference is identified in the two values, the message is
discarded assuming that it has been tempered/altered with
in transition. Checksum files are hidden to avoid tempering
 Creating Data Blocks:

 HDFS maintains replicated copies of data blocks to


avoid corruption of a file due to failure of a server.

 Data blocks are also called as block servers.


 High Throughput

 Throughput is a measure of how many units of


information a system can process in a given amount of
time. So HDFS provides high throughput.

 Suitable for applications with large data sets

 HDFS is suitable for applications which they need to


store, collect or analyze large data sets.

 Streaming access to file system data

 Eventhough HDFS gives priority to Batch processing


streaming access to files is given.
Data Pipelining
 A connection between multiple DataNodes that
supports movement of data across servers is termed as
a pipeline.

 Client retrieves a list of DataNodes on which to place


replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next
DataNode in the Pipeline
 When all replicas are written, the Client moves on to
write the next block in file
Map Reduce
Working of Map Reduce
 Input phase
 Here we have a Record Reader that translates each record
in an input file and sends the parsed data to the mapper in
the form of key-value pairs.

 The Mapper
 Reads data as key/value pairs
◦ The key is often discarded

 Outputs zero or more key/value pairs

 Map is a user-defined function, which takes a series of key-


value pairs and processes each one of them to generate zero
or more key-value pairs.
 Intermediate Keys − They key-value pairs generated
by the mapper are known as intermediate keys.

 Combiner − A combiner is a type of local Reducer


that groups similar data from the map phase into
identifiable sets.
 It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values
in a small scope of one mapper.

 It is not a part of the main MapReduce algorithm; it is


optional.
 Shuffle and Sort
 Output from the mapper is sorted by key

 All values with the same key are guaranteed to go to


the same machine.

 The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local
machine, where the Reducer is running.

 The individual key-value pairs are sorted by key into a


larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in
the Reducer task.
 The Reducer
 Called once for each unique key

 Gets a list of all values associated with a key as input

 The reducer outputs zero or more final key/value pairs


◦ Usually just one output per input key.

Output Phase − In the output phase, we have an


output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file
using a record writer.
Simple Example
MapReduce: Word Count Example
MapReduce-Example

 Let us take a real-world example to comprehend the


power of MapReduce.

 Twitter receives around 500 million tweets per day,


which is nearly 3000 tweets per second.

 The following illustration shows how Tweeter manages


its tweets with the help of MapReduce.
 The MapReduce algorithm performs the following
actions:

 Tokenize − Tokenizes the tweets into maps of tokens


and writes them as key-value pairs.

 Filter − Filters unwanted words from the maps of


tokens and writes the filtered maps as key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of


similar counter values into small manageable units.
MapReduce Features

 Automatic parallelization and distribution

 Fault-Tolerance

 Used to process large data sets.

 Users can write scripts in many languages like java,


python, Ruby and so on.

Вам также может понравиться