Вы находитесь на странице: 1из 3

Hadoop provides 2 services

1. Storage services : for any kind of data


2. Processing service: any kind of data
Hadoop actually is a collection multiple computers connected in a
network.
These systems are classified into 2 types.
Masters and Slaves
There are 5 Daemon processes that runs Hadoop
1.
2.
3.
4.
5.

Primary Name Node


Secondary Name Node
Job Tracker
Data Node
Task Tracker

Store data

Name
Node

Job
Tracker

Seconda
ry Name
Node

TT

TT

TT

TT

TT

DN

DN

DN

DN

DN

For writing data to HDFS


1. Name node identifies locations where file splits can be
stored and communicates with the client tool
2. Client tool interacts with corresponding data nodes for
writing splits on their local discs
a. Writing is happing in parallel
b. This leads to high performance in writing to disk
(distributed writing)
3. Every 30 seconds, each data nodes sends a block report to
the Name node.

4. Name node updates/cross checks its metadata based on


block report.
Processing data on HDFS
1.
2.
3.
4.

Client submits processing job to Job tracker


Job tracker identifies files to be processed in the job
Sends request to the name node for file details
Name node responds with location of each split (including
replicated copies)
5. Job tracker identifies proper task trackers for processing
individual splits
6. Task tracker ensure local splits are process and generate
intermediate results that are stored on local disk of linux on
slave computers
7. This process called mapping (Map)
8. Since intermediate results are distributed on different
computers, shuffling phase occurs that ensures all results
are clubbed and stored on one location for further
processing
9. Job tracker takes the responsibility of identifying correct
system for this clubbing
10.This clubbed result under sorting to produce another
consolidated sorted output
11.Then job tracker runs another process with the help of task
tacker on that slave node containing consolidated and sorted
data. This process is called reducer
12.Reducer produces final results and writes back to HDFS with
the help of name node.
In the origin Hadoop setup, name node is single point of failure for
storage and processing
Secondary name node periodically takes snap shots of metadata in
primary name node.
In case Primary name node fails, cluster is down. But data in
secondary name node can be used to rebuild Primary name node
and run cluster again.
Recent versions of hadoop added another name node that works as
fail over site for primary name node.
All task trackers inform job tracker every 3 seconds about their
presences and every 10th communication signal is a block report.
Each task tracker sends information about the status of tasks that
they have handled in the last 30 seconds.
HDFS commands:

Cloudera pseudo cluster with 1 node containing all daemons you


can download this from cloudera site.
For data management on hdfs, hadoop provides a client tool called
hdfs
It gives various groups of commands for different purposes.
You cannot cd to hdfs. So no pwd
While working in linux shell, you can query contents hdfs
On HDFS for every linux user, cloudera creates a home directory.
That is the default target
Replication factor defines how many copies of a file is to be
maintained on HDFS by hadoop.
This helps in many ways as backup, fail save operations etc
MapReduce is a frame work provided by hadoop for processing
data on HDFS.
There various tools that are built on top of Hadoop and they are
called as hadoop eco system.
Difference and hbase and hdfs:
Hbase is a database management system build on top hdfs
Hdfs can store sequential data in flat files where hbase can support
random reads and writes on file that are stored on HDFS
IBM came up IBM bigdata, DB2 is a major role player.

Вам также может понравиться