2. Processing service: any kind of data Hadoop actually is a collection multiple computers connected in a network. These systems are classified into 2 types. Masters and Slaves There are 5 Daemon processes that runs Hadoop 1. 2. 3. 4. 5.
Primary Name Node
Secondary Name Node Job Tracker Data Node Task Tracker
Store data
Name Node
Job Tracker
Seconda ry Name Node
TT
TT
TT
TT
TT
DN
DN
DN
DN
DN
For writing data to HDFS
1. Name node identifies locations where file splits can be stored and communicates with the client tool 2. Client tool interacts with corresponding data nodes for writing splits on their local discs a. Writing is happing in parallel b. This leads to high performance in writing to disk (distributed writing) 3. Every 30 seconds, each data nodes sends a block report to the Name node.
4. Name node updates/cross checks its metadata based on
block report. Processing data on HDFS 1. 2. 3. 4.
Client submits processing job to Job tracker
Job tracker identifies files to be processed in the job Sends request to the name node for file details Name node responds with location of each split (including replicated copies) 5. Job tracker identifies proper task trackers for processing individual splits 6. Task tracker ensure local splits are process and generate intermediate results that are stored on local disk of linux on slave computers 7. This process called mapping (Map) 8. Since intermediate results are distributed on different computers, shuffling phase occurs that ensures all results are clubbed and stored on one location for further processing 9. Job tracker takes the responsibility of identifying correct system for this clubbing 10.This clubbed result under sorting to produce another consolidated sorted output 11.Then job tracker runs another process with the help of task tacker on that slave node containing consolidated and sorted data. This process is called reducer 12.Reducer produces final results and writes back to HDFS with the help of name node. In the origin Hadoop setup, name node is single point of failure for storage and processing Secondary name node periodically takes snap shots of metadata in primary name node. In case Primary name node fails, cluster is down. But data in secondary name node can be used to rebuild Primary name node and run cluster again. Recent versions of hadoop added another name node that works as fail over site for primary name node. All task trackers inform job tracker every 3 seconds about their presences and every 10th communication signal is a block report. Each task tracker sends information about the status of tasks that they have handled in the last 30 seconds. HDFS commands:
Cloudera pseudo cluster with 1 node containing all daemons you
can download this from cloudera site. For data management on hdfs, hadoop provides a client tool called hdfs It gives various groups of commands for different purposes. You cannot cd to hdfs. So no pwd While working in linux shell, you can query contents hdfs On HDFS for every linux user, cloudera creates a home directory. That is the default target Replication factor defines how many copies of a file is to be maintained on HDFS by hadoop. This helps in many ways as backup, fail save operations etc MapReduce is a frame work provided by hadoop for processing data on HDFS. There various tools that are built on top of Hadoop and they are called as hadoop eco system. Difference and hbase and hdfs: Hbase is a database management system build on top hdfs Hdfs can store sequential data in flat files where hbase can support random reads and writes on file that are stored on HDFS IBM came up IBM bigdata, DB2 is a major role player.