Вы находитесь на странице: 1из 3

MapReduce and Hadoop Cluster Configuration

Robin Raj
B K Birla Institute of Engineering and Technology, Pilani
Email: robinraj0004@gmail.com

Abstract- Everyone wants relevant information for their But one issue that still remains is that when hadoop
business to grow faster and bigger. Every organization wants came in market most of the company who works on
to know what their customers like and dislike. This desirable large amount of data adopted this technology, hadoop is
information requires storage of large amount of data and open source but setup of hadoop is costly. So the aim of
analysis on these large amount of data or information. For this paper is provide a solution for those company or
this we have lot’s of option one of solution is using hadoop organization who did not yet upgrade hadoop newer
architecture. Hadoop is an open-source software framework version because of cost and high risk of data loss and
for storing data and running applications on cluster of also complex. Below diagram shows a basic hadoop
commodity hardware. Hadoop cluster basically depends on architure.
two cluster architecture one is HDFS and second is
MapReduce. This paper describe the parellell computing
performance of Mapreduce and HDFS and also describe
three V architecture. Three V stands for Volume, Varity, Data Data Data Task Task
Velocity. Node Node Nod Tracker Tracker
e

Keywords-Hadoop, HDFS, MapReduce


Name Job
Node Tracker

HDFS MapReduce
Cluster
Client Cluster
I. INTRODUCTION
t
Figure 1
Managing large amount of data is a big task in present era. Most
of people believe that Big Data is a technology but in real Big
II. HDFS Cluster Configuration & Myth
Data is a problem, problem of storing large amount of data on
daliy basics and from that large amount of data retive or find out
the relevant data or information. In market there are lot’s of tools
and technology available to solve this problem , one of them is In figure1, there is basic presentation of hadoop cluster.
Hadoop system. Hadoop provides solution of both problem, data It consists of two cluster one is HDFS cluster and
storing and retrieving. The advantage of Hadoop are high another one is MapReduce Cluster. Master Node of
efficiency, high reliability, high scalability and fault tolerance HDFS cluster is also known as NameNode and storage
and it has been adopted by many enterprise, there are still many unit attached with NameNode is called Data Node. In
disadvantage of Hadoop[1]. According to Moore’s law the hadoop cluster HDFS cluster responsibility is to provide
performance of CPU is doubled about every 18 Months[second]. storage space for storing large amount of data. And in
Hadoop architecture is basically consist of two cluster one is MapReduce Master node is known as Job Tracker and
HDFS (Hadoop Distributed File System ) and MapReduce. slave node known as Task Tracker. Analytics is the main
HDFS provide distributed enviromonment for storing large purpose of MapReduce cluster. MapReduce cluster is
amount of data. That solve the issue of Volume problem in Big basically provide memory (primary memory like RAM )
Data. MapReduce is consist of two different part one is mapping and CPU (processing unit). Data from client is store on
that collect relevant data from HDFS cluster and second part is Data Node in chunks that provide the fast storing of data
reducer that find out final or desired result from the mapped in databse. Basically to store a large amount of data like
data. So analysis of data is the main purpose of MapReduce in tera byte, takes lot’s of time and in today’s scenario
cluster . In the earlier version of hadoop the biggest issue is of everyday lot’s of data generated from different source.
single point of failure on master node. Because in hadoop cluster Hadoop slove this problem using HDFS cluster. HDFS
only one master node is available that store all the meta data first collect data from client and split data into multiple
(information about the connected data node in cluster). If master chunks or block, block of data then stored on data node.
node fail then complete system got stuck. But this problem is By default minimum size of single block of data is 64
soved in newer version of hadoop. MB. If large data stored in chunks block then data
node not decide which data node store the block of data. Client [root@server hadoop]#hadoop namenode –format
decide where the data need to be store on which data node. So
storage responsibility is of Client not of Name Node. Client get //This commande will format the namenode.
information of data node from name node and then start storing
data. ON ALL DATANODE SIDES
Another myth in HDFS cluster is that storing of data on
multiple data nodes done in parallel but in real storing process is [root@server hadoop]#vim /etc/hadoop/core-site.xml
done in serial manner. This is depend on the configuration
property of HDFS cluster. Ii configuration property is good then <configuration>
storing process is done in parallel manner. Below is the <property>
configuration process for HDFS Cluster <name>fs.default.name</name>
<value>hdfs://ip of namenode:10001</value>
Setup HDFS name and data node MULTINODE </property>
</configuration>
CLUSTER-

1.Setup the name node and data node hdfs site III. MapReduce Cluster Configuration

ON NAMENODE SIDE (Master Node) Hadoop MapReduce for processing. Hadoop mapreduce
is used for analyzing a large datasets on multiple
[root@server hadoop]#vim /etc/hadoop/hdfs-site.xml numbers of nodes. MapReduce was originally a
programming framework proposed by Google, which is
<configuration> used in large-scale data processing. The purpose of
<property> putting forward of MapReduce is just in order to solve
<name>dfs.name.dir</name> problems in information retrieval based on the thought
<value>/namenode</value> of "divide-conquer"[first]. Hadoop MapReduce
</property> environment, the number of maps is relatively fixed and
</configuration> usually driven by the number of HDFS blocks of the
input data set [10]. In this case, the parameter settings of
ON ALL DATANODE SIDES (Data Node) a job have not been optimized, and the size of a single
input split is exactly equal to the block size of the
[root@client hadoop]# vim /etc/hadoop/hdfs-site.xml HDFS. Suppose that a Hadoop cluster is executing a job
with input data size 1T, and the block size of HDFS is
<configuration> 64M. The size of blocks and number of slit of data can
<property> be increased by client system. Basically client system is
<name>dfs.data.dir</name> not an user, client system is application server of any
<value>/dataname</value> company from which user interacts for storing and
</property> retrival of data from hadoop cluster.
</configuration> In MapReduce cluster there is one master node that
is known as Job Tracker and multiple data node known
as Task Tracker. Client is common for both HDFS and
2. Setup the name node and data node core site MapReduce cluster because client system use to interact
with HDFS cluster for storing of data and also interact
ON NAMENODE SIDE with MapReduce cluster for getting relevant information
from stored data. The role of Job Tracker is only to
[root@server hadoop]#vim /etc/hadoop/core-site.xml connect Task Tracker with HDFS cluster. Task Tracker
<configuration> directly connect with the data node for processing on
<property> data. Hadoop MapReduce primarily depends on
<name>fs.default.name</name> parameters selections and tuning of Hadoop MapReduce
<value>hdfs://ip of namenode:10001</value> parameters for a better result. The tuning of Hadoop
</property> MapReduce is an efficient way to improve performance
</configuration> of job completion in respect to time and disk utilizations.
So only basic point that need to be focused is remove the single [11] Guo L, Sun H, Luo Z. A Data Distribution Aware
point of failure risk from older version that can be achieved if Task Scheduling Strategy for Map Reduce System.[M]//
we will create another master or name node on master node that Cloud Computing. Springer Berlin Heidelberg,
interact with master node wthin 3 second because each and 2009:694-699.
every data node attached with Name node give their connection [12] Fu J, Du Z. Load Balancing Strategy on Periodical
information to master node within 3 second. If main master node MapReduce Job[J]. Computer Science, 20 13,
fails then slave master node works with the data node. 40(03):38-40.
[13] Ghemawat S, Gobioff H, Leung S T. The Google
IV. Conclusion file system[C]// Acm Sigops Operating Systems Review.
ACM, 2003:29-43.68
Hadoop is widely well known framework for data analysis for [14] Apache. HDFS Architecture. [20 15- 1 1- 19].
large datasets. Hadoop gives performance due to its capability of http:// adoop.apache.orgidocslrl.2.I/hdfs_design.html
datasets analysis in parallel and distributed environment. [15] Beaver D, Kumar S, Li H C, et al. Finding a Needle
Hadoop is an open source framework. Hadoop Distributed File in Haystack: Facebook's Photo Storage. [J]. Proc of
System (HDFS) and the MapReduce are the modules of Hadoop. Osdi, 20 10:47-60.
HDFS is responsible for data storages while MapReduce is [16] Alibaba. TFS(Taobao FileSystem). [2015- 1 1- 19].
responsible for data processing. Huge data set such as web logs http://code.taobao.orglp/tfs/wikilintro
can be processed for analysis by Hadoop. We also define some [17] Lai L, Shen L, Zheng Y, et al. Analysis for
myth in market for hadoop system. REPERA: A Hybrid Data Protection Mechanism in
Distributed Environment [J]. International Journal of
Cloud Applications & Computing, 20 12, 2(1):7 1-82.
References [ 18] Li X, Dai X, Li W, et at. Improved HDFS scheme
based on erasure code and dynamical-replication
[1] Dongyu Feng, Ligu Zhu, Lei Zhang, "Review of Hadoop system[J]. Journal of Computer Applications, 20 12,
Performance Optimization", 2015 10th International Conference 32(8):2 150-2153+2158.
on Computer Science & Education (ICCSE).
[2] Apache. Hadoop. [20 15- 1 1- 19]. http://hadoop.apache.org!
[3] Yang H C, Dasdan A, Hsiao R L, et at. Map-reduce-merge:
simplified relational data processing on large c1usters[J].
Sigmod, 2007:1029- 1040.
[4] Vrba Z, et at. Kahn Process Networks are a Flexible
Alternative to MapReduce[ClII High Performance Computing
and Communications, 2009. HPCC '09. 1 1th IEEE International
Conference on. IEEE,2009: 154-162.
[5] Bu Y, Howe B, Balazinska M, et at. HaLoop: Efficient
Iterative Data Processing on Large Clusters. [J]. Proceedings of
the Vldb Endowment, 20 10, 3( 12):285-296.
[6] Bu Y, Howe B, Balazinska M, et at. The HaLoop approach
to largescale iterative data analysis [J]. Vldb Journal, 20 12, 2
1(2):169- 190.
[7] Zhang Y, Gao Q, Gao L, et al. iMapReduce: A Distributed
Computing Framework for Iterative Computation[J]. Journal of
Grid Computing, 20 12, 10( 1):47-68.
[8] Zhang Y, Gao Q, Gao L, et al. PrIter: A Distributed
Framework for Prioritizing Iterative Computations[J]. IEEE
Transactions on Parallel & Distributed Systems, 20 1 1,
24(9):1884- 1893.
[9] Zhang Y, Chen S. i 2 MapReduce: incremental iterative
MapReduce[C]// Proceedings of the 2nd International Workshop
on Cloud Intelligence. ACM, 20 13.
[10] Elnikety E, Elsayed T, Ramadan H E. iHadoop:
Asynchronous Iterations for MapReduce[C]// Cloud Computing
Technology and Science (CloudCom), 20 1 1 IEEE Third
International Conference on. IEEE, 20 1 1:81-90.

Вам также может понравиться