Вы находитесь на странице: 1из 5

International Journal of Engineering Applied Sciences and Technology, 2018

Vol. 3, No. 2, Pages 1-5


Published Online June – July 2018 in IJEAST (http://www.ijeast.com)

Parameterized Pipelined Map Reduce Based Approach for


Performance Improvement of Parallel Programming Model
Ms. Kavita U. Rahane Ms. Poonam B. Dhole Mr. Vikram K. Abhang
Assistant Professor, Computer Engineering
Amrutvahini COE, Sangamner, Maharshatra,India
Email Id: kavita.rahane@avcoe.org poonam.dhole26@gmail.com abhangv@gmail.com

Abstract – Now a day’s processing huge amount of data The purpose behind developing the system is to retrieve data
open challenge in web resources. Map Reduce Programming from distributed databases in minimum time. In previous
model is solution to this problem. This framework is useful to system reducer have to wait for the output from mapper. Also
compute distributed batch of jobs. Output of Mapper and reducer have to check again and again whether mapper
Reducer simplifies fault tolerance problem. We proposed an
finished its execution or not. So to avoid this the new system
improved version of the Map Reduce programming model
called as Parameterized Pipelined Map Reduce. This model of is proposed called as parameterized pipelined map reduce.
parameterized pipelined map reduce is used as solution the Parameterized pipelined map reduce approach removes the
problems of information recovery. Parameterized pipelined draw backs of previous system using some timing parameter.
Map Reduce permits data transfer by pipeline with some In this mapper will simply generate one parameter which tells
timing parameter among the processes, growing the batched reducer when to retrieve data from mapper. [1][2]
Map Reduce programming model. Here important thing is In existing pipeline map reduce technique more time is
obtaining that parameter from mapper, this is done through
required for execution. The existing map reduce are also
different policies named as letter based policy, word length
policy, sentence based policy, job based policy and analysis having more computation cost ,so that by using parameterized
based policy. This technique will improve system utilization pipelined map reduce overall time taken for computation can
rate as well as reduce the completion time of the job. In our be reduced and in minimum time task can be completed. The
proposed work we directly send parameter to mapper and proposed system retrieves data from hadoop distributed file
reducer. Our result shows 25% performance of system system. The information is retrieved by using parallel
improved. programming model called as map reduce. To use Map
Keywords: Hadoop,Map-Reduce, Parallel Processing, Reduce framework, the coder expresses their desired
Parameterized Pipelined Map-Reduce, Pipelined Map- computation as a series of jobs. The input to job is an associate
Reduce. input specification that will generate key-value pairs. Every
NOMENCLATURE : HDFS- Hadoop distributed file system. job consists of two different steps: initial, a user defined map
method is applied to each input record to produce listing of
intermediate key-value pairs. Second, a user defined Reduce
I. INTRODUCTION method invoked single time for each different key within the
The purpose behind developing the project is to retrieve data map output and passed the list of intermediate result values
from distributed databases in minimum time. In previous related to that key. The Map Reduce programming model
system reducer have to wait for the output from mapper. Also paralleling the execution of these functions and ensures fault
reducer have to check again and again whether mapper tolerance mechanically. Optionally, the user also can give
combiner method.[3]
finished its execution or not. So to avoid this the new system
To overcome drawbacks in existing system such as
is proposed called as parameterized pipelined map reduce.
wastage of reducers CPU cycle. In proposed Parameterized
Parameterized pipelined map reduce approach removes the
pipelined map reduce system the mapper generate some
draw backs of previous system using some timing parameter.
timing parameter so that reducer will comes to know when to
In this mapper will simply generate one parameter which tells
fetch data for execution. The technique improves the
reducer when to retrieve data from mapper. [3]
performance of the system in terms of time. It will take
minimum time to process a huge amount of data. As the
International Journal of Engineering Applied Sciences and Technology, 2018
Vol. 3, No. 2, Pages 1-5
Published Online June – July 2018 in IJEAST (http://www.ijeast.com)
number of nodes increased in the cluster of Hadoop the
Master Node
execution time have been decreased.[3][5]

User
II. LITERATURE SURVEY
Name Node
A. Programming Review
To use Map Reduce framework, the coder expresses their
desired computation as a series of jobs. The input to job is an
associate input specification that will generate key-value pairs.
Every job consists of two different steps: initial, a user defined
map method is applied to each input record to produce listing Slave Slave
of intermediate key-value pairs. Second, a user defined Task Tracker Task Tracker
Reduce method invoked single time for each different key
within the map output and passed the list of intermediate result
values related to that key. The Map Reduce programming Task Inst. Task Inst.
model parallelizes the execution of these functions and
ensures fault tolerance mechanically. Optionally, the user also
can give combiner method [5]. Combiners are nearly preferred
to the reducer functions, except that they are not passed all the
values for a given key: instead, a combiner emits associate Fig.1. Hadoop Map Reduce
output that aggregates the input values it absolutely was
passed. Combiner’s are usually perform map-side “pre- C. HDFS
aggregation,” that minimizes the network traffic between the The HDFS has any desired options for enormous
map steps and reduce step. information parallel processing, such as: (1) work in
Public interface Mapper<K1, V1, K2, V2> commodity clusters if there are hardware failures, (2)
{ accessing streaming information, (3) dealing with large data
Void map (K1 key, V1 value, set (4) use a simple coherent model, and (5) transferable
Output Collector<K2, V2> output); across various hardware and software platforms. The HDFS
Void close (); [6] is designed as master/slave architecture (Fig. 2).
} A HDFS cluster consists of Name Node, a master node that
Public interface Reducer <K2, V2, K3, V3> manages the filing system name space and regulates access to
{ files by clients. Additionally, there are different Data Nodes,
Void Reduce (K2 key, Iterator<V2> values, Output Collector normally one per node in the cluster, that manage storage
<K3, V3> output); connected to the nodes that they run on. HDFS exposes a
Void close (); filing system name space and permits user information to be
} hold on in files. Internally, a file is split into one or a lot of
blocks and these blocks are stored in a set of Data Nodes. The
B. Block Representation Of Hadoop Name Node executes filing system name space operations like
The model called as Map Reduce is designed to process closing and renaming files as well as directories. It determines
volumes of data set in a parallel manner by splitting the Job the mapping of blocks to Data Nodes.
into a various independent Tasks. The Job means a complete Clients read and write request are served by the Data
Map Reduce program, which is the execution of a Mapper or node. The Data Nodes are also responsible for block
Reducer over a set of data. A Task is an executing a Mapper generation, deletion as well as replication as per the
or Reducer on a block of the data. Then the Map Reduce Job instruction given by Name Node.
normally divides the input data set into independent sets of
data, which then executed by the map tasks in a fully parallel
way The Hadoop Map Reduce framework includes one Master
node that runs a Job tracker instance, which is responsible for D. Map Task
taking Job requests from a client node splitting those jobs into The master node takes the input set, splits it into
the task. After that, these tasks are executed by the worker smaller unit, after that distributes them to the worker
node. Each worker node runs a Task Tracker instance. Fig. 1 nodes. A worker node repeats this leading to a
depicts the different components of the Map Reduce
multilevel tree structure. The worker node performs
framework.
operations on the smaller sub problem and gives the
answer back to its master node.
International Journal of Engineering Applied Sciences and Technology, 2018
Vol. 3, No. 2, Pages 1-5
Published Online June – July 2018 in IJEAST (http://www.ijeast.com)

E. Reduce Task
The master node collects the output of all sub problems
and merges them together.

F. Pipelined Map Reduce


Normally, Reduce tasks send HTTP requests to pull their
output from each Task tracker. It means map task execution is
totally separated from Reduce task execution. To support
pipelining, some modifications are done within the map task
to instead push information to reducers immediately. For
better understanding of how this is going to be working, we Fig.3. Pipelined Map Reduce data flow
begin with by illustrating a simple pipelined design [7].
Hadoop is changed to send information directly from mapper III. PROPOSED SYSTEM
to Reducer task. Once a consumer sends a new job to HDFS,
In this section we are going to extend the pipelined map
Job tracker assigns the map task and Reduce task related to the
reduce to improve the performance of the system. We named
work to the offered Task tracker slots. We have to consider
this system as a parameterized pipelined map reduces.
that there are required free Slots to assign all the tasks for each
job. Because of the changed version of Hadoop each Reduce
A. Architecture of Parameterized Pipelined Map
task contacted to every map task at the time of initiation of the
Reduce
work, in addition it opens a TCP socket which is able to be
The Fig. 5 shows the architecture of the parameterized
used for pipelining the output of the map function. When
pipelined map reduces. In pipelined map reduce [7] they
every map output data is created, the mapper examines to
modified hadoop in such way that mapper can directly send
which partition (reduce task) the record should be sent, and
data to reducer as soon as it is generated. Job tracker assigns
sends it through the suitable socket. The pipelined information
the map and reduce task associated with the job to the task
is received by the Reduce task from each map task and that
tracker when client send a new job to hadoop. In pipelined
data is stored in an in memory buffer, it is also responsible for
map reduce when all map output records is generated the
spilling the sorted output of the buffer to disk if required.
mapper comes to know to which partition the record should be
Whenever the Reduce task learns that every map task has
sent and send it immediately via appropriate socket.
completed its work, it then performs the final merge operation
In pipelining reducer have to check periodically whether
of all the sorted output and applies the Reduce method, write
the map task has finished its execution. If mapper doesn’t
the last output to the Hadoop Distributed File system.
complete its execution, then there is a waste of reducer cycle.
Pros of the system:
To overcome this we are introducing parameterized pipelined
1) Allows sending and receiving data between task and
map reduce technique which will reduce the empty process
between jobs with disk I/O.
cycle appears in current technology.
2) Reduce Time.
3) Enabling the user to take snapshots of approximate output.
Cons of the system:1) Reducer has to check periodically Parameter Key, Value
whether the map task has finished its execution. If mapper H H
doesn’t complete its execution, then there is wastage of
reducer cycles and eventually it will degrade the performance D Map Pipelin D
of the system. e
F Local F
Store Reduc
S
e S

Fig.4. Parameterized Pipelined Map Reduce Architecture

which contains a key value pair of when to retrieve next data


Fig.2. Hadoop data flow for batch at the reduced level. This local store will be forwarded to
reducer through the pipeline now reducer will extract the data
from pipeline and store it on HDFS. At the end of every
processing iteration it will receive previous parameter which is
International Journal of Engineering Applied Sciences and Technology, 2018
Vol. 3, No. 2, Pages 1-5
Published Online June – July 2018 in IJEAST (http://www.ijeast.com)
usable for the next accession time. It will improve the 20. //Reducer
performance of the system as compared to pipelined map 21. Capture para1 from hdfs temp
reduce. 22. For(delay para1)
23. {
B. Mathematical Model 24. Get key-Val pair from hdfs
Let PPMR be the perspective solution of Parameterized 25. Generate output
Pipelined Map Reduce Project: 26. Get new para1
PPMR= {s, e, i, o, f, DD, NDD, success, failure} 27. Continue till end of v1
s=initial state that is Diabetics dataset to the system and e be }
the end state that is secure data.
i= input of the system, o=output of the system is Secure Data D. Platform
and Graph Generation. We are going to use a cluster having 4 nodes, all systems
F= Function used in system for Map and Reduce. involved in the experiments has Pentium IV 2.4 GHz
DD=deterministic data it helps identifying the load store processors and 2GB of memory and going to use the Ubuntu
functions or assignment functions. 12.04 operating system. Hadoop v 0.20.2 and Java 1.6.0_13
NDD=Non deterministic data of the system S to be solved. are going to use for performance check and Eclipse 3.3 for
Success- Desired outcome generated like Graph. Java code.
Failure-Desired outcome not generated or forced exit due to
system error. IV. RESULT
Set Theory: Set S= {J, T, M, R, P, D} A. Dataset
Set J = {j1…….jn} represents job that should be executed on The data we are using for the experiments is enwiki-
hadoop. 20130403-pages-articles-multistream-index.txt [8], enwiki-
Set T = {t1…..tn} represents total no of task. latest-abstract.xml [9-10]
Set M = {tm1...tmn} represents Mapper task.
Set R = {tr1……..trn} represents Reducer Task. B. Result sets
Set P = {0, 0; 0, 0} Represents Parameter set. Our Parameterized pipelined map reduce would be able to
Decision = {0, 1} achieve significantly better cluster utilization and hence
MP = Mapper Policy reduce completion time by 20% approximately.
RЄMЄT From analysis of the below timing diagram, at the time 30
M => P -> Policy reducer work is done and need to keep waiting for the output
Policy -> P => D. of mapper, but mapper will take 50 instance to get it
D=Input to the Reducer. completed, as shown in Fig 6 And those all time of CPU
cycles will get wasted in previous techniques. But in our
techniques those will be saved, as shown in Fig. 7.

C. Algorithm
1. Start all nodes
2. Fs=Select input file
3. Initialize vector v1
4. While (fs. newline ())
5. {
6. V1[i]= fs.readline();
7. }
8. Len =∑v1 over time t
9. Start input parser
10. Initialize Para1
11. // Mapper
12. While(len)
13. {
14. Parsed = € v1
15. Generate key –Val pair
16. Forward to temp HDFS Fig: 5 Timeline of map and reduce task completion time for 5GB word
17. Estimate para1 = esti(v1.getdata) count job for pipelining
18. Forward para1 to reduce via hdfs temp
19. }
International Journal of Engineering Applied Sciences and Technology, 2018
Vol. 3, No. 2, Pages 1-5
Published Online June – July 2018 in IJEAST (http://www.ijeast.com)
[4] Dean J, Ghemavat S , 2004 Map Reduce: Simplified data
processing on large clusters, In OSDI.
[5] Borthakur D, 2007 The Hadoop Distributed File System:
Architecture and Design.
[6] Li Wang, Zhiwei Ni, Yiwen Zhang, Zhang Jun Wu, Liyang
Tang, 2011 Pipelined Map Reduce: An improved Map
Reduce Parallel programming model in International
Conference on Intelligent Computation Technology and
Automation, pp. 871-874.
[7] Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K.
Olukotun, 2007 “Map-reduce for machine learning on
multicore”. Proc. Advances in Neural Information
Processing Systems 19, pages 281–288, MIT Press,
Cambridge, MA.
[8] http://hadoop.apache.org
[9] http://wiki.freebase.com/wiki/Data_dumps
[10] http://aws.amazon.com/publicdatasets/

Fig: 6 Timeline of map and reduce task completion time for 5GB word
count job for parameterized pipelining

V. CONCLUSION AND FUTURE WORK


In this paper, we implement the parameterized
pipelined map reduce. This design is much better than the
pipelined map reduce because it reduces completion time up to
25%.
In future work, we will study the applicability of the Map
Reduce technique in cloud environments.

REFERENCES
[1] J. Dean and S. Ghemawat, 2008 “Map Reduce: Simplified
Data Processing on Large Clusters,” Comm. ACM, vol. 51,
no. 1, pp. 107–112.
[2] Heller Stein J.M., Haas P.J., Wang H. J, 1997 Online
aggregation, in SIGMOD.
[3] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M.
Heller Stein April 2010 in the Proceedings of the 7th
USENIX Symposium on Networked Systems Design and
Implementation (NSDI 2010).

Вам также может понравиться