Академический Документы
Профессиональный Документы
Культура Документы
Abstract – Now a day’s processing huge amount of data The purpose behind developing the system is to retrieve data
open challenge in web resources. Map Reduce Programming from distributed databases in minimum time. In previous
model is solution to this problem. This framework is useful to system reducer have to wait for the output from mapper. Also
compute distributed batch of jobs. Output of Mapper and reducer have to check again and again whether mapper
Reducer simplifies fault tolerance problem. We proposed an
finished its execution or not. So to avoid this the new system
improved version of the Map Reduce programming model
called as Parameterized Pipelined Map Reduce. This model of is proposed called as parameterized pipelined map reduce.
parameterized pipelined map reduce is used as solution the Parameterized pipelined map reduce approach removes the
problems of information recovery. Parameterized pipelined draw backs of previous system using some timing parameter.
Map Reduce permits data transfer by pipeline with some In this mapper will simply generate one parameter which tells
timing parameter among the processes, growing the batched reducer when to retrieve data from mapper. [1][2]
Map Reduce programming model. Here important thing is In existing pipeline map reduce technique more time is
obtaining that parameter from mapper, this is done through
required for execution. The existing map reduce are also
different policies named as letter based policy, word length
policy, sentence based policy, job based policy and analysis having more computation cost ,so that by using parameterized
based policy. This technique will improve system utilization pipelined map reduce overall time taken for computation can
rate as well as reduce the completion time of the job. In our be reduced and in minimum time task can be completed. The
proposed work we directly send parameter to mapper and proposed system retrieves data from hadoop distributed file
reducer. Our result shows 25% performance of system system. The information is retrieved by using parallel
improved. programming model called as map reduce. To use Map
Keywords: Hadoop,Map-Reduce, Parallel Processing, Reduce framework, the coder expresses their desired
Parameterized Pipelined Map-Reduce, Pipelined Map- computation as a series of jobs. The input to job is an associate
Reduce. input specification that will generate key-value pairs. Every
NOMENCLATURE : HDFS- Hadoop distributed file system. job consists of two different steps: initial, a user defined map
method is applied to each input record to produce listing of
intermediate key-value pairs. Second, a user defined Reduce
I. INTRODUCTION method invoked single time for each different key within the
The purpose behind developing the project is to retrieve data map output and passed the list of intermediate result values
from distributed databases in minimum time. In previous related to that key. The Map Reduce programming model
system reducer have to wait for the output from mapper. Also paralleling the execution of these functions and ensures fault
reducer have to check again and again whether mapper tolerance mechanically. Optionally, the user also can give
combiner method.[3]
finished its execution or not. So to avoid this the new system
To overcome drawbacks in existing system such as
is proposed called as parameterized pipelined map reduce.
wastage of reducers CPU cycle. In proposed Parameterized
Parameterized pipelined map reduce approach removes the
pipelined map reduce system the mapper generate some
draw backs of previous system using some timing parameter.
timing parameter so that reducer will comes to know when to
In this mapper will simply generate one parameter which tells
fetch data for execution. The technique improves the
reducer when to retrieve data from mapper. [3]
performance of the system in terms of time. It will take
minimum time to process a huge amount of data. As the
International Journal of Engineering Applied Sciences and Technology, 2018
Vol. 3, No. 2, Pages 1-5
Published Online June – July 2018 in IJEAST (http://www.ijeast.com)
number of nodes increased in the cluster of Hadoop the
Master Node
execution time have been decreased.[3][5]
User
II. LITERATURE SURVEY
Name Node
A. Programming Review
To use Map Reduce framework, the coder expresses their
desired computation as a series of jobs. The input to job is an
associate input specification that will generate key-value pairs.
Every job consists of two different steps: initial, a user defined
map method is applied to each input record to produce listing Slave Slave
of intermediate key-value pairs. Second, a user defined Task Tracker Task Tracker
Reduce method invoked single time for each different key
within the map output and passed the list of intermediate result
values related to that key. The Map Reduce programming Task Inst. Task Inst.
model parallelizes the execution of these functions and
ensures fault tolerance mechanically. Optionally, the user also
can give combiner method [5]. Combiners are nearly preferred
to the reducer functions, except that they are not passed all the
values for a given key: instead, a combiner emits associate Fig.1. Hadoop Map Reduce
output that aggregates the input values it absolutely was
passed. Combiner’s are usually perform map-side “pre- C. HDFS
aggregation,” that minimizes the network traffic between the The HDFS has any desired options for enormous
map steps and reduce step. information parallel processing, such as: (1) work in
Public interface Mapper<K1, V1, K2, V2> commodity clusters if there are hardware failures, (2)
{ accessing streaming information, (3) dealing with large data
Void map (K1 key, V1 value, set (4) use a simple coherent model, and (5) transferable
Output Collector<K2, V2> output); across various hardware and software platforms. The HDFS
Void close (); [6] is designed as master/slave architecture (Fig. 2).
} A HDFS cluster consists of Name Node, a master node that
Public interface Reducer <K2, V2, K3, V3> manages the filing system name space and regulates access to
{ files by clients. Additionally, there are different Data Nodes,
Void Reduce (K2 key, Iterator<V2> values, Output Collector normally one per node in the cluster, that manage storage
<K3, V3> output); connected to the nodes that they run on. HDFS exposes a
Void close (); filing system name space and permits user information to be
} hold on in files. Internally, a file is split into one or a lot of
blocks and these blocks are stored in a set of Data Nodes. The
B. Block Representation Of Hadoop Name Node executes filing system name space operations like
The model called as Map Reduce is designed to process closing and renaming files as well as directories. It determines
volumes of data set in a parallel manner by splitting the Job the mapping of blocks to Data Nodes.
into a various independent Tasks. The Job means a complete Clients read and write request are served by the Data
Map Reduce program, which is the execution of a Mapper or node. The Data Nodes are also responsible for block
Reducer over a set of data. A Task is an executing a Mapper generation, deletion as well as replication as per the
or Reducer on a block of the data. Then the Map Reduce Job instruction given by Name Node.
normally divides the input data set into independent sets of
data, which then executed by the map tasks in a fully parallel
way The Hadoop Map Reduce framework includes one Master
node that runs a Job tracker instance, which is responsible for D. Map Task
taking Job requests from a client node splitting those jobs into The master node takes the input set, splits it into
the task. After that, these tasks are executed by the worker smaller unit, after that distributes them to the worker
node. Each worker node runs a Task Tracker instance. Fig. 1 nodes. A worker node repeats this leading to a
depicts the different components of the Map Reduce
multilevel tree structure. The worker node performs
framework.
operations on the smaller sub problem and gives the
answer back to its master node.
International Journal of Engineering Applied Sciences and Technology, 2018
Vol. 3, No. 2, Pages 1-5
Published Online June – July 2018 in IJEAST (http://www.ijeast.com)
E. Reduce Task
The master node collects the output of all sub problems
and merges them together.
C. Algorithm
1. Start all nodes
2. Fs=Select input file
3. Initialize vector v1
4. While (fs. newline ())
5. {
6. V1[i]= fs.readline();
7. }
8. Len =∑v1 over time t
9. Start input parser
10. Initialize Para1
11. // Mapper
12. While(len)
13. {
14. Parsed = € v1
15. Generate key –Val pair
16. Forward to temp HDFS Fig: 5 Timeline of map and reduce task completion time for 5GB word
17. Estimate para1 = esti(v1.getdata) count job for pipelining
18. Forward para1 to reduce via hdfs temp
19. }
International Journal of Engineering Applied Sciences and Technology, 2018
Vol. 3, No. 2, Pages 1-5
Published Online June – July 2018 in IJEAST (http://www.ijeast.com)
[4] Dean J, Ghemavat S , 2004 Map Reduce: Simplified data
processing on large clusters, In OSDI.
[5] Borthakur D, 2007 The Hadoop Distributed File System:
Architecture and Design.
[6] Li Wang, Zhiwei Ni, Yiwen Zhang, Zhang Jun Wu, Liyang
Tang, 2011 Pipelined Map Reduce: An improved Map
Reduce Parallel programming model in International
Conference on Intelligent Computation Technology and
Automation, pp. 871-874.
[7] Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K.
Olukotun, 2007 “Map-reduce for machine learning on
multicore”. Proc. Advances in Neural Information
Processing Systems 19, pages 281–288, MIT Press,
Cambridge, MA.
[8] http://hadoop.apache.org
[9] http://wiki.freebase.com/wiki/Data_dumps
[10] http://aws.amazon.com/publicdatasets/
Fig: 6 Timeline of map and reduce task completion time for 5GB word
count job for parameterized pipelining
REFERENCES
[1] J. Dean and S. Ghemawat, 2008 “Map Reduce: Simplified
Data Processing on Large Clusters,” Comm. ACM, vol. 51,
no. 1, pp. 107–112.
[2] Heller Stein J.M., Haas P.J., Wang H. J, 1997 Online
aggregation, in SIGMOD.
[3] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M.
Heller Stein April 2010 in the Proceedings of the 7th
USENIX Symposium on Networked Systems Design and
Implementation (NSDI 2010).