Вы находитесь на странице: 1из 2

MapReduce

As hadoop has been developed on Java Platform and has inherited Java Multi-
Threading Concepts by heart. Therefore to take the advantage of the parallel
processing that is being provided by Hadoop, the thing that needs to be
understood is how to express the query in Hadoop? For this we need to define a
MapReduce Job.

MapReduce is formed by the two terms Map +Reduce. Thus the MapReduce
works by breaking the processing task into two phase MAPPER and
REDUCER.Though in a MapReduce job we have a Mapper Function and a
Reducer Function and while programming for both the phases we have key-
value pair as input and output.

In order to write a MapReduce program, we need a data file on which the
operation needs to be performed. Remember that we are dealing with Big data
so the file size is no a limit but for small set of data MapReduce is not a good
idea. So for large set of data we go for Hadoop. So, to speed up the processing
of Big Datasets, we need to run parts of the program in parallel. Which sounds
very simple in theoretical terms. That is, process different parts of the data into
different processes by using all the available hardware threads on a machine.
Some of the following issues may occur:
1. Dividing work into equal-sizes is tedious.
2. Concatenating the results, shuffle and sort.
3. Hardware requirements and limitations.
So, to get rid of these problems we adopted Hadoop and now we need to
express our queries in terms of MapReduce jobs.
MAP PHASE & REDUCE PHASE: Each has key-value pairs as input and
output.

Map ()
{
Setting up data for reducer function to do its work
Filter out what is needed
Drop bad records
}

Reduce()
{
Concatenates the data from the mappers
Represent the data in the readable way
}


Problem: Write a MapReduce job to process NCDC data from 1901 to 2001 in
order to select the maximum temperature for each year.

Solution: To write a MapReduce job we need to write basically three
component
1. Map Function
2. Reduce Function
3. Function to run the job

I nput Data:
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

MapFunction takes as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)

Extract the required data:
(1950, 0)
(1950, 22)
(1950, 11)
(1949, 111)
(1949, 78)
Sorts and groups the key-value pairs by key:
(1949, [111, 78])
(1950, [0, 22, 11])

Reducer will now have to iterate through the list and pick up the maximum
reading:
(1949, 111)
(1950, 22)

The Mapper interface is a generic type , with four formal type parameters that
specify the input key, input value, output key, output value

Вам также может понравиться