Академический Документы
Профессиональный Документы
Культура Документы
CS 15-319
Programming Models- Part III
Lecture 6, Feb 1, 2012
Todays session
Programming Models Part III
Announcement:
Project update is due today
Pregel,
Dryad, and
MapReduce GraphLab
Message
Passing
Examples of Interface (MPI)
parallel
Traditional processing
models of
Parallel parallel
computer programming
Why architectures
parallelism?
Last 2 Sessions
Basics
A close look at MapReduce data flow
Additional functionality
Scheduling and fault-tolerance in MapReduce
Comparison with existing techniques and models
Basics
A close look at MapReduce data flow
Additional functionality
Scheduling and fault-tolerance in MapReduce
Comparison with existing techniques and models
It is likely that the input data set will not fit on a single computers
hard drive
Map Phase
mappers M0 M1 M2 M3
The outputs from the mappers are denoted as
intermediate outputs (IOs) and are brought
IO0 IO1 IO2 IO3
into a second set of tasks called Reducers
Reduce Phase
Shuffling Data
The process of bringing together IOs into a set
of Reducers is known as shuffling process Reducers R0 R1
The map and reduce functions receive and emit (K, V) pairs
Input Splits Intermediate Outputs Final Outputs
Nodes are spread over different racks embraced in one or many data centers
A salient point is that the bandwidth between two nodes is dependent on their
relative locations in the network topology
For example, nodes that are on the same rack will have higher bandwidth
between them as opposed to nodes that are off-rack
Basics
A close look at MapReduce data flow
Additional functionality
Scheduling and fault-tolerance in MapReduce
Comparison with existing techniques and models
Files loaded from local HDFS store Files loaded from local HDFS store
InputFormat InputFormat
file file
Split Split Split Split Split Split
file file
RecordReaders RR RR RR RR RR RR RecordReaders
OutputFormat OutputFormat
Writeback to local Writeback to local
HDFS store HDFS store
Carnegie Mellon University in Qatar 16
Input Files
Input files are where the data for a MapReduce task is
initially stored
file
The RecordReader class actually loads data from its source and
converts it into (K, V) pairs suitable for reading by Mappers
Files loaded from local HDFS store
The RecordReader is invoked repeatedly
on the input until the entire split is consumed InputFormat
file
Each invocation of the RecordReader leads Split Split Split
file
to another call of the map function defined
by the programmer RR RR RR
file
The Reducer performs the user-defined work of Split Split Split
file
the second phase of the MapReduce program
RR RR RR
Sort
Reduce
Carnegie Mellon University in Qatar 22
Partitioner
Each mapper may emit (K, V) pairs to any partition
Files loaded from local HDFS store
Reduce
Carnegie Mellon University in Qatar 23
Sort
Each Reducer is responsible for reducing Files loaded from local HDFS store
the values associated with (several)
intermediate keys InputFormat
file
The set of intermediate keys on a single Split Split Split
file
node is automatically sorted by
MapReduce before they are presented RR RR RR
to the Reducer
Map Map Map
Partitioner
Sort
Reduce
file
Split Split Split
The instances of OutputFormat provided by
file
Hadoop write to files on the local disk or in HDFS
RR RR RR
OutputFormat
Carnegie Mellon University in Qatar 25
MapReduce
In this part, the following concepts of MapReduce will
be described:
Basics
A close look at MapReduce data flow
Additional functionality
Scheduling and fault-tolerance in MapReduce
Comparison with existing techniques and models
R = Rack
MT
N = Node
N MT = Map Task
MT RT = Reduce Task
R Y = Year
N MT T = Temperature
Carnegie Mellon University in Qatar 27
MapReduce
In this part, the following concepts of MapReduce will
be described:
Basics
A close look at MapReduce data flow
Additional functionality
Scheduling and fault-tolerance in MapReduce
Comparison with existing techniques and models
I.e., JT does not push map and reduce tasks to TTs but rather TTs pull
them by making pertaining requests
If the job is still in the map phase, JT asks another TT to re-execute all
Mappers that previously ran at the failed TT
If a tasks progress score is less than (average 0.2), and the task
has run for at least 1 minute, it is marked as a straggler
T1 Not a straggler
PS= 2/3
T2 A straggler
PS= 1/12
Time
Basics
A close look at MapReduce data flow
Additional functionality
Scheduling and fault-tolerance in MapReduce
Comparison with existing techniques and models
37
Carnegie Mellon University in Qatar
What Makes MapReduce Unique?
MapReduce is characterized by:
Pregel,
Dryad, and
MapReduce GraphLab
Message
Passing
Examples of Interface (MPI)
parallel
Traditional processing
models of
parallel
Parallel
computer programming
Programming Models- Part IV
Why architectures
parallelism?