Parallel Programming, Mapreduce Model: Unit Ii

Parallel programming, Mapreduce model
UNIT II
Serial vs. Parallel Programming

A
serial program consist of a sequence of instructions, where each instruction executed one after the other
In
a parallel program, the processing is broken up into parts, each of which can be executed concurrently.
The Basics Parallel Programming

Identifying
sets of tasks that can run concurrently and/or paritions of data that can be processed concurrently
Sometimes
it's just not possible: Fibonacci function

A
common situation is having a large amount of consistent data which must be processed.
huge array which can be broken up into sub-arrays
implementation technique: master/worker

The MASTER:
initializes
the array and splits it up according to the number of available WORKERS sends each WORKER its subarray receives the results from each WORKER
The WORKER:
receives
the subarray from the MASTER performs processing on the subarray returns results to MASTER
An example of the MASTER/WORKER technique
Approximating pi
Approximating pi..
The area of the square, denoted As = (2r)2 or 4r2. The area of the circle, denoted Ac, is pi * r2. So: pi = Ac / r2 As = 4r2 r2 = As / 4 pi = 4 * Ac / As
Parallelize this method

Randomly
Count
generate points in the square
the number of generated points that are both in the circle and in the square
r
= the number of points in the circle divided by the number of points in the square
PI
=4*r
NUMPOINTS = 100000; // some large number - the bigger, the closer the approximation
p = number of WORKERS; numPerWorker = NUMPOINTS / p; countCircle = 0; // one of these for each WORKER
// each WORKER does the following: for (i = 0; i < numPerWorker; i++) { generate 2 random numbers that lie inside the square; xcoord = first random number; ycoord = second random number; if (xcoord, ycoord) lies inside the circle countCircle++; }
MASTER: receives from WORKERS their countCircle values computes PI from these values: PI = 4.0 * countCircle / NUMPOINTS;
MapReduce
How to painlessly process terabytes of data ?
A Brief History
Functional programming (e.g., Lisp)

map() function
Applies a function to each value of a sequence
reduce() function
Combines all elements of a sequence using a binary operator
What is MapReduce?
This model derives from the map and reduce combinators from a functional language like Lisp. Restricted parallel programming model meant for large clusters
User implements Map() and Reduce()
Parallel computing framework

Libraries take care of EVERYTHING else
Parallelization Fault Tolerance Data Distribution Load Balancing
Useful model for many practical tasks
Map and Reduce
Map()
Process a key/value pair to generate intermediate key/value pairs
Reduce()
Merge all intermediate values associated with the same key
Example: Counting Words
Map()
Input <filename, file text> Parses file and emits <word, count> pairs
eg. <hello, 1>
Reduce()
Sums all values for the same key and emits <word, TotalCount>
eg. <hello, (3 5 2 7)> => <hello, 17>
MapReduce: Programming Model
M
How now Brown cow
M M M Map
How does It work now
<How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1>
<How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1>
R R
MapReduce Framework
Reduce
brown 1 cow 1 does 1 How 2 it 1 now 2 work 1
Input
Output
Example Use of MapReduce
Counting words in a large set of documents
map(string key, string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w,;)1 reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));
MapReduce Examples
Distributed grep
Map function emits <word, line_number> if word matches search criteria Reduce function is the identity function
URL access frequency

Map function processes web logs, emits <url, 1> Reduce function sums values and emits <url, total>
MapReduce: Programming Model
More formally,
Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)
MapReduce Runtime System

Partitions input data 2. Schedules execution across a set of machines 3. Handles machine failure 4. Manages interprocess communication
1.
MapReduce Benefits
Greatly reduces parallel programming complexity

Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing
Practical
Approximately 1000 Google MapReduce jobs run everyday.
Google Computing Environment

Typical Clusters contain 1000's of machines Dual-processor x86's running Linux with 2-4GB memory Commodity networking
Typically 100 Mbs or 1 Gbs
IDE drives connected to individual machines

Distributed file system
How MapReduce Works
User to do list:
indicate:
Input/output files M: number of map tasks R: number of reduce tasks W: number of machines
Write map and reduce functions Submit the job
This requires no knowledge of parallel/distributed systems!!! What about everything else?
MapReduce Execution Overview

1.
The user program, via the MapReduce library, shards the input data
Input Data
User Program
Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6
* Shards are typically 16-64mb in size
Data Distribution
Input files are split into M pieces on distributed file system

Typically ~ 64 MB blocks
Intermediate files created from map tasks are written to local disk Output files are written to distributed file system

2.
The user program creates process copies distributed on a machine cluster. One copy will be the Master and the others will be worker threads.
Master
User Program Workers Workers Workers Workers Workers
MapReduce Resources
3.
The master distributes M map and R reduce tasks to idle workers.

M == number of shards R == the intermediate key space is divided into R parts
Message(Do_map_task)
Master
Idle Worker
Assigning Tasks
Many copies of user program are started Tries to utilize data localization by running map tasks on machines with data One instance becomes the Master Master finds idle machines and assigns them tasks
MapReduce Resources
4.
Each map-task worker reads assigned input shard and outputs intermediate key/value pairs.
Output buffered in RAM.
Shard 0
Map worker
Key/value pairs

5.
Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process.
Disk locations
Master
Map worker
Local Storage

6.
Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data.
Disk locations
Master
Reduce worker
remote Storage

7.
Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-tasks partition output file.
Sorts data
Partition Output file
Reduce worker

8.
Master process wakes up user process when all tasks have completed. Output contained in R output files.
Master
wakeup
User Program
Output files
Observations
No reduce can begin until map is complete Tasks scheduled based on location of data If map worker fails any time before reduce finishes, task must be completely rerun Master must communicate locations of intermediate files MapReduce library does most of the hard work for us!
Input key*value pairs
Input key*value pairs
...
map
Data store 1 Data store n
map
(key 1, values...)
(key 2, values...)
(key 3, values...)
(key 1, values...)
(key 2, values...)
(key 3, values...)
== Barrier == : Aggregates intermediate values by output key key 1, intermediate values reduce key 2, intermediate values reduce key 3, intermediate values reduce
final key 1 values
final key 2 values
final key 3 values
Fault Tolerance
Workers are periodically pinged by master

No response = failed worker
Map-task failure Re-execute
All output was stored locally
Reduce-task failure Only re-execute partially completed tasks

All output stored in the global file system
Master writes periodic checkpoints
Fault Tolerance
On errors, workers send last gasp UDP packet to master

Detect records that cause deterministic crashes and skips them
Input file blocks stored on multiple machines When computation almost done, reschedule in-progress tasks
Avoids stragglers
Conclusions
Simplifies large-scale computations that fit this model Allows user to focus on the problem without worrying about details Computer architecture not very important
Portable model
MapReduce Applications
Relational operations using MapReduce

Enterprise application rely on structured data processing Same about relational data model and SQL Parallel databases supports parallel execution Drawback: lack the scale and fault tolerance MapReduce provides both
..
Relational join could be executed in parallel using mapreduce E.g. given sales table and city table compute the gross sales by city
Relational operations using MapReduce..
..
Enterprise Batch Processing using MapReduce
Enterprise context : interest in leveraging the MapReduce model for highthroughput batch processing, analysis of data
Batch processing operations

End of day processing Need to access and compute large dataset Time bound Constraints: online availability of trasaction processing system
Opportunity to accelerate batch processing
Example: revalue cust portfolios
References

Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters Josh Carter, http://multipartmixed.com/software/mapreduce_presentation.pdf Ralf Lammel, Google's MapReduce Programming Model Revisited http://code.google.com/edu/parallel/mapreduce-tutorial.html

Parallel Programming, Mapreduce Model: Unit Ii

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Parallel Programming, Mapreduce Model: Unit Ii

Загружено:

Авторское право:

Доступные форматы

Parallel programming, Mapreduce model

Serial vs. Parallel Programming

The Basics Parallel Programming

it's just not possible: Fibonacci function

huge array which can be broken up into sub-arrays

implementation technique: master/worker

An example of the MASTER/WORKER technique

Parallelize this method

generate points in the square

How to painlessly process terabytes of data ?

Functional programming (e.g., Lisp)

User implements Map() and Reduce()

Parallel computing framework

Useful model for many practical tasks

Map and Reduce

Example: Counting Words

MapReduce: Programming Model

How does It work now

<How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1>

<How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1>

brown 1 cow 1 does 1 How 2 it 1 now 2 work 1

Example Use of MapReduce

Counting words in a large set of documents

URL access frequency

MapReduce: Programming Model

MapReduce Runtime System

Greatly reduces parallel programming complexity

Approximately 1000 Google MapReduce jobs run everyday.

Google Computing Environment

Typically 100 Mbs or 1 Gbs

IDE drives connected to individual machines

How MapReduce Works

Write map and reduce functions Submit the job

This requires no knowledge of parallel/distributed systems!!! What about everything else?

MapReduce Execution Overview

Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6

* Shards are typically 16-64mb in size

Input files are split into M pieces on distributed file system

MapReduce Execution Overview

User Program Workers Workers Workers Workers Workers

The master distributes M map and R reduce tasks to idle workers.

MapReduce Execution Overview

MapReduce Execution Overview

MapReduce Execution Overview

Partition Output file

MapReduce Execution Overview

Input key*value pairs

Input key*value pairs

final key 1 values

final key 2 values

final key 3 values

Workers are periodically pinged by master

Reduce-task failure Only re-execute partially completed tasks

Master writes periodic checkpoints

On errors, workers send last gasp UDP packet to master

Relational operations using MapReduce

Relational operations using MapReduce..

Enterprise Batch Processing using MapReduce

Batch processing operations

Opportunity to accelerate batch processing

Example: revalue cust portfolios

Вам также может понравиться