Parallel Computing Concepts

Parallel Programming Concepts
OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.1: Hardware
Dr. Peter Trger + Teaching Team
Summary: Week 4
2
Accelerators enable major speedup for data parallelism

SIMD execution model (no branching)
Memory latency managed with many light-weight threads
Tackle diversity with OpenCL
Loop parallelism with index ranges
Kernels in C, compiled at runtime
Complex memory hierarchy supported
Getting fast is easy, getting faster is hard
Best practices for accelerators
Hardware knowledge needed
What if my computational problem still demands more power?
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Parallelism for
Speed compute faster
Throughput compute more in the same time
Scalability compute faster / more with additional resources
Huge scalability only with shared nothing systems
Processing Element A1
Main Memory
Main Memory
Still also depends on application characteristics
Scaling Up
Scaling Out
Processing Element B1
Parallel Hardware
Shared memory system
Typically a single machine, common address space for tasks
Hardware scaling is limited (power / memory wall)
Shared nothing (distributed memory) system
Tasks on multiple machines, can only access local memory
Global task coordination by explicit messaging
Easy scale-out by adding machines to the network
Task

Cache

Processing
Element

Cache

Processing
Element
Local

Memory

Processing
Element
Local

Memory

Processing
Element
Shared Memory
Message
Task
Message
Task
Message
Task
Message
Parallel Hardware
5
Shared memory system collection of processors

Integrated machine for capacity computing
Prepared for a large variety of problems
Shared-nothing system collection of computers
Clusters and supercomputers for capability computing
Installation to solve few problems in the best way
Parallel software must be able leverage multiple machines
at the same time
Difference to distributed systems (Internet, Cloud)
Single organizational domain, managed as a whole
Single parallel application at a time,
no separation of client and server application
Hybrids are possible (e.g. HPC in Amazon AWS cloud)
Shared Nothing: Clusters

Collection of stand-alone machines connected by a local network
Cost-effective technique for a large-scale parallel computer
Users are builders, have control over their system
Synchronization much slower than in shared memory
Task granularity becomes an issue
Local

Memory

Processing
Element
Message
Local

Memory

Processing
Element
Message
Task
Message
Task
Message
Shared Nothing: Supercomputers

7
Supercomputers / Massively Parallel Processing (MPP) systems

(Hierarchical) cluster with a lot of processors
Still standard hardware, but specialized setup
High-performance interconnection network
For massive data-parallel applications, mostly simulations
(weapons, climate, earthquakes, airplanes, car crashes, ...)
Examples (Nov 2013)
BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops
Tianhe-2, 3.1 million cores,
1 PB memory, 17.808 kW power,
33.86 PFlops (quadrillions
calculations per second)
Annual ranking with the TOP500 list
(www.top500.org)
Example
IBM System Technology Group
Blue Gene/Q
3. Compute card:
One chip module,
16 GB DDR3 Memory,
Heat Spreader for H2O Cooling
4. Node Card:
32 Compute Cards,
Optical Modules, Link Chips; 5D Torus
2. Single Chip Module

1. Chip:
16+2 !P
cores
5b. IO drawer:
8 IO cards w/16 GB
8 PCIe Gen2 x8 slots
3D I/O torus
7. System:
96 racks, 20PF/s
5a. Midplane:
16 Node Cards
Sustained single node perf: 10x P, 20x L

MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)
Software and hardware support for programming models
for exploitation of node hardware concurrency
6. Rack: 2 Midplanes
2011 IBM Corporation
Interconnection Networks
9
Bus systems
Static approach, low costs
PE
PE
PE
PE
Shared communication path,

broadcasting of information
Scalability issues with shared bus
PE
Completely connected networks

Static approach, high costs
PE
PE
PE
PE
PE
Only direct links, optimal performance
PE
PE
Star-connected networks
Static approach with central switch
Less links, still very good performance
Scalability depends on central switch
PE
PE
PE
Switch
PE
PE
PE
PE
PE
10
Crossbar switch
PE1
Dynamic switch-based network
PE2
PE3
PEn
PE1
Supports multiple parallel direct

connections without collisions
PE2
PE3
Less edges than completely connected

network, but still scalability issues
PEn
Fat tree
Use wider links in higher parts of the
interconnect tree
Switch
Combine tree design advantages with

solution for root node scalability
Communication distance between any
two nodes is no more than 2 log #PEs
Switch
PE
Switch
Switch
PE
Switch
PE
PE
Switch
PE
PE
Fully connected graph optimal connectivity (but high cost)
11
Linear array
PE
PE
PE
PE
Ring
Linear array with connected endings
PE
PE
PE
PE
83
N-way D-dimensional mesh

Matrix of processing elements
Not more than N neighbor links
Mesh
and Torus in D dimensions
Structured
PE
PE
PE
PE
PE
PE
PE
PE
PE
N-way
D-dimensional
torus
Compromise
between cost and connectivity
Mesh with wrap-around connection
4-way 2D mesh
4-way 2D torus
8-way 2D mesh
PE
PE
PE
PE
PE
PE
PE
PE
PE
Example: Blue Gene/Q 5D Torus

5D torus interconnect in Blue Gene/Q supercomputer
2 GB/s on all 10 links, 80ns latency to direct neighbors
Additional link for
communication
with I/O nodes
[IBM]
12

OpenHPI Course
Unit 5.2: Granularity and Task Mapping
Workload
14
Last week showed that task granularity may be flexible

Example: OpenCL work group size
But: Communication overhead becomes significant now
What is the right level of task granularity ?
Surface-To-Volume Effect
15
Envision the work to be done

(in parallel) as sliced 3D cube
Not a demand on the application
data, just a representation
Slicing represents splitting into tasks
Computational work of a task
Proportional to the volume of the cube slice
Represents the granularity of decomposition
Communication requirements of the task
Proportional to the surface of the cube slice
communication-to-computation ratio
Fine granularity: Communication high, computation low
Coarse granularity: Communication low, computation high
Fine-grained decomposition for
using all processing elements ?
Coarse-grained decomposition
to reduce communication
overhead ?
A tradeoff question !
[nicerweb.com]
16
Heatmap example with
64 data cells
Version (a): 64 tasks
64x4=
256 messages,
256 data values
64 processing
elements used in
parallel
Version (b): 4 tasks
16 messages,
64 data values
4 processing elements
used in parallel
[Foster]
17
Rule of thumb
Agglomerate tasks to avoid communication
Stop when parallelism is no longer exploited well enough
Agglomerate in all dimensions at the same time
Influencing factors
Communication technology + topology
Serial performance per processing element
Degree of application parallelism
Task communication vs. network topology
Resulting task graph must be
mapped to network topology
Task-to-task communication
may need multiple hops
[Foster]
18
The Task Mapping Problem

19
Given
a number of homogeneous processing elements
with performance characteristics,
some interconnection topology of the processing elements
with performance characteristics,
an application dividable into parallel tasks.
Questions:
What is the optimal task granularity ?
How should the tasks be placed on processing elements ?
Do we still get speedup / scale-up by this parallelization ?
Task mapping is still research, mostly manual tuning today
More options with configurable networks / dynamic routing
Reconfiguration of hardware communication paths

OpenHPI Course
Unit 5.3: Programming with MPI
Message Passing
21
Parallel programming paradigm for shared nothing environments

Implementations for shared memory available,
but typically not the best approach
Users submit their message passing program & data as job
Cluster management system creates program instances
Cluster Management Software
Job
Submission
Host
Instance
1
Instance
0
Instance
2
Execution Hosts
Application
Instance
3
Single Program Multiple Data (SPMD)

Instance 0
22
// (determine rank and comm_size)

int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
rank, token, comm_size - 1);
}
SPMD program
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank
0
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}
Instance 2
Instance 1
int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}

int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}
Instance 3
Input data
Instance 4
int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}

int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}
Message Passing Interface (MPI)

23
Many optimized messaging libraries for shared nothing

environments, developed by networking hardware vendors
Need for standardized API solution: Message Passing Interface
Definition of API syntax and semantics
Enables source code portability, not interoperability
Software independent from hardware concepts
Fixed number of process instances, defined on startup
Point-to-point and collective communication
Focus on efficiency of communication and memory usage
MPI Forum standard
Consortium of industry and academia
MPI 1.0 (1994), 2.0 (1997), 3.0 (2012)
MPI Communicators
Each application instance (process) has a rank, starting at zero
Communicator: Handle for a group of processes
Unique rank numbers inside the communicator group
Instance can determine communicator size and own rank
Default communicator MPI_COMM_WORLD
Instance may be in multiple communicator groups
Rank 0
Size 4
Rank 1
Size 4
Rank 2
Size 4
Rank 3
Size 4
Communicator
24
Communication
25
Point-to-point communication between instances

int MPI_Send(void* buf, int count, MPI_Datatype type,
int destRank, int tag, MPI_Comm com);
int MPI_Recv(void* buf, int count, MPI_Datatype type,
int sourceRank, int tag, MPI_Comm com);
Parameters
Send / receive buffer + size + data type
Sender provides receiver rank, receiver provides sender rank
Arbitrary message tag
Source / destination identified by [tag, rank, communicator] tuple
Default send / receive will block until the match occurs
Useful constants: MPI_ANY_TAG, MPI_ANY_SOURCE, MPI_ANY_DEST
Variations in the API for different buffering behavior
Example: Ring communication

int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}
[mpitutorial.com]
26
Deadlocks
27
Consider:
int MPI_Send(void* buf, int count, MPI_Datatype type,

int destRank, int tag, MPI_Comm com);
int a[10], b[10], myrank;

MPI_Status status;
...
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
}
else if (myrank == 1) {
MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
...
If MPI_Send is blocking, there is a deadlock.
Collective Communication
28
Point-to-point communication vs. collective communication

Use cases: Synchronization, data distribution & gathering
All processes in a (communicator) group communicate together
One sender with multiple receivers (one-to-all)
Multiple senders with one receiver (all-to-one)
Multiple senders and multiple receivers (all-to-all)
Typical pattern in supercomputer applications
Participants continue if the group communication is done
Always blocking operation
Must be executed by all processes in the group
No assumptions on the state of other participants on return
Barrier
29
Communicator members block until everybody reaches the barrier

int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}

int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}

} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}

int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}
MPI_Barrier(comm)

int token;
if (rank != 0) {
MPI_Barrier(comm)
int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}
MPI_Barrier(comm)
int token;
if (rank != 0) {
} else {
token = -1;
}
0, MPI_COMM_WORLD);
if (rank == 0) {
}
Broadcast
int MPI_Bcast( void *buffer, int count,
MPI_Datatype datatype, int rootRank, MPI_Comm comm )
rootRank is the rank of the chosen root process
Root process broadcasts data in buffer to all other processes,
itself included
On return, all processes have the same data in their buffer
Data
Data
D0
D0
Broadcast
Processes
Processes
30
D0
D0
D0
D0
D0
Scatter
int MPI_Scatter(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
sendbuf buffer on root process is divided, parts are sent
to all processes, including root
MPI_SCATTERV allows varying count of data per rank
Data
Data
D0
Scatter
Gather
Processes
D0 D1 D2 D3 D4 D5
Processes
31
D1
D2
D3
D4
D5
Gather
int MPI_Gather(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
Each process (including the root process) sends the data in its
sendbuf buffer to the root process
Incoming data in recvbuf is stored in rank order
recvbuf parameter is ignored for all non-root processes
Data
Data
D0
Scatter
Gather
Processes
D0 D1 D2 D3 D4 D5
Processes
32
D1
D2
D3
D4
D5
Reduction
int MPI_Reduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype, MPI_Op op,
int rootRank, MPI_Comm comm)
Similar to MPI_Gather
Additional reduction operation op to aggregate received
data: maximum, minimum, sum, product, boolean
operators, max-min, min-min
MPI implementation can overlap communication and
reduction calculation for faster results
Data
Data
D0A
D0A + D0B + D0C
D0B
Reduce +
D0C
Processes
Processes
33
D0B
D0C
Example: MPI_Scatter + MPI_Reduce

34
/* -- E. van den Berg
07/10/2001 -- */!
#include <stdio.h>!
#include "mpi.h"!
!
int main (int argc, char *argv[]) { !
int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors!
int rank, i = -1, j = -1;!
!
MPI_Init (&argc, &argv);!
MPI_Comm_rank (MPI_COMM_WORLD, &rank);!
!
MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , !
1, MPI_INT, 0, MPI_COMM_WORLD);!
printf ("[%d] Received i = %d\n", rank, i);!
!
MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, !
0, MPI_COMM_WORLD);!
!
printf ("[%d] j = %d\n", rank, j);!
MPI_Finalize();
!
return 0;!
}!
What Else
35
Variations:
MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall,
Definition of virtual topologies for better task mapping
Complex data types
Packing / Unpacking (sprintf / sscanf)
Group / Communicator Management
Error Handling
Profiling Interface
Several implementations available
MPICH - Argonne National Laboratory
OpenMPI - Consortium of Universities and Industry
...

OpenHPI Course
Unit 5.4: Programming with Channels
Communicating Sequential Processes

37
Formal process algebra to describe concurrent systems

Developed by Tony Hoare at University of Oxford (1977)
Also inventor of QuickSort and Hoare logic
Computer systems act and interact with the environment
Decomposition in subsystems (processes) that operate
concurrently inside the system
Processes interact with other processes, or the environment
Book: T. Hoare, Communicating Sequential Processes, 1985
A mathematical theory, described with algebraic laws
CSP channel concept available in many programming
languages for shared nothing systems
Complete approach implemented in Occam language
CSP: Processes
38
Behavior of real-world objects can be described through their

interaction with other objects
Leave out internal implementation details
Interface of a process is described as set of atomic events
Example: ATM and User, both modeled as processes
card event insertion of a credit card in an ATM card slot
money event extraction of money from the ATM dispenser
Alphabet - set of relevant events for an object description
May never happen, interaction is restricted to these events
ATM = User = {card, money}
A CSP process is the behavior of an object, described with its
alphabet
Communication in CSP
39
Special class of event: Communication

Modeled as unidirectional channel between processes
Channel name is a member of the alphabets of both processes
Send activity described by multiple c.v events
Channel approach assumes rendezvous behavior
Sender and receiver block on the channel operation until the
message is transmitted
Implicit barrier based on communication
With formal foundation, mathematical proofs are possible
When two concurrent processes communicate with each other
only over a single channel, they cannot deadlock.
Network of non-stopping processes which is free of cycles
cannot deadlock.

Whats the Deal ?

40
Any possible system can be modeled through event chains

Enables mathematical proofs for deadlock freedom,
based on the basic assumptions of the formalism
(e.g. single channel assumption)
Some tools available (check readings page)
CSP was the formal base for the Occam language
Language constructs follow the formalism
Mathematical reasoning about the behavior of written code
Still active research (Welsh University), channel concept
frequently adopted
CSP channel implementations for Java, MPI, Go, C, Python
Other formalisms based on CSP, e.g. Task/Channel model
Channels in Scala
41
actor {
var out: OutputChannel[String] = null
val child = actor {
react {
case "go" => out ! "hello"
}
}
Sending channels in messages
val channel = new Channel[String]
out = channel
case class ReplyTo(out:
child ! "go"
OutputChannel[String])
channel.receive {
case msg => println(msg.length)
val child = actor {
}
react {
}
case ReplyTo(out) => out ! "hello"
}
Scope-based channel sharing
}
actor {
val channel = new Channel[String]
child ! ReplyTo(channel)
channel.receive {
case msg => println(msg.length)
}
}
Channels in Go
42
package main
import fmt fmt
func sayHello (ch1 chan string) {
ch1 <- Hello World\n
}
func main() {
ch1 := make(chan string)
go sayHello(ch1)
fmt.Printf(<-ch1)
}
$ 8g chanHello.go
$ 8l -o chanHello chanHello.8
$ ./chanHello
Hello World
$
Concurrent sayHello function

Put value into channel ch1
Program start, create channel
Run sayHello concurrently
Read value from ch1, print it
Compile application
Link application
Run application
Channels in Go
43
select concept allows to switch between available channels

All channels are evaluated
If multiple can proceed, one is chosen randomly
Default clause if no channel is available
select {
case v := <-ch1:
fmt.Println("channel 1 sends", v)
case v := <-ch2:
fmt.Println("channel 2 sends", v)
default: // optional
fmt.Println("neither channel was ready")
}
Channels are typically first-class language constructs
Example: Client provides a response channel in the request
Popular solution to get deterministic behavior
Task/Channel Model
44
Computational model for multi-computer by Ian Foster

Similar concepts to CSP
Parallel computation consists of one or more tasks
Tasks execute concurrently
Number of tasks can vary during execution
Task: Serial program with local memory
A task has in-ports and outports as interface to the
environment
Basic actions: Read / write local memory,
send message on outport,
receive message on in-port,
create new task, terminate
Task/Channel Model
45
Outport / in-port pairs are connected by channels

Channels can be created and deleted
Channels can be referenced as ports,
which can be part of a message
Send operation is non-blocking
Receive operation is blocking
Messages in a channel stay in order
Tasks are mapped to physical processors by the execution
environment
Multiple tasks can be mapped to one processor
Data locality is explicit part of the model
Channels can model control and data dependencies
Programming With Channels

46
Channel-only parallel programs have advantages

Performance optimization does not influence semantics
Example: Shared-memory channels for some parts
Task mapping does not influence semantics
Align number of tasks for the problem,
not for the execution environment
Improves scalability of implementation
Modular design with well-defined interfaces
Communication should be balanced between tasks
Each task should only communicate with a small group of
neighbors

OpenHPI Course
Unit 5.5: Programming with Actors
Actor Model
48
Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular

Actor Formalism for Artificial Intelligence IJCAI 1973.
Mathematical model for concurrent computation
Actor as computational primitive
Local decisions, concurrently sends / receives messages
Has a mailbox for incoming messages
Concurrently creates more actors
Asynchronous one-way message sending
Changing topology allowed, typically no order guarantees
Recipient is identified by mailing address
Actors can send their own identity to other actors
Available as programming language extension or library
in many environments
Erlang Ericsson Language

49
Functional language with actor support

Designed for large-scale concurrency
First version in 1986 by Joe Armstrong, Ericsson Labs
Available as open source since 1998
Language goals driven by Ericsson product development
Scalable distributed execution of phone call handling software
with large number of concurrent activities
Fault-tolerant operation under timing constraints
Online software update
Users
Amazon EC2 SimpleDB , Delicious, Facebook chat, T-Mobile
SMS and authentication, Motorola call processing, Ericsson
GPRS and 3G mobile network products, CouchDB, EJabberD,
Concurrency in Erlang
50
Concurrency Oriented Programming

Actor processes are completely independent (shared nothing)
Synchronization and data exchange with message passing
Each actor process has an unforgeable name
If you know the name, you can send a message
Default approach is fire-and-forget
You can monitor remote actor processes
Using this gives you
Opportunity for massive parallelism
No additional penalty for distribution, despite latency issues
Easier fault tolerance capabilities
Concurrency by default
Actors in Erlang
51
Communication via message passing is part of the language

Send never fails, works asynchronously (PID ! Message)
Actors have mailbox functionality
Queue of received messages, selective fetching
Only messages from same source arrive in-order
receive statement with set of clauses, pattern matching
Process is suspended in receive operation until a match
receive
Pattern1 when Guard1 -> expr1, expr2, ..., expr_n;
Pattern2 when Guard2 -> expr1, expr2, ..., expr_n;
Other
-> expr1, expr2, ..., expr_n
end
Erlang Example: Ping Pong Actors

52
-module(tut15).
-export([test/0, ping/2, pong/0]).
Functions exported + #args
ping(0, Pong_PID) ->

Pong_PID ! finished,
io:format("Ping finished~n", []);
[erlang.org]
ping(N, Pong_PID) ->

Pong_PID ! {ping, self()},
receive
pong ->
io:format("Ping received pong~n", [])
end,
ping(N - 1, Pong_PID).
pong() ->
receive
finished ->
io:format("Pong finished~n", []);
{ping, Ping_PID} ->
io:format("Pong received ping~n", []),
Ping_PID ! pong,
pong()
end.
test() ->
Pong_PID = spawn(tut15, pong, []),
spawn(tut15, ping, [3, Pong_PID]).
Ping actor,
sending message to Pong
Blocking recursive receive,
scanning the mailbox
Pong actor
Blocking recursive receive,
Sending message to Ping
Start Ping and Pong actors
Actors in Scala
53
Actor-based concurrency in Scala, similar to Erlang

Concurrency abstraction on top of threads or processes
Communication by non-blocking send operation and blocking
receive operation with matching functionality
actor {
var sum = 0
loop {
receive {
case Data(bytes)
=> sum += hash(bytes)
case GetSum(requester) => requester ! sum
}}}
All constructs are library functions (actor, loop, receiver, !)
Alternative self.receiveWithin() call with timeout
Case classes act as message type representation
Scala Example: Counter Actor

54
import scala.actors.Actor
import scala.actors.Actor._
case class Inc(amount: Int)
case class Value
class Counter extends Actor {
var counter: Int = 0;
def act() = {
while (true) {
receive {
case Inc(amount) =>
counter += amount
case Value =>
println("Value is "+counter)
exit()
}}}}
object ActorTest extends Application {
val counter = new Counter
counter.start()
for (i <- 0 until 100000) {
counter ! Inc(1)
}
counter ! Value
// Output: Value is 100000
}
Case classes,
acting as message types
Implementation of
the counter actor
Blocking receive loop,
Start the counter actor

Send an Inc message
to the counter actor
Send a Value message
to the counter actor
Actor Deadlocks
Synchronous send operator !? available in Scala
Sends a message and blocks in receive afterwards
Intended for request-response pattern
// actorA
actorB !? Msg1(value) match {
case Response1(r) =>
//
}
receive {
case Msg2(value) =>
reply(Response2(value))
}
// actorB
actorA !? Msg2(value) match {
//
}
receive {
case Msg1(value) =>
}
Original asynchronous send makes deadlocks less probable

// actorA
actorB ! Msg1(value)
while (true) {
receive {
case Msg2(value) =>
// ...
}}
// actorB
actorA ! Msg2(value)
while (true) {
receive {
case Msg1(value) =>
// ...
}}
[http://savanne.be/articles/concurrency-in-erlang-scala/]
55

OpenHPI Course
Unit 5.6: Programming with MapReduce
MapReduce
57
Programming model for parallel processing of large data sets

Inspired by map() and reduce() in functional programming
Intended for best scalability in data parallelism
Huge interest started with Google Research publication
Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Clusters
Google products rely on internal implementation
Apache Hadoop: Widely known open source implementation
Scales to thousands of nodes
Has shown to process petabytes of data
Cluster infrastructure with custom file system (HDFS)
Parallel programming on very high abstraction level
58
Map step
Convert input tuples
[key, value] with map()
function into one / multiple
intermediate tuples
[key2, value2] per input
Shuffle step: Collect all intermediate tuples with the same key
Reduce step
Combine all intermediate tuples with the same key by
some reduce() function to one result per key
Developer just defines stateless map() and reduce() functions
Framework automatically ensures parallelization
Persistence layer needed for input and output only
[developers.google.com]
MapReduce Concept
Example: Character Counting

59
Java Example: Hadoop Word Count

public class WordCount {
public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}}}
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}}...}
[hadoop.apache.org]
60
MapReduce Data Flow
[developer.yahoo.com]
61
Advantages
62
Developer never implements communication or synchronization,

implicitly done by the framework
Allows transparent fault tolerance and optimization
Running map and reduce tasks are stateless
Only rely on their input, produce their own output
Repeated execution in case of failing nodes
Redundant execution for compensating nodes with different
performance characteristics
Scale-out only limited by
Distributed file system performance (input / output data)
Shuffle step communication performance
Chaining of map/reduce tasks is very common in practice
But: Demands embarrassingly parallel problem
Summary: Week 5
63
Shared nothing systems provide very good scalability

Adding new processing elements not limited by walls
Different options for interconnect technology
Task granularity is essential
Surface-to-volume effect
Task mapping problem
De-facto standard is MPI programming
High level abstractions with
Channels
Actors
MapReduce
What steps / strategy would you apply
to parallelize a given compute-intense program?

Parallel Computing Concepts

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Parallel Computing Concepts

Загружено:

Авторское право:

Доступные форматы

Parallel Programming Concepts

Accelerators enable major speedup for data parallelism

Still also depends on application characteristics

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Shared memory system collection of processors

Shared Nothing: Clusters

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Shared Nothing: Supercomputers

Supercomputers / Massively Parallel Processing (MPP) systems

2. Single Chip Module

Sustained single node perf: 10x P, 20x L

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Shared communication path,

Completely connected networks

Only direct links, optimal performance

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Dynamic switch-based network

Supports multiple parallel direct

Less edges than completely connected

Combine tree design advantages with

Fully connected graph optimal connectivity (but high cost)

N-way D-dimensional mesh

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Example: Blue Gene/Q 5D Torus

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Parallel Programming Concepts

Last week showed that task granularity may be flexible

Envision the work to be done

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

The Task Mapping Problem

Parallel Programming Concepts

Parallel programming paradigm for shared nothing environments

Single Program Multiple Data (SPMD)

// (determine rank and comm_size)

// (determine rank and comm_size)

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

// (determine rank and comm_size)

Message Passing Interface (MPI)

Many optimized messaging libraries for shared nothing

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Point-to-point communication between instances

Example: Ring communication

int MPI_Send(void* buf, int count, MPI_Datatype type,

int a[10], b[10], myrank;

Point-to-point communication vs. collective communication

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Communicator members block until everybody reaches the barrier

// (determine rank and comm_size)

// Receive from your left neighbor if you are not rank 0

// (determine rank and comm_size)

// (determine rank and comm_size)

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

D0A + D0B + D0C

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Example: MPI_Scatter + MPI_Reduce

Parallel Programming Concepts

Communicating Sequential Processes

Formal process algebra to describe concurrent systems

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger

Behavior of real-world objects can be described through their