Академический Документы
Профессиональный Документы
Культура Документы
OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.1: Hardware
Dr. Peter Trger + Teaching Team
Summary: Week 4
2
Parallelism for
Speed compute faster
Throughput compute more in the same time
Scalability compute faster / more with additional resources
Huge scalability only with shared nothing systems
Processing Element A1
Processing Element A2
Processing Element A3
Main Memory
Main Memory
Scaling Up
Scaling Out
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Processing Element B1
Processing Element B2
Processing Element B3
Parallel Hardware
Shared memory system
Typically a single machine, common address space for tasks
Hardware scaling is limited (power / memory wall)
Shared nothing (distributed memory) system
Tasks on multiple machines, can only access local memory
Global task coordination by explicit messaging
Easy scale-out by adding machines to the network
Task
Cache
Processing
Element
Cache
Processing
Element
Local
Memory
Processing
Element
Local
Memory
Processing
Element
Shared Memory
Message
Task
Message
Task
Message
Task
Message
Parallel Hardware
5
Local
Memory
Processing
Element
Message
Local
Memory
Processing
Element
Message
Task
Message
Task
Message
Example
IBM System Technology Group
Blue Gene/Q
3. Compute card:
One chip module,
16 GB DDR3 Memory,
Heat Spreader for H2O Cooling
4. Node Card:
32 Compute Cards,
Optical Modules, Link Chips; 5D Torus
5b. IO drawer:
8 IO cards w/16 GB
8 PCIe Gen2 x8 slots
3D I/O torus
7. System:
96 racks, 20PF/s
5a. Midplane:
16 Node Cards
6. Rack: 2 Midplanes
2011 IBM Corporation
Interconnection Networks
9
Bus systems
Static approach, low costs
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Star-connected networks
Static approach with central switch
Less links, still very good performance
Scalability depends on central switch
PE
PE
PE
Switch
PE
PE
PE
PE
PE
Interconnection Networks
10
Crossbar switch
PE1
PE2
PE3
PEn
PE1
PE2
PE3
PEn
Fat tree
Use wider links in higher parts of the
interconnect tree
Switch
Switch
PE
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Switch
Switch
PE
Switch
PE
PE
Switch
PE
PE
Interconnection Networks
11
Linear array
PE
PE
PE
PE
Ring
Linear array with connected endings
PE
PE
PE
PE
83
PE
PE
PE
PE
PE
PE
PE
PE
PE
N-way
D-dimensional
torus
Compromise
between cost and connectivity
Mesh with wrap-around connection
4-way 2D mesh
4-way 2D torus
8-way 2D mesh
PE
PE
PE
PE
PE
PE
PE
PE
PE
[IBM]
12
Workload
14
Surface-To-Volume Effect
15
Surface-To-Volume Effect
Fine-grained decomposition for
using all processing elements ?
Coarse-grained decomposition
to reduce communication
overhead ?
A tradeoff question !
[nicerweb.com]
16
Surface-To-Volume Effect
Heatmap example with
64 data cells
Version (a): 64 tasks
64x4=
256 messages,
256 data values
64 processing
elements used in
parallel
Version (b): 4 tasks
16 messages,
64 data values
4 processing elements
used in parallel
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
[Foster]
17
Surface-To-Volume Effect
Rule of thumb
Agglomerate tasks to avoid communication
Stop when parallelism is no longer exploited well enough
Agglomerate in all dimensions at the same time
Influencing factors
Communication technology + topology
Serial performance per processing element
Degree of application parallelism
Task communication vs. network topology
Resulting task graph must be
mapped to network topology
Task-to-task communication
may need multiple hops
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
[Foster]
18
Given
a number of homogeneous processing elements
with performance characteristics,
some interconnection topology of the processing elements
with performance characteristics,
an application dividable into parallel tasks.
Questions:
What is the optimal task granularity ?
How should the tasks be placed on processing elements ?
Do we still get speedup / scale-up by this parallelization ?
Task mapping is still research, mostly manual tuning today
More options with configurable networks / dynamic routing
Reconfiguration of hardware communication paths
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Message Passing
21
Instance
1
Instance
0
Instance
2
Execution Hosts
Application
Instance
3
22
SPMD program
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank
0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}
Instance 2
Instance 1
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}
Instance 3
Input data
Instance 4
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}
MPI Communicators
Each application instance (process) has a rank, starting at zero
Communicator: Handle for a group of processes
Unique rank numbers inside the communicator group
Instance can determine communicator size and own rank
Default communicator MPI_COMM_WORLD
Instance may be in multiple communicator groups
Rank 0
Size 4
Rank 1
Size 4
Rank 2
Size 4
Rank 3
Size 4
Communicator
24
Communication
25
[mpitutorial.com]
26
Deadlocks
27
Consider:
Collective Communication
28
Barrier
29
MPI_Barrier(comm)
MPI_Barrier(comm)
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}
MPI_Barrier(comm)
// (determine rank and comm_size)
int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0
token = -1;
}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}
Broadcast
int MPI_Bcast( void *buffer, int count,
MPI_Datatype datatype, int rootRank, MPI_Comm comm )
rootRank is the rank of the chosen root process
Root process broadcasts data in buffer to all other processes,
itself included
On return, all processes have the same data in their buffer
Data
Data
D0
D0
Broadcast
Processes
Processes
30
D0
D0
D0
D0
D0
Scatter
int MPI_Scatter(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
sendbuf buffer on root process is divided, parts are sent
to all processes, including root
MPI_SCATTERV allows varying count of data per rank
Data
Data
D0
Scatter
Gather
Processes
D0 D1 D2 D3 D4 D5
Processes
31
D1
D2
D3
D4
D5
Gather
int MPI_Gather(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
Each process (including the root process) sends the data in its
sendbuf buffer to the root process
Incoming data in recvbuf is stored in rank order
recvbuf parameter is ignored for all non-root processes
Data
Data
D0
Scatter
Gather
Processes
D0 D1 D2 D3 D4 D5
Processes
32
D1
D2
D3
D4
D5
Reduction
int MPI_Reduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype, MPI_Op op,
int rootRank, MPI_Comm comm)
Similar to MPI_Gather
Additional reduction operation op to aggregate received
data: maximum, minimum, sum, product, boolean
operators, max-min, min-min
MPI implementation can overlap communication and
reduction calculation for faster results
Data
Data
D0A
D0B
Reduce +
D0C
Processes
Processes
33
D0B
D0C
What Else
35
Variations:
MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall,
Definition of virtual topologies for better task mapping
Complex data types
Packing / Unpacking (sprintf / sscanf)
Group / Communicator Management
Error Handling
Profiling Interface
Several implementations available
MPICH - Argonne National Laboratory
OpenMPI - Consortium of Universities and Industry
...
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
CSP: Processes
38
Communication in CSP
39
Channels in Scala
41
actor {
var out: OutputChannel[String] = null
val child = actor {
react {
case "go" => out ! "hello"
}
}
Sending channels in messages
val channel = new Channel[String]
out = channel
case class ReplyTo(out:
child ! "go"
OutputChannel[String])
channel.receive {
case msg => println(msg.length)
val child = actor {
}
react {
}
case ReplyTo(out) => out ! "hello"
}
Scope-based channel sharing
}
actor {
val channel = new Channel[String]
child ! ReplyTo(channel)
channel.receive {
case msg => println(msg.length)
}
}
Channels in Go
42
package main
import fmt fmt
func sayHello (ch1 chan string) {
ch1 <- Hello World\n
}
func main() {
ch1 := make(chan string)
go sayHello(ch1)
fmt.Printf(<-ch1)
}
$ 8g chanHello.go
$ 8l -o chanHello chanHello.8
$ ./chanHello
Hello World
$
Compile application
Link application
Run application
Channels in Go
43
Task/Channel Model
44
Task/Channel Model
45
Actor Model
48
Concurrency in Erlang
50
Actors in Erlang
51
-module(tut15).
-export([test/0, ping/2, pong/0]).
[erlang.org]
Ping actor,
sending message to Pong
Blocking recursive receive,
scanning the mailbox
Pong actor
Blocking recursive receive,
scanning the mailbox
Actors in Scala
53
import scala.actors.Actor
import scala.actors.Actor._
case class Inc(amount: Int)
case class Value
class Counter extends Actor {
var counter: Int = 0;
def act() = {
while (true) {
receive {
case Inc(amount) =>
counter += amount
case Value =>
println("Value is "+counter)
exit()
}}}}
object ActorTest extends Application {
val counter = new Counter
counter.start()
for (i <- 0 until 100000) {
counter ! Inc(1)
}
counter ! Value
// Output: Value is 100000
}
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Case classes,
acting as message types
Implementation of
the counter actor
Blocking receive loop,
scanning the mailbox
Actor Deadlocks
Synchronous send operator !? available in Scala
Sends a message and blocks in receive afterwards
Intended for request-response pattern
// actorA
actorB !? Msg1(value) match {
case Response1(r) =>
//
}
receive {
case Msg2(value) =>
reply(Response2(value))
}
// actorB
actorA !? Msg2(value) match {
case Response2(r) =>
//
}
receive {
case Msg1(value) =>
reply(Response1(value))
}
// actorB
actorA ! Msg2(value)
while (true) {
receive {
case Msg1(value) =>
reply(Response1(value))
case Response2(r) =>
// ...
}}
[http://savanne.be/articles/concurrency-in-erlang-scala/]
55
MapReduce
57
58
Map step
Convert input tuples
[key, value] with map()
function into one / multiple
intermediate tuples
[key2, value2] per input
Shuffle step: Collect all intermediate tuples with the same key
Reduce step
Combine all intermediate tuples with the same key by
some reduce() function to one result per key
Developer just defines stateless map() and reduce() functions
Framework automatically ensures parallelization
Persistence layer needed for input and output only
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
[developers.google.com]
MapReduce Concept
[hadoop.apache.org]
60
[developer.yahoo.com]
61
Advantages
62
Summary: Week 5
63