Parallel Programming - Slides

Figure 1.
1 Astrophysical N-body
simulation by Scott Linssen (undergraduate
University of North Carolina at Charlotte
[UNCC] student).
Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers
Barry Wilkinson and Michael Allen Prentice Hall, 1998
Main memory
Instructions (to processor)
Data (to or from processor)
Processor
Figure 1.2 Conventional computer having

a single processor and memory.
One
address
space
Memory modules
Interconnection
network
Processors
Figure 1.3 Traditional shared memory

multiprocessor model.
Interconnection
network
Messages
Processor
Local
memory
Computers
Figure 1.4 Message-passing

multiprocessor model (multicomputer).
Interconnection
network
Messages
Processor
Shared
memory
Computers
Figure 1.5 Shared memory multiprocessor

implementation.
Program
Instructions
Program
Instructions
Processor
Processor
Data
Data
Figure 1.6 MPMD structure.
Computers
M
C
Network with direct links

between computers
C
P
Figure 1.7 Static link multicomputer.
Computer (node)
Links
to other
nodes
Switch
Processor
Links
to other
nodes
Memory
Figure 1.8 Node with a switch for internode message transfers.
Link
Node
Node
Figure 1.9 A link between two nodes with

separate wires in each direction.
Figure 1.10 Ring.
10
Links
Computer/
processor
Figure 1.11
(mesh).
Two-dimensional array
11
Root
Links
Processing
element
Figure 1.12 Tree structure.
12
110
100
111
101
010
000
011
001
Figure 1.13 Three-dimensional hypercube.
13
0110
0100
0111
0101
0010
0000
1100
0011
0001
Figure 1.14
1110
1111
1101
1010
1000
1011
1001
Four-dimensional hypercube.
14
Ring
Figure 1.15 Embedding a ring onto a torus.
15
Nodal address
1011
10
11
01
00
x
00
01
11
10
Figure 1.16 Embedding a mesh into a

hypercube.
16
A
Root
A
Figure 1.17 Embedding a tree into a mesh.
17
Packet
Head
Movement
Flit buffer
Request/
Acknowledge
signal(s)
Figure 1.18 Distribution of flits.
18
Source
processor
Destination
processor
Data
R/A
Figure 1.19 A signaling method between

processors for wormhole routing (Ni and
McKinley, 1993).
19
Packet switching
Network
latency
Wormhole routing
Circuit switching
Distance
(number of nodes between source and destination)
Figure 1.20
Network delay characteristics.
20
Node 4
Node 3
Messages
Node 1
Node 2
Figure 1.21 Deadlock in store-and-forward

networks.
21
Virtual channel
buffer
Node
Node
Route
Physical link
Figure 1.22 Multiple virtual channels mapped onto a single physical channel.
22
Ethernet
Workstation/
file server
Workstations
Figure 1.23 Ethernet-type single wire

network.
23
Frame check
sequence
(32 bits)
Data
(variable)
Type
(16 bits)
Source
address
(48 bits)
Destination
address
(48 bits)
Preamble
(64 bits)
Direction
Figure 1.24
Ethernet frame format.
24
Network
Workstation/
file server
Workstations
Figure 1.25 Network of workstations connected via a ring.
25
Workstations
Workstation/
file server
Figure 1.26 Star connected network.
26
Parallel programming cluster
(a) Using specially designed adaptors
(b) Using separate Ethernet interfaces

Figure 1.27 Overlapping connectivity Ethernets.
27
Process 1
Process 2
Computing
Process 3
Slope indicating time

to send message
Process 4
Waiting to send a message
Message
Time
Figure 1.28 Space-time diagram of a message-passing program.
28
ts
fts
(1 f)ts
Serial section
Parallelizable sections
(a) One processor
(b) Multiple
processors
n processors
tp
Figure 1.29
(1 f)ts /n
Parallelizing sequential problem Amdahls law.
29
f = 0%
20
20
16
12
f = 5%
8
f = 10%
f = 20%
Speedup factor, S(n)
Speedup factor, S(n)
n = 256
16
12
8
4
n = 16
4
8
12
16
Number of processors, n
(a)
20
0.2
0.4
0.6
0.8
Serial fraction, f
(b)
1.0
Figure 1.30 (a) Speedup against number of processors. (b) Speedup against serial fraction, f.
30
Source
file
Compile to suit
processor
Executables
Processor 0
Processor n 1
Figure 2.1 Single program, multiple data

operation.
31
Process 1
spawn();
Start execution
of process 2
Process 2
Time
Figure 2.2 Spawning a process.
32
Process 1
Process 2
send(&x, 2);
Movement
of data
recv(&y, 1);
Figure 2.3 Passing a message between

processes using send() and recv()
library calls.
33
Process 1
Time
send();
Suspend
process
Both processes
continue
Process 2
Request to send
Acknowledgment
recv();
Message
(a) When send() occurs before recv()

Process 1
Process 2
Time
recv();
Request to send
send();
Both processes
continue
Suspend
process
Message
Acknowledgment
(b) When recv() occurs before send()
Figure 2.4 Synchronous send() and recv() library calls using a three-way protocol.
34
Process 1
Process 2
Message buffer
Time
send();
Continue
process
recv();
Read
message buffer
Figure 2.5 Using a message buffer.
35
Process 0
Process 1
data
data
Process n 1
data
Action
buf
bcast();
bcast();
bcast();
Code
Figure 2.6
Broadcast operation.
36
Process 0
Process 1
Process n 1
data
data
data
scatter();
scatter();
scatter();
Action
buf
Code
Figure 2.7 Scatter operation.
37
Process 0
Process 1
Process n 1
data
data
data
gather();
gather();
gather();
Action
buf
Code
Figure 2.8 Gather operation.
38
Process 0
Process 1
data
Process n 1
data
data
reduce();
reduce();
Action
buf
+
reduce();
Code
Figure 2.9 Reduce operation (addition).
39
Workstation
PVM
daemon
Application
program
(executable)
Messages
sent through
network
Workstation
Workstation
PVM
daemon
Application
program
(executable)
PVM
daemon
Application
program
(executable)
Figure 2.10
Message passing between workstations using PVM.
40
Workstation
PVM
daemon
Messages
sent through
network
Workstation
PVM
daemon
Workstation
PVM
daemon
Application
program
(executable)
Figure 2.11 Multiple processes allocated to each processor (workstation).
41
Array
holding
data
Process 1
Send buffer
Pack
Process 2
Array to
receive
data
pvm_psend();
Continue
process
pvm_precv(); Wait for message
Figure 2.12 pvm_psend() and pvm_precv() system calls.
42
Process_1
Process_2
pvm_initsend();
pvm_pkint( &x );
pvm_pkstr( &s );
pvm_pkfloat( &y );
pvm_send(process_2 );
x
s
y
Send
buffer
Message
Receive
buffer
Figure 2.13
pvm_recv(process_1 );
pvm_upkint( &x );
pvm_upkstr( &s );
pvm_upkfloat( &y );
PVM packing messages, sending, and unpacking.
43
#include <stdio.h>
Master
#include <stdlib.h>
#include <pvm3.h>
#define SLAVE spsum
#define PROC 10
#define NELEM 1000
main() {
int mytid,tids[PROC];
int n = NELEM, nproc = PROC;
int no, i, who, msgtype;
int data[NELEM],result[PROC],tot=0;
char fn[255];
FILE *fp;
mytid=pvm_mytid();/*Enroll in PVM */
Slave
#include <stdio.h>
#include pvm3.h
#define PROC 10
#define NELEM 1000
/* Start Slave Tasks */

no=
pvm_spawn(SLAVE,(char**)0,0,,nproc,tids);
if (no < nproc) {
printf(Trouble spawning slaves \n);
for (i=0; i<no; i++) pvm_kill(tids[i]);
pvm_exit(); exit(1);
}
main()
int
int
int
int
int
/* Open Input File and Initialize Data */

strcpy(fn,getenv(HOME));
strcat(fn,/pvm3/src/rand_data.txt);
if ((fp = fopen(fn,r)) == NULL) {
printf(Cant open input file %s\n,fn);
exit(1);
}
for(i=0;i<n;i++)fscanf(fp,%d,&data[i]);
/* Receive data from master */

msgtype = 0;
pvm_recv(-1, msgtype);
pvm_upkint(&nproc, 1, 1);
pvm_upkint(tids, nproc, 1);
pvm_upkint(&n, 1, 1);
pvm_upkint(data, n, 1);
/* Broadcast data To slaves*/

pvm_initsend(PvmDataDefault);
msgtype = 0;
pvm_pkint(&nproc, 1, 1);
pvm_pkint(tids, nproc, 1);
pvm_pkint(&n, 1, 1);
pvm_pkint(data, n, 1);
pvm_mcast(tids, nproc, msgtag);
{
mytid;
tids[PROC];
n, me, i, msgtype;
x, nproc, master;
data[NELEM], sum;
mytid = pvm_mytid();
/* Determine my tid */
for (i=0; i<nproc; i++)
if(mytid==tids[i])
{me = i;break;}
Broadcast data
/* Get results from Slaves*/

msgtype = 5;
for (i=0; i<nproc; i++){
pvm_recv(-1, msgtype);
Receive results
pvm_upkint(&who, 1, 1);
pvm_upkint(&result[who], 1, 1);
printf(%d from %d\n,result[who],who);
}
/* Compute global sum */
for (i=0; i<nproc; i++) tot += result[i];
printf (The total is %d.\n\n, tot);
pvm_exit(); /* Program finished. Exit PVM */
return(0);
/* Add my portion Of data */

x = n/nproc;
low = me * x;
high = low + x;
for(i = low; i < high; i++)
sum += data[i];
/* Send result to master */
pvm_initsend(PvmDataDefault);
pvm_pkint(&me, 1, 1);
pvm_pkint(&sum, 1, 1);
msgtype = 5;
master = pvm_parent();
pvm_send(master, msgtype);
/* Exit PVM */
pvm_exit();
return(0);
}
Figure 2.14 Sample PVM program.

44
Process 0
Process 1
Destination
send(,1,);
lib()
send(,1,);
Source
recv(,0,);
lib()
recv(,0,);
(a) Intended behavior

Process 0
Process 1
send(,1,);
lib()
send(,1,);
recv(,0,);
lib()
recv(,0,);
(b) Possible behavior

Figure 2.15 Unsafe message passing with libraries.
45
#include mpi.h
#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000
void main(int argc, char *argv)
{
int myid, numprocs;
int data[MAXSIZE], i, x, low, high, myresult, result;
char fn[255];
char *fp;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if (myid == 0) {
/* Open input file and initialize data */
strcpy(fn,getenv(HOME));
strcat(fn,/MPI/rand_data.txt);
if ((fp = fopen(fn,r)) == NULL) {
printf(Cant open the input file: %s\n\n, fn);
exit(1);
}
for(i = 0; i < MAXSIZE; i++) fscanf(fp,%d, &data[i]);
}
/* broadcast data */
MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD);
/* Add my portion Of data */
x = n/nproc;
low = myid * x;
high = low + x;
for(i = low; i < high; i++)
myresult += data[i];
printf(I got %d from %d\n, myresult, myid);
/* Compute global sum */
MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) printf(The sum is %d.\n, result);
MPI_Finalize();
}
Figure 2.16
Sample MPI program.
46
Time
Startup time
Number of data items (n)
Figure 2.17 Theoretical communication

time.
47
c2g(x) = 6x2
160
f(x) = 4x2 + 2x + 12
140
120
100
80
c1g(x) = 2x2
60
40
20
0
0
3
x0
Figure 2.18 Growth of function f(x) = 4x2 + 2x + 12.
48
110
100
111
101
3rd step
010
2nd step
1st step
000
011
001
Figure 2.19 Broadcast in a three-dimensional hypercube.
49
P000
Message
Step 1
P000
P001
Step 2
P000
P010
P001
P011
Step 3
P000
P100
P010
P110
P001
P101
P011
P111
Figure 2.20 Broadcast as a tree construction.
50
Steps
1
6
Figure 2.21 Broadcast in a mesh.
51
Message
Source
Destinations
Figure 2.22 Broadcast on an Ethernet

network.
52
Source
Sequential
N destinations
Figure 2.23 1-to-N fan-out broadcast.
53
Source
Sequential message issue
Destinations
Figure 2.24 1-to-N fan-out broadcast on a

tree structure.
54
Process 1
Process 2
Process 3
Time
Computing
Waiting
Message-passing system routine
Message
Figure 2.25 Space-time diagram of a parallel program.
55
Number of repetitions or time
2
3
4
5
6
7
8
9
Statement number or regions of program
10
Figure 2.26 Program profile.
56
Input data
Processes
Results
Figure 3.1 Disconnected computational

graph (embarrassingly parallel problem).
57
spawn()
send()
Send initial data

recv()
Slaves
Master
send()
recv()
Collect results
Figure 3.2 Practical embarrassingly parallel computational graph with dynamic process
creation and the master-slave approach.
58
x
Process
80
640
Map
80
480
(a) Square region for each process

Process
10
640
Map
480
(b) Row region for each process

Figure 3.3 Partitioning into regions for individual processes.
59
+2
Imaginary
2
2
Real
+2
Figure 3.4 Mandelbrot set.
60
Work pool
(xc, yc)
(xa, ya)
(xb, yb)
(xe, ye)
(xd, yd)
Task
Return results/
request new task
Figure 3.5 Work pool approach.
61
Rows outstanding in slaves (count)

0
Row sent
disp_height
Increment
Row returned
Terminate
Decrement
Figure 3.6 Counter termination.
62
Total area = 4
Area =
Figure 3.7 Computing by a Monte Carlo

method.
63
f(x)
y =
x
1
1 x2
Figure 3.8 Function being integrated in
computing by a Monte Carlo method.
64
Master
Partial sum
Request
Slaves
Random
number
Random number
process
Figure 3.9 Parallel Monte Carlo

integration.
65
x1
x2
xk-1
xk
xk+1
xk+2
x2k-1
x2k
Figure 3.10 Parallel computation of a sequence.
66
x0 x(n/m)1 xn/m x(2n/m)1
x(m1)n/m xn1
+
Partial sums
+
Sum
Figure 4.1
Partitioning a sequence of numbers into parts and adding the parts.
67
Initial problem
Divide
problem
Final tasks
Figure 4.2
Tree construction.
68
Original list
P0
P0
P4
P0
P0
P2
P1
P2
P4
P3
P4
P6
P5
x0
P6
P7
xn1
Figure 4.3 Dividing a list into parts.
69
x0
xn1
P0
P1
P2
P0
P3
P4
P2
P5
P6
P4
P0
P7
P6
P4
P0
Final sum
Figure 4.4
Partial summation.
70
Found/
Not found
OR
OR
OR
Figure 4.5
Part of a search tree.
71
Figure 4.6
Quadtree.
72
Image area
First division
into four parts
Second division
Figure 4.7
Dividing an image.
73
Unsorted numbers
Buckets
Sort
contents
of buckets
Merge lists
Sorted numbers
Figure 4.8 Bucket sort.
74
Unsorted numbers
p processors
Buckets
Sort
contents
of buckets
Merge lists
Sorted numbers
Figure 4.9 One parallel version of bucket sort.
75
n/m numbers
Unsorted numbers
p processors
Small
buckets
Empty
small
buckets
Large
buckets
Sort
contents
of buckets
Merge lists
Sorted numbers
Figure 4.10 Parallel version of bucket sort.
76
Process n 1
Process 0
Receive
buffer
Send
buffer
Send
buffer
n1
Process 1
n1
Process n 1
n1
Process 0
n1
Process n 2
Figure 4.11 All-to-all broadcast.
77
All-to-all
P0
A0,0 A0,1 A0,2 A0,3
A0,0 A1,0 A2,0 A3,0
P1
A1,0 A1,1 A1,2 A1,3
A0,1 A1,1 A2,1 A3,1
P2
A2,0 A2,1 A2,2 A2,3
A0,2 A1,2 A2,2 A3,2
P3
A3,0 A3,1 A3,2 A3,3
A0,3 A1,3 A2,3 A3,3
Figure 4.12 Effect of all-to-all on an

array.
78
f(x)
f(p)
f(q)
Figure 4.13 Numerical integration using

rectangles.
79
f(x)
f(p)
f(q)
Figure 4.14 More accurate numerical

integration using rectangles.
80
f(x)
f(p)
f(q)
Figure 4.15 Numerical integration using

the trapezoidal method.
81
f(x)
C
Figure 4.16 Adaptive quadrature

construction.
82
f(x)
C=0
Figure 4.17 Adaptive quadrature with false

termination.
83
Center of mass
Distant cluster of bodies
Figure 4.18 Clustering distant bodies.
84
Subdivision
direction
Particles
Partial quadtree
Figure 4.19 Recursive division of two-dimensional space.
85
Figure 4.20 Orthogonal recursive bisection

method.
86
log n numbers
+
+
+
+
+
+
+
+
Binary Tree
Result
Figure 4.21
Process diagram for Problem 4-12(b).
87
y
f(x)
f(a)
b
a
f(b)
Figure 4.22 Bisection method for finding

the zero crossing location of a function.
88
Figure 4.23 Convex hull (Problem 4-22).
89
P0
P1
P2
P3
P4
P5
Figure 5.1 Pipelined processes.
90
sum
a[0]
a[1]
a[2]
a[3]
a[4]
sin
sout
sin
sout
Figure 5.2
sin
sout
sin
sout
sin
sout
Pipeline for an unfolded loop.
91
Signal without Signal without Signal without Signal without

frequency f0
frequency f1
frequency f2
frequency f3
f0
f(t)
fin
f1
fout
fin
f2
fout
fin
f3
fout
fin
f4
fout
fin
fout
Filtered signal
Figure 5.3 Pipeline for a frequency filter.
92
p1
P5
P4
P3
P2
P1
P0
Instance
1
Instance Instance
1
2
Instance Instance Instance
1
2
3
Instance Instance Instance Instance
1
2
3
4
Instance
1
Instance
2
Instance
3
Instance
4
Instance
5
Instance
1
Instance
2
Instance
3
Instance
4
Instance
5
Instance
6
Instance
2
Instance
3
Instance
4
Instance
5
Instance
6
Instance
7
Instance
3
Instance
4
Instance
5
Instance
6
Instance
7
Instance Instance
4
5
Instance Instance
5
6
Instance Instance
6
7
Instance
7
Time
Figure 5.4 Space-time diagram of a pipeline.
93
Instance 0
Instance 1
Instance 2
Instance 3
Instance 4
P0
P1
P2
P3
P4
P5
P0
P1
P2
P3
P4
P5
P0
P1
P2
P3
P4
P5
P0
P1
P2
P3
P4
P5
P0
P1
P2
P3
P4
P5
Time
Figure 5.5 Alternative space-time diagram.
94
Input sequence
d9d8d7d6d5d4d3d2d1d0
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
(a) Pipeline structure

p1
n
d0
d1
d2
d3
d4
d5
d6
d0
d1
d2
d3
d4
d5
d6
d7
d0
d1
d2
d3
d4
d5
d6
d7
d8
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d1
d2
d3
d4
d5
d6
d7
d8
d9
P9
P8
P7
P6
P5
P4
P3
P2
P1
P0
d0
Time
(b) Timing diagram
Figure 5.6
Pipeline processing 10 data elements.
95
P5
P5
P4
Information
transfer
sufficient to
start next
process
P4
P3
P3
P2
P2
P1
P0
P1
Information passed
to next stage
Time
(a) Processes with the same
execution time
P0
Time
(b) Processes not with the
same execution time
Figure 5.7 Pipeline processing where information passes to next stage before end of process.
96
Processor 0
P0
P1
P2
Processor 1
P3
P4
P5
P6
Processor 2
P7
P8
P9
P10
P11
Figure 5.8 Partitioning processes onto processors.
97
Multiprocessor
Host
computer
Figure 5.9 Multiprocessor system with a line configuration.
98
1 i
P0
1 i
P1
P2
Figure 5.10
1 i
1 i
P3
1 i
P4
Pipelined addition.
99
Master process
dn1 d2d1d0
Slaves
P0
P1
P2
Pn1
Sum
Figure 5.11 Pipelined addition numbers with a master process and ring configuration.
100
Master process
Numbers
d0
d1
P0
P1
Slaves
P2
dn1
Pn1
Sum
Figure 5.12 Pipelined addition of numbers with direct access to slave processes.
101
P0
1
4, 3, 1, 2, 5
4, 3, 1, 2
4, 3, 1
4, 3
P1
P2
P3
P4
2
1
2
3
Time
(cycles)
1
2
10
1
2
Figure 5.13 Steps in insertion sort with five numbers.
102
P0
Series of numbers
xn1 x1x0
Smaller
numbers
P1
P2
Compare
xmax
Largest number
Next largest
number
Figure 5.14 Pipeline for sorting using insertion sort.
103
Master process
dn1 d2d1d0
Sorted sequence
P0
P1
P2
Pn1
Figure 5.15 Insertion sort with results returned to the master process using a bidirectional line configuration.
104
Sorting phase
Returning sorted numbers
2n 1
n
Shown for n = 5
P4
P3
P2
P1
P0
Time
Figure 5.16
Insertion sort with results returned.
105
P0
Not multiples of
1st prime number
P1
P2
2nd prime
number
3rd prime
number
Series of numbers
xn1 x1x0
Compare
multiples
1st prime
number
Figure 5.17 Pipeline for sieve of Eratosthenes.
106
P0
P1
Compute x0
x0
Compute x1
P2
x0
x1
Compute x2
P3
x0
x1
x2
Compute x3
x0
x1
x2
x3
Figure 5.18 Solving an upper triangular set of linear equation using a pipeline.
107
P5
P4
Processes
P3
Final computed value
P2
P1
P0
First value passed onward

Time
Figure 5.19 Pipeline processing using back

substitution.
108
P0
divide
send(x0)
end
Time
P1
recv(x0)
send(x0)
multiply/add
divide/subtract
send(x1)
end
P2
recv(x0)
send(x0)
multiply/add
recv(x1)
send(x1)
multiply/add
divide/subtract
send(x2)
end
P3
recv(x0)
send(x0)
multiply/add
recv(x1)
send(x1)
multiply/add
recv(x2)
send(x2)
multiply/add
divide/subtract
send(x3)
end
P4
recv(x0)
send(x1)
multiply/add
recv(x1)
send(x1)
multiply/add
recv(x2)
send(x2)
multiply/add
recv(x3)
send(x3)
multiply/add
divide/subtract
send(x4)
end
Figure 5.20 Operations in back substitution pipeline.
109
y4y3y2y1
x1
x2
x3
x4
yin
yout
yin
yout
yin
yout
yin
yout
a1
a2
a3
a4
Figure 5.21
Output
Pipeline for Problem 5-9.
110
Display
Display
Audio input
(digitized)
Pipeline
Audio input
(digitized)
(a) Pipeline solution
(b) Direct decomposition
Figure 5.22 Audio histogram display.
111
Processes
P0
P1
P2
Pn1
Active
Time
Waiting
Barrier
Figure 6.1 Processes reaching the barrier at

different times.
112
Processes
P0
P1
Pn1
Barrier();
Barrier();
Processes wait until
all reach their
barrier call
Barrier();
Figure 6.2 Library call barriers.
113
Processes
P0
P1
Pn1
Counter, C
Increment
and check for n
Barrier();
Barrier();
Barrier();
Figure 6.3 Barrier using a centralized counter.
114
Slave processes
Master
Arrival
phase
Departure
phase
for(i=0;i<n;i++)
recv(Pany);
for(i=0;i<n;i++)
send(Pi);
Barrier:
send(Pmaster);
recv(Pmaster);
Barrier:
send(Pmaster);
recv(Pmaster);
Figure 6.4 Barrier implementation in a message-passing system.
115
P0
P1
P2
P3
Arrival
at barrier
P4
P5
P6
P7
Sychronizing
message
Departure
from barrier
Figure 6.5 Tree barrier.
116
P0
P1
P2
P3
P4
P5
P6
P7
1st stage
Time
2nd stage
3rd stage
Figure 6.6 Butterfly construction.
117
Instruction
a[] = a[] + k;
Processors
a[0]=a[0]+k;
a[1]=a[1]+k;
a[n-1]=a[n-1]+k;
a[0]
a[1]
a[n-1]
Figure 6.7 Data parallel computation.
118
Numbers
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
10
11
12
13
14
15
i=0
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
10
11
12
i=0
i=0
i=0
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9 i=10 i=11 i=12
10
11
12
13
14
15
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
10
11
12
13
14
15
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
i=0
Add
Step 1
(j = 0)
i=9 i=10 i=11 i=12 i=13 i=14
Add
Step 2
(j = 1)
13
14
15
Add
Step 3
(j = 2)
Add
Final step
i=0
(j = 3)

Figure 6.8 Data parallel prefix sum operation.
119
Computed
value
Error
Exact value
t+1
Iteration
Figure 6.9 Convergence rate.
120
Process 0
Send
buffer
data
x0
Process 1
data
x1
Process n 1
data
xn1
Receive
buffer
Allgather();
Allgather();
Allgather();
Figure 6.10 Allgather operation.
121
2 106
Execution
time
( = 1)
1 106
Overall
Communication
Computation
0
0
12
16
20
24
28
32
Number of processors, p
Figure 6.11 Effects of computation and communication in Jacobi iteration.
122
Metal plate
Enlarged
hi1,j
hi,j
hi,j1
hi,j+1
hi+1,j
Figure 6.12 Heat distribution problem.
123
x1
x2
xk1
xk+1 xk+2
xk
x2k1 x2k
xik
xi1
xi+1
xi
xi+k
xk2
Figure 6.13 Natural ordering of heat

distribution problem.
124
row
send(g, Pi-1,j);
send(g, Pi+1,j);
send(g, Pi,j-1);
send(g, Pi,j+1);
recv(w, Pi-1,j)
recv(x, Pi+1,j);
recv(y, Pi,j-1);
recv(z, Pi,j+1);
send(g, Pi-1,j);
send(g, Pi+1,j);
send(g, Pi,j-1);
send(g, Pi,j+1);
recv(w, Pi-1,j)
recv(x, Pi+1,j);
recv(y, Pi,j-1);
recv(z, Pi,j+1);
send(g, Pi-1,j);
send(g, Pi+1,j);
send(g, Pi,j-1);
send(g, Pi,j+1);
recv(w, Pi-1,j)
recv(x, Pi+1,j);
recv(y, Pi,j-1);
recv(z, Pi,j+1);
column
i
send(g, Pi-1,j);
send(g, Pi+1,j);
send(g, Pi,j-1);
send(g, Pi,j+1);
recv(w, Pi-1,j)
recv(x, Pi+1,j);
recv(y, Pi,j-1);
recv(z, Pi,j+1);
send(g, Pi-1,j);
send(g, Pi+1,j);
send(g, Pi,j-1);
send(g, Pi,j+1);
recv(w, Pi-1,j)
recv(x, Pi+1,j);
recv(y, Pi,j-1);
recv(z, Pi,j+1);
Figure 6.14
Message passing for heat distribution problem.
125
P0 P1
P0
Pp1
P1
Pp1
Blocks
Strips (columns)
Figure 6.15 Partitioning heat distribution problem.
126
n
--p
Square blocks
Strips
Figure 6.16 Communication consequences of partitioning.
127
2000
Strip partition best
tstartup
1000
Block partition best
0
1
10
100
1000
Processors, p
Figure 6.17 Startup times for block and

strip partitions.
128
Process i
Array held
by process i
One row
of points
Ghost points
Copy
Array held
by process i+1
Process i+1
Figure 6.18
Configurating array into contiguous rows for each process, with ghost points.
129
20C
4ft
100C
10ft
10ft
Figure 6.19 Room for Problem 6-14.
130
vehicle
Figure 6.20 Road junction for

Problem 6-16.
131
Airflow
Actual dimensions
selected at will
Figure 6.21 Figure for Problem 6-23.
132
P5
P4
P
Processors 3
P2
P1
P0
Time
(a) Imperfect load balancing leading
to increased execution time
P5
P4
P
Processors 3
P2
P1
P0
(b) Perfect load balancing
Figure 7.1 Load balancing.
133
Work pool
Queue
Master
process
Tasks
Send task
Request task
(and possibly
submit new tasks)
Slave worker processes
Figure 7.2 Centralized work pool.
134
Initial tasks
Master, Pmaster
Process M0
Process Mn1
Slaves
Figure 7.3 A distributed work pool.
135
Process
Process
Requests/tasks
Process
Process
Figure 7.4 Decentralized work pool.
136
Slave Pi
Requests
Local
selection
algorithm
Requests
Slave Pj
Local
selection
algorithm
Figure 7.5 Decentralized selection algorithm requesting tasks between slaves.
137
Master
process
P0
P1
Figure 7.6
P2
P3
Pn1
Load balancing using a pipeline structure.
138
Pcomm
If buffer empty,
make request
Request for task
Receive task
from request
If buffer full,
send task
If free,
request
task
Receive
task from
request
Ptask
Figure 7.7 Using a communication process in line load balancing.
139
P0
Task
when
requested
P1
P3
P2
P5
P4
P6
Figure 7.8 Load balancing using a tree.
140
Parent
Process
Inactive
Final
acknowledgment
First task
Acknowledgment
Task
Other processes
Active
Figure 7.9 Termination using message

acknowledgments.
141
Token passed to next processor

when reached local termination condition
P0
P1
Figure 7.10
P2
Pn1
Ring termination detection algorithm.
142
Token
AND
Terminated
Figure 7.11 Process algorithm for local

termination.
143
Task
P0
Pj
Figure 7.12
Pi
Pn1
Passing task to previous processes.
144
AND
Terminated
AND
AND
Terminated
Terminated
Figure 7.13 Tree termination.
145
Summit
F
E
D
C
A
Base camp
Possible intermediate camps

Figure 7.14
Climbing a mountain.
146
17
E
9
51
24
D
13
10
A
14
8
Figure 7.15 Graph of mountain climb.
147
Destination
C
D
E
10
13
24
51
14
17
Source
(a) Adjacency matrix

Weight NULL
A
B 10
C 8
D 14
E 9
F 17
D 13
E 24
F 51
Source
F
(b) Adjacency list
Figure 7.16 Representing a graph.
148
Vertex j
di
Vertex i
wi,j
dj
Figure 7.17 Moores shortest-path algorithm.
149
Master process
Start at
source
vertex
Vertex
Vertex w[]
w[]
New
distance
dist
dist
Process A
Vertex w[]
New
distance
Process C
Other processes
dist
Process B
Figure 7.18
Distributed graph search.
150
Entrance
Search path
Exit
Figure 7.19 Sample maze for Problem 7-9.
151
Gold
Entrance
Figure 7.20 Plan of rooms for Problem 7-10.
152
Room B
Door
Room A
Figure 7.21 Graph representation for

Problem 7-10.
153
Bus
Cache
Processors
Memory modules
Figure 8.1 Shared memory multiprocessor

using a single bus.
154
TABLE 8.1
Language
SOME EARLY PARALLEL PROGRAMMING LANGUAGES

Originator/date
Comments
Concurrent Pascal
Brinch Hansen, 1975a
Extension to Pascal
Ada
U.S. Dept. of Defense, 1979b
Completely new language
Modula-P
Brunl, 1986c
Extension to Modula 2
C*
Thinking Machines, 1987d
Extension to C for SIMD systems
Concurrent C
Gehani and Roome, 1989e
Extension to C
Fortran D
Fox et al., 1990f
Extension to Fortran for data parallel programming
a. Brinch Hansen, P. (1975), The Programming Language Concurrent Pascal, IEEE Trans. Software Eng.,
Vol. 1, No. 2 (June), pp. 199207.
b. U.S. Department of Defense (1981), The Programming Language Ada Reference Manual, Lecture
Notes in Computer Science, No. 106, Springer-Verlag, Berlin.
c. Brunl, T., R. Norz (1992), Modula-P User Manual, Computer Science Report, No. 5/92 (August), Univ.
Stuttgart, Germany.
d. Thinking Machines Corp. (1990), C* Programming Guide, Version 6, Thinking Machines System Documentation.
e. Gehani, N., and W. D. Roome (1989), The Concurrent C Programming Language, Silicon Press, New
Jersey.
f. Fox, G., S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu (1990), Fortran D
Language Specification, Technical Report TR90-141, Dept. of Computer Science, Rice University.
155
Main program
FORK
Spawned processes
FORK
FORK
JOIN
JOIN
JOIN
JOIN
Figure 8.2
FORK-JOIN construct.
156
Code
Heap
IP
Stack
Interrupt routines
Files
(a) Process
Code
Stack
Heap
Thread
IP
Interrupt routines
Stack
Thread
IP
Files
(b) Threads
Figure 8.3 Differences between a process

and threads.
157
Main program
thread1
pthread_create(&thread1, NULL, proc1, &arg);
proc1(&arg)
{
return(*status);
}
pthread_join(thread1, *status);
Figure 8.4 pthread_create() and pthread_join().
158
Main program
Thread
pthread_create();
pthread_create();
Thread
pthread_create();
Thread
Termination
Termination
Termination
Figure 8.5 Detached threads.
159
Shared variable, x
Write
Write
Read Read
+1
+1
Process 1
Process 2
Figure 8.6 Conflict in accessing shared

variable.
160
Process 1
while (lock == 1) do_nothing;
lock = 1;
Process 2
while (lock == 1)do_nothing;
Critical section
lock = 0;
lock = 1;
Critical section
lock = 0;
Figure 8.7 Control of critical sections through busy waiting.
161
R1
R2
Resource
P1
P2
Process
(a) Two-process deadlock
R1
R2
Rn 1
Rn
P1
P2
Pn 1
Pn
(b) n-process deadlock
Figure 8.8 Deadlock (deadly embrace).
162
Main memory
Block
7
6
5
4
3
2
1
0
Address
tag
Cache
Cache
Block in cache
Processor 1
Processor 2
Figure 8.9 False sharing in caches.
163
sum
Array a[]
addr
Figure 8.10 Shared memory locations for Section 8.4.1 program example.
164
global_index sum
Array a[]
addr
Figure 8.11 Shared memory locations for Section 8.4.2 program example.
165
TABLE 8.2 LOGIC CIRCUIT DESCRIPTION FOR FIGURE 8.12
Test1
Test2
Test3
Gate
Function
Input 1
Input 2
Output
AND
Test1
Test2
Gate1
NOT
Gate1
OR
Test3
Output1
Output2
Output1
Gate1
Output2
Figure 8.12 Sample logic circuit.
166
Log
Movement
of logs
River
Frog
Figure 8.13 River and frog for Problem 8-23.
167
Pool of threads
Request
Request
serviced
Slaves
Master Signal
Figure 8.14 Thread pool for Problem 8-24.
168
a[i] a[0]
a[i] a[n-1]
Compare
Increment
counter, x
b[x] = a[i]
Figure 9.1 Finding the rank in parallel.
169
a[i] a[0] a[i] a[1]
a[i] a[2] a[i] a[3]
Compare
0/1
0/1
0/1
0/1
Add
Add
0/1/2
0/1/2
Tree
Add
0/1/2/3/4
Figure 9.2 Parallelizing the rank computation.
170
Master
a[]
b[]
Read
numbers
Place selected
number
Slaves
Figure 9.3 Rank sort using a master and

slaves.
171
Sequence of steps
P1
A
P2
1
Send(A)
If A > B send(B)
else send(A)
If A > B load A
else load B
2
Compare
Figure 9.4 Compare and exchange on a message-passing system Version 1.
172
P1
A
P2
1
Send(A)
B
Send(B)
2
If A > B load B
3
If A > B load A
Compare
Compare
Figure 9.5 Compare and exchange on a message-passing system Version 2.
173
P2
P1
Merge
88
Original 50
numbers 28
25
88
50
28
25
43
42
Final
numbers 28
25
98
80
43
42
98
88
80
50
43
42
28
25
Keep
higher
numbers
Return
lower
numbers
Figure 9.6 Merging two sublists Version 1.
174
P1
P2
Original
numbers
Merge
Keep
lower
numbers
(final
numbers)
98
88
80
50
43
42
28
25
Merge
98
80
43
42
98
80
43
42
88
50
28
25
88
50
28
25
Original
numbers
98
88
80
50
43
42
28
25
Keep
higher
numbers
(final
numbers)
Figure 9.7 Merging two sublists Version 2.
175
Original
sequence:
Phase 1
Place
largest
number
Phase 2
Place
next
largest
number
Phase 3
Time
Figure 9.8 Steps in bubble sort.
176
Phase 1
1
1
Phase 2
1
Time
Phase 3
3
Phase 4
4
Figure 9.9 Overlapping bubble sort actions in a pipeline.
177
P0
P1
P2
P3
P4
P5
P6
P7
Step
Time
Figure 9.10
Odd-even transposition sort sorting eight numbers.
178
Smallest
number
Largest
number
Figure 9.11 Snakelike sorted list.
179
14
14
10
13
16
16
13
10
15
15
12
11
14
12
11
12
11
16
13
10
15
(a) Original placement

of numbers
(b) Phase 1 Row sort
(c) Phase 2 Column sort
11
12
14
11
12
10
10
11
12
16
15
13
10
16
15
13
14
16
15
14
13
(d) Phase 3 Row sort
(e) Phase 4 Column sort
(f) Final phase Row sort
Figure 9.12 Shearsort.
180
(a) Operations between elements

in rows
Figure 9.13
(b) Transpose operation
(c) Operations between elements

in rows (originally columns)
Using the transpose operation to maintain operations in rows.
181
Unsorted list
4
P0
P0
Divide
list
4
P4
P0
P0
P2
P1
P2
P0
P4
P3
P4
P2
P6
P5
P6
P4
P7
P6
Merge
2
Sorted list
P0
P4
P0
Process allocation
Figure 9.14 Mergesort using tree allocation of processes.
182
Unsorted list
Pivot
4
Sorted list
P0
P0
P0
P0
P4
P2
P6
P4
P6
P1
P7
Process allocation
Figure 9.15 Quicksort using tree allocation of processes.
183
Unsorted list
Pivot
4
6
Sorted list
Pivots
Figure 9.16 Quicksort showing pivot withheld in processes.
184
Work pool
Sublists
Request
sublist
Return
sublist
Slave processes
Figure 9.17 Work pool implementation of

quicksort.
185
(a) Phase 1
000
001
010
011
100
101
p1
(b) Phase 2
000
001
Figure 9.18
111
110
111
> p1
010
p2
(c) Phase 3
110
011
100
> p2
101
p3
> p3
000
001
010
011
100
101
110
111
p4
> p4
p5
> p5
p6
> p6
p7
> p7
Hypercube quicksort algorithm when the numbers are originally in node 000.
186
Broadcast pivot, p1
(a) Phase 1
000
001
010
011
100
101
p1
(c) Phase 3
000
001
111
110
111
> p1
Broadcast pivot, p2
(b) Phase 2
110
Broadcast pivot, p3
010
011
100
101
p2
> p2
p3
> p3
Broadcast
pivot, p4
Broadcast
pivot, p5
Broadcast
pivot, p6
Broadcast
pivot, p7
000
001
010
011
100
101
110
111
p4
> p4
p5
> p5
p6
> p6
p7
> p7
Figure 9.19 Hypercube quicksort algorithm when numbers are distributed among nodes.
187
110
111
(a) Phase 1 communication 010
011
100
000
101
001
110
111
(b) Phase 2 communication 010
011
100
000
101
001
110
111
(c) Phase 3 communication 010
011
100
101
Figure 9.20 Hypercube quicksort
communication.
000
001
188
Broadcast pivot, p1
(a) Phase 1
000
001
011
010
110
111
p1
(c) Phase 3
000
001
100
101
100
> p1
Broadcast pivot, p2
(b) Phase 2
101
Broadcast pivot, p3
011
010
110
111
p2
> p2
p3
> p3
Broadcast
pivot, p4
Broadcast
pivot, p5
Broadcast
pivot, p6
Broadcast
pivot, p7
000
001
011
010
110
111
101
100
p4
> p4
p5
> p5
p6
> p6
p7
> p7
Figure 9.21
Quicksort hypercube algorithm with Gray code ordering.
189
a[]
b[]
2 4 5 8
Sorted lists
Even indices
Odd indices
c[]
1 3 6 7
Merge
Merge
1 2 5 6
d[] 3 4 7 8
Compare and exchange
Final sorted list
e[]
Figure 9.22 Odd-even merging of two

sorted lists.
190
Compare and
exchange
c2n
c2n1
c2n2
bn
bn1
Even
mergesort
b4
b3
b2
b1
an
an1
Odd
mergesort
a4
a3
a2
a1
c7
c6
c5
c4
c3
c2
c1
Figure 9.23 Odd-even mergesort.
191
Value
a0, a1, a2, a3,
an2, an1
(a) Single maximum
a0, a1, a2, a3,
an2, an1
(b) Single maximum and single minimum
Figure 9.24 Bitonic sequences.
192
Bitonic sequence
3
Compare and
exchange
Bitonic sequence
Bitonic sequence
Figure 9.25 Creating two bitonic

sequences from one bitonic sequence.
193
Unsorted numbers
8
9
7
4
Compare and
exchange
4
5
Sorted list
Figure 9.26 Sorting a bitonic sequence.
194
Unsorted numbers
Bitonic
sorting
operation
Direction
of increasing
numbers
Sorted list
Figure 9.27
Bitonic mergesort.
195
Compare and exchange

ai with ai+n/2 (n numbers)
8
= bitonic list
[Fig. 9.24 (a) or (b)]
Step
1
Form
bitonic lists
of four
numbers
n=2
1
ai with ai+2
Split
ai with ai+4
Split
n=4
2
ai with ai+1
Sort
n=8
Higher
Split
Lower
ai with ai+2
n=2
3
Compare and
exchange
n=4
1
Sort bitonic list
ai with ai+1
Form
bitonic list
of eight
numbers
n=2
ai with ai+1
9
Sort
Figure 9.28 Bitonic mergesort on eight numbers.
196
Step 1
88
50
28
25
98
80
43
42
Step 2
50
42
28
25
98
88
80
43
Step 3
43
42
28
25
98
88
80
50
Terminates when insertions at top/bottom of lists
Figure 9.29 Compare-and-exchange

algorithm for Problem 9-5.
197
Column
a0,0
a0,1
a0,m2
a0,m1
a1,0
a1,1
a1,m2
a1,m1
an2,0
an2,1
an2,m-2 an2,m1
an1,0
an1,1
an1,m2 an1,m1
Row
Figure 10.1 An n m matrix.
198
Column
Multiply
Sum
results
Row
i
ci,j
A
Figure 10.2 Matrix multiplication, C = A B.
199
Row
sum
i
ci
Figure 10.3 Matrix-vector multiplication
c = A b.
200
Multiply
Sum
results
Figure 10.4 Block matrix multiplication.
201
a0,0
a0,1
a0,2
a0,3
b0,0
b0,1
b0,2
b0,3
a1,0
a1,1
a1,2
a1,3
b1,0
b1,1
b1,2
b1,3
a2,0
a2,1
a2,2
a2,3
b2,0
b2,1
b2,2
b2,3
a3,0
a3,1
a3,2
a3,3
b3,0
b3,1
b3,2
b3,3
(a) Matrices
A0,0
a0,0
a0,1
a1,0
a1,1
B0,0
b0,0
b0,1
b1,0
b1,1
A0,1
a0,2
a0,3
a1,2
a1,3
b2,0
b2,1
b3,0
b3,1
a0,0b0,0 + a0,1b1,0 a0,0b0,1 + a0,1b1,1

=
B1,0
a0,2b2,0 + a0,3b3,0
a0,2b2,1 + a0,3b3,1
a1,2b2,0 + a1,3b3,0
a1,2b2,1 + a1,3b3,1
a1,0b0,0 + a1,1b1,0 a1,0b0,1 + a1,1b1,1
a0,0b0,0 + a0,1b1,0 + a0,2b2,0 + a0,3b3,0
a0,0b0,1 + a0,1b1,1 + a0,2b2,1 + a0,3b3,1
a1,0b0,0 + a1,1b1,0 + a1,2b2,0 + a1,3b3,0
a1,0b0,1 + a1,1b1,1 + a1,2b2,1 + a1,3b3,1
= C0,0
(b) Multiplying A0,0 B0,0 to obtain C0,0
Figure 10.5 Submatrix multiplication.
202
Column j
Row i
b[][j]
a[i][]
Processor Pi,j
c[i][j]
Figure 10.6 Direct implementation of

matrix multiplication.
203
a0,0 b0,0 a0,1 b1,0 a0,2 b2,0 a0,3 b3,0
P0
P1
P2
P3
P0
P2
P0
+
c0,0
Figure 10.7 Accumulation using a tree

construction.
204
j
P0 P1 P2 P3
i
App
Apq
Bpp
Bpq
P0 + P1
Cpp
Aqp
Aqq
Bqp
Bqq
P4 + P5
Cqp
P2 + P3
Cpq
P6 + P7
Cqq
P4 P5 P6 P7
Figure 10.8 Submatrix multiplication and summation.
205
i
A
Pi,j
B
Figure 10.9 Movement of A and B

elements.
206
j
B
i
i places
A
j places
ai,j+i
bi+j,j
Figure 10.10 Step 2 Alignment of

elements of A and B.
207
j
B
i
A
Pi,j
Figure 10.11 Step 4 One-place shift of
elements of A and B.
208
Pumping
action
a0,3 a0,2 a0,1 a0,0
b3,0
b2,0
b1,0
b0,0
b3,1
b2,1
b1,1
b0,1
b3,2
b2,2
b1,2
b0,2
b3,3
b2,3
b1,3
b0,3
c0,0
c0,1
c0,2
c0,3
c1,0
c1,1
c1,2
c1,3
c2,0
c2,1
c2,2
c2,3
c3,0
c3,1
c3,2
c3,3
One cycle delay

a1,3 a1,2 a1,1 a1,0
a2,3 a2,2 a2,1 a2,0
a3,3 a3,2 a3,1 a3,0
Figure 10.12 Matrix multiplication using a systolic array.
209
Pumping
action
a0,3 a0,2 a0,1 a0,0
a1,3 a1,2 a1,1 a1,0
a2,3 a2,2 a2,1 a2,0
a3,3 a3,2 a3,1 a3,0
b3
b2
b1
b0
c0
c1
c2
c3
Figure 10.13 Matrix-vector multiplication

using a systolic array.
210
Column
Row
Row i
aji
Step through
Row j
Already
cleared
to zero
Cleared
to zero
Column i
Figure 10.14 Gaussian elimination.
211
Column
Row
n i +1 elements
(including b[i])
Row i
Broadcast
ith row
Already
cleared
to zero
Figure 10.15 Broadcast in parallel implementation of Gaussian elimination.
212
P0
P1
P2
Pn1
Row
Broadcast
rows
Figure 10.16 Pipeline implementation of

Gaussian elimination.
213
Row
0
P0
n/p
P1
2n/p
P2
3n/p
P3
Figure 10.17 Strip partitioning.
214
Row
0
n/p
P0
2n/p
P1
3n/p
Figure 10.18 Cyclic partitioning to

equalize workload.
215
Solution space
f(x, y)
y
x
Figure 10.19 Finite difference method.
216
Boundary points (see text)

x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
x21
x22
x23
x24
x25
x26
x27
x28
x29
x30
x31
x32
x33
x34
x35
x36
x37
x38
x39
x40
x41
x42
x43
x44
x45
x46
x47
x48
x49
x50
x51
x52
x53
x54
x55
x56
x57
x58
x59
x60
x61
x62
x63
x64
x65
x66
x67
x68
x69
x70
x71
x72
x73
x74
x75
x76
x77
x78
x79
x80
x81
x82 x83
x85
x86
x87
x88
x89
x90
x91
x92
x95
x96
x97
x98
x99 x100
Figure 10.20
x84
x93
x94
Mesh of points numbered in natural order.
217
Those equations with a boundary

point on diagonal unnecessary
for solution
1
1
ith equation
To include
boundary values
and some zero
entries (see text)
1 4 1
1
1 4 1
1
1
1
1 4 1
ai,in ai,i1 ai,i ai,i+1 ai,i+n
1
1
1 4 1
1 4 1
x1
x2
0
0
1
1
xN-1
xN
x
0
0
Figure 10.21 Sparse matrix for Laplaces equation.
218
Sequential order of computation
Point
computed
Point to be
computed
Figure 10.22 Gauss-Seidel relaxation with natural order, computed sequentially.
219
Red
Black
Figure 10.23 Red-black ordering.
220
Figure 10.24 Nine-point stencil.
221
Coarsest grid points
Finer grid points

Processor
Figure 10.25 Multigrid processor

allocation.
222
50C
40C
60C
Ambient temperature at edges of board = 20C

Figure 10.26 Printed circuit board for Problem 10-18.
223
Origin (0, 0)
i
Picture element
(pixel)
p(i, j)
Figure 11.1 Pixmap.
224
Number
of pixels
Gray level
255
Figure 11.2 Image histogram.
225
x0
x1
x2
x3
x4
x5
x6
x7
x8
Figure 11.3 Pixel values for a 3 3 group.
226
Step 1
Each pixel adds
pixel from left
Step 2
Each pixel adds
pixel from right
Step 3
Each pixel adds pixel
from above
Step 4
Each pixel adds pixel
from below
Figure 11.4 Four-step data transfer for the computation of mean.
227
x0
x1
x2
x0
x0 + x1
x3
x4
x0
x7
x5
x3
x8
x6
(a) Step 1
(b) Step 2
x2
x0
x7
x1
x8
x2
x0 + x1 + x2
x5
x3
x4
x0 + x1 + x2
x3 + x4 + x5
x6 + x7 + x8
x5
x8
x6
x7
x8
x0 + x1 + x2
x3 + x4 + x5
x6
x7
x6 + x7 + x8
x4
x5
x3 + x4 + x5
x0 + x1 + x2
x3
x4
x6 + x7
x1
x2
x0 + x1 + x2
x3 + x4
x6
x1
x6 + x7 + x8
x6 + x7 + x8
(c) Step 3
(d) Step 4
Figure 11.5 Parallel mean data accumulation.
228
Largest
in row
Next largest
in row
Next largest
in column
Figure 11.6 Approximate median algorithm requiring six steps.
229
Mask
Pixels
w0
w1
w2
w3
w4
w5
w6
w7
w8
Result
x0
x1
x2
x3
x4
x5
x6
x7
x8
x4'
Figure 11.7 Using a 3 3 weighted mask.
230
k=
1
9
Figure 11.8
Mask to compute mean.
231
k=
1
16
Figure 11.9 A noise reduction mask.
232
1
k=
9
Figure 11.10 High-pass sharpening filter

mask.
233
Intensity transition
First derivative
Second derivative
Figure 11.11 Edge detection using

differentiation.
234
Image
y
Constant
intensity
f(x, y)
Gradient
Figure 11.12 Gray level gradient and

direction.
235
Figure 11.13 Prewitt operator.
236
Figure 11.14 Sobel operator.
237
(a) Original image (Annabel)
(b) Effect of Sobel operator
Figure 11.15 Edge detection with Sobel operator.
238
Figure 11.16 Laplace operator.
239
Upper pixel
x1
x3
Left pixel
x4
x5
Right pixel
x7
Lower pixel
Figure 11.17 Pixels used in Laplace

operator.
240
Figure 11.18 Effect of Laplace operator.
241
b = x1a + y1
y = ax + b
b = xa + y
(x1, y1)
(a, b)
Pixel in image
x
(a) (x, y) plane
a
(b) Parameter space
Figure 11.19 Mapping a line into (a, b) space.
242
y = ax + b
r = x cos + y sin
(r, )
x
(a) (x, y) plane
(b) (r, ) plane
Figure 11.20 Mapping a line into (r, ) space.
243
Figure 11.21 Normal representation using

image coordinate system.
244
Accumulator
15
10
5
0
0102030
Figure 11.22 Accumulators, acc[r][], for

the Hough transform.
245
Transform
rows
Transform
columns
xjk
Xjm
Xlm
Figure 11.23 Two-dimensional DFT.
246
Transform
Image
fj,k
Convolution
f(j, k)
F(j, k)
hj,k
gj,k
g(j, k)
Inverse
transform
Multiply
H(j, k)
h(j, k)
G(j, k)
Filter/image
(a) Direct convolution
(b) Using Fourier transform
Figure 11.24 Convolution using Fourier transforms.
247
Master process
w0
w1
wn1
Slave processes
X[0]
X[1]
X[n1]
Figure 11.25 Master-slave approach for

implementing the DFT directly.
248
x[j]
Process j
X[k]
a
wk
Values for
next iteration
X[k]
a x[j]
a
wk
Figure 11.26 One stage of a pipeline

implementation of DFT algorithm.
249
x[0]
x[1]
x[2]
x[3]
x[N1]
Output sequence
0
1
X[k]
a
wk
wk
X[0],X[1],X[2],X[3]
P0
P1
P2
P3
PN1
(a) Pipeline structure

X[0] X[1] X[2] X[3] X[4] X[5] X[6]
PN1
PN2
Pipeline
stages
P2
P1
P0
Time
(b) Timing diagram
Figure 11.27 Discrete Fourier transform with a pipeline.
250
Input sequence
x0
x1
x2
x3
xN2
xN1
Transform
N/2 pt
DFT
N/2 pt
DFT
Xeven
Xodd
wk
Xk
Xk+N/2
k = 0, 1, N/2
Figure 11.28 Decomposition of N-point DFT into two N/2-point DFTs.
251
x0
X0
x1
X1
x2
X2
x3
X3
Figure 11.29 Four-point discrete Fourier

transform.
252
Xk = (0,2,4,6,8,10,12,14)+wk(1,3,5,7,9,11,13,15)
{(0,4,8,12)+wk(2,6,10,14)}+wk{(1,5,9,13)+wk(3,7,11,15)}
{[(0,8)+wk(4,12)]+wk[(2,10)+wk(6,14)]}+{[(1,9)+wk(5,13)]+wk[(3,11)+wk(7,15)]}
x0
x8
x4
x12
x2
x10
x6
x14
x1
x9
x5
x13
x3
x11
x7
x15
0000 1000 0100 1100 0010 1010 0110 1011 0001 1001 0101 1101 0011 1011 0111 1111
Figure 11.30 Sixteen-point DFT decomposition.
253
x0
X0
x1
X1
x2
X2
x3
X3
x4
X4
x5
X5
x6
X6
x7
X7
x8
X8
x9
X9
x10
X10
x11
X11
x12
X12
x13
X13
x14
X14
x15
X15
Figure 11.31 Sixteen-point FFT computational flow.
254
Process
Row
Inputs
P/r
0000 x0
P0
P1
P2
P3
Outputs
X0
0001 x1
X1
0010 x2
X2
0011 x3
X3
0100 x4
X4
0101 x5
X5
0110 x6
X6
0111 x7
X7
1000 x8
X8
1001 x9
X9
1010 x10
X10
1011 x11
X11
1100 x12
X12
1101 x13
X13
1110 x14
X14
1111 x15
X15
Figure 11.32 Mapping processors onto 16-point FFT computation.
255
P0
P1
P2
P3
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
Figure 11.33 FFT using transpose

algorithm first two steps.
256
P0
P1
P2
P3
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
Figure 11.34 Transposing array for

transpose algorithm.
257
P0
P1
P2
P3
x0
x4
x8
x12
x1
x5
x9
x13
x2
x6
x10
x14
x3
x7
x11
x15
Figure 11.35 FFT using transpose

algorithm last two steps.
258
7
6
5
4
3
2
Mask
Figure 11.36 Image for Problem 11-3.
259
C0
First choice
Second choice
Not
including
C0
C1
Not
including
C1
Cn1
Not
including
Cn1
Third choice
Figure 12.1 State space tree.
260
1
Parent A
p p+1
A1
1
Parent B
p p+1
m
B2
p p+1
A1
1
Child 2
A2
B1
Child 1
m
B2
p p+1
B1
m
A2
Figure 12.2 Single-point crossover.
261
Subpopulation
Migration path;
every island sends
to every other island
Figure 12.3 Island model.
262
Island subpopulations
Limited migration path
Figure 12.4 Stepping stone model
263
Program
Instructions
Clock
Processors
with local
memory
Data
Shared memory
Figure D.1 PRAM model.
264
d[0] s[0]
1
d[1] s[1]
1
d[2] s[2]
d[3] s[3]
d[4] s[4]
1
d[5] s[5]
1
d[6] s[6]
1
d[7] s[7]
0
Null
Figure D.2
List ranking by pointer jumping.
265
Threads or processes
Local computation
(maximum time w)
Maximum of h
sends or receives
Communication
Barrier synchronization
Figure D.3
A view of the bulk synchronous parallel model.
266
Pi
Next message
Processors
Message
Pk
Pi
Time
Figure D.4 LogP parameters.
267
268

Parallel Programming - Slides

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Parallel Programming - Slides

Загружено:

Авторское право:

Доступные форматы

Figure 1.

Figure 1.2 Conventional computer having

Figure 1.3 Traditional shared memory

Figure 1.4 Message-passing

Figure 1.5 Shared memory multiprocessor

Network with direct links

Figure 1.7 Static link multicomputer.

Figure 1.8 Node with a switch for internode message transfers.

Figure 1.9 A link between two nodes with

Figure 1.10 Ring.

Figure 1.12 Tree structure.

Figure 1.13 Three-dimensional hypercube.

Figure 1.15 Embedding a ring onto a torus.

Figure 1.16 Embedding a mesh into a

Figure 1.17 Embedding a tree into a mesh.

Figure 1.19 A signaling method between

Network delay characteristics.

Figure 1.21 Deadlock in store-and-forward

Figure 1.23 Ethernet-type single wire

Ethernet frame format.

Parallel programming cluster

(a) Using specially designed adaptors

(b) Using separate Ethernet interfaces

Slope indicating time

Waiting to send a message

Figure 1.28 Space-time diagram of a message-passing program.

(a) One processor

Parallelizing sequential problem Amdahls law.

Speedup factor, S(n)

Speedup factor, S(n)

Figure 2.1 Single program, multiple data

Figure 2.2 Spawning a process.

Figure 2.3 Passing a message between

(a) When send() occurs before recv()

Figure 2.5 Using a message buffer.

Figure 2.7 Scatter operation.

Figure 2.8 Gather operation.

Figure 2.9 Reduce operation (addition).

Message passing between workstations using PVM.

pvm_precv(); Wait for message

Figure 2.12 pvm_psend() and pvm_precv() system calls.

PVM packing messages, sending, and unpacking.

/* Start Slave Tasks */

/* Open Input File and Initialize Data */

/* Receive data from master */

/* Broadcast data To slaves*/

/* Get results from Slaves*/

/* Add my portion Of data */

Figure 2.14 Sample PVM program.

(a) Intended behavior

(b) Possible behavior

Sample MPI program.

Figure 2.17 Theoretical communication

Figure 2.18 Growth of function f(x) = 4x2 + 2x + 12.

Figure 2.19 Broadcast in a three-dimensional hypercube.

Figure 2.20 Broadcast as a tree construction.

Figure 2.22 Broadcast on an Ethernet

Figure 2.23 1-to-N fan-out broadcast.

Sequential message issue

Figure 2.24 1-to-N fan-out broadcast on a

Number of repetitions or time

Figure 3.1 Disconnected computational

Send initial data

(a) Square region for each process