Вы находитесь на странице: 1из 213

C MPI

: 91 1 1

: (03) 5776085 x 305

E-mail : c00tch00@nchc.gov.tw

C MPI ........................................................................................................1
.............................................................................................................................4
1.1 MPI ...................................................................................................5
1.2 ........................................................................6
1.3 IBM SP2 MPI ...................................................................................7
1.3.1 IBM SP2 MPI C ..............................................................7
1.3.2 IBM SP2 Job command file......................................................................7
1.3.3 IBM SP2 ..............................................................9
1.4 PC Cluster MPI...............................................................................11

1.4.1 PC Cluster C MPI ...............................................11


1.4.2 PC Cluster Job command file .............................................................12
1.4.3 PC Cluster ............................................................13
...................................................................................14

2.1

MPI .........................................................................................................15
2.1.1 mpi.h include file ..............................................................................................15
2.1.2 MPI_Init, MPI_Finalize....................................................................................15
2.1.3 MPI_Comm_size, MPI_Comm_rank ...............................................................16

2.1.4 MPI_Send, MPI_Recv ......................................................................................17


2.2 T2SEQ ....................................................................20
2.3 T2CP ...............................................................................22
2.4

MPI_ScatterMPI_GatherMPI_Reduce .............................................................27
2.4.1 MPI_ScatterMPI_Gather ..............................................................................27

2.4.2 MPI_Reduce, MPI_Allreduce ..........................................................................29


2.5 T2DCP ................................................................................31
...............................................................................35
3.1

MPI_Sendrecv, MPI_Bcast.......................................................................................36
3.1.1 MPI_ Sendrecv ..............................................................................................36
3.1.2 MPI_Bcast .....................................................................................................36
3.2 T3SEQ ........................................................................38
3.3 T3CP .......................................................40
3.4 () T3DCP_1 ............................................47
3.5 () T3DCP_2 ............................................52
...................................................................................57
4.1 T4SEQ ....................................................................58
4.2. MPI_ScattervMPI_Gatherv .................................................................................60
4.3 MPI_PackMPI_UnpackMPI_ BarrierMPI_ Wtime......................................62
4.3.1 MPI_PackMPI_Unpack .............................................................................62
4.3.2

MPI_BarrierMPI_Wtime...........................................................................64
2

4.4

T4DCP ................................................................................66

5.1
5.2
5.3

...............................................................................................72
T5SEQ ..................................................................................73
T5CP .................................................................77
T5DCP ......................................................85

5.4 MPI .....................................................................................91


5.4.1 (Cartesian Topology) .....................................................91
5.4.2 MPI MPI_Cart_create..........................................92
MPI_Cart_coordsMPI_Cart_shift..........................................................................92
5.4.3 MPI ..................................................................95
MPI_Type_vectorMPI_Type_commit ...................................................................95
5.5 T5_2D ...............................................................97
MPI .............................................................................................110
6.1
Nonblocking ........................................................................................111
6.2
..................................................................................................120
6.3 ......................................................................124
6.4

.........................................................................................126
6.4.1 .......................................................................................126
6.4.2 .......................................................................132
.....................................................................................................134

7.1
.............................................................................................135
7.2
.....................................................................................................140
7.3
.........................................................................................150
SOR ............................................................................................158
8.1
SOR ....................................................................................159
8.2
SOR ..................................................................................164
8.3

SOR ..........................................................................................173

8.4
SOR ................................................................181
.....................................................................................................191
9.1
.................................................................................192
9.2
.................................................................................196
.........................................................................................................................207
Parallel Processing of 1-D Arrays without Partition........................................................208
Parallel Processing of 1-D Arrays with Partition.............................................................209
Parallel on the 1st Dimension of 2-D Arrays without Partition.......................................210
Parallel on the 1st Dimension of 2-D Arrays with Partition............................................211
Partition on the 1st dimension of 3-D Arrays ..................................................................212


MPI
MPI
MPI

IBM SP2 MPI

PC cluster MPI

1.1

MPI

MPI (Message Passing Interface) Message Passing


FortranCC++ MPI
cluster MPI
MPI1.2 MPI 1998 MPI
2.0 MPI 2.0 Argonne National Lab
MPICH 1.2 MPI 2.0

http://www-unix.mcs.anl.gov/mpi/mpich
anonymous ftp
ftp.mcs.anl.gov
(directory) pub/mpi mpich-1.2.1.tar.Z
MPI

mpich-1.2.1.tar.gz

1.2

IBM SP2IBM SP2 SMPHP SPP2000SGI Origin2000 Fujitsu


VPP300 MPI PC cluster MPICH
PC clusterIBM SP2 IBM SP2 SMP
CPU
CPU IBM SP2 CPU CPU
CPU
( HP SPP2000) CPU CPU
CPU CPU CPU
(time sharing) CPU
HP SPP2000 SGI ORIGIN2000 16 CPU
SP2 VPP300 CPU
SP2 SMP node 4 CPU
42 node SMP lusterSP2 SP2 SMP
(job scheduler) LoadLeveler (batch job)
LoadLeveler job command file llsubmit SP2
SPP2000ORIGIN2000 VPP300 NQS (Network Queue System)
NQS job command file qsub
PC cluster DQS (Distributed Queue System)
NQS

1.3

IBM SP2 MPI

C shell home directory .cshrc


include file (mpif.hmpif90.hmpi.h) (mpxlfmpxlf90mpccmpCC)MPI library
LoadLeveler (llsubmitllqllstatusllcancel)
set lpath=(. ~ /usr/lpp/ppe.poe/include /usr/lpp/ppe.poe/lib)
set lpath=($lpath /usr/lpp/ppe.poe/bin /home/loadl/bin )
set path=($path $lpath)
.cshrc source .cshrc
(logout) (login) source .cshrc

1.3.1 IBM SP2 MPI C


MPI C (compiler) mpicc IBM
SMP mpccmpcc :

SP2 SP2

mpcc -O3 -qarch=auto -qstrict -o file.x file.f


-O3

(level 3 Optimization)

-qarch=auto

-qstrict

-o file.x

file.x
(default) a.out

1.3.2 IBM SP2 Job command file


IBM SP2(ivy) LoadLeveler job command file
job command file jobp4 CPU file.x

#!/bin/csh
#@ executable = /usr/bin/poe
#@ arguments = /your_working_directory/file.x
#@ output
= outp4
#@ error
= outp4
#@ job_type = parallel

euilib us

#@ class
= medium
#@ min_processors = 4
#@ max_processors = 4
#@ requirements = (Adapter == "hps_user")
#@ wall_clock_limit = 20
#@ queue
executable
arguments
output
error
class

= /usr/bin/poe poe Parallel Operating Environment


=
= (stdout)
= (error message)
= SP2 CPU llclass :
short
(CPU 12 10 120MHz CPU)
medium
(CPU 24 64 160MHz CPU)

long
(CPU 96 24 120MHz CPU)
min_processors = CPU
max_processors = CPU
requirements = (Adapter == "hps_user")
wall_clock_limit =
job
queue

CPU short class 4 CPUmedium class 32 CPU


long class 8 CPU MPI 1.2 CPU CPU CPU
min_processors max_processors
wall_clock_limit

IBM SP2 SMP (ivory) LoadLeveler job command


file job command file jobp4 CPU file.x

#!/bin/csh
#@ network.mpi= css0,shared,us
#@ executable = /usr/bin/poe
#@ arguments = /your_working_directory/file.x
#@ output
= outp4
#@ error
= outp4

euilib us

#@ job_type = parallel
#@ class
= medium
#@ tasks_per_node = 4
#@ node = 1
#@ wall_clock_limit = 20
#@ queue
IBM SP2 SMP Node 375MHz CPU 4GB 8GB
class
= SP2 SMP CPU llclass :
short
(CPU 12 3 Node 6 CPU)

tasks_per_node=4

(CPU 24 32 Node 128 CPU)


(CPU 48 4 Node 16 CPU)
class Node 8GB
Node CPU

node=1

Node CPU

medium
bigmem

CPU medium class 16 Node 64 CPU class

1.3.3 IBM SP2


IBM SP2 SP2 SMP LoadLeveler job command file
llsubmit job command file
job command file jobp4 :
llsubmit

jobp4

llq llq
grep class user id jobp4 medium
:
llq | grep medium
9

llq :
job_id
----------ivy1.1781.0
ivy1.1814.0

user_id
-----------u43ycc00
u50pao00

job_id
user_id
submitted
status

submitted
-------------8/13 11:24
8/13 20:12

status priority
------ -------R
50
R
50

class
---------medium
short

running on
------------ivy39
ivy35

LoadLeveler
login name
/ :

R Running
I Idle (=waiting in queue)

Priority

ST Start execution
NQ Not Queued

Class
Running on

CPU
CPU

llcancel
llcancel

job_id

job_id llq llcancel


llq

10

PC Cluster MPI

1.4

MPICH C shell home directory .cshrc


include file (mpif.hmpi.h) (mpif77mpiccmpiCC)MPI library
DQS PC Cluster
:
setenv PGI /usr/local/pgi
set path = ( . ~ /usr/local/pgi/linux86/bin $path)
set path = ( /home/package/DQS/bin $path)
set path = ( /home/package/mpich/bin $path)
PGI Portland Group Inc. PGI CC++
pgccpgCC DQS
MPICH PGI

1.4.1

PC Cluster C MPI

MPICH C mpicc GNU gcc


gcc :
mpicc -O3 -o file.x file.f

-O3
gcc
-o file.x file.x
a.out
file.c
C
PGI MPI mpicc pgcc
pgcc makefile :

11

OBJ
EXE

= file.o
= file.x

MPI
= /home/package/mpich_PGI
LIB
= $(MPI)/lib/libmpich.a
MPICC = $(MPI)/bin/mpicc
OPT = -O2 -I$(MPI)/include
$(EXE) : $(OBJ)
$(MPICC) $(LFLAG) -o $(EXE) $(OBJ) $(LIB)
.f.o :
$(MPICC) $(OPT) -c $<
makefile make

1.4.2

PC Cluster Job command file

PC cluster DQS
DQS job command file job command file jobp4
CPU hubksp :
#!/bin/csh
#$ -l qty.eq.4,HPCS00
#$ -N HUP4
#$ -A user_id
#$ -cwd
#$ -j y
cat $HOSTS_FILE > MPI_HOST
mpirun -np 4 -machinefile MPI_HOST hubksp >& outp4
#!/bin/csh
C shell script
#$ -l qty.eq.4,HPCS DQS CPUqty (quantity)
HPCS CPU cluster queue class
#$ -N HUP4
(Name) HUP4
#$ -A user_id
(Account)
#$ -cwd
(working directory)
home directory
#$ -j y
$HOST_FILE
-np 4 hubksp
>& outp4

DQS node list


mpirun CPU hubksp
outp4
12

1.4.3

PC Cluster

PC cluster DQS job command file


qsub job command file PC cluster job command
file jobp4 :
qsub

jobp4

qstat cluster
qstat -f cluster node qsub jobp4 qstat
:
c00tch00 HUP4
c00tch00 HUP4

hpcs01
hpcs02

62
62

0:1
0:1

r
r

RUNNING
RUNNING

02/26/99 10:51:23
02/26/99 10:51:23

c00tch00 HUP4
c00tch00 HUP4

hpcs03
hpcs04

62
62

0:1
0:1

r
r

RUNNING
RUNNING

02/26/99 10:51:23
02/26/99 10:51:23

----Pending Jobs -----------------------------------------------------------------------------------------c00tch00 RAD5


70
0:2
QUEUED
02/26/99 19:24:32
user_id CPU DQS
job_id62
0:1 0 0:1 1
r RUNNING
// :: Pending Jobs
RUNNING QUEUED
qdel
qdel

job_id

job_id qstat qdel qstat

13

(sequential program) MPI


2.1 MPI MPI_InitMPI_FinalizeMPI_Comm_size
MPI_Comm_rankMPI_SendMPI_Recv
2.2 T2SEQ
2.3 MPI T2SEQ T2CP
2.4 MPI MPI_ScatterMPI_GatherMPI_Reduce
MPI_Allreduce
2.5 T2SEQ T2DCP

14

MPI

2.1

MPI
MPI_Init, MPI_Finalize,
MPI_Comm_size, MPI_Comm_rank,
MPI_Send, MPI_Recv

2.1.1 mpi.h include file


MPI C include <mpi.h>
(statement)mpi.h MPI MPI MPI (constant)
:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
main ( argc, argv)
int argc;
char **argv;
{
...
...
MPI_Finalize();
return 0;
}
startend(int myid, int nproc, int is1, int is2, int* istart, int* iend)
{
...
return 0;
}
MPI mpi.h MPI
MPI

2.1.2 MPI_Init, MPI_Finalize


MPI MPI_Init CPU
15

MPI_Finalize MPI_Init
MPI_Finalize :
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
main ( argc, argv)
int argc;
char **argv;
{
MPI_Init(&argc, &argv);
...
MPI_Finalize();
return 0;
}

2.1.3 MPI_Comm_size, MPI_Comm_rank


MPI_Init MPI_Comm_size CPU
(nproc) MPI_Comm_rank CPU (myid) CPU 0
CPU myid CPU myid 1 CPU myid
2 CPU
CPU
CPU job command file min_processors
max_processors -np
MPI_Comm_size MPI_Comm_rank :
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_COMM_WORLD MPI (default) communicator
CPU communicator communicator CPU
MPI 1.2 CPU CPU
MPI
:

16

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int
nproc, myid;
main ( argc, argv)
int argc;
char **argv;
{
MPI_Init(&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD,
MPI_Comm_rank (MPI_COMM_WORLD,
...
...
MPI_Finalize();

&nproc);
&myid);

return 0;
}

2.1.4 MPI_Send, MPI_Recv


CPU '' (point to
point communication) '' (collective communication)
'' MPI_Send MPI_Recv, '' ''

CPU CPU '' CPU


MPI_Send CPU MPI_Recv MPI_Send
MPI_Recv MPI_Send
:
MPI_Send ((void *)&data, icount, DATA_TYPE, idest, itag, MPI_COMM_WORLD);
data
icount
DATA_TYPE
idest
itag

(scalar) (array)
icount data
MPI 1.1
CPU id

17

MPI data type


MPI_CHAR
MPI_SHORT
MPI_INT
MPI_LONG
MPI_UNSIGNED_CHAR
MPI_UNSIGNED_SHORT
MPI_UNSIGNED
MPI_UNSIGNED_LONG
MPI_FLOAT
MPI_DOUBLE
MPI_LONG_DOUBLE
MPI_PACKED
1.1

C data type
signed char
signed short iny
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned long int
float
double
long double

description
1-byte character
2-byte integer
4-byte integer
4-byte integer
1-byte unsigned character
2-byte unsigned integer
4-byte unsigned integer
4-byte unsigned integer
4-byte floating point
8-byte floating point
8-byte floating point

C MPI

MPI_Recv :
MPI_Recv ((void *)&data, icount, DATA_TYPE, isrc, itag, MPI_COMM_WORLD, istat);
data
icount
DATA_TYPE
isrc
itag
istat

CPU id

MPI_Recv

istat mpi.h MPI_STATUS_SIZE

MPI_Status

istat[MPI_STATUS_SIZE];

mpi.h MPI_STATUS_SIZE
MPI_Status

istat[8];

CPU CPU

MPI_Recv( (void *)&buff, icount, DATA_TYPE, MPI_ANY_SOURCE, itag,


18

MPI_COMM_WORLD, istat);
CPU id STATUS
isrc= istat( MPI_SOURCE );
MPI (MPI_SendMPI_Recv) '' (envelope)
(message)
1. CPU id
2. CPU id
3.
4.

communicator

CPU CPU

19

T2SEQ

2.2

T2SEQ test data generation bcd

for loop loop


for loop for loop

/*

PROGRAM T2SEQ
sequential version of 1-dimensional array operation
#include <stdio.h>
#include <stdlib.h>
#define n

*/

200

main ()
{
double suma, a[n], b[n], c[n], d[n];
int
i, j;
FILE *fp;
/*

test data generation and write out to file

'input.dat'

*/

compute and write out the result

*/

for (i = 0; i < n; i++) {


j=i+1;
b[i] = 3. / (double) j + 1.0;
c[i] = 2. / (double) j + 1.0;
d[i] = 1. / (double) j + 1.0;
}
fp = fopen( "input.dat", "w");
fwrite( (void *)&b, sizeof(b), 1, fp );
fwrite( (void *)&c, sizeof(c), 1, fp );
fwrite( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
/*

read

'input.dat',

20

fp = fopen( "input.dat", "r");


fread( (void *)&b, sizeof(b), 1, fp );
fread( (void *)&c, sizeof(c), 1, fp );
fread( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
suma = 0.;
for (i = 0; i < n; i++) {
a[i] = b[i] + c[i] * d[i];
suma += a[i];
}
for (i = 0; i < n; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]);
}
printf( "sum of A=%f\n",suma);
return 0;
}
T2SEQ :
10.000
2.118
2.060

3.056
2.108
2.057

2.562
2.099
2.054

2.383
2.091
2.052

2.290
2.085
2.050

2.234
2.079
2.048

2.196
2.074
2.046

2.168
2.070
2.044

2.148
2.066
2.043

2.131
2.063
2.041

2.040
2.039
2.037
sum of A=438.548079

2.036

2.035

2.034

2.033

2.032

2.031

2.031

21

2.3

T2CP

(decomposition/partition)

(sequential version)

T2SEQ ? '' 2.5


''
T2SEQ CPU abcd
CPU
startend CPU index CPU0
CPU1 CPU2 2.1 :
computing partition without data partition

cpu0

istart
iend
ntotal
|
|
|

cpu1

istart
iend
ntotal
|
|
|

cpu2

istart
iend
ntotal
|
|
|

cpu3

istart
iend
|
|

array element inside territory

array element outside territory

2.1
2.1 CPU
CPU istart iend
MPI 1.2 Parallel I/O CPU0myid
for loop MPI_Send CPU CPU
22

myid CPU0 CPU


itag MPI_Send MPI_Recv
CPU a CPU0CPU0
for loop CPU CPU0 a
suma a suma
/*

PROGRAM T2CP
computation partition without data partition of 1-dimensional arrays
#include <stdio.h>
#include <stdlib.h>

*/

#include <mpi.h>
#define n 200
main ( argc, argv)
int argc;
char **argv;
{
double
int
FILE
int

suma, a[n], b[n], c[n], d[n];


i, j, k;
*fp;
nproc, myid, istart, iend, icount;

int
itag, isrc, idest, istart1, icount1;
int
gstart[16], gend[16], gcount[16];
MPI_Status
istat[8];
MPI_Comm comm;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
startend( nproc, 0, n - 1, gstart, gend, gcount);
istart=gstart[myid];
iend=gend[myid];
comm=MPI_COMM_WORLD;
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
/*

READ 'input.dat', COMPUTE AND WRITE OUT THE RESULT */


if ( myid==0) {
23

fp = fopen( "input.dat", "r");


fread( (void *)&b, sizeof(b), 1, fp );
fread( (void *)&c, sizeof(c), 1, fp );
fread( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
for (idest = 1; idest < nproc; idest++) {
istart1=gstart[idest];
icount1=gcount[idest];
itag=10;
MPI_Send ((void *)&b[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
itag=20;
MPI_Send ((void *)&c[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
itag=30;
MPI_Send ((void *)&d[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
}
}
else {
icount=gcount[myid];
isrc=0;
itag=10;
MPI_Recv ((void *)&b[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
itag=20;
MPI_Recv ((void *)&c[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
itag=30;
MPI_Recv ((void *)&d[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
}
/*
compute, collect computed result and write out the result
*/
for (i = istart; i <= iend; i++) {
a[i] = b[i] + c[i] * d[i];
}
itag=110;
if (myid > 0) {
icount=gcount[myid];
idest=0;
MPI_Send((void *)&a[istart], icount, MPI_DOUBLE, idest, itag, comm);
}
24

else {
for ( isrc=1; isrc < nproc; isrc++ ) {
icount1=gcount[isrc];
istart1=gstart[isrc];
MPI_Recv((void *)&a[istart1], icount1, MPI_DOUBLE, isrc, itag, comm, istat);
}
}
if (myid == 0) {
for (i = 0; i < n; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]);
}
suma=0.0;
for (i = 0; i < n; i++)
suma+=a[i];
printf( "sum of A=%f\n",suma);
}
MPI_Finalize();
return 0;
}
startend(,int nproc,int is1,int is2,int gstart[16],int gend[16], int gcount[16])
{
int
ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
for ( i=0; i < nproc; i++ ) {
if(i < ir) {
gstart[i]=is1+i*(iblock+1);
gend[i]=gstart[i]+iblock;
}
else {
gstart[i]=is1+i*iblock+ir;
gend[i]=gstart[i]+iblock-1;
}
if(ilength < 1) {
gstart[i]=1;
gend[i]=0;
25

}
gcount[i]=gend[i]-gstart[i] + 1;
}
}

T2CP :
ATTENTION: 0031-408 4 nodes allocated by LoadLeveler, continuing...
NPROC,MYID,ISTART,IEND=4
1
50
99
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
10.000 3.056
2.562
2.383

2.290

0
49
150
199
100
149
2.234
2.196
2.168

2.148
2.074
2.050

2.131
2.070
2.048

0
3
2

2.118
2.066
2.046

2.108
2.063
2.044

2.099
2.060
2.043

2.091
2.057
2.041

2.085
2.054
2.040

2.079
2.052
2.039

2.037
2.036
2.035
sum of A=438.548079

2.034

2.033

2.032

2.031

2.031

SPMD (Single Program Multiple Data) CPU


CPU rank (myid)
(if statement) index for loop
index rank CPU T2CP
CPU :
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
startend( nproc, 0, n - 1, gstart, gend, gcount);
istart=gstart[myid];
iend=gend[myid];
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
CPU myid istart iend
SPMD CPU
CPU0
CPU master CPU slave CPU
26

2.4

MPI_ScatterMPI_GatherMPI_Reduce

MPI_ScatterMPI_GatherMPI_AllgatherMPI_ReduceMPI_Allreduce '
' communicator CPU
CPU

2.4.1 MPI_Scatter
MPI_Gather
MPI_Scatter iroot CPU t nproc (nproc= CPU )
n CPU id CPU ( iroot CPU )
CPU0 CPU1 CPU2 2.2 :

CPU0

t 1t 2t 3t 4

CPU0

t1

CPU1

t2

CPU2

CPU2

t3

CPU3

CPU3

t4

CPU1

Scatter
>

2.2 MPI_Scatter
MPI_Scatter :
iroot = 0
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_DOUBLE, iroot, comm);

n
CPU
MPI_DOUBLE
b

n b

n
MPI_DOUBLE
iroot

CPU id

MPI_Gather MPI_Scatter idest CPU CPU a


27

CPU id t CPU0 n t
CPU1 n t CPU2 n t
2.3 :

CPU0

t 1t 2t 3t 4

CPU0

t1

CPU1

t2

CPU2

CPU2

t3

CPU3

CPU3

t4

CPU1

Gather
<

2.3 MPI_Gather
MPI_Gather :
idest = 0
MPI_Gather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_ DOUBLE, idest, comm);
MPI_Gather
a

n a

n
MPI_DOUBLE

t
n
MPI_DOUBLE

CPU

idest

CPU id

MPI_Allgather :
MPI_ Allgather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, comm);
MPI_Allgather MPI_Gather MPI_Gather
CPU MPI_ Allgather CPU

28

CPU0

t 1t 2t 3t 4

CPU1

t 1t 2t 3t 4

CPU2

CPU3

CPU0

t1

CPU1

t2

t 1t 2t 3t 4

CPU2

t3

t 1t 2t 3t 4

CPU3

t4

Allgather

2.4 MPI_Allgather

2.4.2 MPI_Reduce, MPI_Allreduce


'' (reduction operation) CPU
(partial sum) CPU MPI_Reduce
CPU (iroot)MPI_Allreduce
CPU
MPI_Reduce 2.5 MPI_Allreduce 2.6

CPU0

suma

0.2 1.5

CPU0

CPU1

suma

0.5 0.6

CPU2

suma

0.3 0.4

Reduce
CPU1
>
(MPI_SUM) CPU2

CPU3

suma

0.7 1.0

CPU3

sumall

1.7 3.5

2.5 MPI_Reduce

CPU0

suma

0.2 1.5

CPU1

suma

0.5 0.6

CPU2

suma

0.3 0.4

CPU3

suma

0.7 1.0

Allreduce
>
(MPI_SUM)

CPU0

sumall

1.7 3.5

CPU1

sumall

1.7 3.5

CPU2

sumall

1.7 3.5

CPU3

sumall

1.7 3.5

2.6 MPI_Allreduce
29

MPI_Reduce MPI_Allreduce :
iroot = 0;
MPI_Reduce ((void *)&suma, (void *)&sumall, count, MPI_DOUBLE, MPI_SUM,
iroot, comm);
MPI_Allreduce((void *)&suma, (void *)&sumall, count, MPI_DOUBLE, MPI_SUM,
comm);
suma
sumall
count
MPI_DOUBLE

()
() ( CPU suma )
()
suma sumall
2.1
CPU_id

MPI_SUM
iroot

sumall CPU suma


MPI
MPI_SUM
MPI_PROD
MPI_MAX
MPI_MIN
MPI_MAXLOC
MPI_MINLOC
MPI_LAND
MPI_LOR
MPI_LXOR
MPI_BAND
MPI_BOR
MPI_BXOR

Operation
sum
product
maximum
minimum
max value and location
min value and location
logical AND
logical OR
logical exclusive OR
binary AND
binary OR
binary exclusive OR

C Data type
MPI_INT, MPI_ FLOAT,
MPI_DOUBLE, MPI_LONG_DOUBLE

MPI_FLOAT_INT, MPI_DOUBLE_INT,
MPI_LONG_INT, MPI_2INT
MPI_SHORT, MPI_LONG, MPI_INT,
MPI_UNSIGNED_SHORT, MPI_UNSIGNED,
MPI_UNSIGNED_LONG
MPI_SHORT, MPI_LONG, MPI_INT,
MPI_UNSIGNED_SHORT, MPI_UNSIGNED,
MPI_UNSIGNED_LONG

2.1 MPI Reduction Function


MPI_MAXLOC MPI_MINLOC C structure
Data type
MPI_FLOAT_INT
MPI_DOUBLE_INT
MPI_LONG_INT
MPI_2INT

Description (C structure)
{MPI_FLOAT, MPI_INT}
{MPI_DOUBLE, MPI_INT}
{MPI_LONG, MPI_INT}
{MPI_INT, MPI_INT}

30

2.5

T2DCP

T2DCP np CPU
abcd ntotal np bcd
ntotal ntotal t bc
d aCPU0 MPI_Scatter
CPU
iroot=0;
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_DOUBLE, iroot, comm);
CPU MPI_Gather a
CPU0
idest=0;
MPI_Gather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, idest, comm);
T2CP T2DCP MPI_ScatterMPI_Gather
CPU MPI_Send
MPI_Recv
dimension CPU np ntotal
n ntotal / np define :
#define ntotal
#define np
#define n

200
4
50

ntotal np dimension ntotal n :


double a[n], b[n], c[n], d[n], t[ntotal];
CPU for loop 0 n-1 suma a ntotal
np (partial sum) MPI_Reduce
CPU suma CPU0 sumall

31

iroot=0;
MPI_Reduce ((void *0&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, iroot,
comm);
T2DCP :
/*

PROGRAM T2DCP */

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ntotal 200
#define np 4
#define n
50
main ( argc, argv)
int argc;
char **argv;
{
/*

Data & Computational Partition Using MPI_Scatter, MPI_Gather


value of n must be modified when run on other than 4 processors

*/
int
FILE

i, j, k;
*fp;

double
a[n], b[n], c[n], d[n], t[ntotal], suma, sumall;
int
nproc, myid, istart, iend, iroot, idest;
MPI_Comm
comm;
MPI_Status
istat[8];
extern int
mod;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
Comm = MPI_COMM_WORLD;
istart = 0;
iend = n-1;

32

/*
*/

read input data and distribute input data


if (nproc != np) {
printf( "nproc not equal to np= %d\t%d\t",nproc, np);
printf(" program will stop");
MPI_Finalize();
return 0;
}
if (myid == 0) {
fp = fopen( "input.dat", "r");
fread( (void *)&t, sizeof(t), 1, fp );
}
iroot=0;
MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_ DOUBLE, iroot, comm);
if(myid == 0) {
fread( (void *)&t, sizeof(t), 1, fp );
}
MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&c, n, MPI_ DOUBLE, iroot, comm);
if(myid == 0) {
fread( (void *)&t, sizeof(t), 1, fp );
}
MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&d, n, MPI_DOUBLE, iroot, comm);

/*
compute, gather computed data, and write out the result
*/
suma=0.0;
/* for(i=0; i<ntotal; i++) { */
for(i=istart; i<=iend; i++) {
a[i]=b[i]+c[i]*d[i];
suma=suma+a[i];
}
idest=0;
MPI_Gather((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_ DOUBLE, idest, comm);
MPI_Reduce((void *)&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, idest, comm);
if(myid == 0) {
for (i = 0; i < ntotal; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]);
33

}
printf( "sum of A=%f\n",sumall);
}
MPI_Finalize();
return 0;
}

T2DCP :
ATTENTION: 0031-408
10.000 3.056
2.562
2.148
2.131
2.118
2.074
2.070
2.066
2.050
2.048
2.046
2.037
2.036
2.035
sum of A=438.548079

4 nodes allocated by LoadLeveler, continuing...


2.383
2.290
2.234
2.196
2.168
2.108
2.099
2.091
2.085
2.079
2.063
2.060
2.057
2.054
2.052
2.044
2.034

2.043
2.033

2.041
2.032

2.040
2.031

2.039
2.031

34

MPI
3.1 MPI MPI_SendrecvMPI_BcastMPI_ Sendrecv
MPI_Bcast
3.2 T3SEQ
3.3 MPI_ SendrecvMPI_SendMPI_Recv T3SEQ
T3CP_1 MPI_ Bcast T3CP_1 MPI_Send MPI_Recv
T3CP_2T3CP_1 T3CP_2
3.4 T3DCP_1
3.5 T3DCP_2

35

3.1

MPI_Sendrecv, MPI_Bcast

MPI_Sendrecv '' MPI_Bcast ''

3.1.1 MPI_ Sendrecv


CPU CPU CPU MPI_
Sendrecv MPI_Send MPI_Recv
CPU CPU

itag = 110;
MPI_ Sendrecv ((void *)&b[iend],
icount, DATA_TYPE, r_nbr, itag,
(void *)&b[istartm1], icount, DATA_TYPE, l_nbr, itag, comm, istat);
b[iend]

icount
DATA_TYPE
r_nbr
itag

CPU id ()

b[istartm1]
icount
DATA_TYPE
l_nbr
itag

CPU id ()

istat

3.1.2 MPI_Bcast
MPI_Bcast '' Bcast Broadcast
communicator CPU ''
CPU CPU
CPU
MPI_Bcast :
iroot=0;
36

MPI_Bcast ( (void *)&b, icount, DATA_TYPE, iroot, comm);


b
icount
DATA_TYPE

CPU id

iroot

MPI_Bcast 3.1 :

CPU0

CPU0

b1 b2 b3 b4

CPU1

b1 b2 b3 b4

CPU2

CPU2

b1 b2 b3 b4

CPU3

CPU3

b1 b2 b3 b4

CPU1

b1 b2 b3b4
MPI_Bcast

3.1 MPI_Bcast

37

T3SEQ

3.2

T3SEQ for loop a[i] c[i]d[i]


b[i-1]b[i]b[i+1]
CPU (outside territory)
(boundary data exchange) a amax
/*

PROGRAM T3SEQ
Boundary Data Exchange Program - Sequential Version

*/
#include <stdio.h>
#include <stdlib.h>
#define ntotal

200

main ()
{
double
int
FILE

amax, a[ntotal], b[ntotal], c[ntotal], d[ntotal];


i, j;
*fp;

extern double max(double, double);


/*

read

'input.dat',

compute, and write out the result

*/

fp = fopen( "input.dat", "r");


fread( (void *)&b, sizeof(b), 1, fp );
fread( (void *)&c, sizeof(c), 1, fp );
fread( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
amax = -1.0e12;
for (i = 1; i < ntotal-1; i++) {
a[i]=c[i]*d[i]+(b[i-1]+2.0*b[i]+b[i+1])*0.25;
amax=max(amax,a[i]);
}
for (i = 0; i < ntotal; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
38

a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]);
}
printf( "MAXIMUM VALUE OF A ARRAY is=%f\n",amax);
return 0;
}
double max(double a, double b)
{
if(a >= b)
return a;
else
return b;
}
T3SEQ :
0.000
2.148

3.063
2.131

2.563
2.118

2.383
2.108

2.290
2.099

2.234
2.091

2.196
2.085

2.168
2.079

2.074
2.070
2.066
2.063
2.060
2.057
2.050
2.048
2.046
2.044
2.043
2.041
2.037
2.036
2.035
2.034
2.033
2.032
MAXIMUM VALUE OF A ARRAY is=5.750000

2.054
2.040
2.031

2.052
2.039
2.031

39

T3CP

3.3

T3SEQ ? 3.4
3.5
T3CP_1 startend CPU index
CPU0 CPU1 3.2 :

left
mpi_proc_null

cpu0

cpu1

cpu2

right


| |
istart2
istart
.

iend+1
iend1
iend
|

.
. .
iend+1
iend1
iend
|


| |
| istart
istart2
istart-1
. . . . .

is owned data

ntotal
|
. . .
ntotal
|
. . .

iend
iend1
|
|
|

|
istart
istart2
istart -1

mpi_proc_null

is exchanged data

3.2
3.2 .
CPU
CPU istart iend T3SEQ :
amax=-1.e12;
for (i=1; i<ntotal-1; i++) {
a[i]=c[i]*d[i] + ( b[i-1] + 2.0*b[i] + b[i+1] )*0.25
amax = max(amax, a[i])
}

40

index i 1 <ntotal-1 CPU0 1 CPU istart


istart1 :
istart1=istart;
if (myid == 0) istart1=1;
loop CPU ntotal-2 iend-1 CPU iend
iend1 :
iend1= iend;
if (myid == nproc-1) iend1= iend 1;
a[i] i istart b[istart-1] a[i] i iend b[iend+1]
istartm1 (istart minus 1) iendp1 (iend plus 1) b
index :
istartm1=istart-1;
iendp1=iend+1;
(i-1 i+1 ) for loop MPI_Sendrecv
CPU CPU startend index
CPU0 CPU1 CPU2 CPU
l_nbr CPU CPU id r_nbr CPU CPU id
CPU CPU CPU
MPI_PROC_NULL MPI_PROC_RULL
mpi.h
l_nbr = myid-1;
r_nbr = myid+1;
IF(myid == 0)
l_nbr = MPI_PROC_NULL;
IF(myid == NPROC-1) r_nbr = MPI_PROC_NULL;
b[i-1] b[i+1] MPI_Sendrecv b[i-1]
b[i+1]
b[i-1] 3.2 CPU1 b[iend] "
b[istartm1]" " b[iend]" b[istartm1]
MPI_PROC_NULL CPU
b[istartm1] :
41

itag = 110;
MPI_Sendrecv ((void *)&b[iend],
1, MPI_DOUBLE, r_nbr, itag,
(void *)&b[istartm1], 1, MPI_DOUBLE, l_nbr, itag, comm, istat)
b[i+1] 3.2 CPU1 b[istart] "
b[iendp1]" " b[istart]" b[iendp1]
MPI_PROC_NULL CPU
b[iendp1] :
itag = 120;
MPI_Sendrecv ((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag,
(void *)&b[iendp1], 1, MPI_DOUBLE, r_nbr, itag, comm, istat);
CPU for loop istart iend amax a ntotal
np
MPI_Allreduce CPU amax gmax (global maximum)
CPU
MPI_Allreduce ( (void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm );
reduce allreduce CPU reduce
MPI_Allreduce CPU reduce
MPI_Reduce MPI_Allreduce
:
/*

PROGRAM T3CP
Boundary data exchange with computing partition without data partition
Using MPI_Send, MPI_Recv to distribute input data

*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ntotal 200
main ( argc, argv)
42

int argc;
char **argv;
{
double amax, gmax, a[ntotal], b[ntotal], c[ntotal], d[ntotal];
int
i, j, k;
FILE *fp;
int
nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp;
int
itag, isrc, idest, istart1,icount1, istart2, iend1, istartm1, iendp1;
int
gstart[16], gend[16], gcount[16];
MPI_Status
istat[8];
MPI_Comm comm;
extern double max(double, double);
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm=MPI_COMM_WORLD;
startend (nproc, 0, ntotal-1, gstart, gend, gcount);
istart=gstart[myid];
iend=gend[myid];
icount=gcount[myid];
lastp=nproc-1;
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
istartm1=istart-1;
iendp1=iend+1;
istart2=istart;
if (myid == 0) istart2=istart+1;
iend1=iend;
if(myid == lastp ) iend1=iend-1;
l_nbr = myid - 1;
r_nbr = myid + 1;
if (myid == 0) l_nbr=MPI_PROC_NULL;
if (myid == lastp) r_nbr=MPI_PROC_NULL;

43

/*

READ 'input.dat', and distribute input data

*/

if ( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&b, sizeof(b), 1, fp );
fread( (void *)&c, sizeof(c), 1, fp );
fread( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
for (idest = 1; idest < nproc; idest++) {
istart1=gstart[idest];
icount1=gcount[idest];
itag=10;
MPI_Send ((void *)&b[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
itag=20;
MPI_Send ((void *)&c[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
itag=30;
MPI_Send ((void *)&d[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
}
}
else {
isrc=0;
itag=10;
MPI_Recv ((void *)&b[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
itag=20;
MPI_Recv ((void *)&c[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
itag=30;
MPI_Recv ((void *)&d[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
}
/*
Exchange data outside the territory
*/
itag=110;
MPI_Sendrecv((void *)&b[iend],
1, MPI_DOUBLE, r_nbr, itag,
(void *)&b[istartm1],1, MPI_DOUBLE, l_nbr, itag, comm, istat);
itag=120;
MPI_Sendrecv((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag,
(void *)&b[iendp1],1, MPI_DOUBLE, r_nbr, itag, comm, istat);
44

/*
Compute, gather and write out the computed result
*/
amax= -1.0e12;
for (i=istart2; i<=iend1; i++) {
a[i]=c[i]*d[i]+(b[i-1]+2.0*b[i]+b[i+1])*0.25;
amax=max(amax,a[i]);
}
itag=130;
if (myid > 0) {
idest=0;
MPI_Send((void *)&a[istart], icount, MPI_DOUBLE, idest, itag, icomm);
}
else

for (isrc=1; isrc<nproc; isrc++) {


istart1=gstart[isrc];
icount1=gcount[isrc];
MPI_Recv((void *)&a[istart1], icount1, MPI_DOUBLE, isrc, itag, comm, istat);
}
}
MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm);
amax=gmax;
if( myid == 0) {
for (i = 0; i < ntotal; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]);
}
printf ("MAXIMUM VALUE OF ARRAY A is %f\n", amax);
}
MPI_Finalize();
return 0;
}
double max(double a, double b)
{
if(a >= b)
return a;
else
return b; }
45

T3CP_1 :
ATTENTION: 0031-408 4 nodes allocated by LoadLeveler, continuing...
NPROC,MYID,ISTART,IEND=4
0
0
49
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
0.000
3.063
2.563
2.383

2.290

50
99
100
149
150
199
2.234
2.196
2.168

2.148
2.074
2.050
2.037

2.099
2.060
2.043
2.033

2.091
2.057
2.041
2.032

2.131
2.070
2.048
2.036

2.118
2.066
2.046
2.035

2.108
2.063
2.044
2.034

1
2
3

2.085
2.054
2.040
2.031

2.079
2.052
2.039
2.031

MAXIMUM VALUE OF ARRAY A is 5.750000


CPU0 MPI_Bcast
CPU

MPI_Bcast T3CP MPI_Send MPI_Recv


if ( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&b, sizeof(b), 1, fp );
fread( (void *)&c, sizeof(c), 1, fp );
fread( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
}
iroot=0;
MPI_Bcast( (void *)&b, ntotal, MPI_DOUBLE, iroot, comm);
MPI_Bcast( (void *)&c, ntotal, MPI_DOUBLE, iroot, comm);
MPI_Bcast( (void *)&d, ntotal, MPI_DOUBLE, iroot, comm);

46

3.4

() T3DCP_1

np CPU n
ntotal np ntotal NP
n+2 dimension [n+2]
index 1 n :
double a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal], amax,gmax;
3.3 :

left
mpi_proc_null

cpu0

n
+
index 0 1 2 . n 1

index

cpu1

0 1 2 . n

n
+
1

n
+
index 0 1 2
. n 1

cpu2

n
+
index 0 1 2 . n 1

cpu3

mpi_proc_null
right

is owned data

is exchanged data

3.3
CPU for loop 1 N :
istart=1;
iend=n;
CPU for loop 2 CPU for loop 1 CPU for loop
n-1 CPU for loop n :

47

istart2= istart ;
if (myid == 0) istart2=2;
iend1= iend;
if(myid == nproc-1) iend1= iend 1;
CPU CPU iend
CPU istart-1 :
istartm1 = istart 1;
itag=110;
MPI_Sendrecv ((void *)&b[iend],
1, MPI_DOUBLE, r_nbr, itag,
(void *)&b[istartm1], 1, MPI_DOUBLE, l_nbr, itag, comm, istat);
CPU CPU istart
CPU iend+1 :
iendp1 = iend+1;
itag=120
MPI_Sendrecv ((void *)&b[istart],

1, MPI_DOUBLE, l_nbr, itag,

(void *)&b[iendp1], 1, MPI_DOUBLE, r_nbr, itag, comm, istat);


ntotal bcd t MPI_Scatter CPU b
cd dimension 1 MPI_Scatter
b[1]c[1]d[1]MPI_Gather
iroot=0;
MPI_Scatter (t,

n, MPI_DOUBLE,

MPI_Gather (a[1], n, MPI_DOUBLE,

b[1], n, MPI_DOUBLE, iroot, comm)


t,

n, MPI_DOUBLE, iroot, comm);

T3DCP_1 :
/*

PROGRAM T3DCP_1
Boundary data exchange with data & computing partition
Using MPI_Gather, MPI_Scatter to gather & scatter data

*/
48

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ntotal 200
#define n
50
#define np

main ( argc, argv)


int argc;
char **argv;
{
double amax, gmax, a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal];
int
i, j, k;
FILE
int
int

*fp;
nproc, myid, istart, iend, istart2, iend1, istartm1, iendp1;
r_nbr,l_nbr, lastp, iroot, itag;

MPI_Status
istat[8];
MPI_Comm
comm;
extern double max(double, double);
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm=MPI_COMM_WORLD;
istart=1;
iend=n;
lastp=nproc-1;
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
istartm1=istart-1;
iendp1=iend+1;
istart2=istart;
if(myid == 0) istart2=2;
iend1=iend;
if(myid == lastp ) iend1=iend-1;
49

l_nbr = myid - 1;
r_nbr = myid + 1;
if(myid == 0)
l_nbr=MPI_PROC_NULL;
if(myid == lastp) r_nbr=MPI_PROC_NULL;
/*

READ 'input.dat', and distribute input data

*/

if( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&t, sizeof(t), 1, fp );
}
iroot=0;
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b[1], n, MPI_DOUBLE, iroot, comm);
if( myid==0)
fread( (void *)&t, sizeof(t), 1, fp );
MPI_Scatter ((void *)&t, n, MPI_DOUBLE,( void *)&c[1], n, MPI_DOUBLE, iroot, comm);
if( myid==0) {
fread( (void *)&t, sizeof(t), 1, fp );
fclose( fp );
}
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&d[1], n, MPI_DOUBLE, iroot, comm);
/*
Exchange data outside the territory
*/
itag=110;
MPI_Sendrecv((void *)&b[iend],
1,MPI_DOUBLE, r_nbr, itag,
(void *)&b[istartm1], 1,MPI_DOUBLE, l_nbr, itag, comm, istat);
itag=120;
MPI_Sendrecv((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag,
(void *)&b[iendp1],1, MPI_DOUBLE, r_nbr, itag, comm, istat);
/*
Compute, gather and write out the computed result
*/
amax= -1.0e12;
for (i=istart2; i<=iend1; i++) {
a[i]=c[i]*d[i] + ( b[i-1] + 2.0*b[i] + b[i+1] )*0.25;
amax=max(amax,a[i]);
50

}
MPI_Gather((void *)&a[istart], n, MPI_DOUBLE,(void *)&t, n, MPI_DOUBLE,iroot, comm);
MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX,
amax=gmax;
if( myid == 0) {
for (i = 0; i < ntotal; i+=40) {

comm);

printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]);
}
printf ("MAXIMUM VALUE OF ARRAY A is %f\n", amax);
}
MPI_Finalize();
return 0;
}
double max(double a, double b)
{
if(a >= b)
return a;
else
return b;
}
T3DCP_1 :
ATTENTION: 0031-408

4 nodes allocated by LoadLeveler, continuing...

NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
0.000
3.063
2.563
2.383
2.148
2.131
2.118
2.108

2.290
2.099

2.234
2.091

50
50
50
50
2.196
2.168
2.085
2.079

2.074
2.050
2.037

2.060
2.043
2.033

2.057
2.041
2.032

2.054
2.040
2.031

2.070
2.048
2.036

2.066
2.046
2.035

2.063
2.044
2.034

1
3
0
2

1
1
1
1

2.052
2.039
2.031

MAXIMUM VALUE OF ARRAY A is 5.750000

51

3.5

() T3DCP_2

3.4 T3DCP_1 for loop :


for (i=3; i<=ntotal-2; i++)
a[i]=c[i]*d[i]+( b[i-2] +2.0*b[i-1] +2.0*b[i]+2.0*Bb[i+1] +b[i+2] )*0.125;
T3DCP_1 dimension 4
dimension [n+4] index 2 n+1 3.4 :
double a[n+4], b[n+4], c[n+4], d[n+4], t[ntotal], amax, gmax;
istart = 2;
iend = n+1;

LEFT
mpi_proc_null

cpu0

0 1 2 3 .

index

cpu1

cpu2

n n n
+ + +
1 2 3

0 1 2 3

index

0 1 2 3

index

cpu3

RIGHT

n n n
+ + +
. 1 2 3

.

n n n
+ + +
1 2 3

0 1 2 3

is exchanged data
is owned data

n
+
1

n n
+ +
2 3

mpi_proc_null

3.4
for loop index CPU 3 CPU
ntotal-2 :
istart3=istart;
52

if (myid == 0) istart3=4;
iend2= iend;
if (myid == nproc-1) iend2= iend 2;
CPU
CPU iend-1 CPU
istart-2 :
iendm1=iend-1;
istartm2=istart-2;
itag = 110;
MPI_Sendrecv ((void *)&b[iendm1], 2, MPI_DOUBLE, r_nbr, itag,
1
(void *)&b[istartm2], 2, MPI_DOUBLE, l_nbr, itag, comm, istat);
CPU CPU
istart CPU iend+1 :
iendp1=iend+1;
itag=120;
MPI_Sendrecv ((void *)&b[istart],

2, MPI_DOUBLE, l_nbr, itag,

(void *)&b[iendp1], 2, MPI_DOUBLE, r_nbr, itag, comm, istat);


:
/*

PROGRAM T3CP_2
Two element of boundary data exchange with data & computing partition
Using MPI_Gather, MPI_Scatter to gather & scatter data

*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ntotal 200
#define n
50
#define np

main ( argc, argv)


53

int argc;
char **argv;
{
double amax, gmax, a[n+4], b[n+4], c[n+4], d[n+4], t[ntotal];
int
i, j, k;
FILE *fp;
int
nproc, myid, istart, iend, istart3, iend2, istartm2, iendm1, iendp1;
int
r_nbr, l_nbr, lastp, iroot, itag;
MPI_Status
istat[8];
MPI_Comm comm;
extern double max(double, double);
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm=MPI_COMM_WORLD;
istart=2;
iend=n+1;
lastp=nproc-1;
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
istartm2=istart-2;
iendp1=iend+1;
iendm1=iend-1;
istart3=istart;
if(myid == 0) istart3=4;
iend2=iend;
if(myid == lastp ) iend2=iend-2;
l_nbr = myid - 1;
r_nbr = myid + 1;
if(myid == 0)
l_nbr=MPI_PROC_NULL;
if(myid == lastp) r_nbr=MPI_PROC_NULL;
/*

READ 'input.dat', and distribute input data

*/

54

if ( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&t, sizeof(t), 1, fp );
}
iroot=0;
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b[2], n, MPI_DOUBLE, iroot, comm);
if( myid==0)
fread( (void *)&t, sizeof(t), 1, fp );
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&c[2], n, MPI_DOUBLE, iroot, comm);
If ( myid==0) {
fread( (void *)&t, sizeof(t), 1, fp );
fclose( fp );
}
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&d[2], n, MPI_DOUBLE, iroot, comm);
/*
Exchange data outside the territory
*/
itag=110;
MPI_Sendrecv((void *)&b[iendm1], 2, MPI_DOUBLE, r_nbr, itag,
(void *)&b[istartm2], 2, MPI_DOUBLE, l_nbr, itag, comm, istat);
itag=120;
MPI_Sendrecv((void *)&b[istart], 2, MPI_DOUBLE, l_nbr, itag,
(void *)&b[iendp1], 2, MPI_DOUBLE, r_nbr, itag, comm, istat);
/*
C

Compute, gather and write out the computed result

*/
amax= -1.0e12;
for (i=istart3; i<=iend2; i++) {
a[i]=c[i]*d[i] + ( b[i-2] + 2.0*b[i-1] + 2.0*b[i] + 2.0*b[i+1] + b[i+2] )*0.125;
amax=max(amax,a[i]);
}
MPI_Gather((void *)&a[istart], n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, iroot, comm);
MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm);
amax=gmax;
if( myid == 0) {
for (i = 0; i < ntotal; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]);
55

}
printf ("MAXIMUM VALUE OF ARRAY A is %f\n", amax);
}
MPI_Finalize();
return 0;
}
double max(double a, double b)
{
if(a >= b)
return a;
else
return b;
}
T3DCP_2 :
ATTENTION: 0031-408

4 nodes allocated by LoadLeveler, continuing...

NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
0.000
2.148
2.074
2.050

3.078
2.131
2.070
2.048

2.565
2.118
2.066
2.046

2.384
2.108
2.063
2.044

0
1
3
2
2.291
2.099
2.060
2.043

2
2
2
2

51
51
51
51

2.234
2.091
2.057
2.041

2.196
2.085
2.054
2.040

2.168
2.079
2.052
2.039

2.037
2.036
2.035
2.034
2.033
2.032
MAXIMUM VALUE OF ARRAY A is 4.484722

2.031

2.031

56


(grid points) CPU
dimension
4.1 T4SEQ dimension 161 7 23
4.2 MPI_Scatterv MPI_Gatherv '' MPI_Scatter
MPI_Gather CPU
4.3 MPI_Pack MPI_Unpack MPI_Barrier
MPI_Wtime
4.4 MPI T4SEQ T4DCP

57

T4SEQ

4.1

T4SEQ dimension 161 7 23


abcd (scalar data) pqr (initial
value)
/*

PROGRAM T4SEQ
Sequential Version of an odd-dimensioned array with -1, +1 access

*/
#include <stdio.h>
#include <stdlib.h>
#define ntotal 161
main ()
{
double
a[ntotal], b[ntotal], c[ntotal], d[ntotal], p, q, r, pqr[3];
int
i,j;
FILE
*fp;
extern double max(double, double);
/*

READ 'input.dat', COMPUTE AND WRITE OUT THE RESULT */


for (i = 0; i < ntotal; i++) {
b[i]=3.0/(double)(i+1)+1.0;
c[i]=2.0/(double)(i+1)+1.0;
d[i]=1.0/(double)(i+1)+1.0;
}
p=1.45;
q=2.62;
r=0.5;
pqr[0]=p;
pqr[1]=q;
pqr[2]=r;
fp = fopen( "input.dat", "w");
fwrite((void *)&b, sizeof(b), 1, fp );
fwrite((void *)&c, sizeof(c), 1, fp );
58

fwrite((void *)&d, sizeof(d), 1, fp );


fwrite((void *)&pqr, sizeof(pqr), 1, fp );
fclose( fp );
fp = fopen( "input.dat", "r");
fread( (void *)&b, sizeof(b), 1, fp );
fread( (void *)&c, sizeof(c), 1, fp );
fread( (void *)&d, sizeof(d), 1, fp );
fread( (void *)&pqr, sizeof(pqr), 1, fp );
fclose( fp );
p=pqr[0];
q=pqr[1];
r=pqr[2];
for (i = 1; i < ntotal-1; i++) {
a[i]=c[i]*d[i]*p+(b[i-1]+2.0*b[i]+b[i+1])*q+r;
}
for (i = 0; i < ntotal-1; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]);
}
return 0;
}

T4SEQ :
0.000
13.305
12.872
12.726

18.550
13.210
12.847
12.714

15.720
13.133
12.824
12.703

14.682
13.070
12.803
12.693

14.143
13.018
12.785
12.684

13.812
12.973
12.768
12.675

13.588
12.935
12.753
12.667

13.427
12.901
12.739
12.660

59

4.2.

MPI_ScattervMPI_Gatherv

4.1 T4SEQ dimension 161


7 23 2468 CPU 24
68 CPU MPI_Scatter MPI_Gather
CPU MPI_Send MPI_Recv
MPI MPI_Scatterv MPI_ Gatherv

MPI_ Scatterv MPI_ Scatter MPI_ Gatherv MPI_ Gather


MPI_ Scatter CPUMPI_Gather CPU
MPI_ Scatterv MPI_Gatherv
MPI_ Scatterv :
MPI_ Scatterv ( (void *)&t, gcount, gdisp, MPI_DOUBLE,
(void *)&c(1), mycount, MPI_DOUBLE, iroot, comm);
MPI_ Scatterv iroot CPU t CPU id CPU iroot
CPU CPU gcount
CPU gdisp t
:
t
gcount
gdisp

CPU
CPU t

MPI_DOUBLE

c(1)
mycount
MPI_DOUBLE
iroot

CPU id

gcount gdisp startend CPU


START index CPU id gcountgdisp
dimension :
double

a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal];


60

int

nproc, myid, mycount, istart, iend, l_nbr, r_nbr, gcount[np], gdisp[np], gend[np];

MPI_Status

istat[8];

Startend (nproc, 1, ntotal, gstart, gend, gcount);


mycount = gcount[myid];
abcd dimension
MPI_Scatterv b[1]c[1]d[1] t
(displacement offset) startend istart

MPI_Gatherv :
MPI_ Gatherv ((void *)&a[1], mycount, MPI_DOUBLE,
(void *)&t, gcount, gdisp, MPI_DOUBLE, iroot, comm);
MPI_ Gatherv MPI_Scatterv iroot CPU CPU ( iroot CPU)
a CPU id t CPU
gcount CPU gdisp
t :
a[1]
mycount
MPI_DOUBLE

gcount
gdisp
MPI_DOUBLE
iroot

CPU
CPU T

CPU id

61

4.3

MPI_PackMPI_UnpackMPI_ BarrierMPI_ Wtime

T4SEQ abcd pqr


MPI_Scatterv MPI_Gatherv
CPU PQR
MPI_BCAST CPU (2001 ) CPU
CPU

MPI MPI_Pack
(noncontiguous data) (contiguous memory
locations) (buffer area) (character array)
MPI_Unpack

'' MPI_Barrier CPU MPI


(Fortran function) MPI_Wtime (wall clock time)

4.3.1 MPI_Pack
MPI_Unpack
MPI_Pack T4SEQ pqr

CPU (pack)
MPI_Unpack
MPI_PackMPI_Unpack
pqr 4 12
buf1 12 :
#define bufsize 12
char buf1[bufsize];
MPI_Pack :
MPI_Pack ((void *)&p, 1, MPI_FLOAT, (void *)&buf1, bufsize, &ipos, comm);
p

buf1
62

buf1

MPI_ FLOAT

buf1

buf1
bufsize

buf1

ipos

buf1

ipos
CPU0 pqr buf1 :
if (myid == 0) {
scanf (%f %f %f, &p, &q, &r);
ipos = 0;
MPI_Pack ((void *)&p, 1, MPI_ FLOAT, (void *)&buf1, bufsize, &ipos, comm);
MPI_Pack ((void *)&q, 1, MPI_ FLOAT, (void *)&buf1, bufsize, &ipos, comm);
MPI_Pack ((void *)&r, 1, MPI_ FLOAT, (void *)&buf1, bufsize, &ipos, comm);
}
MPI_Bcast buf1 CPU :
iroot=0
MPI_Bcast ((void *)&buf1, bufsize, MPI_CHAR, iroot, comm);
MPI_Unpack :
MPI_Unpack ((void *)&buf1, bufsize, ipos, (void *)&p 1, MPI_FLOAT, comm);
buf1
bufsize
ipos
p
1
MPI_FLOAT

buf1
buf1
buf1
buf1
buf1

ipos
CPU bcast buf1 buf1 pqr
:
63

if (myid > 0)

ipos=0;
MPI_Unpack ((void *)&buf1, bufsize, ipos, (void *)&p, 1, MPI_FLOAT, comm);
MPI_ Unpack ((void *)&buf1, bufsize, ipos, (void *)&q, 1, MPI_ FLOAT, comm);
MPI_ Unpack ((void *)&buf1, bufsize, ipos, (void *)&r, 1, MPI_ FLOAT, comm);
}
Pack

float

p, q, r, buf1[3];

if (myid == 0)
scanf (%f
buf1(1)=p
buf1(2)=q

{
%f

%f,

&p, &q, &r);

buf1(3)=r
}
iroot=0;
MPI_Bcast (buf1, 3, MPI_FLOAT, iroot, comm);
If (myid > 0) {
p = buf1(1);
q = buf1(2);
r = buf1(3);
}

4.3.2 MPI_Barrier
MPI_Wtime
MPI_Barrier '' communicator CPU '
' (synchronized) CPU MPI_Barrier CPU
MPI_Barrier MPI_Barrier
MPI_Barrier :
MPI_Barrier (MPI_COMM_WORLD);
(wall clock time) MPI MPI_Wtime
64

'' :
time1=MPI_Wtime();
time1 double
CPU
MPI_Init MPI_Wtime MPI_Finalize
MPI_Wtime :
MPI_Init();
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_Barrier (MPI_COMM_WORLD);
time1=MPI_Wtime()
...
time2=MPI_Wtime() time1;
printf (myid, clock time= %f\t%f\n, myid,time2);
MPI_Finalize();
Return 0;
time 1 MPI_Barrier ? Job Scheduler
(executable file) CPU CPU
CPU

CPU
CPU CPU time 1 MPI_ Barrier

65

T4DCP

4.4

T4DCP T4SEQ ab
cd ntotal 161 4 CPU startend CPU
41404040 np CPU
n ntotal / np + 1 define :
#define ntotal
#define np

161
4

#define n

41

n+2 demension
(n+2) index 1 n :
double

a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal];

ntotal np CPU n(=41) n-1(=40) CPU


startend gcount t
gdisp :
int

gcount[np], gstart[np], gend[np];

startend (nproc, 1, ntotal, gstart, gend, gcount);


CPU mycount index istartiend :
mycount=gcount[myid];
istart=1;
iend=mycount;
CPU 4.1 :

66

mpi_proc_null

iend

iend+1

iend

istart-1

istart

istart-1

istart
istart-1

is owned data

iend+1

iend

CPU0

iend+1


istart

CPU1

CPU2

mpi_proc_null

is exchanged data

4.1
for loop :
for (i=1; i<ntotal-1; i++)
a[i]=c[i]*d[i] + ( b[i-1] + 2.0*b[i] + b[i+1]

)*0.25;

for loop index :


istart2= istart;
if (myid == 0) istart2=1;
iend1= iend;
if (myid == nproc-1) iend 1= iend 1;
T4DCP :
/*

PROGRAM T4DCP
Boundary data exchange with data & computing partition
Using MPI_Gatherv, MPI_Scatterv to gather & scatter data

*/
#include <stdio.h>
67

#include <stdlib.h>
#include <mpi.h>
#define ntotal 161
#define n 41
#define np 4
main ( argc, argv)
int argc;
char **argv;
{
double
int
FILE
int

p, q, r, pqr[3], a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal], clock;


i, j, k;
*fp;
nproc, myid, istart, iend, istart2, iend1, istartm1, iendp1;

int
r_nbr, l_nbr, lastp, iroot, itag, icount;
int
gstart[16], gend[16], gcount[16], gdisp[16];
MPI_Status
istat[8];
MPI_Comm

comm;

MPI_Init (&argc, &argv);


MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_Barrier(MPI_COMM_WORLD);
clock=MPI_Wtime();
startend (nproc, 1, ntotal, gstart, gend, gcount);
for (i = 0; i < nproc; i++) {
gdisp [i] = gstart[i]-1;
}
comm=MPI_COMM_WORLD;
istart=1;
iend=gend[myid];
icount=gcount[myid];
lastp=nproc-1;
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
istartm1=istart-1;
iendp1=iend+1;
68

istart2=istart;
if(myid == 0) istart2=2;
iend1=iend;
if(myid == lastp ) iend1=iend-1;
l_nbr = myid - 1;
r_nbr = myid + 1;
if(myid == 0) l_nbr=MPI_PROC_NULL;
if(myid == lastp) r_nbr=MPI_PROC_NULL;
/*

READ 'input.dat', and distribute input data

*/

if( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&t, sizeof(t), 1, fp );
}
iroot=0;
MPI_Scatterv ((void *)&t, gcount, gdisp, MPI_DOUBLE,
(void *)&b[1], icount,MPI_DOUBLE, iroot, comm);
if( myid==0)
fread( (void *)&t, sizeof(t), 1, fp );
MPI_Scatterv ((void *)&t, gcount, gdisp, MPI_DOUBLE,
(void *)&c[1], icount, MPI_DOUBLE, iroot, comm);
if( myid==0) {
fread( (void *)&t, sizeof(t), 1, fp );
fread( (void *)&pqr, sizeof(pqr), 1, fp );
fclose( fp );
}
MPI_Scatterv ((void *)&t, gcount, gdisp, MPI_DOUBLE,
(void *)&d[1], icount, MPI_DOUBLE, iroot, comm);
MPI_Bcast ((void *)&pqr, 3, MPI_DOUBLE, 0, comm);
p=pqr[0];
q=pqr[1];
r=pqr[2];
/*
Exchange data outside the territory
*/
69

itag=110;
MPI_Sendrecv((void *)&b[iend],

1, MPI_DOUBLE, r_nbr, itag,

(void *)&b[istartm1], 1, MPI_DOUBLE, l_nbr, itag,

comm, istat);

itag=120;
MPI_Sendrecv((void *)&b[istart], 1,MPI_DOUBLE, l_nbr, itag,
(void *)&b[iendp1],1,MPI_DOUBLE, r_nbr, itag, comm, istat);
/*
C
*/

Compute, gather and write out the computed result


for (i=istart2; i<=iend1; i++) {
a[i]=c[i]*d[i]*p + ( b[i-1] + 2.0*b[i] + b[i+1]

)*q + r;

}
MPI_Gatherv ((void *)&a[istart], icount, MPI_DOUBLE,
(void *)&t, gcount, gdisp, MPI_DOUBLE, iroot, comm);
if( myid == 0) {
for (i = 0; i < ntotal-1; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]);
}
}
clock=MPI_Wtime() - clock;
printf( "myid, clock time= %d\t%.3f\n", myid, clock);
MPI_Finalize();
return 0;
}
startend(,int nproc,int is1,int is2,int gstart[16],int gend[16], int gcount[16])
{
int
ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
for ( i=0; i < nproc; i++ ) {
if(i < ir) {
gstart[i]=is1+i*(iblock+1);
gend[i]=gstart[i]+iblock;
}
else {
gstart[i]=is1+i*iblock+ir;
70

gend[i]=gstart[i]+iblock-1;
}
if(ilength < 1) {
gstart[i]=1;
gend[i]=0;
}
gcount[i]=gend[i]-gstart[i] + 1;
}
}
T4DCP :
ATTENTION: 0031-408 4 tasks allocated by LoadLeveler, continuing...
NPROC,MYID,ISTART,IEND=4
0
1
41
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
0.000
13.305
12.872
12.726

18.550
13.210
12.847
12.714

15.720
13.133
12.824
12.703

myid, clock time= 0


myid, clock time= 1
myid, clock time= 2
myid, clock time= 3

14.682
13.070
12.803
12.693

1
2
3
14.143
13.018
12.785
12.684

1
1
1
13.812
12.973
12.768
12.675

40
40
40
13.588
12.935
12.753
12.667

13.427
12.901
12.739
12.660

0.002
0.002
0.002
0.002

CPU CPU

71

5.1 T5SEQ

5.2 T5CP
5.3 T5DCP
5.4 MPI
5.5 T5_2D

72

5.1 T5SEQ
T5SEQ (global variables)
(local variables) (test data generation)

/*

PROGRAM T5SEQ
Sequential version of multiple dimensional array with -1,+1 data access

*/
#include <stdio.h>
#include <stdlib.h>
#define kk 20
#define km 3
#define mm 160
#define nn
120
double f1[mm][nn][km], f2[mm][nn][km], hxu[mm][nn], hxv[mm][nn],
hmmx[mm][nn], hmmy[mm][nn];
double vecinv[kk][kk], am7[kk];
main ()
{
double u1[mm][nn][kk], v1[mm][nn][kk], ps1[mm][nn];
double d7[mm][nn], d8[mm][nn], d00[mm][nn][kk];
double clock, sumf1, sumf2;
int
i, j, k, ka, isec1, isec2, nsec1, nsec2;
/*

Test data generation


*/
wtime(&isec1, &nsec1);
for (i=0; i<mm; i++)
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
u1[i][j][k]=1.0/(double) (i+1) + 1.0/(double) (j+1) + 1.0/(double) (k+1);
for (i=0; i<mm; i++)
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
73

v1[i][j][k]=2.0/(double) (i+1) + 1.0/(double) (j+1) + 1.0/(double) (k+1);


for (i=0; i<mm; i++) {
for (j=0; j<nn; j++) {
ps1[i][j] = 1.0/(double)(i+1) + 1.0/(double)(j+1);
hxu[i][j] = 2.0/(double)(i+1) + 1.0/(double)(j+1);
hxv[i][j] = 1.0/(double)(i+1) + 2.0/(double)(j+1);
hmmx[i][j] = 2.0/(double)(i+1) + 1.0/(double)(j+1);
hmmy[i][j] = 1.0/(double)(i+1) + 2.0/(double)(j+1);
}
}
for (k=0; k<kk; k++) {
am7[k]=1.0/(double) (k+1);
for (ka=0; ka<kk; ka++) {
vecinv[k][ka]=1.0/(double) (ka+1) + 1.0/(double) (k+1);
}
}
/*
Start the computation
*/
for (i=0; i<mm; i++)

for (j=0; j<nn; j++) {


for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}
for (i=0; i<mm-1; i++)
for (j=1; j<nn-1; j++)
d7[i][j] = ( ps1[i+1][j]+ps1[i][j] )*0.50*hxu[i][j];
for (i=1; i<mm-1; i++)
for (j=0; j<nn-1; j++)
d8[i][j] = ( ps1[i][j+1]+ps1[i][j] )*0.50*hxv[i][j];
for (i=1; i<mm-1; i++)
74

for (j=1; j<nn-1; j++)


for (k=0; k<kk; k++)
d00[i][j][k]=(d7[i][j]*u1[i][j][k]-d7[i-1][j]*u1[i-1][j][k])*hmmx[i][j]
+(d8[i][j]*v1[i][j][k]-d8[i][j-1]*v1[i][j-1][k])*hmmy[i][j];
for (i=1; i<mm-1; i++)
for (ka=0; ka<kk; ka++)
for (j=1; j<nn-1; j++)
for (k=0; k<km; k++)
f1[i][j][k]=f1[i][j][k]-vecinv[ka][k]*d00[i][j][ka];
sumf1=0.0;
sumf2=0.0;
for (i=1; i<mm-1; i++) {
for (j=1; j<nn-1; j++) {
for (k=0; k<km; k++) {
f2[i][j][k]=-am7[k]*ps1[i][j];
sumf1 +=f1[i][j][k];
sumf2 +=f2[i][j][k];
}
}
}
/*

Output data for validation

*/

printf( "SUMF1,SUMF2= %.5f\t%.5f\n", sumf1, sumf2 );


printf( " F2[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
f2[i][1][1],f2[i+5][1][1],f2[i+10][1][1],f2[i+15][1][1],
f2[i+20][1][1],f2[i+25][1][1],f2[i+30][1][1],f2[i+35][1][1]);
}
wtime(&isec2, &nsec2);
clock=(double) (isec2-isec1) + (double) (nsec2-nsec1)/1.0e9;
printf( " clock time = %f\n", clock);
return 0;
}
#include <sys/time.h>
int wtime(int *isec, int *nsec)
75

{
struct timestruc_t tb;
int iret;
iret=gettimer(TIMEOFDAY, &tb);
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
T5SEQ IBM SP2 SMP :
SUMF1,SUMF2= 26172.46054
F2[i][1][1],i=0,159,5
0.000
-0.333 -0.295 -0.281

-0.274

-0.269

-0.266

-0.264

-0.262
-0.256
-0.254

-0.258
-0.255
-0.254

-0.258
-0.255
-0.253

-0.257
-0.255
-0.253

-0.257
-0.254
-0.253

-0.261
-0.256
-0.254

-0.260
-0.255
-0.254

-0.259
-0.255
-0.254

-2268.89180

clock time = 0.090299

76

5.2 T5CP
C least
dimension index index

index +1 index -1

istart-1
| istart
| |

istart-1
| istart
| |
|
|
istart-1
| istart
| |

| |
| iend+1
iend

|
| |
istart
|

nn |
. |
.
j=1

P0
5.1
#define kk

| |
| iend+1
iend
P1

|
iend

| |
| iend+1
iend

P2

ps1(i,j)

P3

ps1(mm,nn)

20

#define km 3
#define mm 160
#define nn
120
double

u1[mm][nn][kk], v1[mm][nn][kk], ps1[mm][nn];


77

itag = 20;
MPI_Sendrecv ((void *)&ps1[istart][0],
nn, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][0], nn, MPI_DOUBLE, r_nbr, itag, icomm, istat);

k=kk
k=1
nn
.
.
.
j=1
i=1 . . . . . m
P0

u1[i][j][k]

m = mm / np

P1

5.2

P2

P3

u1(mm,nn,kk)

nnkk = nn*kk;
itag = 10;
MPI_Sendrecv ((void *)&u1[iend][0][0],

nnkk, MPI_DOUBLE, r_nbr, itag,

(void *)&u1[istartm1][0][0], nnkk, MPI_DOUBLE, l_nbr, itag, icomm, istat);


MPI
nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp, iroot; itag, isrc, idest, istart1, icount1, istart2,
iend1, istartm1, iendp1 MPI
Fortran (COMMON area) Fortran

78

T5CP :
/*

PROGRAM T5CP
Computing partition on the first dimension of multiple dimensional
array with -1,+1 data exchange without data partition

*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define kk 20
#define km 3
#define mm 160
#define nn
120
double f1[mm][nn][km], f2[mm][nn][km], hxu[mm][nn], hxv[mm][nn],
hmmx[mm][nn], hmmy[mm][nn];
double vecinv[kk][kk], am7[kk];
main ( argc, argv)
int argc;
char **argv;
{
double u1[mm][nn][kk], v1[mm][nn][kk], ps1[mm][nn];
double d7[mm][nn], d8[mm][nn], d00[mm][nn][kk];
double clock, sumf1, sumf2, gsumf1, gsumf2;
int
i, j, k, ka, nnkk;
int
int

nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp, iroot;


itag, isrc, idest, istart1, icount1, istart2, iend1, istartm1, iendp1;

int
gstart[16], gend[16], gcount[16];
MPI_Status
istat[8];
MPI_Comm
comm;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm=MPI_COMM_WORLD;
79

MPI_Barrier(comm);
clock=MPI_Wtime();
startend (nproc, 0, mm-1, gstart, gend, gcount);
istart=gstart[myid];
iend=gend[myid];
icount=gcount[myid];
lastp=nproc-1;
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
istartm1 = istart-1;
iendp1 = iend+1;
istart2 = istart;
if (myid == 0) istart2 = 1;
iend1 = iend;
if (myid == lastp ) iend1 = iend-1;
l_nbr = myid - 1;
r_nbr = myid + 1;
if (myid == 0) l_nbr = MPI_PROC_NULL;
if (myid == lastp) r_nbr = MPI_PROC_NULL;
/*

Test data generation

*/

/* for (i=0; i<mm; i++) */


for (i=istart; i<=iend; i++)
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
u1[i][j][k]=1.0/(double) (i+1) + 1.0/(double) (j+1) + 1.0/(double) (k+1);
/* for (i=0; i<mm; i++) */
for (i=istart; i<=iend; i++)
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
v1[i][j][k]=2.0/(double) (i+1) + 1.0/(double) (j+1) + 1.0/(double) (k+1);
for (i=0; i<mm; i++)

{
80

for (j=0; j<nn; j++) {


ps1[i][j] = 1.0/(double)(i+1) + 1.0/(double)(j+1);
hxu[i][j] = 2.0/(double)(i+1) + 1.0/(double)(j+1);
hxv[i][j] = 1.0/(double)(i+1) + 2.0/(double)(j+1);
hmmx[i][j] = 2.0/(double)(i+1) + 1.0/(double)(j+1);
hmmy[i][j] = 1.0/(double)(i+1) + 2.0/(double)(j+1);
}
}
for (k=0; k<kk; k++) {
am7[k] = 1.0/(double) (k+1);
for (ka=0; ka<kk; ka++) {
vecinv[k][ka] = 1.0/(double) (ka+1) + 1.0/(double) (k+1);
}
}
/*

Start the computation


nnkk = nn*kk;
itag = 10;

*/

MPI_Sendrecv ((void *)&u1[iend][0][0],


nnkk, MPI_DOUBLE, r_nbr, itag,
(void *)&u1[istartm1][0][0], nnkk, MPI_DOUBLE, l_nbr, itag, comm, istat);
itag = 20;
MPI_Sendrecv ((void *)&ps1[istart][0], nn, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][0], nn, MPI_DOUBLE, r_nbr, itag, comm, istat);
/*

for (i=0; i<mm; i++) { */


for (i=istart; i<=iend; i++) {
for (j=0; j<nn; j++) {
for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}

/*

for (i=0; i<mm-1; i++) */


for (i=istart; i<=iend1; i++)
for (j=1; j<nn-1; j++)
d7[i][j] = ( ps1[i+1][j]+ps1[i][j] )*0.50*hxu[i][j];

81

/*

for (i=1; i<mm-1; i++) */


for (i=istart2; i<=iend1; i++)
for (j=0; j<nn-1; j++)
d8[i][j] = ( ps1[i][j+1]+ps1[i][j] )*0.50*hxv[i][j];
itag=30;
MPI_Sendrecv ((void *)&d7[iend][0],
nn,MPI_DOUBLE,r_nbr,itag,
(void *)&d7[istartm1][0],nn,MPI_DOUBLE,l_nbr,itag,
comm, istat);

/* for (i=1; i<mm-1; i++) */


for (i=istart2; i<=iend1; i++)
for (j=1; j<nn-1; j++)
for (k=0; k<kk; k++)
d00[i][j][k]=(d7[i][j]*u1[i][j][k]-d7[i-1][j]*u1[i-1][j][k])*hmmx[i][j]
+(d8[i][j]*v1[i][j][k]-d8[i][j-1]*v1[i][j-1][k])*hmmy[i][j];
/* for (i=1; i<mm-1; i++) */
for (i=istart2; i<=iend1; i++)
for (ka=0; ka<kk; ka++)
for (j=1; j<nn-1; j++)
for (k=0; k<km; k++)
f1[i][j][k]=f1[i][j][k]-vecinv[ka][k]*d00[i][j][ka];
sumf1=0.0;
sumf2=0.0;
/* for (i=1; i<mm-1; i++) { */
for (i=istart2; i<=iend1; i++) {
for (j=1; j<nn-1; j++) {
for (k=0; k<km; k++) {
f2[i][j][k]=-am7[k]*ps1[i][j];
sumf1 +=f1[i][j][k];
sumf2 +=f2[i][j][k];
}
}
}
/*

Output data for validation


iroot=0;

*/

82

MPI_Reduce ((void *)&sumf1,(void *)&gsumf1, 1, MPI_DOUBLE,


MPI_SUM, iroot, comm);
MPI_Reduce ((void *)&sumf2,(void *)&gsumf2, 1, MPI_DOUBLE,
MPI_SUM, iroot, comm);
itag=40;
if (myid != 0) {
icount1 = icount*nn*km;
MPI_Send ((void *)&f2[istart][0][0], icount1, MPI_DOUBLE, iroot, itag, comm);
}
else {
for (isrc=1; isrc<nproc; isrc++) {
istart1 = gstart[isrc];
icount1 = gcount[isrc]*nn*km;
MPI_Recv ((void *)&f2[istart1][0][0], icount1, MPI_DOUBLE, isrc, itag, comm, istat);
}
}
if (myid == 0) {
printf( "SUMF1,SUMF2= %.5f\t%.5f\n", gsumf1, gsumf2 );
printf( " F2[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
f2[i][1][1],f2[i+5][1][1],f2[i+10][1][1],f2[i+15][1][1],
f2[i+20][1][1],f2[i+25][1][1],f2[i+30][1][1],f2[i+35][1][1]);
}
}
clock=MPI_Wtime() - clock;
printf( " myid, clock time = %d\t%.5f\n", myid, clock);
MPI_Finalize();
return 0;
}

83

T5CP :
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4

0
1
2
3

0
40
80
120

39
79
119
159

SUMF1,SUMF2= 26172.46054
-2268.89180
F2[i][1][1],i=0,159,5
0.000
-0.333 -0.295 -0.281 -0.274 -0.269 -0.266 -0.264
-0.262 -0.261 -0.260 -0.259 -0.258 -0.258 -0.257 -0.257
-0.256 -0.256 -0.255
-0.254 -0.254 -0.254
myid, clock time=0
myid, clock time=1
myid, clock time=2
myid, clock time=3

-0.255 -0.255
-0.254 -0.254
0.03366
0.03054

-0.255
-0.253

-0.255
-0.253

-0.254
-0.253

0.03195
0.03338

T5SEQ 0.090 CPU T5CP 0.033


(parallel speed up) 0.090/0.033= 2.73

84

5.3 T5DCP
T5DCP T5SEQ
mm 160 np CPU mm np
m mm/npmm np m
mm/np+1 :
#define kk
#define km

20
3

#define mm 160
#define nn
120
#define m
40
dimension (m+2) istart-1 iend+1
double f1[m+2][nn][km], f2[m+2][nn][km], hxu[m+2][nn], hxv[m+2][nn],
hmmx[m+2][nn], hmmy[m+2][nn];
double u1[m+2][nn][kk], v1[m+2][nn][kk], ps1[m+2][nn];
double d7[m+2][nn], d8[m+2][nn], d00[m+2][nn][kk], tt[mm][nn][km];
mm np f1f2 MPI_Gather
MPI_Gather :
iroot=0;
icount1= m*nn*km;
MPI_Gather((void *)&f2[istart][0][0], icount1, MPI_DOUBLE,
(void *)&tt,
icount1, MPI_DOUBLE, iroot, icomm);
:
/*

PROGRAM

T5DCP

Computing & data partition on the first dimension of multiple


dimensional arrays with -1,+1 data exchange
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
85

#define kk 20
#define km 3
#define mm 160
#define nn
120
#define m
40
double f1[m+2][nn][km], f2[m+2][nn][km], hxu[m+2][nn], hxv[m+2][nn],
hmmx[m+2][nn], hmmy[m+2][nn];
double vecinv[kk][kk], am7[kk];
main ( argc, argv)
int argc;
char **argv;
{
double u1[m+2][nn][kk], v1[m+2][nn][kk], ps1[m+2][nn];
double d7[m+2][nn], d8[m+2][nn], d00[m+2][nn][kk], tt[mm][nn][km];
double clock, sumf1, sumf2, gsumf1, gsumf2;
int
int
int

i, j, k, ka, ii, nnkk;


nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp, iroot, istartg;
itag, icount1, istart2, iend1, istartm1, iendp1;

int
gstart[16], gend[16], gcount[16];
MPI_Status
istat[8];
MPI_Comm
comm;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm=MPI_COMM_WORLD;
MPI_Barrier(icomm);
clock=MPI_Wtime();
startend( nproc, 1, mm, gstart, gend, gcount);
istart = 1;
iend = m;
lastp = nproc-1;
istartg = gstart[myid];
printf( "NPROC,MYID,ISTART,IEND,istartg=%d\t%d\t%d\t%d\t%d\n",
86

nproc,myid,istart,iend,istartg);
istartm1 = istart-1;
iendp1 = iend+1;
istart2 = istart;
if (myid == 0) istart2 = 2;
iend1 = iend;
if (myid == lastp ) iend1 = iend-1;
l_nbr = myid - 1;
r_nbr = myid + 1;
if (myid == 0)
l_nbr = MPI_PROC_NULL;
if (myid == lastp) r_nbr = MPI_PROC_NULL;
/* for (i=0; i<mm; i++) */
for (i=istart; i<=iend; i++) {
ii = i + istartg -1;
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
u1[i][j][k]=1.0/(double) ii + 1.0/(double) (j+1) + 1.0/(double) (k+1);
}
/* for (i=0; i<mm; i++) */
for (i=istart; i<=iend; i++) {
ii = i + istartg -1;
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
v1[i][j][k]=2.0/(double) ii + 1.0/(double) (j+1) + 1.0/(double) (k+1);
}
for (i=istart; i<=iend; i++) {
ii = i + istartg -1;
for (j=0; j<nn; j++) {
ps1[i][j] = 1.0/(double) ii + 1.0/(double)(j+1);
hxu[i][j] = 2.0/(double) ii + 1.0/(double)(j+1);
hxv[i][j] = 1.0/(double) ii + 2.0/(double)(j+1);
hmmx[i][j] = 2.0/(double) ii + 1.0/(double)(j+1);
87

hmmy[i][j] = 1.0/(double) ii + 2.0/(double)(j+1);


}
}
for (k=0; k<kk; k++) {
am7[k] = 1.0/(double) (k+1);
for (ka=0; ka<kk; ka++) {
vecinv[k][ka] = 1.0/(double) (ka+1) + 1.0/(double) (k+1);
}
}
/*

Start the computation

*/

nnkk = nn*kk;
itag = 10;
MPI_Sendrecv ((void *)&u1[iend][0][0],

nnkk, MPI_DOUBLE, r_nbr, itag,

(void *)&u1[istartm1][0][0], nnkk, MPI_DOUBLE, l_nbr, itag, comm, istat);


itag = 20;
MPI_Sendrecv ((void *)&ps1[istart][0],

nn, MPI_DOUBLE, l_nbr, itag,

(void *)&ps1[iendp1][0], nn, MPI_DOUBLE, r_nbr, itag, comm, istat);


/*

for (i=0; i<mm; i++) { */


for (i=istart; i<=iend; i++) {
for (j=0; j<nn; j++) {
for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}

/*

for (i=0; i<mm-1; i++)

*/

for (i=istart; i<=iend1; i++)


for (j=1; j<nn-1; j++)
d7[i][j] = ( ps1[i+1][j]+ps1[i][j] )*0.50*hxu[i][j];
/*

for (i=1; i<mm-1; i++) */


for (i=istart2; i<=iend1; i++)
for (j=0; j<nn-1; j++)
d8[i][j] = ( ps1[i][j+1]+ps1[i][j] )*0.50*hxv[i][j];
88

itag=30;
MPI_Sendrecv ((void *)&d7[iend][0],
nn, MPI_DOUBLE, r_nbr, itag,
(void *)&d7[istartm1][0], nn, MPI_DOUBLE, l_nbr, itag, comm, istat);
/*

for (i=1; i<mm-1; i++) */


for (i=istart2; i<=iend1; i++)
for (j=1; j<nn-1; j++)
for (k=0; k<kk; k++)
d00[i][j][k]=(d7[i][j]*u1[i][j][k]-d7[i-1][j]*u1[i-1][j][k])*hmmx[i][j]
+(d8[i][j]*v1[i][j][k]-d8[i][j-1]*v1[i][j-1][k])*hmmy[i][j];

/*

for (i=1; i<mm-1; i++) */


for (i=istart2; i<=iend1; i++)
for (ka=0; ka<kk; ka++)
for (j=1; j<nn-1; j++)
for (k=0; k<km; k++)
f1[i][j][k]=f1[i][j][k]-vecinv[ka][k]*d00[i][j][ka];
sumf1=0.0;
sumf2=0.0;

/*

for (i=1; i<mm-1; i++) { */


for (i=istart2; i<=iend1; i++) {
for (j=1; j<nn-1; j++) {
for (k=0; k<km; k++) {
f2[i][j][k]=-am7[k]*ps1[i][j];
sumf1 +=f1[i][j][k];
sumf2 +=f2[i][j][k];
}
}
}

/*
*/

Output data for validation


MPI_Allreduce ((void *)&sumf1, (void *)&gsumf1, 1, MPI_DOUBLE, MPI_SUM, comm);
MPI_Allreduce ((void *)&sumf2, (void *)&gsumf2, 1, MPI_DOUBLE, MPI_SUM, comm);
icount1 = m*nn*km;
iroot=0;
MPI_Gather((void *)&f2[istart][0][0], icount1, MPI_DOUBLE,
89

(void *)&tt,
if (myid == 0) {

icount1, MPI_DOUBLE, iroot, comm);

printf( "SUMF1,SUMF2= %.5f\t%.5f\n", gsumf1, gsumf2 );


printf( " tt[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
tt[i][1][1],tt[i+5][1][1],tt[i+10][1][1],tt[i+15][1][1],
tt[i+20][1][1],tt[i+25][1][1],tt[i+30][1][1],tt[i+35][1][1]);
}
clock=MPI_Wtime() - clock;
printf( " myid, clock time= %d\t%.5f\n", myid, clock);
}
MPI_Finalize();
return 0;
}
T5DCP :
ATTENTION: 0031-408 4 tasks allocated by LoadLeveler, continuing...
NPROC,MYID,ISTART,IEND,istartg=4
0
1
40
NPROC,MYID,ISTART,IEND,istartg=4
2
1
40
NPROC,MYID,ISTART,IEND,istartg=4
1
1
40
NPROC,MYID,ISTART,IEND,istartg=4
3
1
40
SUMF1,SUMF2= 26172.46054
-2268.89180
tt[i][1][1],i=0,159,5
0.000
-0.333 -0.295 -0.281 -0.274 -0.269 -0.266 -0.264
-0.262 -0.261 -0.260 -0.259 -0.258
-0.256 -0.256 -0.255 -0.255 -0.255
-0.254 -0.254 -0.254 -0.254 -0.254
myid, clock time= 0
0.03041
myid, clock time= 1
0.02975
myid, clock time= 2
0.02992
myid, clock time= 3

-0.258
-0.255
-0.253

-0.257
-0.255
-0.253

1
81
41
121

-0.257
-0.254
-0.253

0.02978

T5SEQ 0.090 CPU T5DCP 0.030


(parallel speed up) 0.090/0.030 = 3.00 CPU
0.030 0.033
gather send/recv

90

5.4 MPI
5.3
MPI
MPI_Cart_createMPI_Cart_coordsMPI_Cart_shiftMPI_Type_vectorMPI_Type_commit

5.4.1 (Cartesian Topology)

(up)
(j)

u
p
d
o
w
n

CPU2
(0,2)

CPU5
(1,2)

CPU8
(2,2)

CPU11
(3,2)

CPU1
(0,1)

CPU4
(1,1)

CPU7
(2,1)

CPU10
(3,1)

CPU0
(0,0)

CPU3
(1,0)

CPU6
(2,0)

CPU9
(3,0)

(sideways)

A(i,j)

(i) (right)

5.3
a(mm,nn)
m mm/4 n
nn/3 mm nn 200 150
define mn
dimension :
#define
#define
#define
#define

mm 200
nn
150
m
50
n
50
91

#define ip
#define jp

4
3

double a(m+2, n+2)


a 4x3 CPU CPU CPU
5.3 a :
CPU CPU 5.3 X
(i) Y (j)

5.4.2 MPI MPI_Cart_create

MPI_Cart_coords
MPI_Cart_shift
MPI_Comm_size nproc MPI MPI_Csrt_create
5.3 :
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ip 4
#define jp 3
#define ndim 2
int
nproc, myid, r_nbr, l_nbr, t_nbr, b_nbr, comm2d, my_coord[ndim];
int
ipart[ndim], periods[ndim], sideways, updown, right, up, reorder;
MPI_Status istat[8];
MPI_Comm comm;
main ( argc, argv)
int argc;
char **argv;
{
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
ipart[0]=ip;
ipart[1]=jp;
92

periods[0]=0;
periods[1]=0;
reorder=1;
MPI_Cart_create(MPI_COMM_WORLD, ndim, ipart, periods, reorder, &comm2d);
.....
return 0;
}

MPI_COMM_WORLD

communicator

ndim
ipart
ipart (1)
ipart (2)

5.3 2
ndim
5.3 4
5.3 3

periods
periods (1)

ndim
1
0 5.3 0

periods (2)

1
0 5.3 0
CPU
1

reorder
comm2d

communicator

4x3 communicator comm2d


MPI_Cart_create MPI_Comm_rank CPU
MPI_Cart_create reorder 0
CPU CPU

communicator comm2d myid


MPI_Comm_rank communicator comm2d CPU idMPI_Comm_ank
:
MPI_Comm_rank (comm2d, &myid)

comm2d
myid

communicator
communicator comm2d CPU id

myid 5.3 CPU0CPU1CPU2 MPI_Cart_coords


93

CPU CPU my_coord


MPI_Cart_coords :
MPI_Cart_coords (comm2d, myid, ndim, my_coord)

communicator
communicator comm2d CPU id
5.3 2
ndim myid CPU CPU

comm2d
myid
ndim
my_coord

my_coord(1)
my_coord (2)
CPU my_coord 5.3 CPU0CPU1CPU2 CPU id
5.3 my_coord (0) CPU CPU my_coord
(0) ip-1 CPU CPU my_coord (1) CPU
CPU () my_coord (1) jp-1 CPU CPU
()
MPI_Cart_shift CPU CPU id
MPI_ Cart_shift :
int
sideways, updown, right, up
sideways=0;
updown=1;
right=1;
up=1;
MPI_ Cart_shift (comm2d, sideways, right, &l_nbr, &r_nbr);

comm2d
sideways
right
l_nbr
r_nbr

communicator
( i )

CPU CPU id
CPU CPU id

MPI_ Cart_shift (comm2d, updown, up, &b_nbr, &t_nbr);

updown

( j )
94

up

b_nbr
t_nbr

CPU CPU id
CPU CPU id

l_nbr, r_nbr, b_nbr, t_nbr left_neighborright_neighborbotton_neighbor


top_neighbor

5.4.3 MPI
MPI_Type_vector
MPI_Type_commit

(up)
(j)
n CPU2
. (0,2)
1xxxxx
u
p
d
o
w
n

n
.
1

CPU1
(0,1)

n
y
. CPU0 y
1 (0,0) y
1....m

CPU5
(1,2)
xxxxx

CPU8
(2,2)
xxxxx

CPU11
(3,2)
xxxxx

CPU4
(1,1)

CPU7
(2,1)

CPU10
(3,1)

y
CPU3 y
(1,0) y
1....m
(sideways)

y
CPU6 y
(2,0) y
1....m

a(i,j)

y
CPU9 y
(3,0) y
1....m
(i) (right)

5.4
a(mm,nn) C
a index j index i 5.4
CPU i=1 j 1 ni=2 j 1 n i=m j 1 n
i-1 i+1 5.4 y y y y
j-1 j+1 5.4 x x x x dimension
95

a(m,n) x n
CPU x x x x MPI_Type_vector
MPI_Type_commit
MPI_ Type_vector MPI_ Type_commit :
MPI_ Type_vector (count, blocklen, stride, oldtype, &newtype);
MPI_ Type_commit (&newtype);

count
blocklen
stride
oldtype

newtype

5.4 x x x x :
MPI_ Type_vector (m, 1, n, MPI_REAL, &vector2d);
MPI_ Type_commit (&vector2d);
x x x x vector2d

96

5.5 T5_2D
T5SEQ :
#define kk

20

#define km
3
#define mm 160
#define nn
120
double f1[mm][nn][km], f2[mm][nn][km], hxu[mm][nn], hxv[mm][nn],
hmmx[mm][nn], hmmy[mm][nn];
double vecinv[kk][kk], am7[kk];
main ()
{
double u1[mm][nn][kk], v1[mm][nn][kk], ps1[mm][nn];
double d7[mm][nn], d8[mm][nn], d00[mm][nn][kk];
double clock, sumf1, sumf2;
5.4 mm
nn 5.5 mm
nn mm nn ip 4 jp 2 m=mm/ip
n=nn/jp 5.5 :
#include <stdlib.h>
#include <mpi.h>
#define
#define
#define
#define
#define

kk
20
km
3
mm 160
nn
120
m
40

#define n
#define ip
#define jp

60
4
2

double f1[m+2][n+2][km], f2[m+2][n+2][km], hxu[m+2][n+2],


hxv[m+2][n+2], hmmx[m+2][n+2], hmmy[m+2][n+2];
97

double vecinv[kk][kk], am7[kk];


int
nproc, myid, r_nbr, l_nbr, t_nbr, b_nbr, comm2d, my_coord[2];
int
ipart[2], periods[2], reorder, sideways, updown, right, up, icomm;
MPI_Status istat[8];
main ( argc, argv)
int argc;
char **argv;
{
double u1[m+2][n+2][kk], v1[m+2][n+2][kk], ps1[m+2][n+2];
double d7[m+2][n+2], d8[m+2][n+2], d00[m+2][n+2][kk], tt[mm][nn][km];
double clock, sumf1, sumf2, gsumf1, gsumf2;
dimension [m+2][ n+2]

up
Index j
d8(m+2, n+2),

ps1(m+2, n+2)

u1(m+2, n+2, kk),


n
.
1

cpu1
(0,1)
xxxxxxx

n
.
1

cpu0
(0,0)
xxxxxxx
1.......m

v1(m+2, n+2, kk)


cpu3
(1,1)
xxxxxxx

cpu2
(1,0)
xxxxxxx
1.......m

cpu5
(2,1)
xxxxxxx

cpu7
(3,1)
xxxxxxx

cpu4
(2,0)
xxxxxxx
1.......m

cpu6
(3,0)
xxxxxxx
1.......m

Index i
5.5

(sideways) (right)

5.4 MPI 4x2 :

98

nbr2d()
ipart[0]=ip;
ipart[1]=jp;
periods[0]=0;
periods[1]=0;
reorder=1;
sideways=0;
updown=1;
right=1;
up=1;
MPI_Cart_create(MPI_COMM_WORLD, 2, ipart, periods, reorder, &comm2d);
MPI_Comm_rank( comm2d,&myid);
MPI_Cart_coords( comm2d, myid, 2, my_coord);
MPI_Cart_shift( comm2d, sideways, right, &l_nbr, &r_nbr);
MPI_Cart_shift( comm2d, updown, up, &b_nbr, &t_nbr);
printf(" myid,coord,l,r,t,b_nbr=%d\t%d\t%d\t%d\t%d\t%d\t%d\n",
myid,my_coord[0],my_coord[1],l_nbr,r_nbr,t_nbr,b_nbr);

}
CPU 5.5 x x x x
m vector2d n+2 1
vector3D kk*(n+2) kk
MPI :
n2=n+2;
MPI_Type_vector(m, 1, n2, MPI_DOUBLE, &vector2d);
MPI_Type_commit (&vector2d);
n2kk=n2*kk;
MPI_ Type_vector (m, kk, n2kk, MPI_DOUBLE, &vector3d);
MPI_ Type_commit (&vector3d);
x x x x vector2d
vector3d 1j for loop 1 n
jstart=1;
jend=n;
jstartm1=jstart-1;
jendp1=jend+1;
99

CPU 5.6 CPU ( x )


jstartm1 (x ) MPI_Sendrecv :
MPI_Sendrecv((void *)& ps1[istart][jend], 1, vector2d, t_nbr, itag,
(void *)&ps1[istart][jstartm1], 1, vector2d, b_nbr, itag, comm2d, istat);
MPI_Sendrecv ((void *)&v1[istart][jend][0], 1, vector3d, t_nbr, itag,
(void *)&v1[istart][jstartm1][0], 1, vector3d, b_nbr, itag, comm2d, istat);

up
Index j
d8(m+2, n+2),

ps1(m+2, n+2)

u1(m+2, n+2, kk),

jend
.
jstart

jend
.
jstart

v1(m+2, n+2, kk)

yyyyyyy
xxxxxxx
cpu1
yyyyyyy
xxxxxxx

yyyyyyy
xxxxxxx
cpu3
yyyyyyy
xxxxxxx

yyyyyyy
xxxxxxx
cpu5
yyyyyyy
xxxxxxx

yyyyyyy
xxxxxxx
cpu7
yyyyyyy
xxxxxxx

yyyyyyy

yyyyyyy

yyyyyyy

yyyyyyy

xxxxxxx
cpu0
yyyyyyy
xxxxxxx

xxxxxxx
cpu2
yyyyyyy
xxxxxxx

xxxxxxx
cpu4
yyyyyyy
xxxxxxx

xxxxxxx
cpu6
yyyyyyy
xxxxxxx

1.......m

1.......m

1.......m

1.......m

Index i
5.6

(sideways) (right)

CPU (y ) jendp1
(y ) MPI_Sendrecv :
MPI_ Sendrecv ((void *)&ps1[istart][ jstart], 1, vector2d, b_nbr, itag,
(void *)&ps1[istart][ jendp1], 1, vector2d, t_nbr, itag, comm2d, istat);
MPI_ Sendrecv ((void *)&v1[istart][jstart][0],
(void *)&v1[istart][jendp1][0],

1, vector3d, b_nbr, itag,


1, vector3d, t_nbr, itag, comm2d, istat);
100

i for loop 1 m
istart=1;
iend=m;
istartm1=istart-1
iendp1=iend+1
CPU 5.3 T5DCP
CPU iendp1
MPI_Sendrecv :
MPI_ Sendrecv ((void *)&ps1[istart][ jstart], n, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][ jstart], n, MPI_DOUBLE, r_nbr, itag, comm2d, istat);
CPU istartm1
MPI_ Sendrecv :
n2kk=(n+2)*kk
MPI_ Sendrecv ((void *)&u1[iend][jstart][0],
n2kk, MPI_DOUBLE, r_nbr, itag,
(void *)&u1[istartm1][jstart][0], n2kk, MPI_DOUBLE, l_nbr, itag,
comm2d, istat);
tt dimension f1f2 dimension MPI_Gather MPI_Gatherv
MPI_SendMPI_Recv CPU
copy1 f2 tt
double
double

tt[mm][nn][km];
f1[m+2][n+2][km], f2[m+2][n+2][km];

T5_2D :
/*

PROGRAM T5_2D
Computing & data partition on the first 2 dimensions of multiple
dimensional arrays with -1,+1 data exchange */

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
101

#define kk
#define km
#define
#define
#define
#define

20
3

mm 160
nn
120
m
40
n
60

#define ip
#define jp
#define np

4
2
8

double f1[m+2][n+2][km], f2[m+2][n+2][km], hxu[m+2][n+2],


hxv[m+2][n+2], hmmx[m+2][n+2], hmmy[m+2][n+2];
double vecinv[kk][kk], am7[kk];
int
int

nproc, myid, myid_i, myid_j, lastp, lastp_i, lastp_j,


r_nbr, l_nbr, t_nbr, b_nbr, my_coord[2], g_coord[np][2];
ipart[2], periods[2], reorder, sideways, updown, right, up;

int
istart,iend, istart2, iend1, istartm1, iendp1;
int
jstart,jend, jstart2, jend1, jstartm1, jendp1;
int
istartg[16], iendg[16], jstartg[16], jendg[16];
MPI_Comm
comm2d;
MPI_Status
istat[8];
MPI_Datatype vector2d, vector3d;
main ( argc, argv)
int argc;
char **argv;
{
double u1[m+2][n+2][kk], v1[m+2][n+2][kk], ps1[m+2][n+2];
double d7[m+2][n+2], d8[m+2][n+2], d00[m+2][n+2][kk];
double clock, sumf1, sumf2, gsumf1, gsumf2, tt[mm][nn][km];
int
i, j, k, ka, ii, jj, n2, n2kk;
int
itag, isrc, idest, iroot, ig, jg, count;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
if (nproc != np) {
102

if (myid == 0)
printf(" nproc not equal to np=%d ", np, "program will stop");
MPI_Finalize();
return 0;
}
nbr2d();
MPI_Barrier(comm2d);
clock=MPI_Wtime();
MPI_Gather ((void *)&my_coord, 2, MPI_INTEGER,
(void *)&g_coord, 2, MPI_INTEGER, 0, comm2d);
startend( ip, 1, mm, istartg, iendg);
startend( jp, 1, nn, jstartg, jendg);
istart = 1;
iend = m;
jstart = 1;
jend = n;
myid_i=my_coord[0];
myid_j=my_coord[1];
ig = istartg[myid_i];
jg = jstartg[myid_j];
lastp_i=ip-1;
lastp_j=jp-1;
printf( "NPROC,MYID,ISTART,IEND,ig,jg=%d\t%d\t%d\t%d\t%d\t%d\n",
nproc, myid, istart, iend, ig, jg);
istartm1 = istart-1;
iendp1 = iend+1;
jstartm1 = jstart-1;
jendp1 = jend+1;
istart2 = istart;
if (myid_i == 0) istart2 = 2;
jstart2 = jstart;
if (myid_j == 0) jstart2 = 2;
iend1 = iend;
if (myid_i == lastp_i ) iend1 = iend-1;
jend1 = jend;
103

if (myid_j == lastp_j ) jend1 = jend-1;


/*
Test data generation
*/
/* for (i=0; i<mm; i++) */
for (i=istart; i<=iend; i++) {
ii = i + ig -1;
for (j=0; j<nn; j++)
*/
for (j=jstart; j<=jend; j++) {
jj = j + jg -1;

/*

for (k=0; k<kk; k++)


u1[i][j][k]=1.0/(double) ii + 1.0/(double) jj + 1.0/(double) (k+1);
}
}
/* for (i=0; i<mm; i++) */
for (i=istart; i<=iend; i++) {
ii = i + ig -1;
for (j=0; j<nn; j++)
*/
for (j=jstart; j<=jend; j++) {
jj = j + jg -1;

/*

for (k=0; k<kk; k++)


v1[i][j][k]=2.0/(double) ii + 1.0/(double) jj + 1.0/(double) (k+1);
}
}
for (i=istart; i<=iend; i++) {
ii = i + ig -1;
/*
for (j=0; j<nn; j++) {
*/
for (j=jstart; j<=jend; j++) {
jj = j + jg -1;
ps1[i][j] = 1.0/(double) ii + 1.0/(double) jj;
hxu[i][j] = 2.0/(double) ii + 1.0/(double) jj;
hxv[i][j] = 1.0/(double) ii + 2.0/(double) jj;
hmmx[i][j] = 2.0/(double) ii + 1.0/(double) jj;
hmmy[i][j] = 1.0/(double) ii + 2.0/(double) jj;
}
}
104

for (k=0; k<kk; k++) {


am7[k] = 1.0/(double) (k+1);
for (ka=0; ka<kk; ka++) {
vecinv[k][ka] = 1.0/(double) (ka+1) + 1.0/(double) (k+1);
}
}
/*

Start the computation


*/
n2
= n+2;
n2kk = n2*kk;
MPI_Type_vector(m, kk, n2kk, MPI_DOUBLE, &vector3d );
MPI_Type_commit(&vector3d );
MPI_Type_vector(m, 1, n2, MPI_DOUBLE, &vector2d );
MPI_Type_commit(&vector2d );
itag = 10;
MPI_Sendrecv ((void *)&u1[iend][jstart][0],
n2kk, MPI_DOUBLE, r_nbr, itag,
(void *)&u1[istartm1][jstart][0],n2kk, MPI_DOUBLE, l_nbr, itag,
comm2d, istat);
itag = 20;
MPI_Sendrecv ((void *)&ps1[istart][jstart], n, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][jstart], n, MPI_DOUBLE, r_nbr, itag,
comm2d, istat);
itag = 30;
MPI_Sendrecv ((void *)&v1[istart][jend][0],
1, vector3d, t_nbr, itag,
(void *)&v1[istart][jstartm1][0], 1, vector3d, b_nbr, itag,
comm2d, istat);
for (i=istart; i<=iend; i++) {
for (j=jstart; j<=jend; j++) {
for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}

}
/* for (i=0; i<mm-1; i++)
*/
/* for (j=0; j<nn-1; j++)
*/
for (i=istart; i<=iend1; i++)
for (j=jstart; j<=jend1; j++)
105

d7[i][j] = ( ps1[i+1][j]+ps1[i][j] )*0.50*hxu[i][j];


/* for (i=1; i<mm-1; i++)
*/
/* for (j=0; j<nn-1; j++)
*/
for (i=istart2; i<=iend1; i++)
for (j=jstart; j<=jend1; j++)
d8[i][j] = ( ps1[i][j+1]+ps1[i][j] )*0.50*hxv[i][j];
itag=50;
MPI_Sendrecv ((void *)&d7[iend][jstart],

n, MPI_DOUBLE, r_nbr, itag,

(void *)&d7[istartm1][jstart],n, MPI_DOUBLE, l_nbr, itag,


comm2d, istat);
itag=60;
MPI_Sendrecv((void *)&d8[istart][jend],

1, vector2d, t_nbr, itag,

(void *)&d8[istart][jstartm1],1, vector2d, b_nbr, itag,


comm2d, istat);
/* for (i=1; i<mm-1; i++) */
/* for (j=1; j<nn-1; j++) */
for (i=istart2; i<=iend1; i++)
for (j=jstart2; j<=jend1; j++)
for (k=0; k<kk; k++)
d00[i][j][k]=(d7[i][j]*u1[i][j][k]-d7[i-1][j]*u1[i-1][j][k])*hmmx[i][j]
+(d8[i][j]*v1[i][j][k]-d8[i][j-1]*v1[i][j-1][k])*hmmy[i][j];
/* for (i=1; i<mm-1; i++)

*/

for (i=istart2; i<=iend1; i++)


for (ka=0; ka<kk; ka++)
/* for (j=1; j<nn-1; j++) */
for (j=jstart2; j<=jend1; j++)
for (k=0; k<km; k++)
f1[i][j][k]=f1[i][j][k]-vecinv[ka][k]*d00[i][j][ka];
sumf1=0.0;
sumf2=0.0;
/* for (i=1; i<mm-1; i++) { */
for (i=istart2; i<=iend1; i++) {
/*
for (j=1; j<nn-1; j++) {
*/
for (j=jstart2; j<=jend1; j++) {
106

for (k=0; k<km; k++) {


f2[i][j][k]=-am7[k]*ps1[i][j];
sumf1 +=f1[i][j][k];
sumf2 +=f2[i][j][k];
}
}
}
MPI_Allreduce ((void *)&sumf1, (void *)&gsumf1, 1, MPI_DOUBLE, MPI_SUM, comm2d);
MPI_Allreduce ((void *)&sumf2, (void *)&gsumf2, 1, MPI_DOUBLE, MPI_SUM, comm2d);
count=km*(n+2)*(m+2);
iroot = 0;
itag = 70;
if (myid > 0)
MPI_Send ( (void *)&f2, count, MPI_DOUBLE, iroot, itag, comm2d);
else {
copy1(myid, tt);
for (isrc=1; isrc < nproc; isrc++) {
MPI_Recv ((void *)&f2, count, MPI_DOUBLE, isrc, itag, comm2d, istat);
copy1 (isrc, tt);
}
}
if (myid == 0) {
printf( "sumf1,sumf2= %.5f\t%.5f\n", gsumf1, gsumf2 );
printf( " tt[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
tt[i][1][1],tt[i+5][1][1],tt[i+10][1][1],tt[i+15][1][1],
tt[i+20][1][1],tt[i+25][1][1],tt[i+30][1][1],tt[i+35][1][1]);
}
clock=MPI_Wtime() - clock;
printf( " myid, clock time= %d\t%.5f\n", myid, clock);
}
MPI_Finalize();
return 0;
}

107

nbr2d()
{
ipart[0]=ip;
ipart[1]=jp;
periods[0]=0;
periods[1]=0;
reorder=1;
sideways=0;
updown=1;
right=1;
up=1;
MPI_Cart_create(MPI_COMM_WORLD, 2, ipart, periods, reorder, &comm2d);
MPI_Comm_rank( comm2d,&myid);
MPI_Cart_coords( comm2d, myid, 2, my_coord);
MPI_Cart_shift( comm2d, sideways, right, &l_nbr, &r_nbr);
MPI_Cart_shift( comm2d, updown, up, &b_nbr, &t_nbr);
printf(" myid,coord,l,r,t,b_nbr=%d\t%d\t%d\t%d\t%d\t%d\t%d\n",
myid,my_coord[0],my_coord[1],l_nbr,r_nbr,t_nbr,b_nbr);
return 0;
}
copy1(int id, double tt[mm][nn][km])
{
/*

copy partitioned array f2 to global array tt


int
i, j, k, ii, jj, ig, jg;
ii=g_coord[id][0];

*/

jj=g_coord[id][1];
for (i=1; i<=m; i++) {
ig=istartg[ii]+i-2;
for (j=1; j<=n; j++) {
jg=jstartg[jj]+j-2;
for (k=0; k<km; k++)
tt[ig][jg][k] = f2[i][j][k];
}
}
return 0;
}

108

T5_2D :
ATTENTION: 0031-408 8 tasks allocated by LoadLeveler, continuing...
myid,coord,l,r,t,b_nbr=0
0
0
-3 2
1
-3
myid,coord,l,r,t,b_nbr=1
0
1
-3 3
-3 0
myid,coord,l,r,t,b_nbr=2
1
0
0
4
3
-3
myid,coord,l,r,t,b_nbr=3
myid,coord,l,r,t,b_nbr=4
myid,coord,l,r,t,b_nbr=5
myid,coord,l,r,t,b_nbr=6

1
2
2
3

1
0
1
0

1
2
3
4

5
6
7
-3

-3
5
-3
7

2
-3
4
-3

myid,coord,l,r,t,b_nbr=7
3
1
5
-3 -3 6
sumf1,sumf2 = 26172.46985
-2268.89180
tt[i][1][1],i=0,159,5
0.000
-0.333 -0.295 -0.281 -0.274 -0.269

-0.266

-0.264

-0.262
-0.256
-0.254

-0.257
-0.255
-0.253

-0.257
-0.254
-0.253

-0.261
-0.256
-0.254

-0.260
-0.255
-0.254

-0.259
-0.255
-0.254

myid, clock time= 3


myid, clock time= 2
myid, clock time= 0
myid, clock time= 1

0.02869
0.02974
0.03281
0.02929

myid, clock time= 7


myid, clock time= 4
myid, clock time= 5
myid, clock time= 6

0.01998
0.02046
0.02030
0.02039

-0.258
-0.255
-0.254

-0.258
-0.255
-0.253

t5seq 0.090 CPU t5_2d 0.033


(parallel speed up) 0.090/0.033 = 2.7 CPU
0.030 CPU 0.033

109

MPI
Nonblocking blocking

110

6.1

Nonblocking

MPI_SendMPI_Recv Blocking
MPI_Send CPU Buffer is empty MPI_Send
MPI_Recv CPU
Buffer is full MPI_Recv
6.1 Blocking Send/Recv
Processor 0
User mode

kernel mode

MPI_Send
sendbuf

sysbuf

CPU idled

copy
sendbuf
to
sysbuf

Now sendbuf can be reused

Processor 1
User mode

kernel mode

MPI_Recv
recvbuf
CPU idled

sysbuf

copy
sysbuf
to
recvbuf

Now recvbuf contains valid data

6.1 Blocking Send/Recv


Nonblocking MPI_IsendMPI_Irecv
MPI_Wait

MPI_Isend MPI_Irecv MPI_Wait

6.2 Nonblocking Send/Recv

111

Processor 0
User mode

kernel mode

MPI_Isend
computation

sendbuf

sysbuf

copy
sendbuf
to
sysbuf

Now sendbuf can be reused


MPI_Wait

Processor 1
User mode

kernel mode

MPI_Irecv
computation

recvbuf

sysbuf

copy
sysbuf
to
recvbuf

Now recvbuf contains valid data


MPI_Wait

6.2 Nonblocking Send/Recv


MPI_Isend
MPI_Isend ((void *)&data, count, DATATYPE, dest, tag, MPI_COMM_WORLD, request);

data
count
DATATYPE

dest
CPU id
tag

MPI_COMM_WORLD communicator
request

112

MPI_Irecv
MPI_Irecv ((void *)&data, count, DATATYPE, src, tag, MPI_COMM_WORLD, request);

data
count
DATATYPE

dest
tag
MPI_COMM_WORLD
request

CPU id

communicator

MPI_Wait
MPI_Wait (request, istat);

request
Istat

MPI_Isend MPI_Irecv
request

5.3 T5DCP MPI_Sendrecv MPI_Isend


MPI_Irecv ps1 u1 ps1 for loop
MPI_Wait ps1
ps1 for loop MPI_Wait
u1 MPI_Wait u1
d7 for loop MPI_Sendrecv
Nonblocking T5DCP T6DCP
/*

PROGRAM

T6DCP

Computing & data partition on the first dimension of multiple


dimensional arrays with -1,+1 boundary data exchange using non-blocking send/recv
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
113

#define kk
#define km
#define
#define
#define
#define

20
3

mm 160
nn
120
m
40
np
4

double f1[m+2][nn][km], f2[m+2][nn][km], hxu[m+2][nn], hxv[m+2][nn],


hmmx[m+2][nn], hmmy[m+2][nn];
double vecinv[kk][kk], am7[kk];
main ( argc, argv)
int argc;
char **argv;
{
double u1[m+2][nn][kk], v1[m+2][nn][kk], ps1[m+2][nn];
double d7[m+2][nn], d8[m+2][nn], d00[m+2][nn][kk], tt[mm][nn][km];
double clock, sumf1, sumf2, gsumf1, gsumf2;
int
i, j, k, ka, ii, nnkk;
int

nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp, iroot, istartg;

int
itag, istart2, iend1, istartm1, iendp1;
int
gstart[16],gend[16],gcount[16];
MPI_Status
istat[8];
MPI_Comm
comm;
MPI_Request

requ1, reqps1;

MPI_Init (&argc, &argv);


MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm=MPI_COMM_WORLD;
MPI_Barrier(comm);
clock=MPI_Wtime();
startend( nproc, 1, mm, gstart, gend, gcount);
istart = 1;
iend = m;
icount = m;
114

lastp = nproc-1;
istartg = gstart[myid];
printf( "NPROC,MYID,ISTART,IEND,istartg=%d\t%d\t%d\t%d\t%d\n",
nproc, myid, istart, iend, istartg);
istartm1 = istart-1;
iendp1 = iend+1;
istart2 = istart;
if (myid == 0) istart2 = 2;
iend1 = iend;
if (myid == lastp ) iend1 = iend-1;
l_nbr = myid - 1;
r_nbr = myid + 1;
if (myid == 0) l_nbr = MPI_PROC_NULL;
if (myid == lastp) r_nbr = MPI_PROC_NULL;
/*
Test data generation
*/
/* for (i=0; i<mm; i++) */
for (i=istart; i<=iend; i++) {
ii = i + istartg -1;
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
u1[i][j][k]=1.0/(double) ii + 1.0/(double) (j+1) + 1.0/(double) (k+1);
}
/* for (i=0; i<mm; i++) */
for (i=istart; i<=iend; i++) {
ii = i + istartg -1;
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
v1[i][j][k]=2.0/(double) ii + 1.0/(double) (j+1) + 1.0/(double) (k+1);
}
for (i=istart; i<=iend; i++) {
ii = i + istartg -1;
for (j=0; j<nn; j++) {
115

ps1[i][j] = 1.0/(double) ii + 1.0/(double)(j+1);


hxu[i][j] = 2.0/(double) ii + 1.0/(double)(j+1);
hxv[i][j] = 1.0/(double) ii + 2.0/(double)(j+1);
hmmx[i][j] = 2.0/(double) ii + 1.0/(double)(j+1);
hmmy[i][j] = 1.0/(double) ii + 2.0/(double)(j+1);
}
}
for (k=0; k<kk; k++) {
am7[k] = 1.0/(double) (k+1);
for (ka=0; ka<kk; ka++) {
vecinv[k][ka] = 1.0/(double) (ka+1) + 1.0/(double) (k+1);
}
}
/*

Start the computation

*/
nnkk = nn*kk;
itag = 10;
/*

MPI_Sendrecv ((void *)&u1[iend][0][0], nnkk, MPI_DOUBLE, r_nbr,itag,


(void *)&u1[istartm1][0][0],nnkk, MPI_DOUBLE, l_nbr, itag, comm, istat);

*/
MPI_Isend ((void *)&u1[iend][0] [0],

nnkk, MPI_DOUBLE, r_nbr, itag, comm, &requ1)

MPI_Irecv ((void *)&u1[istartm1][0] [0], nnkk, MPI_DOUBLE, l_nbr, itag, comm, &requ1);
itag = 20;
/* MPI_Sendrecv ((void *)&ps1[istart][0], nn, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][0], nn, MPI_DOUBLE, r_nbr, itag, comm, istat);
*/
MPI_Isend ((void *)&ps1[istart][0], nn, MPI_DOUBLE, l_nbr, itag, comm, &reqps1)
MPI_Irecv ((void *)&ps1[iendp1][0], nn, MPI_DOUBLE, r_nbr, itag, comm, &reqps1);
/* for (i=0; i<mm; i++) { */
for (i=istart; i<=iend; i++) {
for (j=0; j<nn; j++) {
for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}

116

/*

for (i=1; i<mm-1; i++) */


for (i=istart2; i<=iend1; i++)
for (j=0; j<nn-1; j++)
d8[i][j] = ( ps1[i][j+1]+ps1[i][j] )*0.50*hxv[i][j];
MPI_Wait (&reqps1, istat);

/*

for (i=0; i<mm-1; i++) */


for (i=istart; i<=iend1; i++)
for (j=1; j<nn-1; j++)
d7[i][j] = ( ps1[i+1][j]+ps1[i][j] )*0.50*hxu[i][j];
MPI_Wait (&requ1, istat);
itag=30;
MPI_Sendrecv ((void *)&d7[iend][0],

nn, MPI_DOUBLE, r_nbr, itag,

(void *)&d7[istartm1][0], nn, MPI_DOUBLE, l_nbr, itag, comm, istat);


/* for (i=1; i<mm-1; i++)

*/

for (i=istart2; i<=iend1; i++)


for (j=1; j<nn-1; j++)
for (k=0; k<kk; k++)
d00[i][j][k]=(d7[i][j]*u1[i][j][k]-d7[i-1][j]*u1[i-1][j][k])*hmmx[i][j]
+(d8[i][j]*v1[i][j][k]-d8[i][j-1]*v1[i][j-1][k])*hmmy[i][j];
/* for (i=1; i<mm-1; i++) */
for (i=istart2; i<=iend1; i++)
for (ka=0; ka<kk; ka++)
for (j=1; j<nn-1; j++)
for (k=0; k<km; k++)
f1[i][j][k]=f1[i][j][k]-vecinv[ka][k]*d00[i][j][ka];
sumf1=0.0;
sumf2=0.0;
/* for (i=1; i<mm-1; i++)

{ */

for (i=istart2; i<=iend1; i++) {


for (j=1; j<nn-1; j++) {
for (k=0; k<km; k++) {
f2[i][j][k]=-am7[k]*ps1[i][j];
sumf1 +=f1[i][j][k];
117

sumf2 +=f2[i][j][k];
}
}
}
/*
Output data for validation
*/
MPI_Allreduce ((void *)&sumf1,(void *)&gsumf1, 1, MPI_DOUBLE, MPI_SUM, comm);
MPI_Allreduce ((void *)&sumf2,(void *)&gsumf2, 1, MPI_DOUBLE, MPI_SUM, comm);
icount1 = m*nn*km;
iroot=0;
MPI_Gather((void *)&f2[istart][0][0],icount1,MPI_DOUBLE,
(void *)&tt,
icount1,MPI_DOUBLE, iroot, comm);
if (myid == 0) {
printf( "SUMF1,SUMF2= %.5f\t%.5f\n", gsumf1, gsumf2 );
printf( " tt[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
tt[i][1][1],tt[i+5][1][1],tt[i+10][1][1],tt[i+15][1][1],
tt[i+20][1][1],tt[i+25][1][1],tt[i+30][1][1],tt[i+35][1][1]);
}
clock=MPI_Wtime() - clock;
printf( " myid, clocktime= %d\t%.5f\n", myid, clock);
}
MPI_Finalize();
return 0;
}

T6DCP CPU :
ATTENTION: 0031-408

4 tasks allocated by LoadLeveler, continuing...

NPROC,MYID,ISTART,IEND,istartg=4
NPROC,MYID,ISTART,IEND,istartg=4
NPROC,MYID,ISTART,IEND,istartg=4

3
0
2

NPROC,MYID,ISTART,IEND,istartg=4
1
SUMF1,SUMF2= 26172.46054
-2268.89180
F2[i][1][1],i=1,160,5

1
1
1

40
40
40

121
1
81

40

41

118

0.000
-0.262

-0.333 -0.295 -0.281 -0.274 -0.269 -0.266 -0.264


-0.261 -0.260 -0.259 -0.258 -0.258 -0.257 -0.257

-0.256 -0.256 -0.255 -0.255 -0.255


-0.254 -0.254 -0.254 -0.254 -0.254
myid, clock time= 0
0.02943
myid, clock time= 1
0.02882
myid, clock time= 2
myid, clock time= 3

-0.255
-0.253

-0.255
-0.253

-0.254
-0.253

0.02873
0.02895

CPU MPI_Sendrecv T5DCP 0.030


CPU MPI_IsendMPI_Irecv MPI_Wait T6DCP 0.029
Nonblocking Blocking

119

6.2

CPU
CPU MPI_Pack buffer
CPU buffer MPI_Unpack
buffer
MPI_Sendrecv iend
istartm1 :
itag=110
MPI_Sendrecv ((void *)&ps1[iend][0],
n, MPI_DOUBLE, r_nbr, itag,
(void *)&ps1[istartm1][0], n, MPI_DOUBLE, l_nbr, itag, comm, istat);
itag=120
MPI_Sendrecv ((void *)&ps2[iend][0],
n, MPI_DOUBLE, r_nbr, itag,
(void *)&PS2[istartm1][0], n, MPI_DOUBLE, l_nbr, itag, comm, istat);
MPI_Pack iend buf1
MPI_Sendrecv buf1 r_nbr buf2 MPI_Unpack
buf2 istartm1
#define n 120
#define bufsize n*2*8
char
buf1[bufsize], buf2[bufsize];
int
ipos, itag, icount, l_nbr, r_nbr;
MPI_Comm
MPI_Status

comm;
istat[8];

MPI_Barrier (comm);
ipos=0
MPI_Pack ( (void *)&ps1[iend][0], n, MPI_DOUBLE, (void *)&buf1, bufsize, &ipos, comm);
MPI_Pack ( (void *)&ps2[iend][0], n, MPI_DOUBLE, (void *)&buf1, bufsize, &ipos, comm);
itag=120;
MPI_Sendrecv ((void *)&buf1, bufsize, MPI_CHAR, r_nbr, itag,
(void *)&buf2, bufsize, MPI_CHAR, l_nbr, itag, comm, istat);
if (myid > 0) {
ipos=0;
120

MPI_Unpack ( (void *)&buf2, bufsize, &ipos,


(void *)&ps1[istartm1][0], n, MPI_DOUBLE, comm);
MPI_Unpack ( (void *)&buf2, bufsize, &ipos,
(void *)&ps2[istartm1][0], n, MPI_DOUBLE, comm);
}
packunpack
double buf1[2][n], buf2[2][n];
for (j=0; j<n; j++) {
buf1[0][j]=ps1[iend][j];
buf1[1][j]=ps2[iend][j];
}
icount=n*2;
itag=120;
MPI_Sendrecv ((void *)&buf1, icount, MPI_DOUBLE, r_nbr, itag,
(void *)&buf2, icount, MPI_DOUBLE, l_nbr, itag, comm, istat);
if (myid > 0) {
for (j=0; j<n; j++) {
ps1[istartm1][j]=buf2[0][j];
ps2[istartm1][j]=buf2[1][j];
}
}
MPI_Sendrecv MPI_Sendrecv
n 6.3 6.4 IBM
SP2IBM SP2 SMPHP SPP2200 Fujitsu VPP300 CPU
1 MB
1 MB
MPI_Type_struct
7.1

121

point-to-point message passing test on IBM


SP2

35
30
25
Mbytes/s

IBM SP2_160 us
IBM SP2_120 us
IBM SP2_160
IBM SP2_120

20
15
10

16M

8M

4M

2M

1M

512K

256K

128K

64K

32K

16K

8K

4K

2K

1K

512

256

128

64

32

16

messge length, bytes


6.3

IBM SP2 CPU

122

point-to-point communication test on different


computers
1000
900
800

600
500

Fujistu VPP300
HP SPP2000
IBM SP2_375
IBM SP2_160

400
300
200

16M

8M

4M

2M

1M

512K

messge length, bytes

256K

128K

64K

32K

16K

8K

4K

2K

1K

512

256

128

64

32

16

100
8

Mbytes/s

700

6.4 CPU

123

6.3
CPU CPU

ps2 u2 i-1i+1
ps2 d1 i-1
d1 i-1
for (i=istart; i<=iend1; i++) {
for (j=0; j<jend1; j++) {
d1[i][j]=(ps2[i+1][j]+ps2[i][j])*HXU[i][j]*0.50;
d2[i][j]=(ps2[i][j+1]+ps2[i][j])*HXV[i][j]*0.50;
}
}
MPI_Sendrecv((void *)&d1[iend][0],
nn, MPI_DOUBLE, r_nbr, itag,
(void *)&d1[istartm1][0], nn, MPI_DOUBLE, l_nbr, itag, comm, istat);
for (i=istart2; i<=iend1; i++)
for (j=1; j<n1; j++)
for (k=0; k<kk; k++)
d11[i][j][k]= (d1[i][j]*u2[i][j][k]-d1[i-1][j]*u2[i-1][j][k])*hmmx[i][j]
+ (d2[i][j]*v2[i][j][k]-d2[i][j-1]*v2[i][j-1][k])*hmmy[i][j];

ps2 u2 i-1i+1 i-1


d1
for (i=istartm1; i<=iend1; i++) {
for (j=0; j<jend1; j++) {
d1[i][j]=(ps2[i+1][j]+ps2[i][j])*HXU[i][j]*0.50;
d2[i][j]=(ps2[i][j+1]+ps2[i][j])*HXV[i][j]*0.50;
}
}
124

for (i=istart2; i<=iend1; i++)


for (j=1; j<n1; j++)
for (k=0; k<kk; k++)
d11[i][j][k]= (d1[i][j]*u2[i][j][k]-d1[i-1][j]*u2[i-1][j][k])*hmmx[i][j]
+ (d2[i][j]*v2[i][j][k]-d2[i][j-1]*v2[i][j-1][k])*hmmy[i][j];

RFS CHEF u1v1t1q1ps1u3v3


t3q3ps3wp1 MPI_PackMPI_Unpack

125

6.4
MPI CPU
MPI_ScatterMPI_Scatterv MPI_Bcast CPU
CPU
CPU
MPI_GatherMPI_Ggatherv CPU
CPU CPU

6.4.1
bcd np input.11
input.12input.13 . . .
/*
PROGRAM PIOSEQ
#include <stdio.h>

*/

#include <stdlib.h>
#define mm
200
#define np
4
#define m
50
main ()
{
double
int
FILE
char
/*

suma, a[mm], b[mm], c[mm], d[mm];


i,j, iu, size, ip, istart, iend;
*fp;
string[10];

test data generation and write to file

'input.dat' */

for (i = 0; i < mm; i++) {


j=i+1;
b[i] = 3. / (double) j + 1.0;
c[i] = 2. / (double) j + 1.0;
d[i] = 1. / (double) j + 1.0;
}
fp = fopen( "input.dat", "w");
126

fwrite( (void *)&b, sizeof(b), 1, fp );


fwrite( (void *)&c, sizeof(c), 1, fp );
fwrite( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
/*
prepare parallel input data for np process
*/
for (ip=0; ip<np; ip++) {
iu=11+ip;
sprintf(string, "input.%d", iu);
fp = fopen(string, "w");
startend ( ip, np, 0, mm-1, &istart, &iend);
size = (iend-istart+1)*sizeof(double);
fwrite ((void *)&b[istart], size, 1, fp);
fwrite ((void *)&c[istart], size, 1, fp);
fwrite ((void *)&d[istart], size, 1, fp);
fclose( fp );
}
/*
sequential processing
*/
fp = fopen( "input.dat", "r");
fread( (void *)&b, sizeof(b), 1, fp );
fread( (void *)&c, sizeof(c), 1, fp );
fread( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
suma = 0.;
for (i = 0; i < mm; i++) {
a[i] = b[i] + c[i] * d[i];
suma += a[i];
}
for (i = 0; i < mm; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]);
}
printf( "sum of array A=%f\n", suma);
exit(0);
127

}
startend(int myid,int nproc,int is1,int is2,int* istart,int* iend)
{
int ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
if(myid < ir) {
*istart=is1+myid*(iblock+1);
*iend=*istart+iblock;
}
else {
*istart=is1+myid*iblock+ir;
*iend=*istart+iblock-1;
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}

double a[mm][nn][kk], b[mm][nn][kk], c[mm][nn][kk], d[mm][nn][kk];


size = (iend-istart+1)*nn*kk*sizeof(double);
fwrite ((void *)&b[istart][0][0], size, 1, fp);
fwrite ((void *)&c[istart] [0][0], size, 1, fp);
fwrite ((void *)&d[istart] [0][0], size, 1, fp);
CPU
/*

PROGRAM PIODCP
Each processor read its own data from individual file

*/
#include <stdio.h>
#include <stdlib.h>
128

#include <mpi.h>
#define mm 200
#define np
#define m

4
50

main ( argc, argv)


int argc;
char **argv;
{
char
int
FILE
double

string[10];
i, j, k, iu, size;
*fp;
a[m], b[m], c[m], d[m], t[mm], suma, sumall;

int
nproc, myid, istart, iend, iroot, idest;
MPI_Comm
comm;
MPI_Status istat[8];
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm=MPI_COMM_WORLD;
istart=0;
iend=m-1;
/*
READ INPUT DATA and DISTRIBUTE INPUT DATA
*/
if(nproc != np) {
printf( "nproc not equal to np= %d\t%d\t",nproc, np);
printf(" program will stop");
MPI_Finalize();
return 0;
}
iu=11+myid;
sprintf(string, "input.%d", iu);
fp = fopen(string, "r");
129

size = m*sizeof(double);
fread ((void *)&b[istart], size, 1, fp);
fread ((void *)&c[istart], size, 1, fp);
fread ((void *)&d[istart], size, 1, fp);
fclose( fp );
/*
COMPUTE, GATHER COMPUTED DATA, and WRITE OUT the RESULT
*/
suma=0.0;
/* for(i=0; i<ntotal; i++) {

*/

for(i=0; i<m; i++) {


a[i]=b[i]+c[i]*d[i];
suma=suma+a[i];
}
idest=0;
MPI_Gather((void *)&a,m,MPI_DOUBLE, (void *)&t, m, MPI_DOUBLE, idest, comm);
MPI_Reduce((void *)&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, idest, comm);
If (myid == 0) {
for (i = 0; i < mm; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]);
}
printf( "sum of array A=%f\n",sumall);
}
sprintf(string, "output.%d", iu);
fp = fopen(string, "w");
size = m*sizeof(double);
fwrite ((void *)&b[istart], size, 1, fp);
fclose( fp );
MPI_Finalize();
return 0;
}

size = m*nn*kk*sizeof(double);
fread ((void *)&b[istart][0][0], size, 1, fp);
fread ((void *)&c[istart] [0][0], size, 1, fp);
fread ((void *)&d[istart] [0][0], size, 1, fp);
130


CPU CPU (load
balance) CPU
CPU (local
disk) CPU CPU
CPU
system CPU input.xx
CPU /var/tmp CPU /var/tmp input.xx
#define
#define
#define
#include

mm 200
np
4
m
50
<mpi.h>

double
char
int

a[m+2], b[m+2], c[m+2], d[m+2], tt[mm];


fname[30], cmd[30];
nproc, myid, istart, iend, i, iu;

MPI_Init();
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
. . . . . .
iu=11+myid;
sprintf(cmd, cp input.%d, iu, /var/tmp);
system (cmd);
sprintf(fname, /var/tmp/input.%d, iu);
fp = fopen(fname, "r");
size = m*sizeof(double);
fread ((void *)&b[1], size, 1, fp);
fread ((void *)&c[1], size, 1, fp);
fread ((void *)&d[1], size, 1, fp);
fclose( fp );

131

6.4.2
CPU CPU

A output.xx
#define
#define
#define
#include

mm 200
np
4
m
50
<mpi.h>

double
char
int

a[m], b[m], c[m], d[m], tt[mm];


fname[10];
nproc, myid, istart, iend, i, iu, size;

MPI_Init();
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
. . . . . .
iu=11+myid;
sprintf(fname, output.%d, iu);
fp = fopen(fname, "r");
size = m*sizeof(double);
fwrite ((void *)&b, size, 1, fp);
fwrite ((void *)&c, size, 1, fp);
fwrite ((void *)&d, size, 1, fp);
fclose( fp );

double

a[m][nn][kk], b[m] [nn][kk], c[m] [nn][kk], d[m] [nn][kk], tt[mm] [nn][kk];

size = m*nn*kk*sizeof(double);
fwrite ((void *)&b, size, 1, fp);
fwrite ((void *)&c, size, 1, fp);
fwrite ((void *)&d, size, 1, fp);

132

output.11output.12output.13 . . . np

#define np
4
#define mm 200
char
fname[10];
double a[mm], b[mm], c[mm], d[mm];
int
i, iu, size
for (i=0; i<np; i++) {
iu=11+I;
sprintf (fname, output.%d, iu);
fp = fopen (fname, r);
startend (i, np, 0, mm-1, &istart, &iend);
size = (iend-istart+1)*sizeof(double);
fread ((void *)&a[istart], size, 1, fp);
fclose( fp );
}
sprintf (fname, output.dat);
fp = fopen (fname, w);
fwrite ((void *)&a, sizeof(a), 1, fp);

size = (iend-istart+1)*nn*kk*sizeof(double);
fread ((void *)&a[istart][0][0], size, 1, fp);

133


MPI (Transposing Block Distribution)
(2 Way Recursive and Pipeline method)


134

7.1
MPI MPI_INTMPI_FLOATMPI_DOUBLE
MPI_CHAR (derived data type)
MPI_Type_vectorMPI_Type_contiguousMPI_Type_indexed
MPI_Type_struct MPI_Type_vector
(Constant Stride)MPI_ Type_contiguous
MPI_ Type_struct
C struct
, C struct

struct {
float a;
float b;
int
n;
} load;
MPI MPI_ Type_struct :
#define count 3
int
length[count];
MPI_Datatype oldtype[count];
MPI_Aint
disp[count];
MPI_Datatype newtype;
MPI_ Type_struct ( count, length, disp, oldtype, &newtpye);
MPI_Type_commit(&newtpye);

count
length
disp

() count
count MPI_Aint

oldtype
newtype

count MPI_Datatype
MPI_Datatype

MPI_Address
135

(Displacement) :
MPI_Address ( (void *)&data, &address);

data
adress

data

/*

PROGRAM T7STRUCT
C struct and related MPI_Type_struct example

*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define count
3
/*--------- MPI related data ---------*/
int
nproc, myid;
MPI_Comm
comm;
MPI_Status

istat[8];

main ( argc, argv)


int argc;
char **argv;
{
int i, itag;
int length[3] = {1, 1, 1};
MPI_Datatype oldtype[3] = {MPI_FLOAT, MPI_FLOAT, MPI_INT};
MPI_Datatype
MPI_Aint
struct {

newtype;
disp[3];

float a;
float b;
int
n;
} new;

136

MPI_Init (&argc, &argv);


MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm = MPI_COMM_WORLD;
MPI_Address ((void *)&new.a, &disp[0] );
MPI_Address ((void *)&new.b, &disp[1] );
MPI_Address ((void *)&new.n, &disp[2] );
for (i=2; i>=0; i--)
disp[i] -= disp[0];
MPI_Type_struct ( count, length, disp, oldtype, &newtype);
MPI_Type_commit(&newtype);
itag = 10;
if (myid == 0 ) {
scanf ("%f %f %d", &new.a, &new.b, &new.n);
MPI_Send ((void *)&new, 1, newtype, 1, itag, comm);
}
else {
MPI_Recv ((void *)&new, 1, newtype, 0, itag, comm, istat);
printf ("a,b,n=%f\t%f\t%d\n", new.a, new.b, new.n);
}
MPI_Finalize();
return 0;
}
CPU
ATTENTION: 0031-408 2 tasks allocated by LoadLeveler, continuing...
a,b,n=10.000000 20.000000
30
C CPU
C struct MPI_Type_struct
CPU

MPI_Sendrecv MPI_Pack
buf1 MPI_ Sendrecv buf 1 buf 2
MPI_Unpack buf 2
137

#define

im=160;

#define
float
int
char

km=20;
up[im+1][km], vp[im+1][km], wp[im+1][km];
bufsize = km*4*8, km2=km*2, itag, ipos, l_nbr, r_nbr;
buf1[bufsize], buf2[bufsize];

MPI_Comm comm;
. . . . . . . .
if (myid > 0) {
ipos=0;

MPI_Pack ( (void *)&up[istart][0], km, MPI_FLOAT, (void *)&buf1, bufsize, ipos, comm);
MPI_Pack ((void *)&vp[istart][0], km2, MPI_FLOAT, (void *)&buf1, bufsize, ipos, comm);
MPI_Pack ((void *)&wp[istart][0], km, MPI_FLOAT, (void *)&buf1, bufsize, ipos, comm);
}
itag=202;
MPI_Sendrecv((void *)&buf1, bufsize, MPI_CHAR, l_nbr, itag,
(void *)&buf2, bufsize, MPI_CHAR, r_nbr, itag, comm, istat);
if (myid < nproc)
{
ipos=0;
MPI_Unpack ((void *)& buf 2, bufsize, ipos,
(void *)&up[iendp1][0], km, MPI_FLOAT, comm );
MPI_ Unpack ((void *)& buf 2, bufsize, ipos,
(void *)&vp[iendp1][0], km2, MPI_FLOAT, comm );
MPI_ Unpack ((void *)& buf 2, bufsize, ipos,
(void *)&wp[iendp1][0], km, MPI_FLOAT, comm );
}
km MPI_Type_Contiguous
cont2d :
MPI_Datatype cont2d;
MPI_Type_ Contiguous ( km, MPI_DOUBLE, &cont2d );
MPI_Type_commit (&cont2d);
C

138


up km vp km*2 wp
km [istart][0][iendp1][0]
upvpwp dimension
ipack3 MPI_Sendrecv
Pack Unpack
int
length[3], ifirst = 1;
MPI_Datatype ipack3, itype[3];
MPI_Aint
disp[3];
. . . . . . .
if (ifirst == 1) {
ifirst = 0;
length[0] = 1;
length[1] = 2;
length[2] = 1;
itype[0] = cont2d;
itype[1] = cont2d;
itype[2] = cont2d;
MPI_Address( (void *)&up[istart][0], &disp[0]);
MPI_Address((void *)&vp[istart][0], &disp[1]);
MPI_Address((void *)&wp[istart][0], &disp[2]);
For (i=2; i>=0; i--)
disp[i] -= disp[0];
MPI_Type_struct( 3, length, disp, itype, &ipack3);
MPI_Type_commit(&ipack3);
}
itag=202;
MPI_Sendrecv( (void *)&up[istart][0], 1, ipack3, l_nbr, itag,
(void *)&up[iendp1][0], 1, ipack3, r_nbr, itag, comm, istat);
Pack/Unpack

139

7.2

(Array Transpose) 7.1

2nd dimension

a(i,j)
P0

P1

n
.
.
.

P2
3333666999
3333666999

2222555888
2222555888
1111444777
1111444777

j=1
i=1

n3333
3333
2222
2222
1111
1111
j=1
i=1 . .

666
666
555
555
444
444
.

999
999
888
888
777
777

m
1st dimension

7.1

row_to_col

block transpose

7.1 CPU 4 CPU 16 5 CPU 25

CPU 7.1
row distribution
column distribution CPU
derived data type 7.2 [i][j]
CPU

140

2nd dimension

P0

P1

P2

A(I,J)
itype(i,j)
N
. (0,2)
.
. (0,1)
. 1111444777
. (0,0) (1,0) (2,0)
J=1
I=1 . . . . M

itype(i,j)

itype(i,j)
3333666999
(0,2) (1,2) (2,2)

(1,2)
2222555888
(0,1) (1,1) (2,1)

(2,1)

(1,0)

(2,0)
1st dimension

7.2 Derived Data Type Row_to_Col

Transpose

Initial address of the derived data type


Data represented by the derived data type

2nd dimension

jmax

jleng

ileng
1st dimension

jmin

7.3

vector2d

derived data type

7.2 7.3
block2d

int
jmin, jmax, ileng, jleng, count, stride;
MPI_Datatype block2d[ip][jp];
stride = jmax - jmin +1;
141

for (i=0; i<nproc; i++) {


ileng=iendg[i]-istartg[i]+1;
for (j=0; j<nproc; j++) {
jleng=jendg[j]-jstartg[j]+1;
MPI_Type_vector (ileng,jleng,nn,MPI_INT,& block2d[i][j]);
MPI_Type_commit (&block2d[i][j]);
}
}
block2d

row_to_col

transpose

itag=10;
k=-1;
for (id = 0; id < nproc; id++) {
if (id != myid ) {
k=k+1;
istart1=istartg[id];
jstart1=jstartg[id];
MPI_Isend( (void *)&a[istart1][jstart], 1, block2d[id][myid], id, itag, comm, &req1[k]);
MPI_Irecv( (void *)&a[istart][jstart1], 1, block2d[myid][id], id, itag, comm, &req2[k]);
}
}
icount=nproc-1;
MPI_Waitall (icount, req1, stat);
MPI_Waitall (icount, req2, stat);
MPI_IsendMPI_Irecv MPI_Waitall :
MPI_Waitall (count, request, status);

count
request

MPI_Isend MPI_Irecv
count MPI_Request

status

count MPI_Status

row_to_col col_to_row id myid req1(id) req2(id)


MPI_Waitall k
142

request
non-blocking send/recv blocked sendrecv
non-blocking send/recv
itag=10;
for (id = 0; id < nproc; id++) {
if (id != myid ) {
istart1=istartg[id];
jstart1=jstartg[id];
MPI_Sendrecv( (void *)&a[istart1][jstart], 1, vector[id][myid], id, itag,
(void *)&a[istart][jstart1], 1, vector[myid][id], id, itag, comm, istat);
}
}

P0
a(i,j)
n3333
(0,2)
2222
(0,1)
1111
(0,0)
j=1
i=1 . .

P1

666
(1,2)
555
(1,1)
444
(1,0)
.

P2

999
(2,2)
888
(2,1)
777
(2,0)

3333666999
3333666999
2222555888
2222555888
1111444777
1111444777

7.5
block2d

col_to_row transpose

col_to_row

143

itag=20;
k=-1;
for (id = 0; id < nproc; id++) {
if (id != myid ) {
k = k +1;
istart1=istartg[id];
jstart1=jstartg[id];
MPI_Isend( (void *)&a[istart][jstart1], 1, block2d[myid][id], id, itag, comm, &req1[k]);
MPI_Irecv( (void *)&a[istart1][jstart], 1, block2d[id][myid], id, itag, comm, &req2[k]);
}
}
icount=nproc-1;
MPI_Waitall (icount, req1, stat);
MPI_Waitall (icount, req2, stat);
block2d
/*
program transpose
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

*/

#define np 3
#define mm 9
#define nn 6
main ( argc, argv)
int argc;
char **argv;
{
int
a[mm][nn];
int
int
int

istartg[np], iendg[np], jstartg[np], jendg[np];


i, j, k, nproc, myid, istart, iend, jstart, jend;
iu, id, itag, icount, istart1, jstart1, jleng;

FILE
*fp;
char
string[80], fname[16];
MPI_Datatype vector[np][np];
MPI_Request
req1[np], req2[np];
MPI_Status
istat[8];
144

MPI_Comm

comm;

MPI_Init (&argc, &argv);


MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
comm = MPI_COMM_WORLD;
MPI_Barrier( comm );
for (i = 0; i < nproc; i++)
startend( i, nproc, 0, mm-1, &istartg[i],&iendg[i] );
for (j = 0; j < nproc; j++)
startend(j, nproc, 0, nn-1, &jstartg[j], &jendg[j] );
for (i=0; i<nproc; i++) {
icount=iendg[i]-istartg[i]+1;
for (j=0; j<nproc; j++) {
jleng=jendg[j]-jstartg[j]+1;
MPI_Type_vector (icount,jleng,nn,MPI_INT,&block2d[i][j]);
MPI_Type_commit (&block2d[i][j]);
}
}
istart=istartg[myid];
iend=iendg[myid];
jstart=jstartg[myid];
jend=jendg[myid];
printf("myid,istart,iend,jstart,jend=%d %d %d %d %d\n",
myid,istart,iend,jstart,jend);
for (j=jstart; j<=jend; j++) {
for (i=0; i<3; i++)
a[i][j]=1+myid;
for (i=3; i<6; i++)
a[i][j]=4+myid;
for (i=6; i<9; i++)
a[i][j]=7+myid;
}
iu=myid+11;
sprintf( fname,"output.%d", iu);
fp = fopen( fname, "w");
for (j=jstart; j<=jend; j++) {
145

sprintf(string,"%d %d %d %d %d %d %d %d %d\n",
a[0][j],a[1][j],a[2][j],a[3][j],a[4][j],a[5][j],a[6][j],a[7][j],a[8][j]);
fwrite( (void *)&string, sizeof(string), 1, fp );
}
/*
row_to_col
*/
itag=10;
k=-1;
for (id = 0; id < nproc; id++) {
if (id != myid ) {
k=k+1;
istart1=istartg[id];
jstart1=jstartg[id];
MPI_Isend( (void *)&a[istart1][jstart], 1, block2d[id][myid],
id, itag, comm, &req1[k]);
MPI_Irecv( (void *)&a[istart][jstart1], 1, block2d[myid][id],
id, itag, comm, &req2[k]);
}
}
icount=nproc-1;
MPI_Waitall (icount, req1, istat);
MPI_Waitall (icount, req2, istat);
sprintf( string, "after row_to_col\n");
fwrite( (void *)&string, sizeof(string), 1, fp );
for (j=nn-1; j>=0; j--) {
sprintf(string,"%d %d %d\0\0",
a[istart][j],a[istart+1][j],a[istart+2][j]);
fwrite( (void *)&string, sizeof(string), 1, fp );
}
/*
col_to_row
*/
MPI_Barrier( comm );
itag=20;
k=-1;
for (id = 0; id < nproc; id++) {
146

if (id != myid ) {
k=k+1;
istart1=istartg[id];
jstart1=jstartg[id];
MPI_Isend( (void *)&a[istart][jstart1], 1, block2d[myid][id],
id, itag, comm, &req1[k]);
MPI_Irecv( (void *)&a[istart1][jstart], 1, block2d[id][myid],
id, itag, comm, &req2[k]);
}
}
icount=nproc-1;
MPI_Waitall (icount, req1, istat);
MPI_Waitall (icount, req2, istat);
for (i=0; i<mm; i++)
for (j=jstart; j<=jend; j++)
a[i][j]=a[i][j]+10;
sprintf( string, "after col_to_row\n");
fwrite( (void *)&string, sizeof(string), 1, fp );
for (j=jstart; j<=jend; j++) {
sprintf(string,"%d %d %d %d %d %d %d %d %d\n",
a[0][j],a[1][j],a[2][j],a[3][j],a[4][j],a[5][j],a[6][j],a[7][j],a[8][j]);
fwrite( (void *)&string, sizeof(string), 1, fp );
}
MPI_Finalize();
return 0;
}
startend(int myid,int nproc,int is1,int is2,int* istart,int* iend)
{
int ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
if(myid < ir) {
*istart=is1+myid*(iblock+1);
*iend=*istart+iblock;
}
147

else {
*istart=is1+myid*iblock+ir;
*iend=*istart+iblock-1;
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
CPU fort.11
1.
1.

1.
1.

1.
1.

4.
4.

4.
4.

4.
4.

7.
7.

7.
7.

7.
7.

after col_to_row
11. 11. 11. 14.
11. 11. 11. 14.

14.
14.

14.
14.

17.
17.

17.
17.

17.
17.

after row_to_col
1.
1.
1.
1.
1.
1.
2.
2.
3.
3.

2.
2.
3.
3.

2.
2.
3.
3.

fort.12
2.
2.
2.
2.
2.
2.
after row_to_col
4.
4.
4.

5.
5.

5.
5.

5.
5.

8.
8.

8.
8.

8.
8.

6.
6.
6.
6.
6.
6.
after col_to_row
12. 12. 12. 15.

15.

15.

18.

18.

18.

4.
5.
5.

4.
5.
5.

4.
5.
5.

148

12.

12.

12.

15.

15.

15.

18.

18.

18.

6.
6.

6.
6.

6.
6.

9.
9.

9.
9.

9.
9.

16.
16.

16.
16.

16.
16.

19.
19.

19.
19.

19.
19.

fort.13
3.
3.

3.
3.

3.
3.

after row_to_col
7.
7.
7.
7.
7.
7.
8.
8.
8.
8.
8.
8.
9.
9.
9.
9.
9.
9.
after col_to_row
13.
13.

13.
13.

13.
13.

149

7.3
for loop
for loop X
index [i][j] index [i][j][i-1][j][i][j-1]i j
(Recursive) (2-Way Recursive )
(Pipeline Method)
#define m

128

#define m 128
double x[m+2][n+2];
for (i=1; i<=m; i++)
for (j=1; j<=n; j++)
x[i][j]=x[i][j]+( x[i-1][j]+x[i][j-1] )*0.5;

2nd Dimension
j
P0

P1

P2

x[i][j]

3
i

1st Dimension

7.2 (a)
150

x 7.2(a)
CPU
CPU j CPU
CPU CPU
CPU CPU 7.2(b)
345 CPU 26 CPU
17 CPU j
j

P0

P1

P2

time
7
6

7.2 (b)

/*
program pipeseq
#include <stdio.h>

*/

#include <stdlib.h>
#define m 128
#define n 128

151

main ()
{
double
int
FILE

x[m+2][n+2], eps, omega, err1, temp, clock;


i, j, k, loop, isec1, nsec1, isec2, nsec2;
*fp;

wtime(&isec1, &nsec1);
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
for (loop=0; loop<36000; loop++) {
err1=0.0;
for (i=1; i<=m; i++) {
for (j=1; j<=n; j++) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
if(err1 <= eps) break;
}
printf( "loop,err1 = %d %.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
wtime(&isec2, &nsec2);
152

clock=(double) (isec2-isec1) + (double) (nsec2-nsec1)/1.0e9;


printf( " clock time=%f\n", clock);
return 0;
}
#include <sys/time.h>
int wtime(int *isec, int *nsec)
{
struct timestruc_t tb;
int iret;
iret=gettimer(TIMEOFDAY, &tb);
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
PIPESEQ IBM SP2 SMP
4.349
6.040
4.860
7.116
4.919
0.318
2.283
3.704
2.286
4.340
loop,err1 = 10566 9.99821e-06
x[i][n], i=1; i<=128; i+=8

3.700
6.900

2.232
6.741

8.797
1.727

7.457
3.974
4.557
4.171
7.183
3.670
clock time=10.663873

6.317
5.158

6.057
6.431

4.818
3.384

6.752
5.198

5.458
5.561

/*

program pipeline
Parallel on 1st dimension

*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define m 128
#define n

128

main ( argc, argv)


int argc;
153

char **argv;
{
double
int
int

x[m+2][n+2], eps, omega, err1, gerr1, temp, clock;


i, j, k, ip, itag, loop, iblock, iblklen, jj, isrc;
nproc, myid, istart, iend, count, istart1, count1,

istartm1, iendp1, l_nbr, r_nbr, lastp;


int
istartg[32], iendg[32];
MPI_Status
istat[8];
MPI_Comm comm;
FILE

*fp;

MPI_Init (&argc, &argv);


MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_Barrier (MPI_COMM_WORLD);
clock=MPI_Wtime();
for (i = 0; i < nproc; i++) {
startend(i, nproc, 1, m, &istartg[i], &iendg[i]);
}
comm=MPI_COMM_WORLD;
istart=istartg[myid];
iend =iendg[myid];
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
lastp =nproc-1;
istartm1=istart-1;
iendp1=iend+1;
l_nbr = myid-1;
r_nbr = myid+1;
if(myid == 0)

l_nbr = MPI_PROC_NULL;

if(myid == lastp) r_nbr = MPI_PROC_NULL;


if( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
154

fclose( fp );
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
count=(m+2)*(n+2);
MPI_Bcast((void *)&x, count, MPI_DOUBLE, 0, comm);
for (ip=0; ip<nproc; ip++)
startend( ip, nproc, 1, m, &istartg[i], &iendg[i]);
iblock = 4;
omega = 0.5;
eps = 1.0e-5;
for (loop=1; loop<36000; loop++) {
err1 = 1.0e-15;
itag = 20;
MPI_Sendrecv ((void *)&x[istart][0], n+2, MPI_DOUBLE, l_nbr, itag,
(void *)&x[iendp1][0], n+2, MPI_DOUBLE, r_nbr, itag, comm, istat);
itag = 10;
for (jj=1; jj<=m; jj+=iblock) {
iblklen = min(iblock, n-jj+1);
MPI_Recv( (void *)&x[istartm1][jj], iblklen, MPI_DOUBLE, l_nbr, itag, comm, istat);
for (i=istart; i<=iend; i++) {
for (j=jj; j<=jj+iblklen-1; j++) {
temp = 0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j] = x[i][j]+omega*temp;
if ( temp < 0.0) temp = -temp;
if(temp >

err1) err1 = temp;

}
}
MPI_Send( (void *)&x[iend][jj], iblklen, MPI_DOUBLE, r_nbr, itag, comm);
}
MPI_Allreduce((void *)&err1,(void *)&gerr1,1,MPI_DOUBLE,MPI_MAX, comm);
err1 = gerr1;
if(err1 < eps) break;
155

}
itag = 110;
if( myid == 0) {
for (isrc=1; isrc<nproc; isrc++) {
istart1=istartg[isrc];
count1=(iendg[isrc]-istart1+1)*(n+2);
MPI_Recv((void *)&x[istart1][0], count1, MPI_DOUBLE, isrc, itag, comm, istat);
}
printf( "loop,err1 = %d %.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
else {
count = (iend-istart+1)*(n+2);
MPI_Send ((void *)&x[istart][0], count, MPI_DOUBLE, 0, itag, comm);
}
clock = MPI_Wtime() - clock;
printf( " myid, clock time= %d
MPI_Finalize();
return 0;

%f\n", myid, clock);

}
startend(int myid,int nproc,int is1,int is2,int* istart,int* iend)
{
int ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
if(myid < ir) {
*istart=is1+myid*(iblock+1);
*iend=*istart+iblock;
}
else {
*istart=is1+myid*iblock+ir;
156

*iend=*istart+iblock-1;
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
min(int i1, int i2)
{
if (i1 < i2) return i1;
else return i2;
}
PIPELINE IBM SP2 SMP CPU 13.93
10.66
ATTENTION: 0031-408

4 tasks allocated by LoadLeveler, continuing...

NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4

0
1
2
3

4.349
6.040
4.860
7.116
4.919
0.318
2.283
3.704
2.286
4.340
myid, clock time= 2 13.927630
loop,err1 = 10567 9.99821e-06
myid, clock time= 1 13.927630
myid, clock time= 3 13.927758
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
6.752
4.171
7.183
3.670
5.198
myid, clock time= 0

5.458
5.561

1
33
65
97

32
64
96
128

3.700
6.900

2.232
6.741

8.797
1.727

6.317
5.158

6.057
6.431

4.818
3.384

13.928672

157

SOR
SOR(Successive
Over-Relaxation)

SOR
SOR
SOR
SOR

158

8.1 SOR
Successive Over-Relaxation (SOR) method Laplace
for-loop x for-loop x
omega
for-loop
for (i=1; i<=m; i++)

for (j=1; j<=n; j++) {


temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] ) x[i][j];
x[i][j]=x[i][j] + omega*temp;
if (temp < 0.0) temp=-temp;
if (temp > err1)

err1=temp;

}
}
x[i][j] 8.1
x[i-1][j] x[i][j-1] x[i+1][j] x[i][j+1]
(pipeline method)
(red-black SOR method)

2nd Dimension

X(I,J)

not updated yet

about to be updated

already updated

x[i][j+1]

x[i-1][j]

j=1

x[i][j]

x[i+1][j]

x[i][j-1]
i=1

8.1

1st Dimension

SOR

159

/*

program sor
Sequential version of Successive Over-Relaxation Method

*/
#include <stdio.h>
#include <stdlib.h>
#define m 128
#define n

128

main ( argc, argv)


int argc;
char **argv;
{
double x[m+2][n+2], eps, omega, err1, temp, clock;
int
i, j, k, loop, isec1, nsec1, isec2, nsec2;
FILE

*fp;

wtime(&isec1, &nsec1);
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
for (loop=1; loop<36000; loop++) {
err1 = 0.0;
for (i=1; i<=m; i++) {
for (j=1; j<=n; j++) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
160

for (i=1; i<=m; i++) {


for (j=mod(i,2)+1; j<=n; j+=2) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
if(err1 <= eps) break;
}
printf( "loop,err1 = %d %.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
wtime(&isec2, &nsec2);
clock=(double) (isec2-isec1) + (double) (nsec2-nsec1)/1.0e9;
printf( " clock time=%f\n", clock);
return 0;
}
#include <sys/time.h>
int wtime(int *isec, int *nsec)
{
struct timestruc_t tb;
int iret;
iret=gettimer(TIMEOFDAY, &tb);
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
input.dat
/*
program sordata
*/
#include <stdio.h>
161

#include <stdlib.h>
#define m 128
#define n
double
main ()
{
double
int
FILE

128
seed = 123456.78;

x[m+2][n+2];
i, j;
*fp;

for (i=0; i<=m+1; i++) {


randnum( n+2, &x[i][0]);
}
for (i=0; i<=m+1; i++)
for (j=0; j<=n+1; j++)
x[i][j]=x[i][j]*10.0;
fp = fopen( "input.dat", "w");
fwrite( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
randnum ( int number, double array[])
{
int
i, ic, ifirst=1;
double a=16807.0, b=2147483647.0, twom01=0.50;
static double twom62,twom31,twom16,twom08,twom04,twom02;
if (ifirst == 1) {
ifirst=0;
twom02=twom01*twom01;
twom04=twom02*twom02;
twom08=twom04*twom04;
162

twom16=twom08*twom08;
twom31=twom16*twom08*twom04*twom02*twom01;
twom62=twom31*twom31;
}
for (i=0; i<number; i++) {
seed=seed*a;
ic=(int) (seed/b);
seed-=b*(double)ic;
array[i] = seed*twom31+seed*twom62;
}
return 0;
}
SORSEQ IBM SP2 SMP loop 10567
eps 10.66
4.349

6.040

4.860

7.116

0.318
2.283
3.704
2.286
loop,err1 = 10567 9.99821e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
6.752
4.171
7.183
3.670
clock time=10.664137

5.198

4.919

3.700

2.232

8.797

4.340

6.900

6.741

1.727

5.458

6.317

6.057

4.818

5.561

5.158

6.431

3.384

163

8.2 SOR
(red-black SOR method)
8.2
i+j
i+j

SOR

2nd Dimension

x[i][j]

j=n

red (white) element

black element

3
2
j=1
i=1

8.2

1st Dimension

SOR

j+i
j+i
/*

program sorrb
Sequential version of red-black Successive Over-Relaxation Method

*/
#include <stdio.h>
#include <stdlib.h>
#define m 128
#define n 128
main ( argc, argv)
int argc;
164

char **argv;
{
double
int
FILE

x[m+2][n+2], eps, omega, err1, temp, clock;


i, j, k, loop, isec1, nsec1, isec2, nsec2;
*fp;

wtime(&isec1, &nsec1);
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
for (loop=1; loop<36000; loop++) {
err1 = 0.0;
for (i=1; i<=m; i++) {

/*

*/

for (j=mod(i+1,2)+1; j<=n; j+=2) {


temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
for (i=1; i<=m; i++) {

/*

*/

for (j=mod(i,2)+1; j<=n; j+=2) {


temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
if(err1 <= eps) break;
}
165

printf( "loop,err1 = %d %.5e\n", loop, err1);


printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
wtime(&isec2, &nsec2);
clock=(double) (isec2-isec1) + (double) (nsec2-nsec1)/1.0e9;
printf( " clock time=%f\n", clock);
return 0;
}
#include <sys/time.h>
int wtime(int *isec, int *nsec)
{
struct timestruc_t tb;
int iret;
iret=gettimer(TIMEOFDAY, &tb);
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
mod(int i1, int i2)
{
int

i3;

i3=i1/i2;
i1=i1-i3*i2;
return i1;
}
SORRB IBM SP2 SMP loop 10313
eps 4.51 SOR 10567
SOR 10.66
4.349

6.040

4.860

7.116

0.318
2.283
3.704
2.286
loop,err1 = 10313 9.99917e-06
x[i][n], i=1; i<=128; i+=8

4.919

3.700

2.232

8.797

4.340

6.900

6.741

1.727

166

7.457
4.171

3.974
7.183

4.557
3.670

6.752
5.198

5.458
5.561

6.317
5.158

6.057
6.431

4.818
3.384

clock time=4.505438

2nd Dimension

P0

P1

P2

x[i][j]

j=n

red (white) element

black element

3
2
j=1
i=1

1st Dimension

8.3 red_black SOR method


Red-Black SOR COMMU(0)
COMMU(1)

/*

program sorrbp -- red-black Successive Over-Relaxation Method


Parallel on the first dimension

*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define m 128
#define n 128
double
int

x[m+2][n+2];
nproc, myid, istart, iend, count, count1, istart1,

istartm1, iendp1, l_nbr, r_nbr, lastp, itag;


int
istartg[32], iendg[32];
MPI_Status
istat[8];
MPI_Comm

comm;

167

main ( argc, argv)


int argc;
char **argv;
{
double
int
FILE

eps, omega, err1, gerr1, temp, clock;


i, j, n2, loop;
*fp;

MPI_Init (&argc, &argv);


MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_Barrier (MPI_COMM_WORLD);
clock=MPI_Wtime();
comm=MPI_COMM_WORLD;
for (i = 0; i < nproc; i++)
startend(i, nproc, 1, m, &istartg[i], &iendg[i]);
istart=istartg[myid];
iend =iendg[myid];
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
lastp =nproc-1;
istartm1=istart-1;
iendp1=iend+1;
l_nbr = myid-1;
r_nbr = myid+1;
if(myid == 0)
l_nbr = MPI_PROC_NULL;
if(myid == lastp) r_nbr = MPI_PROC_NULL;
if( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
168

}
}
count=(m+2)*(n+2);
MPI_Bcast((void *)&x, count, MPI_DOUBLE, 0, comm);
eps=1.0e-5;
omega=0.5;
for (loop=1; loop<12000; loop++) {
err1 = 0.0;
n2 = n+2;
MPI_Sendrecv ((void *)&x[istart][0], n2, MPI_DOUBLE, l_nbr, 1,
(void *)&x[iendp1][0], n2, MPI_DOUBLE, r_nbr, 1, comm, istat);
MPI_Sendrecv ((void *)&x[iend][0],
n2, MPI_DOUBLE, r_nbr, 2,
(void *)&x[istartm1][0],n2, MPI_DOUBLE, l_nbr, 2, comm, istat);
for (i=istart; i<=iend; i++) {
/*
red(white) grid) */
for (j=mod(i+1,2)+1; j<=n; j+=2) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
MPI_Sendrecv ((void *)&x[istart][0], n2, MPI_DOUBLE, l_nbr, 3,
(void *)&x[iendp1][0], n2, MPI_DOUBLE, r_nbr, 3, comm, istat);
MPI_Sendrecv ((void *)&x[iend][0],
n2, MPI_DOUBLE, r_nbr, 4,
(void *)&x[istartm1][0],n2, MPI_DOUBLE, l_nbr, 4, comm, istat);
for (i=istart; i<=iend; i++) {
/*
black grid
*/
for (j=mod(i,2)+1; j<=n; j+=2) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
MPI_Allreduce((void *)&err1,(void *)&gerr1,1,MPI_DOUBLE,MPI_MAX, comm);
err1 = gerr1;
if(err1 <= eps) break;
}
169

itag=30;
if (myid == 0) {
for (i=1; i<nproc; i++) {
istart1=istartg[i];
count1 =(iendg[i]-istart1+1)*(n+2);
MPI_Recv ((void *)&x[istart1][0], count1, MPI_DOUBLE, i, itag, comm, istat);
}
}
else {
count=(iend-istart+1)*(n+2);
MPI_Send ((void *)&x[istart][0], count, MPI_DOUBLE, 0, itag, comm);
}
if (myid == 0) {
printf( "loop,err1

= %d

%.5e\n", loop, err1);

printf( " x[i][n], i=1; i<=128; i+=8\n");


for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
clock=MPI_Wtime() - clock;
printf( " myid,clock time=%d
MPI_Finalize();
return 0;

%f\n", myid,clock);

}
mod(int i1, int i2)
{
int i3;
i3 = i1/i2;
i3 = i1-i3*i2;
return i3;
}
startend(int myid,int nproc,int is1,int is2,int* istart,int* iend)
{
int ilength, iblock, ir;
ilength=is2-is1+1;
170

iblock=ilength/nproc;
ir=ilength-iblock*nproc;
if(myid < ir) {
*istart=is1+myid*(iblock+1);
*iend=*istart+iblock;
}
else {
*istart=is1+myid*iblock+ir;
*iend=*istart+iblock-1;
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
SORRBP IBM SP2 SMP CPU ITER 10313
eps 4.72 4.51
Red-Black SOR 10313 (speed up) =4.51/4.72=0.96
ATTENTION: 0031-408

4 tasks allocated by LoadLeveler, continuing...

NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4

0
1
2
3

4.349
6.040
4.860
7.116
4.919
0.318
2.283
3.704
2.286
4.340
myid,clock time=1 4.717445
myid,clock time=2 4.717462
myid,clock time=3 4.717453
loop,err1 = 10313 9.99917e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
6.752
4.171
7.183
3.670
5.198
myid,clock time=0 4.718111

1
33
65
97

32
64
96
128

3.700
6.900

2.232
6.741

8.797
1.727

5.458

6.317

6.057

4.818

5.561

5.158

6.431

3.384

IBM SP2 SMP CPU loop 10313


eps 9.56 CPU
171

ATTENTION: 0031-408

8 tasks allocated by LoadLeveler, continuing...

NPROC,MYID,ISTART,IEND=8
NPROC,MYID,ISTART,IEND=8
NPROC,MYID,ISTART,IEND=8
NPROC,MYID,ISTART,IEND=8

0
1
2
3

1
17
33
49

16
32
48
64

NPROC,MYID,ISTART,IEND=8
NPROC,MYID,ISTART,IEND=8
NPROC,MYID,ISTART,IEND=8
NPROC,MYID,ISTART,IEND=8

5
6
7
4

81
97
113
65

96
112
128
80

4.349
6.040
4.860
7.116
0.318
2.283
3.704
2.286
myid,clock time=1 9.557924
myid,clock time=2 9.557863

4.919
4.340

3.700
6.900

2.232
6.741

8.797
1.727

5.458
5.561

6.317
5.158

6.057
6.431

4.818
3.384

myid,clock time=3 9.558286


loop,err1 = 10313 9.99917e-06
myid,clock time=4 9.557916
myid,clock time=5 9.557963
myid,clock time=6 9.557960
myid,clock time=7 9.558097
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.171
7.183
myid,clock time=0

4.557
6.752
3.670
5.198
9.560302

172

8.3 SOR
SOR
8.4
1 n

2nd

Dimension

x[i][j]

j=n

white element

black element

3
2
j=1
i=1

1st Dimension

8.4 SOR
SOR
/*

program sorzebra
Sequential version of zebra SOR
(Successive Over-Relaxation Method)

*/
#include <stdio.h>
#include <stdlib.h>
#define m 128
#define n 128
main ( argc, argv)
int argc;
char **argv;
{
double x[m+2][n+2], eps, omega, err1, temp, clock;
int
FILE

i, j, k, loop, isec1, nsec1, isec2, nsec2;


*fp;
173

wtime(&isec1, &nsec1);
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
for (loop=1; loop<36000; loop++) {
err1 = 0.0;
for (i=1; i<=m; i+=2) {
for (j=1; j<=n; j++) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
for (i=2; i<=m; i+=2) {
for (j=1; j<=n; j++) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
if(err1 <= eps) break;
}
printf( "loop,err1 = %d %.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
174

x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
wtime(&isec2, &nsec2);
clock=(double) (isec2-isec1) + (double) (nsec2-nsec1)/1.0e9;
printf( " clock time=%f\n", clock);
return 0;
}
#include <sys/time.h>
int wtime(int *isec, int *nsec)
{
struct timestruc_t tb;
int iret;
iret=gettimer(TIMEOFDAY, &tb);
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
sorzebra IBM SP2 SMP loop 10409
eps 10.51
4.349
6.040
4.860
7.116
4.919
0.318
2.283
3.704
2.286
4.340
loop,err1 = 10409 9.99896e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
4.171
7.183
3.670
clock time=10.511644

6.752
5.198

5.458
5.561

3.700
6.900

2.232
6.741

8.797
1.727

6.317
5.158

6.057
6.431

4.818
3.384

8.5 CPU
CPU CPU index
for (i = 0; i < nproc; i++) {
startend(i, nproc, 1, (m+1)/2, &istart, &iend);
istartg[i] = istart*2-1;
iendg[i] = min (m, iend*2);
istart=istartg[myid];
iend =iendg[myid];

175

2nd

Dimension

P0

P1

P2

j=n

x[i][j]
white element

black element

3
2
j=1
i=1

1st Dimension

8.5 SOR
CPU index i istart istart-1
CPU MPI_Sendrecv istart-1
i+1 CPU iend+1
MPI_ Sendrecv iend+1 i-1
CPU sor_zebrap
/*

program zebrap -- Parallel on 1st Dimension


(Successive Over-Relaxation Method)

of zebra SOR

*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define m 128
#define n 128
double
int

x[m+2][n+2];
nproc, myid, istart, iend, count, count1, istart1,
istartm1, iendp1, l_nbr, r_nbr, lastp;

int
MPI_Status
MPI_Comm

istartg[32], iendg[32];
istat[8];
comm;

main ( argc, argv)


176

int argc;
char **argv;
{
double
int
FILE

eps, omega, err1, gerr1, temp, clock;


i, j, k, loop, itag;
*fp;

MPI_Init (&argc, &argv);


MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_Barrier (MPI_COMM_WORLD);
clock=MPI_Wtime();
comm=MPI_COMM_WORLD;
for (i = 0; i < nproc; i++) {
startend(i, nproc, 1, (m+1)/2, &istart, &iend);
istartg[i] = istart*2-1;
iendg[i] = min (m, iend*2);
}
istart=istartg[myid];
iend =iendg[myid];
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
lastp =nproc-1;
istartm1=istart-1;
iendp1=iend+1;
l_nbr = myid-1;
r_nbr = myid+1;
if(myid == 0)

l_nbr = MPI_PROC_NULL;

if(myid == lastp) r_nbr = MPI_PROC_NULL;


if( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
}
count=(m+2)*(n+2);
177

MPI_Bcast((void *)&x, count, MPI_DOUBLE, 0, comm);


eps=1.0e-5;
omega=0.5;
for (loop=1; loop<36000; loop++) {
err1 = 0.0;
itag=10;
MPI_Sendrecv ((void *)&x[iend][0],
n+2, MPI_DOUBLE, r_nbr, itag,
(void *)&x[istartm1][0], n+2, MPI_DOUBLE, l_nbr, itag, comm, istat);
for (i=istart; i<=iend; i+=2) {
for (j=1; j<=n; j++) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
itag=20;
MPI_Sendrecv ((void *)&x[istart][0], n+2, MPI_DOUBLE, l_nbr, itag,
(void *)&x[iendp1][0], n+2, MPI_DOUBLE, r_nbr, itag, comm, istat);
for (i=istart+1; i<=m; i+=2) {
for (j=1; j<=n; j++) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
MPI_Allreduce((void *)&err1,(void *)&gerr1,1,MPI_DOUBLE,MPI_MAX, comm);
err1 = gerr1;
if(err1 <= eps) break;
}
itag=30;
if (myid == 0) {
for (i=1; i<nproc; i++) {
istart1=istartg[i];
count1 =(iendg[i]-istart1+1)*(n+2);
MPI_Recv ((void *)&x[istart1][0], count1, MPI_DOUBLE, i, itag, comm, istat);
178

}
}
else {
count=(iend-istart+1)*(n+2);
MPI_Send ((void *)&x[istart][0], count, MPI_DOUBLE, 0, itag, comm);
}
if (myid == 0) {
printf( "loop,err1 = %d %.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
clock=MPI_Wtime() - clock;
printf( " myid,clock time=%d

%f\n", myid,clock);

MPI_Finalize();
return 0;
}
startend(int myid,int nproc,int is1,int is2,int* istart,int* iend)
{
int ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
if(myid < ir) {
*istart=is1+myid*(iblock+1);
*iend=*istart+iblock;
}
else {
*istart=is1+myid*iblock+ir;
*iend=*istart+iblock-1;
}
if(ilength < 1) {
*istart=1;
*iend=0;
179

}
}
min(int i1, int i2)
{
if (i1 < i2) return i1;
else return i2;
}
SOR_ZEBRAP IBM SP2 SMP CPU loop
10409 eps 8.39 (speed up) =10.51/8.39=1.25
ATTENTION: 0031-408 4 tasks allocated by LoadLeveler, continuing...
NPROC,MYID,ISTART,IEND=4
1
33
64
NPROC,MYID,ISTART,IEND=4
2
65
96
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
myid,clock time=1 8.384307

3
0

97
1

128
32

myid,clock time=2 8.384279


myid,clock time=3 8.384790
loop,err1 = 10409 9.99896e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
6.752
4.171
7.183
3.670
5.198
myid,clock time=0 8.385459

5.458
5.561

6.317
5.158

6.057
6.431

4.818
3.384

180

8.4 SOR
x[i][j]
8.6 for loop :

err1 = 0.0;
for (i=1; i<=m; i++) {
for (j=1; j<=n; j++) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] +
x[i-1][j-1]+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}

2nd Dimension

x[i][j]

n
.
x(i-1,j+1) x(i,j+1) x(i+1,j+1)
3
x(i-1,j)

x(i,j)

x(i+1,j)

x(i-1,j-1)

x(i,j-1) x(i+1,j-1)

2
j=1
I=1

1st Dimension

8.6 SOR

() 8.6

181

/*

program color_seq
Sequential version of 4 colour Successive Over-Relaxation Method

*/
#include <stdio.h>
#include <stdlib.h>
#define m 128
#define n

128

main ( argc, argv)


int argc;
char **argv;
{
double
int
FILE

x[m+2][n+2], eps, omega, err1, temp, clock;


i, j, k, loop, isec1, nsec1, isec2, nsec2;
*fp;

wtime(&isec1, &nsec1);
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
for (loop=1; loop<36000; loop++) {
err1 = 0.0;
for (i=1; i<=m; i+=2) { /* update circle */
for (j=1; j<=n; j+=2) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] +
x[i-1][j-1]+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
182

}
for (i=1; i<=m; i+=2) {

/*

update triangle */

for (j=2; j<=n; j+=2) {


temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] +
x[i-1][j-1]+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
for (i=2; i<=m; i+=2) { /* update square
*/
for (j=1; j<=n; j+=2) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] +
x[i-1][j-1]+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
for (i=2; i<=m; i+=2) { /*
for (j=2; j<=n; j+=2) {

update <>

*/

temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] +
x[i-1][j-1]+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
if(err1 <= eps) break;
}
printf( "loop,err1 = %d %.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
wtime(&isec2, &nsec2);
183

clock=(double) (isec2-isec1) + (double) (nsec2-nsec1)/1.0e9;


printf( " clock time=%f\n", clock);
return 0;
}
#include <sys/time.h>
int wtime(int *isec, int *nsec)
{
struct timestruc_t tb;
int iret;
iret=gettimer(TIMEOFDAY, &tb);
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
COLOR_SOR IBM SP2 SMP loop 8157
eps 5.87
4.349
6.040
4.860
7.116
0.318
2.283
3.704
2.286
loop,err1 = 8157 9.99831e-06
x[i][n], i=1; i<=128; i+=8
6.584
4.322
5.045
7.129
4.587
7.141
4.051
4.837
clock time=5.869461

4.919
4.340

3.700
6.900

2.232
6.741

8.797
1.727

4.290
5.531

5.596
5.200

5.881
5.883

4.026
3.643

CPU
CPU CPU index

for (i = 0; i < nproc; i++) {


startend(i, nproc, 1, (m+1)/2, &istart, &iend);
istartg[i] = istart*2-1;
iendg[i] = min (m, iend*2);
}
istart=istartg[myid];
iend =iendg[myid];

184

2nd Dimension

P0

P1

P2

x[i][j]

n
.
3
2
j=1
i=1

1st Dimension

8.6 SOR
istart-1 istart-1
iend+1 iend+1
CPU
/*

program colorp
Parallel on 1st dimension of 4 colour Successive Over-Relaxation Method

*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define m 128
#define n 128
main ( argc, argv)
int argc;
char **argv;
{
double
int
int

x[m+2][n+2], eps, omega, err1, gerr1, temp, clock;


i, j, ip, itag, loop;
nproc, myid, istart, iend, count, count1, istart1,
istartm1, iendp1, l_nbr, r_nbr, lastp;

int
istartg[32], iendg[32];
MPI_Status
istat[8];
185

MPI_Comm
FILE

comm;
*fp;

MPI_Init (&argc, &argv);


MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_Barrier
(MPI_COMM_WORLD);
clock=MPI_Wtime();
comm=MPI_COMM_WORLD;
for (i = 0; i < nproc; i++) {
startend(i, nproc, 1, (m+1)/2, &istart, &iend);
istartg[i] = istart*2-1;
iendg[i] = min (m, iend*2);
}
istart=istartg[myid];
iend =iendg[myid];
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
lastp =nproc-1;
istartm1=istart-1;
iendp1=iend+1;
l_nbr = myid-1;
r_nbr = myid+1;
if(myid == 0)
l_nbr = MPI_PROC_NULL;
if(myid == lastp) r_nbr = MPI_PROC_NULL;
if( myid==0) {
fp = fopen( "input.dat", "r");
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
186

count=(m+2)*(n+2);
MPI_Bcast((void *)&x, count, MPI_DOUBLE, 0, comm);
eps=1.0e-5;
omega=0.5;
for (loop=1; loop<=36000; loop++) {
err1 = 0.0;
itag=10;
MPI_Sendrecv ((void *)&x[iend][0],
n+2, MPI_DOUBLE, r_nbr, itag,
(void *)&x[istartm1][0], n+2, MPI_DOUBLE, l_nbr, itag, comm, istat);
/*

for (i=1; i<=m; i+=2) {


update circle */
for (i=istart; i<=iend; i+=2) {
for (j=1; j<=n; j+=2) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] + x[i-1][j-1]+
+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}

/*

}
for (i=1; i<=m; i+=2) {

update square */

for (i=istart; i<=iend; i+=2) {


for (j=2; j<=n; j+=2) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] + x[i-1][j-1]+
+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
itag=20;
MPI_Sendrecv ((void *)&x[istart][0],
n+2, MPI_DOUBLE, l_nbr, itag,
(void *)&x[iendp1][0], n+2, MPI_DOUBLE, r_nbr, itag, comm, istat);
/*

for (i=2; i<=m; i+=2) {


update triangle
*/
for (i=istart+1; i<=iend; i+=2) {
for (j=1; j<=n; j+=2) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] + x[i-1][j-1]+
+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
187

x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
/*

}
for (i=2; i<=m; i+=2) {

update <>

*/

for (i=istart+1; i<=iend; i+=2) {


for (j=2; j<=n; j+=2) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] + x[i-1][j-1]+
+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
MPI_Allreduce((void *)&err1,(void *)&gerr1,1,MPI_DOUBLE,MPI_MAX, comm);
err1 = gerr1;
if(err1 <= eps) break;
}
itag=30;
if (myid == 0) {
for (i=1; i<nproc; i++) {
istart1=istartg[i];
count1 =(iendg[i]-istart1+1)*(n+2);
MPI_Recv ((void *)&x[istart1][0], count1, MPI_DOUBLE, i, itag, comm, istat);
}
}
else {
count=(iend-istart+1)*(n+2);
MPI_Send ((void *)&x[istart][0], count, MPI_DOUBLE, 0, itag, comm);
}
if (myid == 0) {
printf( "loop,err1

= %d

%.5e\n", loop, err1);

printf( " x[i][n], i=1; i<=128; i+=8\n");


for (i = 1; i <= m; i+=64) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
188

}
}
clock=MPI_Wtime() - clock;
printf( " myid,clock time=%d
MPI_Finalize();
return 0;

%f\n", myid,clock);

}
startend(int myid,int nproc,int is1,int is2,int* istart,int* iend)
{
int ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
if(myid < ir) {
*istart=is1+myid*(iblock+1);
*iend=*istart+iblock;
}
else {
*istart=is1+myid*iblock+ir;
*iend=*istart+iblock-1;
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
min(int i1, int i2)
{
if (i1 < i2) return i1;
else return i2;
}
colorp IBM SP2 SMP CPU loop 8157
eps 3.24 (speed up) =5.87/3.24=1.81

189

ATTENTION: 0031-408 4 tasks allocated by LoadLeveler, continuing...


NPROC,MYID,ISTART,IEND=4
0
1
32
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
NPROC,MYID,ISTART,IEND=4
4.349
6.040
4.860
7.116
0.318
2.283
3.704
2.286
myid,clock time=1 3.230388
myid,clock time=2 3.239964
myid,clock time=3 3.240361
loop,err1 = 8157 9.99831e-06
x[i][n], i=1; i<=128; i+=8
6.584
4.322
5.045
7.129
4.587
7.141
4.051
4.837
myid,clock time=0

1
2
3
4.919

33
64
65
96
97
128
3.700
2.232
8.797

4.340

6.900

6.741

1.727

4.290
5.531

5.596
5.200

5.881
5.883

4.026
3.643

3.240597

190


(finite element method - FEM)
(finite difference method)
(implicit method)
(explicit method)

191

9.1

/*

program femseq -- sequential version of finite element explicit method

*/

#include <stdio.h>
#include <stdlib.h>
#define ne 18
#define nn 28
main ( argc, argv)
int argc;
char **argv;
{
double
int

ve[ne+1], vn[nn+1], clock;


index[ne][4], i, j, k, ie, in, loop;

int

isec1, nsec1, isec2, nsec2;

wtime(&isec1, &nsec1);
for (i=1; i<=ne; i++) {
scanf("%d %d %d %d\n",&index[i][0],&index[i][1],&index[i][2],&index[i][3]);
}
for (ie=1; ie<=ne; ie++)
ve[ie]=10.0*ie;
for (in=1; in<=nn; in++)
vn[in]=100.0*in;
for (loop=0; loop<10; loop++) {
for (ie=1; ie<=ne; ie++) {
for (j=0; j<4; j++) {
k= index[ie][j];
vn[k]= vn[k] + ve[ie];
}
}
for (in=1; in<=nn; in++)
vn[in] = vn[in] * 0.25;
for (ie=1; ie<=ne; ie++) {
for (j=0; j<4; j++) {
192

k= index[ie][j];
ve[ie] = ve[ie] + vn[k];
}
}
for (ie=1; ie<=ne; ie++)
ve[ie] = ve[ie] *0.25;
}
printf("result of vn\n");
for (i=1; i<=nn; i+=7)
printf(" %.3f %.3f %.3f %.3f %.3f %.3f %.3f\n",
vn[i],vn[i+1],vn[i+2],vn[i+3],vn[i+4],vn[i+5],vn[i+6]);
printf("result of ve\n");
for (i=1; i<=ne; i+=6)
printf(" %.3f %.3f %.3f %.3f %.3f %.3f\n",
ve[i],ve[i+1],ve[i+2],ve[i+3],ve[i+4],ve[i+5]);
wtime(&isec2, &nsec2);
clock=(double) (isec2-isec1) + (double) (nsec2-nsec1)/1.0e9;
printf( " clock time=%f\n", clock);
return 0;
}
#include <sys/time.h>
int wtime(int *isec, int *nsec)
{
struct timestruc_t tb;
int iret;
iret=gettimer(TIMEOFDAY, &tb);
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
Element node 9.1 18 element () element node
() (unstructured grid)
ve vn element node loop for loop
ve vn loop ve vn loop
vn loop vn ve
loop ve

193

8
3

12
6

7
2

20

9
11

5
6

16
12
15
8

10
4

15
19

11

23

18
10

27

17

node -> element

17
22

13

13

28
18

14

14
7

24

26t
16

21

element
25
node

3
3

12

6
7

2
2

9
11

5
6

15
8

10
4

16

15

11

14

18
27

element -> node

17
22

13
17

28

23

18
10

13

24

12
19

14
7

20

26
16

21

25

9.1 element node


element node index Index[i][j] i
element node j 1 4
1
2
INDEX= 6
5

2
3
7
6

3
5
6
7
9 10 11 13 14 15 17 18 19 21 22 23
4
6
7
8 10 11 12 14 15 16 18 19 20 22 23 24
8 10 11 12 14 15 16 18 19 20 22 23 24 26 27 28
7
9 10 11 13 14 15 17 18 19 21 22 23 25 26 27

FEM_SEQ
result of vn
303.506 737.138 743.620 309.989 905.479 2197.706 2214.970
922.743 1476.091 3579.268 3602.639 1499.462 1927.588 4670.236
4695.415 1952.767 2066.994 5005.284 5028.654 2090.365 1655.717
4008.155 4025.419 1672.981 642.193 1554.410 1560.892 648.676
194

result of ve
1281.497 1823.797 1298.016 2526.136 3591.987 2554.243
3617.916 5139.575 3651.343 4262.652 6051.210 4296.079
3991.965 5664.587 4020.072 2474.166 3510.129 2490.686
clock time=0.000746

195

9.2
element 9.218 element
CPU CPU 6 element node CPU
node rank CPU (primary processor) rank CPU
(secondary processor)

25

26

27

28

node
16

17

18

element
21

23

23

24
process 2

13
18

17

17

18

10
14

13

14
19

19
11
15

15
20

20
12
16
process 1

7
10

11

9
12

10

11

12

4
5

5
7

6
8
process 0

1
2

2
3

3
4

9.2
element CPU element
iecntg[i] CPU i element iestartg[i] CPU i element
CPU node incntg[i] CPU i node
inodeg[i][j] CPU i node 9.3 g global
196

ecntg

P0

P1

P2

12

P0

8
8

estartg

eendg

12

13

18

P1

13

14

15

16

17

18

19

20

P2

21

22

23

24

25

26

27

28

ncntg

10

11

12

nodeg

9.3

element node

CPU node CPU


( irregular mesh) CPU
CPU CPU CPU
CPU CPU I node scnt[i]
snode[i][j]j=1, scnt[i] CPU CPU i node pcnt[i]
pnode[i][j]j=1,pcnt[i] 9.4

Associated with
scnt

Process 0
P0
P1 P2
0

Process 2
P0 P1 P2

9
10
11
12

snode

Associated with P0
pcnt
0

pnode

Process 1
P0
P1
P2

P1
4
9
10
11
12

9.4

P2
0

P0
0

17
18
19
20
P1
0

P2
4

P0 P1
0
0

P2
0

17
18
19
20

CPU node CPU


197

fem_seq
/*
program femp -- parallel version of finite element explicit method */
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ne 18
#define nn 28
#define np
3
int
int

ecntg[np], estartg[np], eendg[np];


ncntg[np], nodeg[np][nn], itg[np][nn+1];

main ( argc, argv)


int argc;
char **argv;
{
double
int
int
int

ve[ne+1], vn[nn+1], bufs[np+1][nn], bufr[np+1][nn], clock;


index[ne+1][4], i, j, k, ie, in, ii, itime, itag, iflag;
nproc, myid, istart, iend, count, icount, iu, irank, is;
scnt [np], snode[np][nn];

int
pcnt [np], pnode[np][nn];
MPI_Status
istat[8];
MPI_Comm comm;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_Barrier (MPI_COMM_WORLD);
clock=MPI_Wtime();
comm=MPI_COMM_WORLD;
if (myid == 0) {
for (i=1; i<=ne; i++)
scanf("%d %d %d %d\n",&index[i][0],&index[i][1],&index[i][2],&index[i][3]);
}
icount=(ne+1)*4;
MPI_Bcast ((void *)&index, icount, MPI_INT, 0, comm);
198

/*

clear counters, CPU and node association indicators

*/
for (irank = 0; irank < nproc; irank++) {
ncntg[irank]=0;
scnt[irank]=0;
pcnt[irank]=0;
for (j=0; j<nn; j++) {
itg [irank][j]=0;
snode[irank][j]=0;
pnode[irank][j]=0;
}
}
itg CPU node for loop CPU
nn node for loop CPU node
1 9.2 9.1
j
P0
P1
P2

1
1
0
0

2
1
0
0

3
1
0
0

4
1
0
0

5
1
0
0

6
1
0
0

9.1

7
1
0
0

8
1
0
0

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1

itg

/*set node association indicator for associated nodes


*/
for (irank = 0; irank < nproc; irank++) {
startend(irank, nproc, 1,ne, &istart, &iend);
estartg[irank]=istart;
eendg[irank]=iend;
ecntg[irank]=iend-istart+1;
for (ie=istart; ie<=iend; ie++) {
for (j=0; j<4; j++) {
k=index[ie][j];
itg[irank][k]=1;
}
}
if (myid == 0 ) {
199

printf("itg values for irank= %d\n", irank);


for (j=1; j<=nn; j+=4) {
printf("%d %d %d %d\n",
itg[irank][j],itg[irank][j+1],itg[irank][j+2],itg[irank][j+3]);
}
}
}
istart=estartg[myid];
iend=eendg[myid];
count=ecntg[myid];
printf( "NPROC,MYID,ISTART,IEND=%d %d %d %d\n",nproc,myid,istart,iend);
9.1 node () 1 CPU
node P0 P1 node 9101112 node P0 primary node
P1 secondary nodeP1 P2 node 17181920 node P1 primary node
P2 secondary node for loop node
scntpcnt node snodepnode s
secondaryp primary
/*count and store boundary node code
*/
for (in=1; in<=nn; in++) {
iflag=1;
for (irank=0; irank<nproc; irank++) {
if (itg[irank][in] == 1) {
if (iflag == 1) {
iflag=2;
ii=irank;
}
else {
itg[irank][in]=0;
if (irank == myid) {
scnt[ii]=scnt[ii]+1;
snode[ii][scnt[ii]]=in; }

/* node[in] belong to irank */


/* 1st time itg[irank][in]==1

/* 2nd time itg[irank][in]==1

/* secondary node count


/* secondary node code

else {
if (ii == myid) {
pcnt[irank]=pcnt[irank]+1;
pnode[irank][pcnt[irank]]=in; }
}

*/

*/

*/
*/

/* primary node count */


/* primary node code */

200

}
}
}
}
/*
*/

count and store all primary node code which belongs to each CPU
for (irank=0; irank<nproc; irank++) {
for (in=1; in<=nn; in++) {
if (itg[irank][in] == 1) {
ncntg[irank]=ncntg[irank]+1;
nodeg[irank][ncntg[irank]]=in;
}
}
k=ncntg[irank];
if(myid == 0) {
printf("nodeg values for irank,k= %d
for (j=1; j<=k; j+=4)

%d\n", irank,k);

printf("%d %d %d %d\n",
nodeg[irank][j],nodeg[irank][j+1],nodeg[irank][j+2],nodeg[irank][j+3]);
}
}
/*
/*

set initial values


*/
for (ie=1; ie<=ne; ie++) */
for (ie=istart; ie<=iend; ie++)
ve[ie]=10.0*ie;

/*

for (in=1; in<=nn; in++) */


for (ii=1; ii<=ncntg[myid]; ii++) {
in=nodeg[myid][ii];
vn[in]=100.0*in;
}

itime loop vn node vn node


vn node node node

for (itime=0; itime<10; itime++) {


for (irank=0; irank<nproc; irank++)
201

for (is=1; is<=scnt[irank]; is++)


vn[ snode[irank][is] ]=0.0;
/*

for (ie=1; ie<=ne; ie++) { */


for (ie=istart; ie<=iend; ie++) {
for (j=0; j<4; j++) {
k= index[ie][j];
vn[k]= vn[k] + ve[ie];
}
}
for (irank=0; irank<nproc; irank++)
for (is=1; is<=scnt[irank]; is++)
bufs[irank][is]=vn[ snode[irank][is] ];
itag=10;
for (irank=0; irank<nproc; irank++) {
if (scnt[irank] > 0)
MPI_Send((void *)&bufs[irank][1],scnt[irank], MPI_DOUBLE,
irank, itag, comm);
if (pcnt[irank] > 0 )
MPI_Recv ((void *)&bufr[irank][1],pcnt[irank],MPI_DOUBLE,
irank, itag, comm, istat);
}
for (irank=0; irank<nproc; irank++) {
for (i=1; i<=pcnt[irank]; i++) {
k=pnode[irank][i];
vn[k]=vn[k]+bufr[irank][i];
}
}

vn CPU CPU CPU


vn ve ve
/*

for (in=1; in<=nn; in++) */


for (ii=1; ii<=ncntg[myid]; ii++) {
in=nodeg[myid][ii];
vn[in] = vn[in] * 0.25;
}
for (irank=0; irank<nproc; irank++)
202

for (i=1; i<=pcnt[irank]; i++)


bufs[irank][i]=vn[ pnode[irank][i] ];
itag=20;
for (irank=0; irank<nproc; irank++) {
if (pcnt[irank] > 0)
MPI_Send ((void *)&bufs[irank][1],pcnt[irank],MPI_DOUBLE,
irank, itag, comm);
if (scnt[irank] > 0 )
MPI_Recv ((void *)&bufr[irank][1],scnt[irank],MPI_DOUBLE,
irank, itag, comm, istat);
}
for (irank=0; irank<nproc; irank++)
for (i=1; i<=scnt[irank]; i++)
vn[ snode[irank][i] ]=bufr[irank][i];
/*

for (ie=1; ie<=ne; ie++) { */


for (ie=istart; ie<=iend; ie++) {
for (j=0; j<4; j++) {
k= index[ie][j];
ve[ie] = ve[ie] + vn[k];
}
}

/*

for (ie=1; ie<=ne; ie++) */


for (ie=istart; ie<=iend; ie++)
ve[ie] = ve[ie] *0.25;
}

itime loop CPU CPU 0 CPU 0

MPI_Barrier(comm);
for (i=1; i<=ncntg[myid]; i++)
bufs[myid][i]=vn[ nodeg[myid][i] ];
itag=30;
if (myid == 0)
for (irank=1; irank<nproc; irank++)
MPI_Recv ((void *)&bufr[irank][1], ncntg[irank], MPI_DOUBLE,
irank, itag, comm, istat);
else
203

MPI_Send ((void *)&bufs[myid][1],ncntg[myid],MPI_DOUBLE, 0, itag, comm);


if (myid == 0)
for (irank=1; irank<nproc; irank++)
for (i=1; i<=ncntg[irank]; i++)
vn[ nodeg[irank][i] ]=bufr[irank][i];
itag=40;
if (myid == 0)
for (irank=1; irank<nproc; irank++)
MPI_Recv ((void *)&ve[ estartg[irank] ], ecntg[irank], MPI_DOUBLE,
irank, itag, comm, istat);
else
MPI_Send ((void *)&ve[istart],count,MPI_DOUBLE, 0, itag, comm);
MPI_Barrier(comm);
if (myid == 0) {
printf("result of vn\n");
for (i=1; i<=nn; i+=7)
printf(" %.3f %.3f %.3f %.3f %.3f %.3f %.3f\n",
vn[i],vn[i+1],vn[i+2],vn[i+3],vn[i+4],vn[i+5],vn[i+6]);
printf("result of ve\n");
for (i=1; i<=ne; i+=6)
printf(" %.3f %.3f %.3f %.3f %.3f %.3f\n",
ve[i],ve[i+1],ve[i+2],ve[i+3],ve[i+4],ve[i+5]);
}
clock=MPI_Wtime() - clock;
printf( " myid,clock time=%d
MPI_Finalize();
return 0;

%f\n", myid,clock);

}
startend(int myid,int nproc,int is1,int is2,int* istart,int* iend)
{
int ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
if(myid < ir) {
204

*istart=is1+myid*(iblock+1);
*iend=*istart+iblock;
}
else {
*istart=is1+myid*iblock+ir;
*iend=*istart+iblock-1;
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
femp IBM SP2 SMP CPU fem_seq

ATTENTION: 0031-408

3 tasks allocated by LoadLeveler, continuing...

itg values for irank= 0


NPROC,MYID,ISTART,IEND=3
NPROC,MYID,ISTART,IEND=3
1111

1
2

7 12
13 18

1111
1111
0000
0000
0000
0000
itg values for irank= 1
0000
0000
1111
1111
1111
0000
0000
itg values for irank= 2
0000
0000
205

0000
0000
1111
1111
1111
NPROC,MYID,ISTART,IEND=3

nodeg values for irank,k= 0 12


1234
myid,clock time=1 0.003357
myid,clock time=2 0.003364
5678
9 10 11 12
nodeg values for irank,k= 1
13 14 15 16
17 18 19 20
nodeg values for irank,k= 2
21 22 23 24

25 26 27 28
result of vn
303.506 737.138 743.620 309.989 905.479 2197.706 2214.970
922.743 1476.091 3579.268 3602.639 1499.462 1927.588 4670.236
4695.415 1952.767 2066.994 5005.284 5028.654 2090.365 1655.717
4008.155 4025.419 1672.981 642.193 1554.410 1560.892 648.676
result of ve
1281.497 1823.797 1298.016 2526.136 3591.987 2554.243
3617.916 5139.575 3651.343 4262.652 6051.210 4296.079
3991.965 5664.587 4020.072 2474.166 3510.129 2490.686
myid,clock time=0 0.003736

206


1. Tutorial on MPI : The Message-Passing Interface
By William Gropp, Mathematics and Computer Science Division, Argonne National Laboratory,
gropp@mcs.anl.gov
2. MPI in Practice
by William Gropp, Mathematics and Computer Science Division, Argonne National Laboratory,
gropp@mcs.anl.gov
3. A Users Guide to MPI
by Peters S. Pacheco, Department of Mathematics, University of San Francisco, peter@usfca.edu
4. Parallel Programming Using MPI
by J.M.Chuang, Department of Mechanical Engineering, Dalhousie University, Canada
chuangjm@newton.ccs.tuns.ca
5. RS/6000 SP : Practical MPI Programming
IBM International Technical Support Organization,

http://www.redbooks.ibm.com

207

Parallel Processing of 1-D Arrays without Partition


i=1 . . . . . . 50
float a[200], b[200], c[200], d[200]

P0

float a[200], b[200], c[200], d[200]

P1

float a[200], b[200], c[200], d[200]

P2

float a[200], b[200], c[200], d[200]

P3

i=51 . . . . 100

i=101 . . . 150

i=151 . . . 200

for (i=0; i<200; i++)


a[i]=b[i] + c[i]*d[i];

for (i=istart; i<=iend; i++)


a[i]=b[i] + c[i]*d[i];

208

Parallel Processing of 1-D Arrays with Partition

i=1 . . . . . . 50
i=(1 . . . . . 50)

1 . . . . . . 50
(51 . . . 100)

1 . . . . . . 50
(101 . . . 150)

1 . . . . . . 50

Local index

(151 . . 200)

float a[50]
Global index
float b[50]

float c[50]

float d[50]

P0

P1

for (i=0; i<200; i++)


a[i]=b[i] + c[i]*d[i];

P2

P3

for (i=istart; i<=iend; i++)


a[i]=b[i] + c[i]*d[i];

209

Parallel on the 1st Dimension of 2-D Arrays without Partition

i=151 . . . . . . . . .200

i=101 . . . . . . . . . 150

i=51 . . . . . . . . . . 100
j=8
.
j=1
i=1 . . . . . . . . . . . .50

x[i][j]

x[200][8]

x[200][8]

x[200][8]

x[200][8]

y[200][8]
z[200][8]

y[200][8]
z[200][8]

y[200][8]
z[200][8]

y[200][8]
z[200][8]

P0

P1

P2

P3

210

Parallel on the 1st Dimension of 2-D Arrays with Partition


global index

i=(1 . . . . . . . . . . . .50

i=1 . . . . . . . . . . . . 50

51

100

i=1 . . . . . . . . . . . . 50

101

150

i=1 . . . . . . . . . . . . 50

151

200)

i=1 . . . . . . . . . . . . 50

local index

x[50] [8]
y[50] [8]
z[50] [8]
P0

x[50] [8]
y[50] [8]
z[50] [8]
P1

x[50] [8]
y[50] [8]
z[50] [8]
P2

x[50] [8]
y[50] [8]
z[50] [8]
P3

Sequential Version
X[200][8], y[200] [8], z[200] [8]

211

Partition on the 1st dimension of 3-D Arrays


i=50

i=50(100)
i=1 (51)

i=1

i=50(150)

i=1 (101)

i=50(200)

i=1 (151)

j=24

j=24

j=24

j=24

.
.
j=1
k=1 . . . . . . 8

.
.
j=1
k=1 . . . . . . . 8

.
.
j=1
k=1 . . . . . . .8

.
.
j=1
k=1 . . . . . . 8

Global index
x[i][j][k]
x[50][24][8]
y[50][24][8]

x[50][24][8]
y[50][24][8]

x[50][24][8]
y[50][24][8]

x[50][24][8]
y[50][24][8]

z[50][24][8]

z[50][24][8]

z[50][24][8]

z[50][24][8]

P0

P1

P2

P3

Sequential Version :
x[200][24][8], y[200][24][8], z[200][24][8]

212

for (i=0; i<m; i++)


for (j=0; j<n; j++)

y[i][j]=0.25*( x[i-1][j] + x[i+1][j] + x[i][j-1] + x[i][j+1] ) + h*f[i][j];

for (i=0; i<m; i++)


for (j=0; j<n; j++)

x[i][j]=0.25*( x[i-1][j] + x[i+1][j] + x[i][j-1] +x [i][j+1]) + h*f[i][j];

This kind of loops is called DATA RECURSIVE


If it is IMPLICIT scheme and CONVERGES within SUBDOMAIN,
Exchange the boundary data after each Iteration

213

Вам также может понравиться