Mpic 2002 Ebook

C MPI
: 91 1 1
: (03) 5776085 x 305
E-mail : c00tch00@nchc.gov.tw
C MPI ........................................................................................................1
.............................................................................................................................4
1.1 MPI ...................................................................................................5
1.2 ........................................................................6
1.3 IBM SP2 MPI ...................................................................................7
1.3.1 IBM SP2 MPI C ..............................................................7
1.3.2 IBM SP2 Job command file......................................................................7
1.3.3 IBM SP2 ..............................................................9
1.4 PC Cluster MPI...............................................................................11
1.4.1 PC Cluster C MPI ...............................................11

1.4.2 PC Cluster Job command file .............................................................12
1.4.3 PC Cluster ............................................................13
...................................................................................14
2.1
MPI .........................................................................................................15
2.1.1 mpi.h include file ..............................................................................................15
2.1.2 MPI_Init, MPI_Finalize....................................................................................15
2.1.3 MPI_Comm_size, MPI_Comm_rank ...............................................................16
2.1.4 MPI_Send, MPI_Recv ......................................................................................17

2.2 T2SEQ ....................................................................20
2.3 T2CP ...............................................................................22
2.4
MPI_ScatterMPI_GatherMPI_Reduce .............................................................27
2.4.1 MPI_ScatterMPI_Gather ..............................................................................27
2.4.2 MPI_Reduce, MPI_Allreduce ..........................................................................29

2.5 T2DCP ................................................................................31
...............................................................................35
3.1
MPI_Sendrecv, MPI_Bcast.......................................................................................36
3.1.1 MPI_ Sendrecv ..............................................................................................36
3.1.2 MPI_Bcast .....................................................................................................36
3.2 T3SEQ ........................................................................38
3.3 T3CP .......................................................40
3.4 () T3DCP_1 ............................................47
3.5 () T3DCP_2 ............................................52
...................................................................................57
4.1 T4SEQ ....................................................................58
4.2. MPI_ScattervMPI_Gatherv .................................................................................60
4.3 MPI_PackMPI_UnpackMPI_ BarrierMPI_ Wtime......................................62
4.3.1 MPI_PackMPI_Unpack .............................................................................62
4.3.2
MPI_BarrierMPI_Wtime...........................................................................64
2
4.4
T4DCP ................................................................................66
5.1
5.2
5.3
...............................................................................................72
T5SEQ ..................................................................................73
T5CP .................................................................77
T5DCP ......................................................85
5.4 MPI .....................................................................................91

5.4.1 (Cartesian Topology) .....................................................91
5.4.2 MPI MPI_Cart_create..........................................92
MPI_Cart_coordsMPI_Cart_shift..........................................................................92
5.4.3 MPI ..................................................................95
MPI_Type_vectorMPI_Type_commit ...................................................................95
5.5 T5_2D ...............................................................97
MPI .............................................................................................110
6.1
Nonblocking ........................................................................................111
6.2
..................................................................................................120
6.3 ......................................................................124
6.4
.........................................................................................126
6.4.1 .......................................................................................126
6.4.2 .......................................................................132
.....................................................................................................134
7.1
.............................................................................................135
7.2
.....................................................................................................140
7.3
.........................................................................................150
SOR ............................................................................................158
8.1
SOR ....................................................................................159
8.2
SOR ..................................................................................164
8.3
SOR ..........................................................................................173
8.4
SOR ................................................................181
.....................................................................................................191
9.1
.................................................................................192
9.2
.................................................................................196
.........................................................................................................................207
Parallel Processing of 1-D Arrays without Partition........................................................208
Parallel Processing of 1-D Arrays with Partition.............................................................209
Parallel on the 1st Dimension of 2-D Arrays without Partition.......................................210
Parallel on the 1st Dimension of 2-D Arrays with Partition............................................211
Partition on the 1st dimension of 3-D Arrays ..................................................................212

MPI
MPI
MPI
IBM SP2 MPI
PC cluster MPI
1.1
MPI
MPI (Message Passing Interface) Message Passing

FortranCC++ MPI
cluster MPI
MPI1.2 MPI 1998 MPI
2.0 MPI 2.0 Argonne National Lab
MPICH 1.2 MPI 2.0
http://www-unix.mcs.anl.gov/mpi/mpich
anonymous ftp
ftp.mcs.anl.gov
(directory) pub/mpi mpich-1.2.1.tar.Z
MPI
mpich-1.2.1.tar.gz
1.2
IBM SP2IBM SP2 SMPHP SPP2000SGI Origin2000 Fujitsu

VPP300 MPI PC cluster MPICH
PC clusterIBM SP2 IBM SP2 SMP
CPU
CPU IBM SP2 CPU CPU
CPU
( HP SPP2000) CPU CPU
CPU CPU CPU
(time sharing) CPU
HP SPP2000 SGI ORIGIN2000 16 CPU
SP2 VPP300 CPU
SP2 SMP node 4 CPU
42 node SMP lusterSP2 SP2 SMP
(job scheduler) LoadLeveler (batch job)
LoadLeveler job command file llsubmit SP2
SPP2000ORIGIN2000 VPP300 NQS (Network Queue System)
NQS job command file qsub
PC cluster DQS (Distributed Queue System)
NQS
1.3
IBM SP2 MPI
C shell home directory .cshrc

include file (mpif.hmpif90.hmpi.h) (mpxlfmpxlf90mpccmpCC)MPI library
LoadLeveler (llsubmitllqllstatusllcancel)
set lpath=(. ~ /usr/lpp/ppe.poe/include /usr/lpp/ppe.poe/lib)
set lpath=($lpath /usr/lpp/ppe.poe/bin /home/loadl/bin )
set path=($path $lpath)
.cshrc source .cshrc
(logout) (login) source .cshrc
1.3.1 IBM SP2 MPI C

MPI C (compiler) mpicc IBM
SMP mpccmpcc :
SP2 SP2
mpcc -O3 -qarch=auto -qstrict -o file.x file.f

-O3
(level 3 Optimization)
-qarch=auto
-qstrict
-o file.x
file.x
(default) a.out
1.3.2 IBM SP2 Job command file

IBM SP2(ivy) LoadLeveler job command file
job command file jobp4 CPU file.x
#!/bin/csh
#@ executable = /usr/bin/poe
#@ arguments = /your_working_directory/file.x
#@ output
= outp4
#@ error
= outp4
#@ job_type = parallel
euilib us
#@ class
= medium
#@ min_processors = 4
#@ max_processors = 4
#@ requirements = (Adapter == "hps_user")
#@ wall_clock_limit = 20
#@ queue
executable
arguments
output
error
class
= /usr/bin/poe poe Parallel Operating Environment

=
= (stdout)
= (error message)
= SP2 CPU llclass :
short
(CPU 12 10 120MHz CPU)
medium
long
min_processors = CPU
max_processors = CPU
requirements = (Adapter == "hps_user")
wall_clock_limit =
job
queue
CPU short class 4 CPUmedium class 32 CPU

long class 8 CPU MPI 1.2 CPU CPU CPU
min_processors max_processors
wall_clock_limit
IBM SP2 SMP (ivory) LoadLeveler job command

file job command file jobp4 CPU file.x
#!/bin/csh
#@ network.mpi= css0,shared,us
#@ executable = /usr/bin/poe
#@ arguments = /your_working_directory/file.x
#@ output
= outp4
#@ error
= outp4
euilib us
#@ job_type = parallel
#@ class
= medium
#@ tasks_per_node = 4
#@ node = 1
#@ wall_clock_limit = 20
#@ queue
IBM SP2 SMP Node 375MHz CPU 4GB 8GB
class
= SP2 SMP CPU llclass :
short
(CPU 12 3 Node 6 CPU)
tasks_per_node=4

class Node 8GB
Node CPU
node=1
Node CPU
medium
bigmem
CPU medium class 16 Node 64 CPU class
1.3.3 IBM SP2

IBM SP2 SP2 SMP LoadLeveler job command file
llsubmit job command file
job command file jobp4 :
llsubmit
jobp4
llq llq
grep class user id jobp4 medium
:
llq | grep medium
9
llq :
job_id
----------ivy1.1781.0
ivy1.1814.0
user_id
-----------u43ycc00
u50pao00
job_id
user_id
submitted
status
submitted
-------------8/13 11:24
8/13 20:12
status priority
------ -------R
50
R
50
class
---------medium
short
running on
------------ivy39
ivy35
LoadLeveler
login name
/ :
R Running
I Idle (=waiting in queue)
Priority
ST Start execution
NQ Not Queued
Class
Running on
CPU
CPU
llcancel
llcancel
job_id
job_id llq llcancel

llq
10
PC Cluster MPI
1.4
MPICH C shell home directory .cshrc

include file (mpif.hmpi.h) (mpif77mpiccmpiCC)MPI library
DQS PC Cluster
:
setenv PGI /usr/local/pgi
set path = ( . ~ /usr/local/pgi/linux86/bin $path)
set path = ( /home/package/DQS/bin $path)
set path = ( /home/package/mpich/bin $path)
PGI Portland Group Inc. PGI CC++
pgccpgCC DQS
MPICH PGI
1.4.1
PC Cluster C MPI
MPICH C mpicc GNU gcc

gcc :
mpicc -O3 -o file.x file.f
-O3
gcc
-o file.x file.x
a.out
file.c
C
PGI MPI mpicc pgcc
pgcc makefile :
11
OBJ
EXE
= file.o
= file.x
MPI
= /home/package/mpich_PGI
LIB
= $(MPI)/lib/libmpich.a
MPICC = $(MPI)/bin/mpicc
OPT = -O2 -I$(MPI)/include
$(EXE) : $(OBJ)
$(MPICC) $(LFLAG) -o $(EXE) $(OBJ) $(LIB)
.f.o :
$(MPICC) $(OPT) -c $<
makefile make
1.4.2
PC Cluster Job command file
PC cluster DQS
DQS job command file job command file jobp4
CPU hubksp :
#!/bin/csh
#$ -l qty.eq.4,HPCS00
#$ -N HUP4
#$ -A user_id
#$ -cwd
#$ -j y
cat $HOSTS_FILE > MPI_HOST
mpirun -np 4 -machinefile MPI_HOST hubksp >& outp4
#!/bin/csh
C shell script
#$ -l qty.eq.4,HPCS DQS CPUqty (quantity)
HPCS CPU cluster queue class
#$ -N HUP4
(Name) HUP4
#$ -A user_id
(Account)
#$ -cwd
(working directory)
home directory
#$ -j y
$HOST_FILE
-np 4 hubksp
>& outp4
DQS node list

mpirun CPU hubksp
outp4
12
1.4.3
PC Cluster
PC cluster DQS job command file

qsub job command file PC cluster job command
file jobp4 :
qsub
jobp4
qstat cluster
qstat -f cluster node qsub jobp4 qstat
:
c00tch00 HUP4
c00tch00 HUP4
hpcs01
hpcs02
62
62
0:1
0:1
r
r
RUNNING
RUNNING
02/26/99 10:51:23
02/26/99 10:51:23
c00tch00 HUP4
c00tch00 HUP4
hpcs03
hpcs04
62
62
0:1
0:1
r
r
RUNNING
RUNNING
02/26/99 10:51:23
02/26/99 10:51:23
----Pending Jobs -----------------------------------------------------------------------------------------c00tch00 RAD5

70
0:2
QUEUED
02/26/99 19:24:32
user_id CPU DQS
job_id62
0:1 0 0:1 1
r RUNNING
// :: Pending Jobs
RUNNING QUEUED
qdel
qdel
job_id
job_id qstat qdel qstat
13
(sequential program) MPI

2.1 MPI MPI_InitMPI_FinalizeMPI_Comm_size
MPI_Comm_rankMPI_SendMPI_Recv
2.2 T2SEQ
2.3 MPI T2SEQ T2CP
2.4 MPI MPI_ScatterMPI_GatherMPI_Reduce
MPI_Allreduce
2.5 T2SEQ T2DCP
14
MPI
2.1
MPI
MPI_Init, MPI_Finalize,
MPI_Comm_size, MPI_Comm_rank,
MPI_Send, MPI_Recv
2.1.1 mpi.h include file

MPI C include <mpi.h>
(statement)mpi.h MPI MPI MPI (constant)
:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
main ( argc, argv)
int argc;
char **argv;
{
...
...
MPI_Finalize();
return 0;
}
startend(int myid, int nproc, int is1, int is2, int* istart, int* iend)
{
...
return 0;
}
MPI mpi.h MPI
MPI
2.1.2 MPI_Init, MPI_Finalize

MPI MPI_Init CPU
15
MPI_Finalize MPI_Init
MPI_Finalize :
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
main ( argc, argv)
int argc;
char **argv;
{
MPI_Init(&argc, &argv);
...
MPI_Finalize();
return 0;
}
2.1.3 MPI_Comm_size, MPI_Comm_rank

MPI_Init MPI_Comm_size CPU
(nproc) MPI_Comm_rank CPU (myid) CPU 0
CPU myid CPU myid 1 CPU myid
2 CPU
CPU
CPU job command file min_processors
max_processors -np
MPI_Comm_size MPI_Comm_rank :
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myid);
MPI_COMM_WORLD MPI (default) communicator
CPU communicator communicator CPU
MPI 1.2 CPU CPU
MPI
:
16
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int
nproc, myid;
main ( argc, argv)
int argc;
char **argv;
{
MPI_Init(&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD,
MPI_Comm_rank (MPI_COMM_WORLD,
...
...
MPI_Finalize();
&nproc);
&myid);
return 0;
}
2.1.4 MPI_Send, MPI_Recv

CPU '' (point to
point communication) '' (collective communication)
'' MPI_Send MPI_Recv, '' ''
CPU CPU '' CPU

MPI_Send CPU MPI_Recv MPI_Send
MPI_Recv MPI_Send
:
MPI_Send ((void *)&data, icount, DATA_TYPE, idest, itag, MPI_COMM_WORLD);
data
icount
DATA_TYPE
idest
itag
(scalar) (array)
icount data
MPI 1.1
CPU id
17
MPI data type

MPI_CHAR
MPI_SHORT
MPI_INT
MPI_LONG
MPI_UNSIGNED_CHAR
MPI_UNSIGNED_SHORT
MPI_UNSIGNED
MPI_UNSIGNED_LONG
MPI_FLOAT
MPI_DOUBLE
MPI_LONG_DOUBLE
MPI_PACKED
1.1
C data type
signed char
signed short iny
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned long int
float
double
long double
description
1-byte character
2-byte integer
4-byte integer
4-byte integer
1-byte unsigned character
2-byte unsigned integer
4-byte floating point
C MPI
MPI_Recv :
MPI_Recv ((void *)&data, icount, DATA_TYPE, isrc, itag, MPI_COMM_WORLD, istat);
data
icount
DATA_TYPE
isrc
itag
istat
CPU id
MPI_Recv
istat mpi.h MPI_STATUS_SIZE
MPI_Status
istat[MPI_STATUS_SIZE];
mpi.h MPI_STATUS_SIZE
MPI_Status
istat[8];
CPU CPU
MPI_Recv( (void *)&buff, icount, DATA_TYPE, MPI_ANY_SOURCE, itag,

18
MPI_COMM_WORLD, istat);
CPU id STATUS
isrc= istat( MPI_SOURCE );
MPI (MPI_SendMPI_Recv) '' (envelope)
(message)
1. CPU id
2. CPU id
3.
4.
communicator
CPU CPU
19
T2SEQ
2.2
T2SEQ test data generation bcd
for loop loop

for loop for loop
/*
PROGRAM T2SEQ
sequential version of 1-dimensional array operation
#include <stdio.h>
#include <stdlib.h>
#define n
*/
200
main ()
{
double suma, a[n], b[n], c[n], d[n];
int
i, j;
FILE *fp;
/*
test data generation and write out to file
'input.dat'
*/
compute and write out the result
*/
for (i = 0; i < n; i++) {

j=i+1;
b[i] = 3. / (double) j + 1.0;
c[i] = 2. / (double) j + 1.0;
d[i] = 1. / (double) j + 1.0;
}
fp = fopen( "input.dat", "w");
fwrite( (void *)&b, sizeof(b), 1, fp );
fwrite( (void *)&c, sizeof(c), 1, fp );
fwrite( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
/*
read
'input.dat',
20
fp = fopen( "input.dat", "r");

fread( (void *)&b, sizeof(b), 1, fp );
fread( (void *)&c, sizeof(c), 1, fp );
fread( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
suma = 0.;
for (i = 0; i < n; i++) {
a[i] = b[i] + c[i] * d[i];
suma += a[i];
}
for (i = 0; i < n; i+=40) {
printf( "%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\n",
a[i],a[i+5],a[i+10],a[i+15],a[i+20],a[i+25],a[i+30],a[i+35]);
}
printf( "sum of A=%f\n",suma);
return 0;
}
T2SEQ :
10.000
2.118
2.060
3.056
2.108
2.057
2.562
2.099
2.054
2.383
2.091
2.052
2.290
2.085
2.050
2.234
2.079
2.048
2.196
2.074
2.046
2.168
2.070
2.044
2.148
2.066
2.043
2.131
2.063
2.041
2.040
2.039
2.037
sum of A=438.548079
2.036
2.035
2.034
2.033
2.032
2.031
2.031
21
2.3
T2CP
(decomposition/partition)
(sequential version)
T2SEQ ? '' 2.5

''
T2SEQ CPU abcd
CPU
startend CPU index CPU0
CPU1 CPU2 2.1 :
computing partition without data partition
cpu0
istart
iend
ntotal
|
|
|

cpu1
istart
iend
ntotal
|
|
|

cpu2
istart
iend
ntotal
|
|
|

cpu3
istart
iend
|
|

array element inside territory
array element outside territory
2.1
2.1 CPU
CPU istart iend
MPI 1.2 Parallel I/O CPU0myid
for loop MPI_Send CPU CPU
22
myid CPU0 CPU

itag MPI_Send MPI_Recv
CPU a CPU0CPU0
for loop CPU CPU0 a
suma a suma
/*
PROGRAM T2CP
computation partition without data partition of 1-dimensional arrays
#include <stdio.h>
#include <stdlib.h>
*/
#include <mpi.h>
#define n 200
main ( argc, argv)
int argc;
char **argv;
{
double
int
FILE
int
suma, a[n], b[n], c[n], d[n];

i, j, k;
*fp;
nproc, myid, istart, iend, icount;
int
itag, isrc, idest, istart1, icount1;
int
gstart[16], gend[16], gcount[16];
MPI_Status
istat[8];
MPI_Comm comm;
MPI_Init (&argc, &argv);
startend( nproc, 0, n - 1, gstart, gend, gcount);
istart=gstart[myid];
iend=gend[myid];
comm=MPI_COMM_WORLD;
printf( "NPROC,MYID,ISTART,IEND=%d\t%d\t%d\t%d\n",nproc,myid,istart,iend);
/*
READ 'input.dat', COMPUTE AND WRITE OUT THE RESULT */

if ( myid==0) {
23

fclose( fp );
for (idest = 1; idest < nproc; idest++) {
istart1=gstart[idest];
icount1=gcount[idest];
itag=10;
MPI_Send ((void *)&b[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
itag=20;
MPI_Send ((void *)&c[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
itag=30;
MPI_Send ((void *)&d[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
}
}
else {
icount=gcount[myid];
isrc=0;
itag=10;
MPI_Recv ((void *)&b[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
itag=20;
MPI_Recv ((void *)&c[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
itag=30;
MPI_Recv ((void *)&d[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
}
/*
compute, collect computed result and write out the result
*/
for (i = istart; i <= iend; i++) {
a[i] = b[i] + c[i] * d[i];
}
itag=110;
if (myid > 0) {
idest=0;
MPI_Send((void *)&a[istart], icount, MPI_DOUBLE, idest, itag, comm);
}
24
else {
for ( isrc=1; isrc < nproc; isrc++ ) {
icount1=gcount[isrc];
istart1=gstart[isrc];
MPI_Recv((void *)&a[istart1], icount1, MPI_DOUBLE, isrc, itag, comm, istat);
}
}
if (myid == 0) {
for (i = 0; i < n; i+=40) {
}
suma=0.0;
for (i = 0; i < n; i++)
suma+=a[i];
printf( "sum of A=%f\n",suma);
}
MPI_Finalize();
return 0;
}
startend(,int nproc,int is1,int is2,int gstart[16],int gend[16], int gcount[16])
{
int
ilength, iblock, ir;
ilength=is2-is1+1;
iblock=ilength/nproc;
ir=ilength-iblock*nproc;
for ( i=0; i < nproc; i++ ) {
if(i < ir) {
gstart[i]=is1+i*(iblock+1);
gend[i]=gstart[i]+iblock;
}
else {
gstart[i]=is1+i*iblock+ir;
gend[i]=gstart[i]+iblock-1;
}
if(ilength < 1) {
gstart[i]=1;
gend[i]=0;
25
}
gcount[i]=gend[i]-gstart[i] + 1;
}
}
T2CP :
ATTENTION: 0031-408 4 nodes allocated by LoadLeveler, continuing...
NPROC,MYID,ISTART,IEND=4
1
50
99
10.000 3.056
2.562
2.383
2.290
0
49
150
199
100
149
2.234
2.196
2.168
2.148
2.074
2.050
2.131
2.070
2.048
0
3
2
2.118
2.066
2.046
2.108
2.063
2.044
2.099
2.060
2.043
2.091
2.057
2.041
2.085
2.054
2.040
2.079
2.052
2.039
2.037
2.036
2.035
sum of A=438.548079
2.034
2.033
2.032
2.031
2.031
SPMD (Single Program Multiple Data) CPU

CPU rank (myid)
(if statement) index for loop
index rank CPU T2CP
CPU :
startend( nproc, 0, n - 1, gstart, gend, gcount);
iend=gend[myid];
CPU myid istart iend
SPMD CPU
CPU0
CPU master CPU slave CPU
26
2.4
MPI_ScatterMPI_GatherMPI_Reduce
MPI_ScatterMPI_GatherMPI_AllgatherMPI_ReduceMPI_Allreduce '
' communicator CPU
CPU
2.4.1 MPI_Scatter
MPI_Gather
MPI_Scatter iroot CPU t nproc (nproc= CPU )
n CPU id CPU ( iroot CPU )
CPU0 CPU1 CPU2 2.2 :
CPU0
t 1t 2t 3t 4
CPU0
t1
CPU1
t2
CPU2
CPU2
t3
CPU3
CPU3
t4
CPU1
Scatter
>
2.2 MPI_Scatter
MPI_Scatter :
iroot = 0
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_DOUBLE, iroot, comm);
n
CPU
MPI_DOUBLE
b
n b
n
MPI_DOUBLE
iroot
CPU id
MPI_Gather MPI_Scatter idest CPU CPU a

27
CPU id t CPU0 n t
CPU1 n t CPU2 n t
2.3 :
CPU0
t 1t 2t 3t 4
CPU0
t1
CPU1
t2
CPU2
CPU2
t3
CPU3
CPU3
t4
CPU1
Gather
<
2.3 MPI_Gather
MPI_Gather :
idest = 0
MPI_Gather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_ DOUBLE, idest, comm);
MPI_Gather
a
n a
n
MPI_DOUBLE
t
n
MPI_DOUBLE
CPU
idest
CPU id
MPI_Allgather :
MPI_ Allgather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, comm);
MPI_Allgather MPI_Gather MPI_Gather
CPU MPI_ Allgather CPU
28
CPU0
t 1t 2t 3t 4
CPU1
t 1t 2t 3t 4
CPU2
CPU3
CPU0
t1
CPU1
t2
t 1t 2t 3t 4
CPU2
t3
t 1t 2t 3t 4
CPU3
t4
Allgather
2.4 MPI_Allgather
2.4.2 MPI_Reduce, MPI_Allreduce

'' (reduction operation) CPU
(partial sum) CPU MPI_Reduce
CPU (iroot)MPI_Allreduce
CPU
MPI_Reduce 2.5 MPI_Allreduce 2.6
CPU0
suma
0.2 1.5
CPU0
CPU1
suma
0.5 0.6
CPU2
suma
0.3 0.4
Reduce
CPU1
>
(MPI_SUM) CPU2
CPU3
suma
0.7 1.0
CPU3
sumall
1.7 3.5
2.5 MPI_Reduce
CPU0
suma
0.2 1.5
CPU1
suma
0.5 0.6
CPU2
suma
0.3 0.4
CPU3
suma
0.7 1.0
Allreduce
>
(MPI_SUM)
CPU0
sumall
1.7 3.5
CPU1
sumall
1.7 3.5
CPU2
sumall
1.7 3.5
CPU3
sumall
1.7 3.5
2.6 MPI_Allreduce
29
MPI_Reduce MPI_Allreduce :
iroot = 0;
MPI_Reduce ((void *)&suma, (void *)&sumall, count, MPI_DOUBLE, MPI_SUM,
iroot, comm);
MPI_Allreduce((void *)&suma, (void *)&sumall, count, MPI_DOUBLE, MPI_SUM,
comm);
suma
sumall
count
MPI_DOUBLE
()
() ( CPU suma )
()
suma sumall
2.1
CPU_id
MPI_SUM
iroot
sumall CPU suma

MPI
MPI_SUM
MPI_PROD
MPI_MAX
MPI_MIN
MPI_MAXLOC
MPI_MINLOC
MPI_LAND
MPI_LOR
MPI_LXOR
MPI_BAND
MPI_BOR
MPI_BXOR
Operation
sum
product
maximum
minimum
max value and location
min value and location
logical AND
logical OR
logical exclusive OR
binary AND
binary OR
binary exclusive OR
C Data type
MPI_INT, MPI_ FLOAT,
MPI_DOUBLE, MPI_LONG_DOUBLE
MPI_FLOAT_INT, MPI_DOUBLE_INT,
MPI_LONG_INT, MPI_2INT
MPI_SHORT, MPI_LONG, MPI_INT,
MPI_UNSIGNED_SHORT, MPI_UNSIGNED,
MPI_UNSIGNED_LONG
MPI_SHORT, MPI_LONG, MPI_INT,
MPI_UNSIGNED_SHORT, MPI_UNSIGNED,
MPI_UNSIGNED_LONG
2.1 MPI Reduction Function

MPI_MAXLOC MPI_MINLOC C structure
Data type
MPI_FLOAT_INT
MPI_DOUBLE_INT
MPI_LONG_INT
MPI_2INT
Description (C structure)
{MPI_FLOAT, MPI_INT}
{MPI_DOUBLE, MPI_INT}
{MPI_LONG, MPI_INT}
{MPI_INT, MPI_INT}
30
2.5
T2DCP
T2DCP np CPU
abcd ntotal np bcd
ntotal ntotal t bc
d aCPU0 MPI_Scatter
CPU
iroot=0;
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_DOUBLE, iroot, comm);
CPU MPI_Gather a
CPU0
idest=0;
MPI_Gather ((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, idest, comm);
T2CP T2DCP MPI_ScatterMPI_Gather
CPU MPI_Send
MPI_Recv
dimension CPU np ntotal
n ntotal / np define :
#define ntotal
#define np
#define n
200
4
50
ntotal np dimension ntotal n :

double a[n], b[n], c[n], d[n], t[ntotal];
CPU for loop 0 n-1 suma a ntotal
np (partial sum) MPI_Reduce
CPU suma CPU0 sumall
31
iroot=0;
MPI_Reduce ((void *0&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, iroot,
comm);
T2DCP :
/*
PROGRAM T2DCP */
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ntotal 200
#define np 4
#define n
50
main ( argc, argv)
int argc;
char **argv;
{
/*
Data & Computational Partition Using MPI_Scatter, MPI_Gather

value of n must be modified when run on other than 4 processors
*/
int
FILE
i, j, k;
*fp;
double
a[n], b[n], c[n], d[n], t[ntotal], suma, sumall;
int
nproc, myid, istart, iend, iroot, idest;
MPI_Comm
comm;
MPI_Status
istat[8];
extern int
mod;
Comm = MPI_COMM_WORLD;
istart = 0;
iend = n-1;
32
/*
*/
read input data and distribute input data

if (nproc != np) {
printf( "nproc not equal to np= %d\t%d\t",nproc, np);
printf(" program will stop");
MPI_Finalize();
return 0;
}
if (myid == 0) {
fread( (void *)&t, sizeof(t), 1, fp );
}
iroot=0;
MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_ DOUBLE, iroot, comm);
if(myid == 0) {
}
MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&c, n, MPI_ DOUBLE, iroot, comm);
if(myid == 0) {
}
MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&d, n, MPI_DOUBLE, iroot, comm);
/*
compute, gather computed data, and write out the result
*/
suma=0.0;
/* for(i=0; i<ntotal; i++) { */
for(i=istart; i<=iend; i++) {
a[i]=b[i]+c[i]*d[i];
suma=suma+a[i];
}
idest=0;
MPI_Gather((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_ DOUBLE, idest, comm);
MPI_Reduce((void *)&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, idest, comm);
if(myid == 0) {
for (i = 0; i < ntotal; i+=40) {
t[i],t[i+5],t[i+10],t[i+15],t[i+20],t[i+25],t[i+30],t[i+35]);
33
}
printf( "sum of A=%f\n",sumall);
}
MPI_Finalize();
return 0;
}
T2DCP :
ATTENTION: 0031-408
10.000 3.056
2.562
2.148
2.131
2.118
2.074
2.070
2.066
2.050
2.048
2.046
2.037
2.036
2.035
sum of A=438.548079
4 nodes allocated by LoadLeveler, continuing...

2.383
2.290
2.234
2.196
2.168
2.108
2.099
2.091
2.085
2.079
2.063
2.060
2.057
2.054
2.052
2.044
2.034
2.043
2.033
2.041
2.032
2.040
2.031
2.039
2.031
34
MPI
3.1 MPI MPI_SendrecvMPI_BcastMPI_ Sendrecv
MPI_Bcast
3.2 T3SEQ
3.3 MPI_ SendrecvMPI_SendMPI_Recv T3SEQ
T3CP_1 MPI_ Bcast T3CP_1 MPI_Send MPI_Recv
T3CP_2T3CP_1 T3CP_2
3.4 T3DCP_1
3.5 T3DCP_2
35
3.1
MPI_Sendrecv, MPI_Bcast
MPI_Sendrecv '' MPI_Bcast ''
3.1.1 MPI_ Sendrecv

CPU CPU CPU MPI_
Sendrecv MPI_Send MPI_Recv
CPU CPU
itag = 110;
MPI_ Sendrecv ((void *)&b[iend],
icount, DATA_TYPE, r_nbr, itag,
(void *)&b[istartm1], icount, DATA_TYPE, l_nbr, itag, comm, istat);
b[iend]
icount
DATA_TYPE
r_nbr
itag
CPU id ()
b[istartm1]
icount
DATA_TYPE
l_nbr
itag
CPU id ()
istat
3.1.2 MPI_Bcast
MPI_Bcast '' Bcast Broadcast
communicator CPU ''
CPU CPU
CPU
MPI_Bcast :
iroot=0;
36
MPI_Bcast ( (void *)&b, icount, DATA_TYPE, iroot, comm);

b
icount
DATA_TYPE
CPU id
iroot
MPI_Bcast 3.1 :
CPU0
CPU0
b1 b2 b3 b4
CPU1
b1 b2 b3 b4
CPU2
CPU2
b1 b2 b3 b4
CPU3
CPU3
b1 b2 b3 b4
CPU1
b1 b2 b3b4
MPI_Bcast
3.1 MPI_Bcast
37
T3SEQ
3.2
T3SEQ for loop a[i] c[i]d[i]

b[i-1]b[i]b[i+1]
CPU (outside territory)
(boundary data exchange) a amax
/*
PROGRAM T3SEQ
Boundary Data Exchange Program - Sequential Version
*/
#include <stdio.h>
#include <stdlib.h>
#define ntotal
200
main ()
{
double
int
FILE
amax, a[ntotal], b[ntotal], c[ntotal], d[ntotal];

i, j;
*fp;
extern double max(double, double);

/*
read
'input.dat',
compute, and write out the result
*/

fclose( fp );
amax = -1.0e12;
for (i = 1; i < ntotal-1; i++) {
a[i]=c[i]*d[i]+(b[i-1]+2.0*b[i]+b[i+1])*0.25;
amax=max(amax,a[i]);
}
for (i = 0; i < ntotal; i+=40) {
38
}
printf( "MAXIMUM VALUE OF A ARRAY is=%f\n",amax);
return 0;
}
double max(double a, double b)
{
if(a >= b)
return a;
else
return b;
}
T3SEQ :
0.000
2.148
3.063
2.131
2.563
2.118
2.383
2.108
2.290
2.099
2.234
2.091
2.196
2.085
2.168
2.079
2.074
2.070
2.066
2.063
2.060
2.057
2.050
2.048
2.046
2.044
2.043
2.041
2.037
2.036
2.035
2.034
2.033
2.032
MAXIMUM VALUE OF A ARRAY is=5.750000
2.054
2.040
2.031
2.052
2.039
2.031
39
T3CP
3.3
T3SEQ ? 3.4
3.5
T3CP_1 startend CPU index
CPU0 CPU1 3.2 :
left
mpi_proc_null
cpu0
cpu1
cpu2
right

| |
istart2
istart
.
iend+1
iend1
iend
|

.
. .
iend+1
iend1
iend
|

| |
| istart
istart2
istart-1
. . . . .
is owned data
ntotal
|
. . .
ntotal
|
. . .
iend
iend1
|
|
|

|
istart
istart2
istart -1
mpi_proc_null
is exchanged data
3.2
3.2 .
CPU
CPU istart iend T3SEQ :
amax=-1.e12;
for (i=1; i<ntotal-1; i++) {
a[i]=c[i]*d[i] + ( b[i-1] + 2.0*b[i] + b[i+1] )*0.25
amax = max(amax, a[i])
}
40
index i 1 <ntotal-1 CPU0 1 CPU istart

istart1 :
istart1=istart;
if (myid == 0) istart1=1;
loop CPU ntotal-2 iend-1 CPU iend
iend1 :
iend1= iend;
if (myid == nproc-1) iend1= iend 1;
a[i] i istart b[istart-1] a[i] i iend b[iend+1]
istartm1 (istart minus 1) iendp1 (iend plus 1) b
index :
istartm1=istart-1;
iendp1=iend+1;
(i-1 i+1 ) for loop MPI_Sendrecv
CPU CPU startend index
CPU0 CPU1 CPU2 CPU
l_nbr CPU CPU id r_nbr CPU CPU id
CPU CPU CPU
MPI_PROC_NULL MPI_PROC_RULL
mpi.h
l_nbr = myid-1;
r_nbr = myid+1;
IF(myid == 0)
l_nbr = MPI_PROC_NULL;
IF(myid == NPROC-1) r_nbr = MPI_PROC_NULL;
b[i-1] b[i+1] MPI_Sendrecv b[i-1]
b[i+1]
b[i-1] 3.2 CPU1 b[iend] "
b[istartm1]" " b[iend]" b[istartm1]
MPI_PROC_NULL CPU
b[istartm1] :
41
itag = 110;
MPI_Sendrecv ((void *)&b[iend],
1, MPI_DOUBLE, r_nbr, itag,
(void *)&b[istartm1], 1, MPI_DOUBLE, l_nbr, itag, comm, istat)
b[i+1] 3.2 CPU1 b[istart] "
b[iendp1]" " b[istart]" b[iendp1]
MPI_PROC_NULL CPU
b[iendp1] :
itag = 120;
MPI_Sendrecv ((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag,
(void *)&b[iendp1], 1, MPI_DOUBLE, r_nbr, itag, comm, istat);
CPU for loop istart iend amax a ntotal
np
MPI_Allreduce CPU amax gmax (global maximum)
CPU
MPI_Allreduce ( (void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm );
reduce allreduce CPU reduce
MPI_Allreduce CPU reduce
MPI_Reduce MPI_Allreduce
:
/*
PROGRAM T3CP
Boundary data exchange with computing partition without data partition
Using MPI_Send, MPI_Recv to distribute input data
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ntotal 200
main ( argc, argv)
42
int argc;
char **argv;
{
double amax, gmax, a[ntotal], b[ntotal], c[ntotal], d[ntotal];
int
i, j, k;
FILE *fp;
int
nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp;
int
itag, isrc, idest, istart1,icount1, istart2, iend1, istartm1, iendp1;
int
MPI_Status
istat[8];
MPI_Comm comm;
startend (nproc, 0, ntotal-1, gstart, gend, gcount);
iend=gend[myid];
lastp=nproc-1;
istartm1=istart-1;
iendp1=iend+1;
istart2=istart;
if (myid == 0) istart2=istart+1;
iend1=iend;
if(myid == lastp ) iend1=iend-1;
l_nbr = myid - 1;
r_nbr = myid + 1;
if (myid == 0) l_nbr=MPI_PROC_NULL;
if (myid == lastp) r_nbr=MPI_PROC_NULL;
43
/*
READ 'input.dat', and distribute input data
*/
if ( myid==0) {
fclose( fp );
for (idest = 1; idest < nproc; idest++) {
istart1=gstart[idest];
icount1=gcount[idest];
itag=10;
MPI_Send ((void *)&b[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
itag=20;
MPI_Send ((void *)&c[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
itag=30;
MPI_Send ((void *)&d[istart1], icount1, MPI_DOUBLE, idest, itag, comm);
}
}
else {
isrc=0;
itag=10;
MPI_Recv ((void *)&b[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
itag=20;
MPI_Recv ((void *)&c[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
itag=30;
MPI_Recv ((void *)&d[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat);
}
/*
Exchange data outside the territory
*/
itag=110;
MPI_Sendrecv((void *)&b[iend],
(void *)&b[istartm1],1, MPI_DOUBLE, l_nbr, itag, comm, istat);
itag=120;
MPI_Sendrecv((void *)&b[istart], 1, MPI_DOUBLE, l_nbr, itag,
(void *)&b[iendp1],1, MPI_DOUBLE, r_nbr, itag, comm, istat);
44
/*
Compute, gather and write out the computed result
*/
amax= -1.0e12;
for (i=istart2; i<=iend1; i++) {
a[i]=c[i]*d[i]+(b[i-1]+2.0*b[i]+b[i+1])*0.25;
}
itag=130;
if (myid > 0) {
idest=0;
MPI_Send((void *)&a[istart], icount, MPI_DOUBLE, idest, itag, icomm);
}
else
for (isrc=1; isrc<nproc; isrc++) {

istart1=gstart[isrc];
icount1=gcount[isrc];
MPI_Recv((void *)&a[istart1], icount1, MPI_DOUBLE, isrc, itag, comm, istat);
}
}
MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm);
amax=gmax;
if( myid == 0) {
for (i = 0; i < ntotal; i+=40) {
}
printf ("MAXIMUM VALUE OF ARRAY A is %f\n", amax);
}
MPI_Finalize();
return 0;
}
{
if(a >= b)
return a;
else
return b; }
45
T3CP_1 :
ATTENTION: 0031-408 4 nodes allocated by LoadLeveler, continuing...
0
0
49
0.000
3.063
2.563
2.383
2.290
50
99
100
149
150
199
2.234
2.196
2.168
2.148
2.074
2.050
2.037
2.099
2.060
2.043
2.033
2.091
2.057
2.041
2.032
2.131
2.070
2.048
2.036
2.118
2.066
2.046
2.035
2.108
2.063
2.044
2.034
1
2
3
2.085
2.054
2.040
2.031
2.079
2.052
2.039
2.031
MAXIMUM VALUE OF ARRAY A is 5.750000

CPU0 MPI_Bcast
CPU
MPI_Bcast T3CP MPI_Send MPI_Recv

if ( myid==0) {
fclose( fp );
}
iroot=0;
MPI_Bcast( (void *)&b, ntotal, MPI_DOUBLE, iroot, comm);
MPI_Bcast( (void *)&c, ntotal, MPI_DOUBLE, iroot, comm);
MPI_Bcast( (void *)&d, ntotal, MPI_DOUBLE, iroot, comm);
46
3.4
() T3DCP_1
np CPU n
ntotal np ntotal NP
n+2 dimension [n+2]
index 1 n :
double a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal], amax,gmax;
3.3 :
left
mpi_proc_null
cpu0
n
+
index 0 1 2 . n 1

index
cpu1
0 1 2 . n

n
+
1
n
+
index 0 1 2
. n 1

cpu2
n
+
index 0 1 2 . n 1

cpu3
mpi_proc_null
right
is owned data
is exchanged data
3.3
CPU for loop 1 N :
istart=1;
iend=n;
CPU for loop 2 CPU for loop 1 CPU for loop
n-1 CPU for loop n :
47
istart2= istart ;
iend1= iend;
if(myid == nproc-1) iend1= iend 1;
CPU CPU iend
CPU istart-1 :
istartm1 = istart 1;
itag=110;
MPI_Sendrecv ((void *)&b[iend],
(void *)&b[istartm1], 1, MPI_DOUBLE, l_nbr, itag, comm, istat);
CPU CPU istart
CPU iend+1 :
iendp1 = iend+1;
itag=120
MPI_Sendrecv ((void *)&b[istart],
1, MPI_DOUBLE, l_nbr, itag,

ntotal bcd t MPI_Scatter CPU b
cd dimension 1 MPI_Scatter
b[1]c[1]d[1]MPI_Gather
iroot=0;
MPI_Scatter (t,
n, MPI_DOUBLE,
MPI_Gather (a[1], n, MPI_DOUBLE,
b[1], n, MPI_DOUBLE, iroot, comm)

t,
n, MPI_DOUBLE, iroot, comm);
T3DCP_1 :
/*
PROGRAM T3DCP_1
Boundary data exchange with data & computing partition
Using MPI_Gather, MPI_Scatter to gather & scatter data
*/
48
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ntotal 200
#define n
50
#define np
main ( argc, argv)

int argc;
char **argv;
{
double amax, gmax, a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal];
int
i, j, k;
FILE
int
int
*fp;
nproc, myid, istart, iend, istart2, iend1, istartm1, iendp1;
r_nbr,l_nbr, lastp, iroot, itag;
MPI_Status
istat[8];
MPI_Comm
comm;
istart=1;
iend=n;
lastp=nproc-1;
istartm1=istart-1;
iendp1=iend+1;
istart2=istart;
if(myid == 0) istart2=2;
iend1=iend;
49
l_nbr = myid - 1;
r_nbr = myid + 1;
if(myid == 0)
l_nbr=MPI_PROC_NULL;
if(myid == lastp) r_nbr=MPI_PROC_NULL;
/*
*/
if( myid==0) {
}
iroot=0;
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b[1], n, MPI_DOUBLE, iroot, comm);
if( myid==0)
MPI_Scatter ((void *)&t, n, MPI_DOUBLE,( void *)&c[1], n, MPI_DOUBLE, iroot, comm);
if( myid==0) {
fclose( fp );
}
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&d[1], n, MPI_DOUBLE, iroot, comm);
/*
*/
itag=110;
1,MPI_DOUBLE, r_nbr, itag,
(void *)&b[istartm1], 1,MPI_DOUBLE, l_nbr, itag, comm, istat);
itag=120;
(void *)&b[iendp1],1, MPI_DOUBLE, r_nbr, itag, comm, istat);
/*
*/
amax= -1.0e12;
a[i]=c[i]*d[i] + ( b[i-1] + 2.0*b[i] + b[i+1] )*0.25;
50
}
MPI_Gather((void *)&a[istart], n, MPI_DOUBLE,(void *)&t, n, MPI_DOUBLE,iroot, comm);
MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX,
amax=gmax;
if( myid == 0) {
for (i = 0; i < ntotal; i+=40) {
comm);
}
}
MPI_Finalize();
return 0;
}
{
if(a >= b)
return a;
else
return b;
}
T3DCP_1 :
ATTENTION: 0031-408
0.000
3.063
2.563
2.383
2.148
2.131
2.118
2.108
2.290
2.099
2.234
2.091
50
50
50
50
2.196
2.168
2.085
2.079
2.074
2.050
2.037
2.060
2.043
2.033
2.057
2.041
2.032
2.054
2.040
2.031
2.070
2.048
2.036
2.066
2.046
2.035
2.063
2.044
2.034
1
3
0
2
1
1
1
1
2.052
2.039
2.031
51
3.5
() T3DCP_2
3.4 T3DCP_1 for loop :

for (i=3; i<=ntotal-2; i++)
a[i]=c[i]*d[i]+( b[i-2] +2.0*b[i-1] +2.0*b[i]+2.0*Bb[i+1] +b[i+2] )*0.125;
T3DCP_1 dimension 4
dimension [n+4] index 2 n+1 3.4 :
double a[n+4], b[n+4], c[n+4], d[n+4], t[ntotal], amax, gmax;
istart = 2;
iend = n+1;
LEFT
mpi_proc_null
cpu0
0 1 2 3 .

index
cpu1
cpu2
n n n
+ + +
1 2 3

0 1 2 3

index
0 1 2 3

index
cpu3
RIGHT
n n n
+ + +
. 1 2 3

.
n n n
+ + +
1 2 3

0 1 2 3

is exchanged data
is owned data
n
+
1
n n
+ +
2 3

mpi_proc_null
3.4
for loop index CPU 3 CPU
ntotal-2 :
istart3=istart;
52
iend2= iend;
if (myid == nproc-1) iend2= iend 2;
CPU
CPU iend-1 CPU
istart-2 :
iendm1=iend-1;
istartm2=istart-2;
itag = 110;
MPI_Sendrecv ((void *)&b[iendm1], 2, MPI_DOUBLE, r_nbr, itag,
1
CPU CPU
istart CPU iend+1 :
iendp1=iend+1;
itag=120;
MPI_Sendrecv ((void *)&b[istart],
2, MPI_DOUBLE, l_nbr, itag,

:
/*
PROGRAM T3CP_2
Two element of boundary data exchange with data & computing partition
Using MPI_Gather, MPI_Scatter to gather & scatter data
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ntotal 200
#define n
50
#define np
main ( argc, argv)

53
int argc;
char **argv;
{
double amax, gmax, a[n+4], b[n+4], c[n+4], d[n+4], t[ntotal];
int
i, j, k;
FILE *fp;
int
nproc, myid, istart, iend, istart3, iend2, istartm2, iendm1, iendp1;
int
r_nbr, l_nbr, lastp, iroot, itag;
MPI_Status
istat[8];
MPI_Comm comm;
istart=2;
iend=n+1;
lastp=nproc-1;
istartm2=istart-2;
iendp1=iend+1;
iendm1=iend-1;
istart3=istart;
iend2=iend;
l_nbr = myid - 1;
r_nbr = myid + 1;
if(myid == 0)
l_nbr=MPI_PROC_NULL;
/*
*/
54
if ( myid==0) {
}
iroot=0;
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b[2], n, MPI_DOUBLE, iroot, comm);
if( myid==0)
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&c[2], n, MPI_DOUBLE, iroot, comm);
If ( myid==0) {
fclose( fp );
}
MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&d[2], n, MPI_DOUBLE, iroot, comm);
/*
*/
itag=110;
MPI_Sendrecv((void *)&b[iendm1], 2, MPI_DOUBLE, r_nbr, itag,
itag=120;
/*
C
*/
amax= -1.0e12;
a[i]=c[i]*d[i] + ( b[i-2] + 2.0*b[i-1] + 2.0*b[i] + 2.0*b[i+1] + b[i+2] )*0.125;
}
MPI_Gather((void *)&a[istart], n, MPI_DOUBLE, (void *)&t, n, MPI_DOUBLE, iroot, comm);
MPI_Allreduce((void *)&amax, (void *)&gmax, 1, MPI_DOUBLE, MPI_MAX, comm);
amax=gmax;
if( myid == 0) {
for (i = 0; i < ntotal; i+=40) {
55
}
}
MPI_Finalize();
return 0;
}
{
if(a >= b)
return a;
else
return b;
}
T3DCP_2 :
ATTENTION: 0031-408
0.000
2.148
2.074
2.050
3.078
2.131
2.070
2.048
2.565
2.118
2.066
2.046
2.384
2.108
2.063
2.044
0
1
3
2
2.291
2.099
2.060
2.043
2
2
2
2
51
51
51
51
2.234
2.091
2.057
2.041
2.196
2.085
2.054
2.040
2.168
2.079
2.052
2.039
2.037
2.036
2.035
2.034
2.033
2.032
2.031
2.031
56

(grid points) CPU
dimension
4.1 T4SEQ dimension 161 7 23
4.2 MPI_Scatterv MPI_Gatherv '' MPI_Scatter
MPI_Gather CPU
4.3 MPI_Pack MPI_Unpack MPI_Barrier
MPI_Wtime
4.4 MPI T4SEQ T4DCP
57
T4SEQ
4.1
T4SEQ dimension 161 7 23

abcd (scalar data) pqr (initial
value)
/*
PROGRAM T4SEQ
Sequential Version of an odd-dimensioned array with -1, +1 access
*/
#include <stdio.h>
#include <stdlib.h>
#define ntotal 161
main ()
{
double
a[ntotal], b[ntotal], c[ntotal], d[ntotal], p, q, r, pqr[3];
int
i,j;
FILE
*fp;
/*
READ 'input.dat', COMPUTE AND WRITE OUT THE RESULT */

for (i = 0; i < ntotal; i++) {
b[i]=3.0/(double)(i+1)+1.0;
c[i]=2.0/(double)(i+1)+1.0;
d[i]=1.0/(double)(i+1)+1.0;
}
p=1.45;
q=2.62;
r=0.5;
pqr[0]=p;
pqr[1]=q;
pqr[2]=r;
fwrite((void *)&b, sizeof(b), 1, fp );
fwrite((void *)&c, sizeof(c), 1, fp );
58
fwrite((void *)&d, sizeof(d), 1, fp );

fwrite((void *)&pqr, sizeof(pqr), 1, fp );
fclose( fp );
fread( (void *)&pqr, sizeof(pqr), 1, fp );
fclose( fp );
p=pqr[0];
q=pqr[1];
r=pqr[2];
for (i = 1; i < ntotal-1; i++) {
a[i]=c[i]*d[i]*p+(b[i-1]+2.0*b[i]+b[i+1])*q+r;
}
for (i = 0; i < ntotal-1; i+=40) {
}
return 0;
}
T4SEQ :
0.000
13.305
12.872
12.726
18.550
13.210
12.847
12.714
15.720
13.133
12.824
12.703
14.682
13.070
12.803
12.693
14.143
13.018
12.785
12.684
13.812
12.973
12.768
12.675
13.588
12.935
12.753
12.667
13.427
12.901
12.739
12.660
59
4.2.
MPI_ScattervMPI_Gatherv
4.1 T4SEQ dimension 161

7 23 2468 CPU 24
68 CPU MPI_Scatter MPI_Gather
CPU MPI_Send MPI_Recv
MPI MPI_Scatterv MPI_ Gatherv
MPI_ Scatterv MPI_ Scatter MPI_ Gatherv MPI_ Gather

MPI_ Scatter CPUMPI_Gather CPU
MPI_ Scatterv MPI_Gatherv
MPI_ Scatterv :
MPI_ Scatterv ( (void *)&t, gcount, gdisp, MPI_DOUBLE,
(void *)&c(1), mycount, MPI_DOUBLE, iroot, comm);
MPI_ Scatterv iroot CPU t CPU id CPU iroot
CPU CPU gcount
CPU gdisp t
:
t
gcount
gdisp
CPU
CPU t
MPI_DOUBLE
c(1)
mycount
MPI_DOUBLE
iroot
CPU id
gcount gdisp startend CPU

START index CPU id gcountgdisp
dimension :
double
a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal];

60
int
nproc, myid, mycount, istart, iend, l_nbr, r_nbr, gcount[np], gdisp[np], gend[np];
MPI_Status
istat[8];
Startend (nproc, 1, ntotal, gstart, gend, gcount);

mycount = gcount[myid];
abcd dimension
MPI_Scatterv b[1]c[1]d[1] t
(displacement offset) startend istart
MPI_Gatherv :
MPI_ Gatherv ((void *)&a[1], mycount, MPI_DOUBLE,
(void *)&t, gcount, gdisp, MPI_DOUBLE, iroot, comm);
MPI_ Gatherv MPI_Scatterv iroot CPU CPU ( iroot CPU)
a CPU id t CPU
gcount CPU gdisp
t :
a[1]
mycount
MPI_DOUBLE
gcount
gdisp
MPI_DOUBLE
iroot
CPU
CPU T
CPU id
61
4.3
MPI_PackMPI_UnpackMPI_ BarrierMPI_ Wtime
T4SEQ abcd pqr

MPI_Scatterv MPI_Gatherv
CPU PQR
MPI_BCAST CPU (2001 ) CPU
CPU
MPI MPI_Pack
(noncontiguous data) (contiguous memory
locations) (buffer area) (character array)
MPI_Unpack
'' MPI_Barrier CPU MPI

(Fortran function) MPI_Wtime (wall clock time)
4.3.1 MPI_Pack
MPI_Unpack
MPI_Pack T4SEQ pqr
CPU (pack)
MPI_Unpack
MPI_PackMPI_Unpack
pqr 4 12
buf1 12 :
#define bufsize 12
char buf1[bufsize];
MPI_Pack :
MPI_Pack ((void *)&p, 1, MPI_FLOAT, (void *)&buf1, bufsize, &ipos, comm);
p
buf1
62
buf1
MPI_ FLOAT
buf1
buf1
bufsize
buf1
ipos
buf1
ipos
CPU0 pqr buf1 :
if (myid == 0) {
scanf (%f %f %f, &p, &q, &r);
ipos = 0;
MPI_Pack ((void *)&p, 1, MPI_ FLOAT, (void *)&buf1, bufsize, &ipos, comm);
MPI_Pack ((void *)&q, 1, MPI_ FLOAT, (void *)&buf1, bufsize, &ipos, comm);
MPI_Pack ((void *)&r, 1, MPI_ FLOAT, (void *)&buf1, bufsize, &ipos, comm);
}
MPI_Bcast buf1 CPU :
iroot=0
MPI_Bcast ((void *)&buf1, bufsize, MPI_CHAR, iroot, comm);
MPI_Unpack :
MPI_Unpack ((void *)&buf1, bufsize, ipos, (void *)&p 1, MPI_FLOAT, comm);
buf1
bufsize
ipos
p
1
MPI_FLOAT
buf1
buf1
buf1
buf1
buf1
ipos
CPU bcast buf1 buf1 pqr
:
63
if (myid > 0)
ipos=0;
MPI_Unpack ((void *)&buf1, bufsize, ipos, (void *)&p, 1, MPI_FLOAT, comm);
MPI_ Unpack ((void *)&buf1, bufsize, ipos, (void *)&q, 1, MPI_ FLOAT, comm);
MPI_ Unpack ((void *)&buf1, bufsize, ipos, (void *)&r, 1, MPI_ FLOAT, comm);
}
Pack
float
p, q, r, buf1[3];
if (myid == 0)
scanf (%f
buf1(1)=p
buf1(2)=q
{
%f
%f,
&p, &q, &r);
buf1(3)=r
}
iroot=0;
MPI_Bcast (buf1, 3, MPI_FLOAT, iroot, comm);
If (myid > 0) {
p = buf1(1);
q = buf1(2);
r = buf1(3);
}
4.3.2 MPI_Barrier
MPI_Wtime
MPI_Barrier '' communicator CPU '
' (synchronized) CPU MPI_Barrier CPU
MPI_Barrier MPI_Barrier
MPI_Barrier :
MPI_Barrier (MPI_COMM_WORLD);
(wall clock time) MPI MPI_Wtime
64
'' :
time1=MPI_Wtime();
time1 double
CPU
MPI_Init MPI_Wtime MPI_Finalize
MPI_Wtime :
MPI_Init();
time1=MPI_Wtime()
...
time2=MPI_Wtime() time1;
printf (myid, clock time= %f\t%f\n, myid,time2);
MPI_Finalize();
Return 0;
time 1 MPI_Barrier ? Job Scheduler
(executable file) CPU CPU
CPU
CPU
CPU CPU time 1 MPI_ Barrier
65
T4DCP
4.4
T4DCP T4SEQ ab
cd ntotal 161 4 CPU startend CPU
41404040 np CPU
n ntotal / np + 1 define :
#define ntotal
#define np
161
4
#define n
41
n+2 demension
(n+2) index 1 n :
double
a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal];
ntotal np CPU n(=41) n-1(=40) CPU

startend gcount t
gdisp :
int
gcount[np], gstart[np], gend[np];
startend (nproc, 1, ntotal, gstart, gend, gcount);

CPU mycount index istartiend :
mycount=gcount[myid];
istart=1;
iend=mycount;
CPU 4.1 :
66
mpi_proc_null
iend
iend+1
iend
istart-1
istart
istart-1
istart
istart-1
is owned data
iend+1
iend
CPU0
iend+1

istart
CPU1
CPU2
mpi_proc_null
is exchanged data
4.1
for loop :
for (i=1; i<ntotal-1; i++)
a[i]=c[i]*d[i] + ( b[i-1] + 2.0*b[i] + b[i+1]
)*0.25;
for loop index :

istart2= istart;
iend1= iend;
if (myid == nproc-1) iend 1= iend 1;
T4DCP :
/*
PROGRAM T4DCP
Boundary data exchange with data & computing partition
Using MPI_Gatherv, MPI_Scatterv to gather & scatter data
*/
#include <stdio.h>
67
#include <stdlib.h>
#include <mpi.h>
#define ntotal 161
#define n 41
#define np 4
main ( argc, argv)
int argc;
char **argv;
{
double
int
FILE
int
p, q, r, pqr[3], a[n+2], b[n+2], c[n+2], d[n+2], t[ntotal], clock;

i, j, k;
*fp;
nproc, myid, istart, iend, istart2, iend1, istartm1, iendp1;
int
r_nbr, l_nbr, lastp, iroot, itag, icount;
int
gstart[16], gend[16], gcount[16], gdisp[16];
MPI_Status
istat[8];
MPI_Comm
comm;

MPI_Barrier(MPI_COMM_WORLD);
clock=MPI_Wtime();
startend (nproc, 1, ntotal, gstart, gend, gcount);
for (i = 0; i < nproc; i++) {
gdisp [i] = gstart[i]-1;
}
istart=1;
iend=gend[myid];
lastp=nproc-1;
istartm1=istart-1;
iendp1=iend+1;
68
istart2=istart;
iend1=iend;
l_nbr = myid - 1;
r_nbr = myid + 1;
if(myid == 0) l_nbr=MPI_PROC_NULL;
/*
*/
if( myid==0) {
}
iroot=0;
MPI_Scatterv ((void *)&t, gcount, gdisp, MPI_DOUBLE,
(void *)&b[1], icount,MPI_DOUBLE, iroot, comm);
if( myid==0)
(void *)&c[1], icount, MPI_DOUBLE, iroot, comm);
if( myid==0) {
fread( (void *)&pqr, sizeof(pqr), 1, fp );
fclose( fp );
}
(void *)&d[1], icount, MPI_DOUBLE, iroot, comm);
MPI_Bcast ((void *)&pqr, 3, MPI_DOUBLE, 0, comm);
p=pqr[0];
q=pqr[1];
r=pqr[2];
/*
*/
69
itag=110;
(void *)&b[istartm1], 1, MPI_DOUBLE, l_nbr, itag,
comm, istat);
itag=120;
MPI_Sendrecv((void *)&b[istart], 1,MPI_DOUBLE, l_nbr, itag,
(void *)&b[iendp1],1,MPI_DOUBLE, r_nbr, itag, comm, istat);
/*
C
*/

a[i]=c[i]*d[i]*p + ( b[i-1] + 2.0*b[i] + b[i+1]
)*q + r;
}
MPI_Gatherv ((void *)&a[istart], icount, MPI_DOUBLE,
(void *)&t, gcount, gdisp, MPI_DOUBLE, iroot, comm);
if( myid == 0) {
for (i = 0; i < ntotal-1; i+=40) {
}
}
clock=MPI_Wtime() - clock;
printf( "myid, clock time= %d\t%.3f\n", myid, clock);
MPI_Finalize();
return 0;
}
startend(,int nproc,int is1,int is2,int gstart[16],int gend[16], int gcount[16])
{
int
ilength, iblock, ir;
ilength=is2-is1+1;
for ( i=0; i < nproc; i++ ) {
if(i < ir) {
gstart[i]=is1+i*(iblock+1);
gend[i]=gstart[i]+iblock;
}
else {
gstart[i]=is1+i*iblock+ir;
70
gend[i]=gstart[i]+iblock-1;
}
if(ilength < 1) {
gstart[i]=1;
gend[i]=0;
}
gcount[i]=gend[i]-gstart[i] + 1;
}
}
T4DCP :
ATTENTION: 0031-408 4 tasks allocated by LoadLeveler, continuing...
0
1
41
0.000
13.305
12.872
12.726
18.550
13.210
12.847
12.714
15.720
13.133
12.824
12.703
myid, clock time= 0

myid, clock time= 1
myid, clock time= 2
myid, clock time= 3
14.682
13.070
12.803
12.693
1
2
3
14.143
13.018
12.785
12.684
1
1
1
13.812
12.973
12.768
12.675
40
40
40
13.588
12.935
12.753
12.667
13.427
12.901
12.739
12.660
0.002
0.002
0.002
0.002
CPU CPU
71
5.1 T5SEQ
5.2 T5CP
5.3 T5DCP
5.4 MPI
5.5 T5_2D
72
5.1 T5SEQ
T5SEQ (global variables)
(local variables) (test data generation)
/*
PROGRAM T5SEQ
Sequential version of multiple dimensional array with -1,+1 data access
*/
#include <stdio.h>
#include <stdlib.h>
#define kk 20
#define km 3
#define mm 160
#define nn
120
double f1[mm][nn][km], f2[mm][nn][km], hxu[mm][nn], hxv[mm][nn],
hmmx[mm][nn], hmmy[mm][nn];
double vecinv[kk][kk], am7[kk];
main ()
{
double u1[mm][nn][kk], v1[mm][nn][kk], ps1[mm][nn];
double d7[mm][nn], d8[mm][nn], d00[mm][nn][kk];
double clock, sumf1, sumf2;
int
i, j, k, ka, isec1, isec2, nsec1, nsec2;
/*
Test data generation

*/
wtime(&isec1, &nsec1);
for (i=0; i<mm; i++)
for (j=0; j<nn; j++)
for (k=0; k<kk; k++)
u1[i][j][k]=1.0/(double) (i+1) + 1.0/(double) (j+1) + 1.0/(double) (k+1);
73
v1[i][j][k]=2.0/(double) (i+1) + 1.0/(double) (j+1) + 1.0/(double) (k+1);

for (i=0; i<mm; i++) {
for (j=0; j<nn; j++) {
ps1[i][j] = 1.0/(double)(i+1) + 1.0/(double)(j+1);
hxu[i][j] = 2.0/(double)(i+1) + 1.0/(double)(j+1);
hxv[i][j] = 1.0/(double)(i+1) + 2.0/(double)(j+1);
hmmx[i][j] = 2.0/(double)(i+1) + 1.0/(double)(j+1);
hmmy[i][j] = 1.0/(double)(i+1) + 2.0/(double)(j+1);
}
}
for (k=0; k<kk; k++) {
am7[k]=1.0/(double) (k+1);
for (ka=0; ka<kk; ka++) {
vecinv[k][ka]=1.0/(double) (ka+1) + 1.0/(double) (k+1);
}
}
/*
Start the computation
*/
for (j=0; j<nn; j++) {

for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}
for (i=0; i<mm-1; i++)
for (j=1; j<nn-1; j++)
d7[i][j] = ( ps1[i+1][j]+ps1[i][j] )*0.50*hxu[i][j];
for (i=1; i<mm-1; i++)
for (j=0; j<nn-1; j++)
d8[i][j] = ( ps1[i][j+1]+ps1[i][j] )*0.50*hxv[i][j];
for (i=1; i<mm-1; i++)
74
for (j=1; j<nn-1; j++)

d00[i][j][k]=(d7[i][j]*u1[i][j][k]-d7[i-1][j]*u1[i-1][j][k])*hmmx[i][j]
+(d8[i][j]*v1[i][j][k]-d8[i][j-1]*v1[i][j-1][k])*hmmy[i][j];
for (i=1; i<mm-1; i++)
for (ka=0; ka<kk; ka++)
for (j=1; j<nn-1; j++)
for (k=0; k<km; k++)
f1[i][j][k]=f1[i][j][k]-vecinv[ka][k]*d00[i][j][ka];
sumf1=0.0;
sumf2=0.0;
for (i=1; i<mm-1; i++) {
for (j=1; j<nn-1; j++) {
for (k=0; k<km; k++) {
f2[i][j][k]=-am7[k]*ps1[i][j];
sumf1 +=f1[i][j][k];
}
}
}
/*
Output data for validation
*/
printf( "SUMF1,SUMF2= %.5f\t%.5f\n", sumf1, sumf2 );

printf( " F2[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
f2[i][1][1],f2[i+5][1][1],f2[i+10][1][1],f2[i+15][1][1],
f2[i+20][1][1],f2[i+25][1][1],f2[i+30][1][1],f2[i+35][1][1]);
}
clock=(double) (isec2-isec1) + (double) (nsec2-nsec1)/1.0e9;
printf( " clock time = %f\n", clock);
return 0;
}
#include <sys/time.h>
int wtime(int *isec, int *nsec)
75
{
struct timestruc_t tb;
int iret;
iret=gettimer(TIMEOFDAY, &tb);
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
T5SEQ IBM SP2 SMP :
SUMF1,SUMF2= 26172.46054
F2[i][1][1],i=0,159,5
0.000
-0.333 -0.295 -0.281
-0.274
-0.269
-0.266
-0.264
-0.262
-0.256
-0.254
-0.258
-0.255
-0.254
-0.258
-0.255
-0.253
-0.257
-0.255
-0.253
-0.257
-0.254
-0.253
-0.261
-0.256
-0.254
-0.260
-0.255
-0.254
-0.259
-0.255
-0.254
-2268.89180
clock time = 0.090299
76
5.2 T5CP
C least
dimension index index
index +1 index -1
istart-1
| istart
| |
istart-1
| istart
| |
|
|
istart-1
| istart
| |
| |
| iend+1
iend
|
| |
istart
|
nn |
. |
.
j=1
P0
5.1
#define kk
| |
| iend+1
iend
P1
|
iend
| |
| iend+1
iend
P2
ps1(i,j)
P3
ps1(mm,nn)
20
#define km 3
#define mm 160
#define nn
120
double
u1[mm][nn][kk], v1[mm][nn][kk], ps1[mm][nn];

77
itag = 20;
MPI_Sendrecv ((void *)&ps1[istart][0],
nn, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][0], nn, MPI_DOUBLE, r_nbr, itag, icomm, istat);
k=kk
k=1
nn
.
.
.
j=1
i=1 . . . . . m
P0
u1[i][j][k]
m = mm / np
P1
5.2
P2
P3
u1(mm,nn,kk)
nnkk = nn*kk;
itag = 10;
MPI_Sendrecv ((void *)&u1[iend][0][0],
nnkk, MPI_DOUBLE, r_nbr, itag,
(void *)&u1[istartm1][0][0], nnkk, MPI_DOUBLE, l_nbr, itag, icomm, istat);

MPI
nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp, iroot; itag, isrc, idest, istart1, icount1, istart2,
iend1, istartm1, iendp1 MPI
Fortran (COMMON area) Fortran
78
T5CP :
/*
PROGRAM T5CP
Computing partition on the first dimension of multiple dimensional
array with -1,+1 data exchange without data partition
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define kk 20
#define km 3
#define mm 160
#define nn
120
main ( argc, argv)
int argc;
char **argv;
{
double clock, sumf1, sumf2, gsumf1, gsumf2;
int
i, j, k, ka, nnkk;
int
int
nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp, iroot;

itag, isrc, idest, istart1, icount1, istart2, iend1, istartm1, iendp1;
int
MPI_Status
istat[8];
MPI_Comm
comm;
79
MPI_Barrier(comm);
clock=MPI_Wtime();
startend (nproc, 0, mm-1, gstart, gend, gcount);
iend=gend[myid];
lastp=nproc-1;
istartm1 = istart-1;
iendp1 = iend+1;
istart2 = istart;
if (myid == 0) istart2 = 1;
iend1 = iend;
if (myid == lastp ) iend1 = iend-1;
l_nbr = myid - 1;
r_nbr = myid + 1;
if (myid == 0) l_nbr = MPI_PROC_NULL;
if (myid == lastp) r_nbr = MPI_PROC_NULL;
/*
*/
/* for (i=0; i<mm; i++) */

for (i=istart; i<=iend; i++)
u1[i][j][k]=1.0/(double) (i+1) + 1.0/(double) (j+1) + 1.0/(double) (k+1);
/* for (i=0; i<mm; i++) */
v1[i][j][k]=2.0/(double) (i+1) + 1.0/(double) (j+1) + 1.0/(double) (k+1);
{
80
for (j=0; j<nn; j++) {

ps1[i][j] = 1.0/(double)(i+1) + 1.0/(double)(j+1);
hxu[i][j] = 2.0/(double)(i+1) + 1.0/(double)(j+1);
hxv[i][j] = 1.0/(double)(i+1) + 2.0/(double)(j+1);
hmmx[i][j] = 2.0/(double)(i+1) + 1.0/(double)(j+1);
hmmy[i][j] = 1.0/(double)(i+1) + 2.0/(double)(j+1);
}
}
for (k=0; k<kk; k++) {
am7[k] = 1.0/(double) (k+1);
vecinv[k][ka] = 1.0/(double) (ka+1) + 1.0/(double) (k+1);
}
}
/*

nnkk = nn*kk;
itag = 10;
*/

(void *)&u1[istartm1][0][0], nnkk, MPI_DOUBLE, l_nbr, itag, comm, istat);
itag = 20;
MPI_Sendrecv ((void *)&ps1[istart][0], nn, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][0], nn, MPI_DOUBLE, r_nbr, itag, comm, istat);
/*
for (i=0; i<mm; i++) { */

for (i=istart; i<=iend; i++) {
for (j=0; j<nn; j++) {
for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}
/*
for (i=0; i<mm-1; i++) */

for (i=istart; i<=iend1; i++)
for (j=1; j<nn-1; j++)
81
/*
for (i=1; i<mm-1; i++) */

for (i=istart2; i<=iend1; i++)
for (j=0; j<nn-1; j++)
itag=30;
MPI_Sendrecv ((void *)&d7[iend][0],
nn,MPI_DOUBLE,r_nbr,itag,
(void *)&d7[istartm1][0],nn,MPI_DOUBLE,l_nbr,itag,
comm, istat);
/* for (i=1; i<mm-1; i++) */

for (j=1; j<nn-1; j++)
/* for (i=1; i<mm-1; i++) */
for (j=1; j<nn-1; j++)
sumf1=0.0;
sumf2=0.0;
/* for (i=1; i<mm-1; i++) { */
for (j=1; j<nn-1; j++) {
for (k=0; k<km; k++) {
f2[i][j][k]=-am7[k]*ps1[i][j];
}
}
}
/*

iroot=0;
*/
82
MPI_Reduce ((void *)&sumf1,(void *)&gsumf1, 1, MPI_DOUBLE,

MPI_SUM, iroot, comm);
MPI_Reduce ((void *)&sumf2,(void *)&gsumf2, 1, MPI_DOUBLE,
MPI_SUM, iroot, comm);
itag=40;
if (myid != 0) {
icount1 = icount*nn*km;
MPI_Send ((void *)&f2[istart][0][0], icount1, MPI_DOUBLE, iroot, itag, comm);
}
else {
istart1 = gstart[isrc];
icount1 = gcount[isrc]*nn*km;
MPI_Recv ((void *)&f2[istart1][0][0], icount1, MPI_DOUBLE, isrc, itag, comm, istat);
}
}
if (myid == 0) {
printf( "SUMF1,SUMF2= %.5f\t%.5f\n", gsumf1, gsumf2 );
printf( " F2[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
f2[i][1][1],f2[i+5][1][1],f2[i+10][1][1],f2[i+15][1][1],
f2[i+20][1][1],f2[i+25][1][1],f2[i+30][1][1],f2[i+35][1][1]);
}
}
printf( " myid, clock time = %d\t%.5f\n", myid, clock);
MPI_Finalize();
return 0;
}
83
T5CP :
0
1
2
3
0
40
80
120
39
79
119
159
SUMF1,SUMF2= 26172.46054
-2268.89180
F2[i][1][1],i=0,159,5
0.000
-0.333 -0.295 -0.281 -0.274 -0.269 -0.266 -0.264
-0.262 -0.261 -0.260 -0.259 -0.258 -0.258 -0.257 -0.257
-0.256 -0.256 -0.255
-0.254 -0.254 -0.254
myid, clock time=0
myid, clock time=1
myid, clock time=2
myid, clock time=3
-0.255 -0.255
-0.254 -0.254
0.03366
0.03054
-0.255
-0.253
-0.255
-0.253
-0.254
-0.253
0.03195
0.03338
T5SEQ 0.090 CPU T5CP 0.033

(parallel speed up) 0.090/0.033= 2.73
84
5.3 T5DCP
T5DCP T5SEQ
mm 160 np CPU mm np
m mm/npmm np m
mm/np+1 :
#define kk
#define km
20
3
#define mm 160
#define nn
120
#define m
40
dimension (m+2) istart-1 iend+1
double f1[m+2][nn][km], f2[m+2][nn][km], hxu[m+2][nn], hxv[m+2][nn],
hmmx[m+2][nn], hmmy[m+2][nn];
double u1[m+2][nn][kk], v1[m+2][nn][kk], ps1[m+2][nn];
double d7[m+2][nn], d8[m+2][nn], d00[m+2][nn][kk], tt[mm][nn][km];
mm np f1f2 MPI_Gather
MPI_Gather :
iroot=0;
icount1= m*nn*km;
MPI_Gather((void *)&f2[istart][0][0], icount1, MPI_DOUBLE,
(void *)&tt,
icount1, MPI_DOUBLE, iroot, icomm);
:
/*
PROGRAM
T5DCP
Computing & data partition on the first dimension of multiple

dimensional arrays with -1,+1 data exchange
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
85
#define kk 20
#define km 3
#define mm 160
#define nn
120
#define m
40
main ( argc, argv)
int argc;
char **argv;
{
int
int
int
i, j, k, ka, ii, nnkk;

nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp, iroot, istartg;
itag, icount1, istart2, iend1, istartm1, iendp1;
int
MPI_Status
istat[8];
MPI_Comm
comm;
MPI_Barrier(icomm);
clock=MPI_Wtime();
startend( nproc, 1, mm, gstart, gend, gcount);
istart = 1;
iend = m;
lastp = nproc-1;
istartg = gstart[myid];
printf( "NPROC,MYID,ISTART,IEND,istartg=%d\t%d\t%d\t%d\t%d\n",
86
nproc,myid,istart,iend,istartg);
iendp1 = iend+1;
istart2 = istart;
iend1 = iend;
l_nbr = myid - 1;
r_nbr = myid + 1;
if (myid == 0)
/* for (i=0; i<mm; i++) */
ii = i + istartg -1;
u1[i][j][k]=1.0/(double) ii + 1.0/(double) (j+1) + 1.0/(double) (k+1);
}
/* for (i=0; i<mm; i++) */
v1[i][j][k]=2.0/(double) ii + 1.0/(double) (j+1) + 1.0/(double) (k+1);
}
for (j=0; j<nn; j++) {
ps1[i][j] = 1.0/(double) ii + 1.0/(double)(j+1);
hxu[i][j] = 2.0/(double) ii + 1.0/(double)(j+1);
hxv[i][j] = 1.0/(double) ii + 2.0/(double)(j+1);
hmmx[i][j] = 2.0/(double) ii + 1.0/(double)(j+1);
87
hmmy[i][j] = 1.0/(double) ii + 2.0/(double)(j+1);

}
}
for (k=0; k<kk; k++) {
am7[k] = 1.0/(double) (k+1);
}
}
/*
*/
nnkk = nn*kk;
itag = 10;
(void *)&u1[istartm1][0][0], nnkk, MPI_DOUBLE, l_nbr, itag, comm, istat);

itag = 20;
MPI_Sendrecv ((void *)&ps1[istart][0],
nn, MPI_DOUBLE, l_nbr, itag,

/*
for (i=0; i<mm; i++) { */

for (j=0; j<nn; j++) {
for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}
/*
for (i=0; i<mm-1; i++)
*/

for (j=1; j<nn-1; j++)
/*
for (i=1; i<mm-1; i++) */

for (j=0; j<nn-1; j++)
88
itag=30;
nn, MPI_DOUBLE, r_nbr, itag,
(void *)&d7[istartm1][0], nn, MPI_DOUBLE, l_nbr, itag, comm, istat);
/*
for (i=1; i<mm-1; i++) */

for (j=1; j<nn-1; j++)
/*
for (i=1; i<mm-1; i++) */

for (j=1; j<nn-1; j++)
sumf1=0.0;
sumf2=0.0;
/*
for (i=1; i<mm-1; i++) { */

for (j=1; j<nn-1; j++) {
for (k=0; k<km; k++) {
f2[i][j][k]=-am7[k]*ps1[i][j];
}
}
}
/*
*/

MPI_Allreduce ((void *)&sumf1, (void *)&gsumf1, 1, MPI_DOUBLE, MPI_SUM, comm);
MPI_Allreduce ((void *)&sumf2, (void *)&gsumf2, 1, MPI_DOUBLE, MPI_SUM, comm);
icount1 = m*nn*km;
iroot=0;
MPI_Gather((void *)&f2[istart][0][0], icount1, MPI_DOUBLE,
89
(void *)&tt,
if (myid == 0) {
icount1, MPI_DOUBLE, iroot, comm);

printf( " tt[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
tt[i][1][1],tt[i+5][1][1],tt[i+10][1][1],tt[i+15][1][1],
tt[i+20][1][1],tt[i+25][1][1],tt[i+30][1][1],tt[i+35][1][1]);
}
printf( " myid, clock time= %d\t%.5f\n", myid, clock);
}
MPI_Finalize();
return 0;
}
T5DCP :
NPROC,MYID,ISTART,IEND,istartg=4
0
1
40
2
1
40
1
1
40
3
1
40
SUMF1,SUMF2= 26172.46054
-2268.89180
tt[i][1][1],i=0,159,5
0.000
-0.333 -0.295 -0.281 -0.274 -0.269 -0.266 -0.264
-0.262 -0.261 -0.260 -0.259 -0.258
-0.256 -0.256 -0.255 -0.255 -0.255
-0.254 -0.254 -0.254 -0.254 -0.254
myid, clock time= 0
0.03041
myid, clock time= 1
0.02975
myid, clock time= 2
0.02992
myid, clock time= 3
-0.258
-0.255
-0.253
-0.257
-0.255
-0.253
1
81
41
121
-0.257
-0.254
-0.253
0.02978
T5SEQ 0.090 CPU T5DCP 0.030

(parallel speed up) 0.090/0.030 = 3.00 CPU
0.030 0.033
gather send/recv
90
5.4 MPI
5.3
MPI
MPI_Cart_createMPI_Cart_coordsMPI_Cart_shiftMPI_Type_vectorMPI_Type_commit
5.4.1 (Cartesian Topology)
(up)
(j)
u
p
d
o
w
n
CPU2
(0,2)
CPU5
(1,2)
CPU8
(2,2)
CPU11
(3,2)
CPU1
(0,1)
CPU4
(1,1)
CPU7
(2,1)
CPU10
(3,1)
CPU0
(0,0)
CPU3
(1,0)
CPU6
(2,0)
CPU9
(3,0)
(sideways)
A(i,j)
(i) (right)
5.3
a(mm,nn)
m mm/4 n
nn/3 mm nn 200 150
define mn
dimension :
#define
#define
#define
#define
mm 200
nn
150
m
50
n
50
91
#define ip
#define jp
4
3
double a(m+2, n+2)

a 4x3 CPU CPU CPU
5.3 a :
CPU CPU 5.3 X
(i) Y (j)
5.4.2 MPI MPI_Cart_create
MPI_Cart_coords
MPI_Cart_shift
MPI_Comm_size nproc MPI MPI_Csrt_create
5.3 :
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ip 4
#define jp 3
#define ndim 2
int
nproc, myid, r_nbr, l_nbr, t_nbr, b_nbr, comm2d, my_coord[ndim];
int
ipart[ndim], periods[ndim], sideways, updown, right, up, reorder;
MPI_Status istat[8];
MPI_Comm comm;
main ( argc, argv)
int argc;
char **argv;
{
ipart[0]=ip;
ipart[1]=jp;
92
periods[0]=0;
periods[1]=0;
reorder=1;
MPI_Cart_create(MPI_COMM_WORLD, ndim, ipart, periods, reorder, &comm2d);
.....
return 0;
}
MPI_COMM_WORLD
communicator
ndim
ipart
ipart (1)
ipart (2)
5.3 2
ndim
5.3 4
5.3 3
periods
periods (1)
ndim
1
0 5.3 0
periods (2)
1
0 5.3 0
CPU
1
reorder
comm2d
communicator
4x3 communicator comm2d

MPI_Cart_create MPI_Comm_rank CPU
MPI_Cart_create reorder 0
CPU CPU
communicator comm2d myid

MPI_Comm_rank communicator comm2d CPU idMPI_Comm_ank
:
MPI_Comm_rank (comm2d, &myid)
comm2d
myid
communicator
communicator comm2d CPU id
myid 5.3 CPU0CPU1CPU2 MPI_Cart_coords

93
CPU CPU my_coord

MPI_Cart_coords :
MPI_Cart_coords (comm2d, myid, ndim, my_coord)
communicator
communicator comm2d CPU id
5.3 2
ndim myid CPU CPU
comm2d
myid
ndim
my_coord
my_coord(1)
my_coord (2)
CPU my_coord 5.3 CPU0CPU1CPU2 CPU id
5.3 my_coord (0) CPU CPU my_coord
(0) ip-1 CPU CPU my_coord (1) CPU
CPU () my_coord (1) jp-1 CPU CPU
()
MPI_Cart_shift CPU CPU id
MPI_ Cart_shift :
int
sideways, updown, right, up
sideways=0;
updown=1;
right=1;
up=1;
MPI_ Cart_shift (comm2d, sideways, right, &l_nbr, &r_nbr);
comm2d
sideways
right
l_nbr
r_nbr
communicator
( i )

CPU CPU id
CPU CPU id
MPI_ Cart_shift (comm2d, updown, up, &b_nbr, &t_nbr);
updown
( j )
94
up
b_nbr
t_nbr
CPU CPU id
CPU CPU id
l_nbr, r_nbr, b_nbr, t_nbr left_neighborright_neighborbotton_neighbor

top_neighbor
5.4.3 MPI
MPI_Type_vector
MPI_Type_commit
(up)
(j)
n CPU2
. (0,2)
1xxxxx
u
p
d
o
w
n
n
.
1
CPU1
(0,1)
n
y
. CPU0 y
1 (0,0) y
1....m
CPU5
(1,2)
xxxxx
CPU8
(2,2)
xxxxx
CPU11
(3,2)
xxxxx
CPU4
(1,1)
CPU7
(2,1)
CPU10
(3,1)
y
CPU3 y
(1,0) y
1....m
(sideways)
y
CPU6 y
(2,0) y
1....m
a(i,j)
y
CPU9 y
(3,0) y
1....m
(i) (right)
5.4
a(mm,nn) C
a index j index i 5.4
CPU i=1 j 1 ni=2 j 1 n i=m j 1 n
i-1 i+1 5.4 y y y y
j-1 j+1 5.4 x x x x dimension
95
a(m,n) x n
CPU x x x x MPI_Type_vector
MPI_Type_commit
MPI_ Type_vector MPI_ Type_commit :
MPI_ Type_vector (count, blocklen, stride, oldtype, &newtype);
MPI_ Type_commit (&newtype);
count
blocklen
stride
oldtype
newtype
5.4 x x x x :
MPI_ Type_vector (m, 1, n, MPI_REAL, &vector2d);
MPI_ Type_commit (&vector2d);
x x x x vector2d
96
5.5 T5_2D
T5SEQ :
#define kk
20
#define km
3
#define mm 160
#define nn
120
main ()
{
double clock, sumf1, sumf2;
5.4 mm
nn 5.5 mm
nn mm nn ip 4 jp 2 m=mm/ip
n=nn/jp 5.5 :
#include <stdlib.h>
#include <mpi.h>
#define
#define
#define
#define
#define
kk
20
km
3
mm 160
nn
120
m
40
#define n
#define ip
#define jp
60
4
2
double f1[m+2][n+2][km], f2[m+2][n+2][km], hxu[m+2][n+2],

hxv[m+2][n+2], hmmx[m+2][n+2], hmmy[m+2][n+2];
97

int
nproc, myid, r_nbr, l_nbr, t_nbr, b_nbr, comm2d, my_coord[2];
int
ipart[2], periods[2], reorder, sideways, updown, right, up, icomm;
main ( argc, argv)
int argc;
char **argv;
{
double u1[m+2][n+2][kk], v1[m+2][n+2][kk], ps1[m+2][n+2];
double d7[m+2][n+2], d8[m+2][n+2], d00[m+2][n+2][kk], tt[mm][nn][km];
dimension [m+2][ n+2]
up
Index j
d8(m+2, n+2),
ps1(m+2, n+2)
u1(m+2, n+2, kk),

n
.
1
cpu1
(0,1)
xxxxxxx
n
.
1
cpu0
(0,0)
xxxxxxx
1.......m
v1(m+2, n+2, kk)

cpu3
(1,1)
xxxxxxx
cpu2
(1,0)
xxxxxxx
1.......m
cpu5
(2,1)
xxxxxxx
cpu7
(3,1)
xxxxxxx
cpu4
(2,0)
xxxxxxx
1.......m
cpu6
(3,0)
xxxxxxx
1.......m
Index i
5.5
(sideways) (right)
5.4 MPI 4x2 :
98
nbr2d()
ipart[0]=ip;
ipart[1]=jp;
periods[0]=0;
periods[1]=0;
reorder=1;
sideways=0;
updown=1;
right=1;
up=1;
MPI_Cart_create(MPI_COMM_WORLD, 2, ipart, periods, reorder, &comm2d);
MPI_Comm_rank( comm2d,&myid);
MPI_Cart_coords( comm2d, myid, 2, my_coord);
MPI_Cart_shift( comm2d, sideways, right, &l_nbr, &r_nbr);
MPI_Cart_shift( comm2d, updown, up, &b_nbr, &t_nbr);
printf(" myid,coord,l,r,t,b_nbr=%d\t%d\t%d\t%d\t%d\t%d\t%d\n",
myid,my_coord[0],my_coord[1],l_nbr,r_nbr,t_nbr,b_nbr);
}
CPU 5.5 x x x x
m vector2d n+2 1
vector3D kk*(n+2) kk
MPI :
n2=n+2;
MPI_Type_vector(m, 1, n2, MPI_DOUBLE, &vector2d);
MPI_Type_commit (&vector2d);
n2kk=n2*kk;
MPI_ Type_vector (m, kk, n2kk, MPI_DOUBLE, &vector3d);
MPI_ Type_commit (&vector3d);
x x x x vector2d
vector3d 1j for loop 1 n
jstart=1;
jend=n;
jstartm1=jstart-1;
jendp1=jend+1;
99
CPU 5.6 CPU ( x )

jstartm1 (x ) MPI_Sendrecv :
MPI_Sendrecv((void *)& ps1[istart][jend], 1, vector2d, t_nbr, itag,
(void *)&ps1[istart][jstartm1], 1, vector2d, b_nbr, itag, comm2d, istat);
MPI_Sendrecv ((void *)&v1[istart][jend][0], 1, vector3d, t_nbr, itag,
(void *)&v1[istart][jstartm1][0], 1, vector3d, b_nbr, itag, comm2d, istat);
up
Index j
d8(m+2, n+2),
ps1(m+2, n+2)
u1(m+2, n+2, kk),
jend
.
jstart
jend
.
jstart
v1(m+2, n+2, kk)
yyyyyyy
xxxxxxx
cpu1
yyyyyyy
xxxxxxx
yyyyyyy
xxxxxxx
cpu3
yyyyyyy
xxxxxxx
yyyyyyy
xxxxxxx
cpu5
yyyyyyy
xxxxxxx
yyyyyyy
xxxxxxx
cpu7
yyyyyyy
xxxxxxx
yyyyyyy
yyyyyyy
yyyyyyy
yyyyyyy
xxxxxxx
cpu0
yyyyyyy
xxxxxxx
xxxxxxx
cpu2
yyyyyyy
xxxxxxx
xxxxxxx
cpu4
yyyyyyy
xxxxxxx
xxxxxxx
cpu6
yyyyyyy
xxxxxxx
1.......m
1.......m
1.......m
1.......m
Index i
5.6
(sideways) (right)
CPU (y ) jendp1
(y ) MPI_Sendrecv :
MPI_ Sendrecv ((void *)&ps1[istart][ jstart], 1, vector2d, b_nbr, itag,
(void *)&ps1[istart][ jendp1], 1, vector2d, t_nbr, itag, comm2d, istat);
MPI_ Sendrecv ((void *)&v1[istart][jstart][0],
(void *)&v1[istart][jendp1][0],
1, vector3d, b_nbr, itag,

1, vector3d, t_nbr, itag, comm2d, istat);
100
i for loop 1 m
istart=1;
iend=m;
istartm1=istart-1
iendp1=iend+1
CPU 5.3 T5DCP
CPU iendp1
MPI_Sendrecv :
MPI_ Sendrecv ((void *)&ps1[istart][ jstart], n, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][ jstart], n, MPI_DOUBLE, r_nbr, itag, comm2d, istat);
CPU istartm1
MPI_ Sendrecv :
n2kk=(n+2)*kk
MPI_ Sendrecv ((void *)&u1[iend][jstart][0],
n2kk, MPI_DOUBLE, r_nbr, itag,
(void *)&u1[istartm1][jstart][0], n2kk, MPI_DOUBLE, l_nbr, itag,
comm2d, istat);
tt dimension f1f2 dimension MPI_Gather MPI_Gatherv
MPI_SendMPI_Recv CPU
copy1 f2 tt
double
double
tt[mm][nn][km];
f1[m+2][n+2][km], f2[m+2][n+2][km];
T5_2D :
/*
PROGRAM T5_2D
Computing & data partition on the first 2 dimensions of multiple
dimensional arrays with -1,+1 data exchange */
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
101
#define kk
#define km
#define
#define
#define
#define
20
3
mm 160
nn
120
m
40
n
60
#define ip
#define jp
#define np
4
2
8
double f1[m+2][n+2][km], f2[m+2][n+2][km], hxu[m+2][n+2],

hxv[m+2][n+2], hmmx[m+2][n+2], hmmy[m+2][n+2];
int
int
nproc, myid, myid_i, myid_j, lastp, lastp_i, lastp_j,

r_nbr, l_nbr, t_nbr, b_nbr, my_coord[2], g_coord[np][2];
ipart[2], periods[2], reorder, sideways, updown, right, up;
int
istart,iend, istart2, iend1, istartm1, iendp1;
int
jstart,jend, jstart2, jend1, jstartm1, jendp1;
int
istartg[16], iendg[16], jstartg[16], jendg[16];
MPI_Comm
comm2d;
MPI_Status
istat[8];
MPI_Datatype vector2d, vector3d;
main ( argc, argv)
int argc;
char **argv;
{
double u1[m+2][n+2][kk], v1[m+2][n+2][kk], ps1[m+2][n+2];
double d7[m+2][n+2], d8[m+2][n+2], d00[m+2][n+2][kk];
double clock, sumf1, sumf2, gsumf1, gsumf2, tt[mm][nn][km];
int
i, j, k, ka, ii, jj, n2, n2kk;
int
itag, isrc, idest, iroot, ig, jg, count;
if (nproc != np) {
102
if (myid == 0)
printf(" nproc not equal to np=%d ", np, "program will stop");
MPI_Finalize();
return 0;
}
nbr2d();
MPI_Barrier(comm2d);
clock=MPI_Wtime();
MPI_Gather ((void *)&my_coord, 2, MPI_INTEGER,
(void *)&g_coord, 2, MPI_INTEGER, 0, comm2d);
startend( ip, 1, mm, istartg, iendg);
startend( jp, 1, nn, jstartg, jendg);
istart = 1;
iend = m;
jstart = 1;
jend = n;
myid_i=my_coord[0];
myid_j=my_coord[1];
ig = istartg[myid_i];
jg = jstartg[myid_j];
lastp_i=ip-1;
lastp_j=jp-1;
printf( "NPROC,MYID,ISTART,IEND,ig,jg=%d\t%d\t%d\t%d\t%d\t%d\n",
nproc, myid, istart, iend, ig, jg);
iendp1 = iend+1;
jstartm1 = jstart-1;
jendp1 = jend+1;
istart2 = istart;
if (myid_i == 0) istart2 = 2;
jstart2 = jstart;
if (myid_j == 0) jstart2 = 2;
iend1 = iend;
if (myid_i == lastp_i ) iend1 = iend-1;
jend1 = jend;
103
if (myid_j == lastp_j ) jend1 = jend-1;

/*
*/
/* for (i=0; i<mm; i++) */
ii = i + ig -1;
*/
for (j=jstart; j<=jend; j++) {
jj = j + jg -1;
/*

u1[i][j][k]=1.0/(double) ii + 1.0/(double) jj + 1.0/(double) (k+1);
}
}
/* for (i=0; i<mm; i++) */
ii = i + ig -1;
*/
jj = j + jg -1;
/*

v1[i][j][k]=2.0/(double) ii + 1.0/(double) jj + 1.0/(double) (k+1);
}
}
ii = i + ig -1;
/*
for (j=0; j<nn; j++) {
*/
jj = j + jg -1;
ps1[i][j] = 1.0/(double) ii + 1.0/(double) jj;
hxu[i][j] = 2.0/(double) ii + 1.0/(double) jj;
hxv[i][j] = 1.0/(double) ii + 2.0/(double) jj;
hmmx[i][j] = 2.0/(double) ii + 1.0/(double) jj;
hmmy[i][j] = 1.0/(double) ii + 2.0/(double) jj;
}
}
104
for (k=0; k<kk; k++) {

am7[k] = 1.0/(double) (k+1);
}
}
/*

*/
n2
= n+2;
n2kk = n2*kk;
MPI_Type_vector(m, kk, n2kk, MPI_DOUBLE, &vector3d );
MPI_Type_commit(&vector3d );
MPI_Type_vector(m, 1, n2, MPI_DOUBLE, &vector2d );
MPI_Type_commit(&vector2d );
itag = 10;
MPI_Sendrecv ((void *)&u1[iend][jstart][0],
n2kk, MPI_DOUBLE, r_nbr, itag,
(void *)&u1[istartm1][jstart][0],n2kk, MPI_DOUBLE, l_nbr, itag,
comm2d, istat);
itag = 20;
MPI_Sendrecv ((void *)&ps1[istart][jstart], n, MPI_DOUBLE, l_nbr, itag,
(void *)&ps1[iendp1][jstart], n, MPI_DOUBLE, r_nbr, itag,
comm2d, istat);
itag = 30;
MPI_Sendrecv ((void *)&v1[istart][jend][0],
1, vector3d, t_nbr, itag,
(void *)&v1[istart][jstartm1][0], 1, vector3d, b_nbr, itag,
comm2d, istat);
for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}
/* for (i=0; i<mm-1; i++)
*/
/* for (j=0; j<nn-1; j++)
*/
for (j=jstart; j<=jend1; j++)
105

/* for (i=1; i<mm-1; i++)
*/
/* for (j=0; j<nn-1; j++)
*/
for (j=jstart; j<=jend1; j++)
itag=50;
MPI_Sendrecv ((void *)&d7[iend][jstart],
n, MPI_DOUBLE, r_nbr, itag,
(void *)&d7[istartm1][jstart],n, MPI_DOUBLE, l_nbr, itag,

comm2d, istat);
itag=60;
MPI_Sendrecv((void *)&d8[istart][jend],
1, vector2d, t_nbr, itag,
(void *)&d8[istart][jstartm1],1, vector2d, b_nbr, itag,

comm2d, istat);
/* for (i=1; i<mm-1; i++) */
/* for (j=1; j<nn-1; j++) */
for (j=jstart2; j<=jend1; j++)
/* for (i=1; i<mm-1; i++)
*/

/* for (j=1; j<nn-1; j++) */
for (j=jstart2; j<=jend1; j++)
sumf1=0.0;
sumf2=0.0;
/* for (i=1; i<mm-1; i++) { */
/*
for (j=1; j<nn-1; j++) {
*/
for (j=jstart2; j<=jend1; j++) {
106
for (k=0; k<km; k++) {

f2[i][j][k]=-am7[k]*ps1[i][j];
}
}
}
MPI_Allreduce ((void *)&sumf1, (void *)&gsumf1, 1, MPI_DOUBLE, MPI_SUM, comm2d);
MPI_Allreduce ((void *)&sumf2, (void *)&gsumf2, 1, MPI_DOUBLE, MPI_SUM, comm2d);
count=km*(n+2)*(m+2);
iroot = 0;
itag = 70;
if (myid > 0)
MPI_Send ( (void *)&f2, count, MPI_DOUBLE, iroot, itag, comm2d);
else {
copy1(myid, tt);
for (isrc=1; isrc < nproc; isrc++) {
MPI_Recv ((void *)&f2, count, MPI_DOUBLE, isrc, itag, comm2d, istat);
copy1 (isrc, tt);
}
}
if (myid == 0) {
printf( "sumf1,sumf2= %.5f\t%.5f\n", gsumf1, gsumf2 );
printf( " tt[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
tt[i][1][1],tt[i+5][1][1],tt[i+10][1][1],tt[i+15][1][1],
tt[i+20][1][1],tt[i+25][1][1],tt[i+30][1][1],tt[i+35][1][1]);
}
printf( " myid, clock time= %d\t%.5f\n", myid, clock);
}
MPI_Finalize();
return 0;
}
107
nbr2d()
{
ipart[0]=ip;
ipart[1]=jp;
periods[0]=0;
periods[1]=0;
reorder=1;
sideways=0;
updown=1;
right=1;
up=1;
MPI_Cart_create(MPI_COMM_WORLD, 2, ipart, periods, reorder, &comm2d);
MPI_Comm_rank( comm2d,&myid);
MPI_Cart_coords( comm2d, myid, 2, my_coord);
MPI_Cart_shift( comm2d, sideways, right, &l_nbr, &r_nbr);
MPI_Cart_shift( comm2d, updown, up, &b_nbr, &t_nbr);
printf(" myid,coord,l,r,t,b_nbr=%d\t%d\t%d\t%d\t%d\t%d\t%d\n",
myid,my_coord[0],my_coord[1],l_nbr,r_nbr,t_nbr,b_nbr);
return 0;
}
copy1(int id, double tt[mm][nn][km])
{
/*
copy partitioned array f2 to global array tt

int
i, j, k, ii, jj, ig, jg;
ii=g_coord[id][0];
*/
jj=g_coord[id][1];
for (i=1; i<=m; i++) {
ig=istartg[ii]+i-2;
for (j=1; j<=n; j++) {
jg=jstartg[jj]+j-2;
tt[ig][jg][k] = f2[i][j][k];
}
}
return 0;
}
108
T5_2D :
myid,coord,l,r,t,b_nbr=0
0
0
-3 2
1
-3
0
1
-3 3
-3 0
1
0
0
4
3
-3
1
2
2
3
1
0
1
0
1
2
3
4
5
6
7
-3
-3
5
-3
7
2
-3
4
-3
3
1
5
-3 -3 6
sumf1,sumf2 = 26172.46985
-2268.89180
tt[i][1][1],i=0,159,5
0.000
-0.333 -0.295 -0.281 -0.274 -0.269
-0.266
-0.264
-0.262
-0.256
-0.254
-0.257
-0.255
-0.253
-0.257
-0.254
-0.253
-0.261
-0.256
-0.254
-0.260
-0.255
-0.254
-0.259
-0.255
-0.254
myid, clock time= 3

myid, clock time= 2
myid, clock time= 0
myid, clock time= 1
0.02869
0.02974
0.03281
0.02929
myid, clock time= 7

myid, clock time= 4
myid, clock time= 5
myid, clock time= 6
0.01998
0.02046
0.02030
0.02039
-0.258
-0.255
-0.254
-0.258
-0.255
-0.253
t5seq 0.090 CPU t5_2d 0.033

(parallel speed up) 0.090/0.033 = 2.7 CPU
0.030 CPU 0.033
109
MPI
Nonblocking blocking
110
6.1
Nonblocking
MPI_SendMPI_Recv Blocking
MPI_Send CPU Buffer is empty MPI_Send
MPI_Recv CPU
Buffer is full MPI_Recv
6.1 Blocking Send/Recv
Processor 0
User mode
kernel mode
MPI_Send
sendbuf
sysbuf
CPU idled
copy
sendbuf
to
sysbuf
Now sendbuf can be reused
Processor 1
User mode
kernel mode
MPI_Recv
recvbuf
CPU idled
sysbuf
copy
sysbuf
to
recvbuf
Now recvbuf contains valid data
6.1 Blocking Send/Recv

Nonblocking MPI_IsendMPI_Irecv
MPI_Wait
MPI_Isend MPI_Irecv MPI_Wait
6.2 Nonblocking Send/Recv
111
Processor 0
User mode
kernel mode
MPI_Isend
computation
sendbuf
sysbuf
copy
sendbuf
to
sysbuf
Now sendbuf can be reused

MPI_Wait
Processor 1
User mode
kernel mode
MPI_Irecv
computation
recvbuf
sysbuf
copy
sysbuf
to
recvbuf
Now recvbuf contains valid data

MPI_Wait
6.2 Nonblocking Send/Recv

MPI_Isend
MPI_Isend ((void *)&data, count, DATATYPE, dest, tag, MPI_COMM_WORLD, request);
data
count
DATATYPE
dest
CPU id
tag
MPI_COMM_WORLD communicator
request
112
MPI_Irecv
MPI_Irecv ((void *)&data, count, DATATYPE, src, tag, MPI_COMM_WORLD, request);
data
count
DATATYPE
dest
tag
MPI_COMM_WORLD
request
CPU id
communicator
MPI_Wait
MPI_Wait (request, istat);
request
Istat
MPI_Isend MPI_Irecv
request
5.3 T5DCP MPI_Sendrecv MPI_Isend

MPI_Irecv ps1 u1 ps1 for loop
MPI_Wait ps1
ps1 for loop MPI_Wait
u1 MPI_Wait u1
d7 for loop MPI_Sendrecv
Nonblocking T5DCP T6DCP
/*
PROGRAM
T6DCP
Computing & data partition on the first dimension of multiple

dimensional arrays with -1,+1 boundary data exchange using non-blocking send/recv
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
113
#define kk
#define km
#define
#define
#define
#define
20
3
mm 160
nn
120
m
40
np
4

main ( argc, argv)
int argc;
char **argv;
{
int
i, j, k, ka, ii, nnkk;
int
nproc, myid, istart, iend, icount, r_nbr, l_nbr, lastp, iroot, istartg;
int
itag, istart2, iend1, istartm1, iendp1;
int
gstart[16],gend[16],gcount[16];
MPI_Status
istat[8];
MPI_Comm
comm;
MPI_Request
requ1, reqps1;

MPI_Barrier(comm);
clock=MPI_Wtime();
startend( nproc, 1, mm, gstart, gend, gcount);
istart = 1;
iend = m;
icount = m;
114
lastp = nproc-1;
istartg = gstart[myid];
printf( "NPROC,MYID,ISTART,IEND,istartg=%d\t%d\t%d\t%d\t%d\n",
nproc, myid, istart, iend, istartg);
iendp1 = iend+1;
istart2 = istart;
iend1 = iend;
l_nbr = myid - 1;
r_nbr = myid + 1;
if (myid == 0) l_nbr = MPI_PROC_NULL;
/*
*/
/* for (i=0; i<mm; i++) */
u1[i][j][k]=1.0/(double) ii + 1.0/(double) (j+1) + 1.0/(double) (k+1);
}
/* for (i=0; i<mm; i++) */
v1[i][j][k]=2.0/(double) ii + 1.0/(double) (j+1) + 1.0/(double) (k+1);
}
for (j=0; j<nn; j++) {
115
ps1[i][j] = 1.0/(double) ii + 1.0/(double)(j+1);

hxu[i][j] = 2.0/(double) ii + 1.0/(double)(j+1);
hxv[i][j] = 1.0/(double) ii + 2.0/(double)(j+1);
hmmx[i][j] = 2.0/(double) ii + 1.0/(double)(j+1);
hmmy[i][j] = 1.0/(double) ii + 2.0/(double)(j+1);
}
}
for (k=0; k<kk; k++) {
am7[k] = 1.0/(double) (k+1);
}
}
/*
*/
nnkk = nn*kk;
itag = 10;
/*
MPI_Sendrecv ((void *)&u1[iend][0][0], nnkk, MPI_DOUBLE, r_nbr,itag,

(void *)&u1[istartm1][0][0],nnkk, MPI_DOUBLE, l_nbr, itag, comm, istat);
*/
MPI_Isend ((void *)&u1[iend][0] [0],
nnkk, MPI_DOUBLE, r_nbr, itag, comm, &requ1)
MPI_Irecv ((void *)&u1[istartm1][0] [0], nnkk, MPI_DOUBLE, l_nbr, itag, comm, &requ1);
itag = 20;
/* MPI_Sendrecv ((void *)&ps1[istart][0], nn, MPI_DOUBLE, l_nbr, itag,
*/
MPI_Isend ((void *)&ps1[istart][0], nn, MPI_DOUBLE, l_nbr, itag, comm, &reqps1)
MPI_Irecv ((void *)&ps1[iendp1][0], nn, MPI_DOUBLE, r_nbr, itag, comm, &reqps1);
/* for (i=0; i<mm; i++) { */
for (j=0; j<nn; j++) {
for (k=0; k<km; k++) {
f1[i][j][k]=0.0;
f2[i][j][k]=0.0;
}
}
}
116
/*
for (i=1; i<mm-1; i++) */

for (j=0; j<nn-1; j++)
MPI_Wait (&reqps1, istat);
/*
for (i=0; i<mm-1; i++) */

for (j=1; j<nn-1; j++)
MPI_Wait (&requ1, istat);
itag=30;

/* for (i=1; i<mm-1; i++)
*/

for (j=1; j<nn-1; j++)
/* for (i=1; i<mm-1; i++) */
for (j=1; j<nn-1; j++)
sumf1=0.0;
sumf2=0.0;
/* for (i=1; i<mm-1; i++)
{ */

for (j=1; j<nn-1; j++) {
for (k=0; k<km; k++) {
f2[i][j][k]=-am7[k]*ps1[i][j];
117
}
}
}
/*
*/
MPI_Allreduce ((void *)&sumf1,(void *)&gsumf1, 1, MPI_DOUBLE, MPI_SUM, comm);
MPI_Allreduce ((void *)&sumf2,(void *)&gsumf2, 1, MPI_DOUBLE, MPI_SUM, comm);
icount1 = m*nn*km;
iroot=0;
MPI_Gather((void *)&f2[istart][0][0],icount1,MPI_DOUBLE,
(void *)&tt,
icount1,MPI_DOUBLE, iroot, comm);
if (myid == 0) {
printf( " tt[i][1][1],i=0,159,5\n");
for (i = 0; i < mm; i+=40) {
tt[i][1][1],tt[i+5][1][1],tt[i+10][1][1],tt[i+15][1][1],
tt[i+20][1][1],tt[i+25][1][1],tt[i+30][1][1],tt[i+35][1][1]);
}
printf( " myid, clocktime= %d\t%.5f\n", myid, clock);
}
MPI_Finalize();
return 0;
}
T6DCP CPU :
ATTENTION: 0031-408
4 tasks allocated by LoadLeveler, continuing...
3
0
2
1
SUMF1,SUMF2= 26172.46054
-2268.89180
F2[i][1][1],i=1,160,5
1
1
1
40
40
40
121
1
81
40
41
118
0.000
-0.262
-0.333 -0.295 -0.281 -0.274 -0.269 -0.266 -0.264

-0.261 -0.260 -0.259 -0.258 -0.258 -0.257 -0.257
-0.256 -0.256 -0.255 -0.255 -0.255

-0.254 -0.254 -0.254 -0.254 -0.254
myid, clock time= 0
0.02943
myid, clock time= 1
0.02882
myid, clock time= 2
myid, clock time= 3
-0.255
-0.253
-0.255
-0.253
-0.254
-0.253
0.02873
0.02895
CPU MPI_Sendrecv T5DCP 0.030

CPU MPI_IsendMPI_Irecv MPI_Wait T6DCP 0.029
Nonblocking Blocking
119
6.2
CPU
CPU MPI_Pack buffer
CPU buffer MPI_Unpack
buffer
MPI_Sendrecv iend
istartm1 :
itag=110
MPI_Sendrecv ((void *)&ps1[iend][0],
(void *)&ps1[istartm1][0], n, MPI_DOUBLE, l_nbr, itag, comm, istat);
itag=120
MPI_Sendrecv ((void *)&ps2[iend][0],
(void *)&PS2[istartm1][0], n, MPI_DOUBLE, l_nbr, itag, comm, istat);
MPI_Pack iend buf1
MPI_Sendrecv buf1 r_nbr buf2 MPI_Unpack
buf2 istartm1
#define n 120
#define bufsize n*2*8
char
buf1[bufsize], buf2[bufsize];
int
ipos, itag, icount, l_nbr, r_nbr;
MPI_Comm
MPI_Status
comm;
istat[8];
MPI_Barrier (comm);
ipos=0
MPI_Pack ( (void *)&ps1[iend][0], n, MPI_DOUBLE, (void *)&buf1, bufsize, &ipos, comm);
MPI_Pack ( (void *)&ps2[iend][0], n, MPI_DOUBLE, (void *)&buf1, bufsize, &ipos, comm);
itag=120;
MPI_Sendrecv ((void *)&buf1, bufsize, MPI_CHAR, r_nbr, itag,
(void *)&buf2, bufsize, MPI_CHAR, l_nbr, itag, comm, istat);
if (myid > 0) {
ipos=0;
120
MPI_Unpack ( (void *)&buf2, bufsize, &ipos,

(void *)&ps1[istartm1][0], n, MPI_DOUBLE, comm);
MPI_Unpack ( (void *)&buf2, bufsize, &ipos,
(void *)&ps2[istartm1][0], n, MPI_DOUBLE, comm);
}
packunpack
double buf1[2][n], buf2[2][n];
for (j=0; j<n; j++) {
buf1[0][j]=ps1[iend][j];
buf1[1][j]=ps2[iend][j];
}
icount=n*2;
itag=120;
MPI_Sendrecv ((void *)&buf1, icount, MPI_DOUBLE, r_nbr, itag,
(void *)&buf2, icount, MPI_DOUBLE, l_nbr, itag, comm, istat);
if (myid > 0) {
for (j=0; j<n; j++) {
ps1[istartm1][j]=buf2[0][j];
ps2[istartm1][j]=buf2[1][j];
}
}
MPI_Sendrecv MPI_Sendrecv
n 6.3 6.4 IBM
SP2IBM SP2 SMPHP SPP2200 Fujitsu VPP300 CPU
1 MB
1 MB
MPI_Type_struct
7.1
121
point-to-point message passing test on IBM

SP2
35
30
25
Mbytes/s
IBM SP2_160 us
IBM SP2_120 us
IBM SP2_160
IBM SP2_120
20
15
10
16M
8M
4M
2M
1M
512K
256K
128K
64K
32K
16K
8K
4K
2K
1K
512
256
128
64
32
16
messge length, bytes

6.3
IBM SP2 CPU
122
point-to-point communication test on different

computers
1000
900
800
600
500
Fujistu VPP300
HP SPP2000
IBM SP2_375
IBM SP2_160
400
300
200
16M
8M
4M
2M
1M
512K
messge length, bytes
256K
128K
64K
32K
16K
8K
4K
2K
1K
512
256
128
64
32
16
100
8
Mbytes/s
700
6.4 CPU
123
6.3
CPU CPU
ps2 u2 i-1i+1
ps2 d1 i-1
d1 i-1
for (i=istart; i<=iend1; i++) {
for (j=0; j<jend1; j++) {
d1[i][j]=(ps2[i+1][j]+ps2[i][j])*HXU[i][j]*0.50;
d2[i][j]=(ps2[i][j+1]+ps2[i][j])*HXV[i][j]*0.50;
}
}
MPI_Sendrecv((void *)&d1[iend][0],
for (j=1; j<n1; j++)
d11[i][j][k]= (d1[i][j]*u2[i][j][k]-d1[i-1][j]*u2[i-1][j][k])*hmmx[i][j]
+ (d2[i][j]*v2[i][j][k]-d2[i][j-1]*v2[i][j-1][k])*hmmy[i][j];
ps2 u2 i-1i+1 i-1

d1
for (i=istartm1; i<=iend1; i++) {
for (j=0; j<jend1; j++) {
d1[i][j]=(ps2[i+1][j]+ps2[i][j])*HXU[i][j]*0.50;
d2[i][j]=(ps2[i][j+1]+ps2[i][j])*HXV[i][j]*0.50;
}
}
124

for (j=1; j<n1; j++)
d11[i][j][k]= (d1[i][j]*u2[i][j][k]-d1[i-1][j]*u2[i-1][j][k])*hmmx[i][j]
+ (d2[i][j]*v2[i][j][k]-d2[i][j-1]*v2[i][j-1][k])*hmmy[i][j];
RFS CHEF u1v1t1q1ps1u3v3

t3q3ps3wp1 MPI_PackMPI_Unpack
125
6.4
MPI CPU
MPI_ScatterMPI_Scatterv MPI_Bcast CPU
CPU
CPU
MPI_GatherMPI_Ggatherv CPU
CPU CPU
6.4.1
bcd np input.11
input.12input.13 . . .
/*
PROGRAM PIOSEQ
#include <stdio.h>
*/
#include <stdlib.h>
#define mm
200
#define np
4
#define m
50
main ()
{
double
int
FILE
char
/*
suma, a[mm], b[mm], c[mm], d[mm];

i,j, iu, size, ip, istart, iend;
*fp;
string[10];
test data generation and write to file
'input.dat' */
for (i = 0; i < mm; i++) {

j=i+1;
b[i] = 3. / (double) j + 1.0;
c[i] = 2. / (double) j + 1.0;
d[i] = 1. / (double) j + 1.0;
}
126
fwrite( (void *)&b, sizeof(b), 1, fp );

fwrite( (void *)&c, sizeof(c), 1, fp );
fwrite( (void *)&d, sizeof(d), 1, fp );
fclose( fp );
/*
prepare parallel input data for np process
*/
for (ip=0; ip<np; ip++) {
iu=11+ip;
sprintf(string, "input.%d", iu);
fp = fopen(string, "w");
startend ( ip, np, 0, mm-1, &istart, &iend);
size = (iend-istart+1)*sizeof(double);
fwrite ((void *)&b[istart], size, 1, fp);
fwrite ((void *)&c[istart], size, 1, fp);
fwrite ((void *)&d[istart], size, 1, fp);
fclose( fp );
}
/*
sequential processing
*/
fclose( fp );
suma = 0.;
for (i = 0; i < mm; i++) {
a[i] = b[i] + c[i] * d[i];
suma += a[i];
}
for (i = 0; i < mm; i+=40) {
}
printf( "sum of array A=%f\n", suma);
exit(0);
127
}
startend(int myid,int nproc,int is1,int is2,int* istart,int* iend)
{
int ilength, iblock, ir;
ilength=is2-is1+1;
if(myid < ir) {
*istart=is1+myid*(iblock+1);
*iend=*istart+iblock;
}
else {
*istart=is1+myid*iblock+ir;
*iend=*istart+iblock-1;
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
double a[mm][nn][kk], b[mm][nn][kk], c[mm][nn][kk], d[mm][nn][kk];

size = (iend-istart+1)*nn*kk*sizeof(double);
fwrite ((void *)&b[istart][0][0], size, 1, fp);
fwrite ((void *)&c[istart] [0][0], size, 1, fp);
fwrite ((void *)&d[istart] [0][0], size, 1, fp);
CPU
/*
PROGRAM PIODCP
Each processor read its own data from individual file
*/
#include <stdio.h>
#include <stdlib.h>
128
#include <mpi.h>
#define mm 200
#define np
#define m
4
50
main ( argc, argv)

int argc;
char **argv;
{
char
int
FILE
double
string[10];
i, j, k, iu, size;
*fp;
a[m], b[m], c[m], d[m], t[mm], suma, sumall;
int
nproc, myid, istart, iend, iroot, idest;
MPI_Comm
comm;
istart=0;
iend=m-1;
/*
READ INPUT DATA and DISTRIBUTE INPUT DATA
*/
if(nproc != np) {
printf( "nproc not equal to np= %d\t%d\t",nproc, np);
printf(" program will stop");
MPI_Finalize();
return 0;
}
iu=11+myid;
sprintf(string, "input.%d", iu);
fp = fopen(string, "r");
129
size = m*sizeof(double);
fread ((void *)&b[istart], size, 1, fp);
fread ((void *)&c[istart], size, 1, fp);
fread ((void *)&d[istart], size, 1, fp);
fclose( fp );
/*
COMPUTE, GATHER COMPUTED DATA, and WRITE OUT the RESULT
*/
suma=0.0;
/* for(i=0; i<ntotal; i++) {
*/
for(i=0; i<m; i++) {

a[i]=b[i]+c[i]*d[i];
suma=suma+a[i];
}
idest=0;
MPI_Gather((void *)&a,m,MPI_DOUBLE, (void *)&t, m, MPI_DOUBLE, idest, comm);
MPI_Reduce((void *)&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, idest, comm);
If (myid == 0) {
for (i = 0; i < mm; i+=40) {
}
printf( "sum of array A=%f\n",sumall);
}
sprintf(string, "output.%d", iu);
fp = fopen(string, "w");
fwrite ((void *)&b[istart], size, 1, fp);
fclose( fp );
MPI_Finalize();
return 0;
}
size = m*nn*kk*sizeof(double);
fread ((void *)&b[istart][0][0], size, 1, fp);
fread ((void *)&c[istart] [0][0], size, 1, fp);
fread ((void *)&d[istart] [0][0], size, 1, fp);
130

CPU CPU (load
balance) CPU
CPU (local
disk) CPU CPU
CPU
system CPU input.xx
CPU /var/tmp CPU /var/tmp input.xx
#define
#define
#define
#include
mm 200
np
4
m
50
<mpi.h>
double
char
int
a[m+2], b[m+2], c[m+2], d[m+2], tt[mm];

fname[30], cmd[30];
nproc, myid, istart, iend, i, iu;
MPI_Init();
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
. . . . . .
iu=11+myid;
sprintf(cmd, cp input.%d, iu, /var/tmp);
system (cmd);
sprintf(fname, /var/tmp/input.%d, iu);
fp = fopen(fname, "r");
fread ((void *)&b[1], size, 1, fp);
fread ((void *)&c[1], size, 1, fp);
fread ((void *)&d[1], size, 1, fp);
fclose( fp );
131
6.4.2
CPU CPU
A output.xx
#define
#define
#define
#include
mm 200
np
4
m
50
<mpi.h>
double
char
int
a[m], b[m], c[m], d[m], tt[mm];

fname[10];
nproc, myid, istart, iend, i, iu, size;
MPI_Init();
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
. . . . . .
iu=11+myid;
sprintf(fname, output.%d, iu);
fp = fopen(fname, "r");
fwrite ((void *)&b, size, 1, fp);
fwrite ((void *)&c, size, 1, fp);
fwrite ((void *)&d, size, 1, fp);
fclose( fp );
double
a[m][nn][kk], b[m] [nn][kk], c[m] [nn][kk], d[m] [nn][kk], tt[mm] [nn][kk];
size = m*nn*kk*sizeof(double);
fwrite ((void *)&b, size, 1, fp);
fwrite ((void *)&c, size, 1, fp);
fwrite ((void *)&d, size, 1, fp);
132
output.11output.12output.13 . . . np
#define np
4
#define mm 200
char
fname[10];
double a[mm], b[mm], c[mm], d[mm];
int
i, iu, size
for (i=0; i<np; i++) {
iu=11+I;
sprintf (fname, output.%d, iu);
fp = fopen (fname, r);
startend (i, np, 0, mm-1, &istart, &iend);
size = (iend-istart+1)*sizeof(double);
fread ((void *)&a[istart], size, 1, fp);
fclose( fp );
}
sprintf (fname, output.dat);
fp = fopen (fname, w);
fwrite ((void *)&a, sizeof(a), 1, fp);
size = (iend-istart+1)*nn*kk*sizeof(double);
fread ((void *)&a[istart][0][0], size, 1, fp);
133

MPI (Transposing Block Distribution)
(2 Way Recursive and Pipeline method)

134
7.1
MPI MPI_INTMPI_FLOATMPI_DOUBLE
MPI_CHAR (derived data type)
MPI_Type_vectorMPI_Type_contiguousMPI_Type_indexed
MPI_Type_struct MPI_Type_vector
(Constant Stride)MPI_ Type_contiguous
MPI_ Type_struct
C struct
, C struct
struct {
float a;
float b;
int
n;
} load;
MPI MPI_ Type_struct :
#define count 3
int
length[count];
MPI_Datatype oldtype[count];
MPI_Aint
disp[count];
MPI_Datatype newtype;
MPI_ Type_struct ( count, length, disp, oldtype, &newtpye);
MPI_Type_commit(&newtpye);
count
length
disp
() count
count MPI_Aint
oldtype
newtype
count MPI_Datatype
MPI_Datatype
MPI_Address
135
(Displacement) :
MPI_Address ( (void *)&data, &address);
data
adress
data
/*
PROGRAM T7STRUCT
C struct and related MPI_Type_struct example
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define count
3
/*--------- MPI related data ---------*/
int
nproc, myid;
MPI_Comm
comm;
MPI_Status
istat[8];
main ( argc, argv)

int argc;
char **argv;
{
int i, itag;
int length[3] = {1, 1, 1};
MPI_Datatype oldtype[3] = {MPI_FLOAT, MPI_FLOAT, MPI_INT};
MPI_Datatype
MPI_Aint
struct {
newtype;
disp[3];
float a;
float b;
int
n;
} new;
136

comm = MPI_COMM_WORLD;
MPI_Address ((void *)&new.a, &disp[0] );
MPI_Address ((void *)&new.b, &disp[1] );
MPI_Address ((void *)&new.n, &disp[2] );
for (i=2; i>=0; i--)
disp[i] -= disp[0];
MPI_Type_struct ( count, length, disp, oldtype, &newtype);
MPI_Type_commit(&newtype);
itag = 10;
if (myid == 0 ) {
scanf ("%f %f %d", &new.a, &new.b, &new.n);
MPI_Send ((void *)&new, 1, newtype, 1, itag, comm);
}
else {
MPI_Recv ((void *)&new, 1, newtype, 0, itag, comm, istat);
printf ("a,b,n=%f\t%f\t%d\n", new.a, new.b, new.n);
}
MPI_Finalize();
return 0;
}
CPU
a,b,n=10.000000 20.000000
30
C CPU
C struct MPI_Type_struct
CPU
MPI_Sendrecv MPI_Pack
buf1 MPI_ Sendrecv buf 1 buf 2
MPI_Unpack buf 2
137
#define
im=160;
#define
float
int
char
km=20;
up[im+1][km], vp[im+1][km], wp[im+1][km];
bufsize = km*4*8, km2=km*2, itag, ipos, l_nbr, r_nbr;
buf1[bufsize], buf2[bufsize];
MPI_Comm comm;
. . . . . . . .
if (myid > 0) {
ipos=0;
MPI_Pack ( (void *)&up[istart][0], km, MPI_FLOAT, (void *)&buf1, bufsize, ipos, comm);
MPI_Pack ((void *)&vp[istart][0], km2, MPI_FLOAT, (void *)&buf1, bufsize, ipos, comm);
MPI_Pack ((void *)&wp[istart][0], km, MPI_FLOAT, (void *)&buf1, bufsize, ipos, comm);
}
itag=202;
MPI_Sendrecv((void *)&buf1, bufsize, MPI_CHAR, l_nbr, itag,
(void *)&buf2, bufsize, MPI_CHAR, r_nbr, itag, comm, istat);
if (myid < nproc)
{
ipos=0;
MPI_Unpack ((void *)& buf 2, bufsize, ipos,
(void *)&up[iendp1][0], km, MPI_FLOAT, comm );
MPI_ Unpack ((void *)& buf 2, bufsize, ipos,
(void *)&vp[iendp1][0], km2, MPI_FLOAT, comm );
MPI_ Unpack ((void *)& buf 2, bufsize, ipos,
(void *)&wp[iendp1][0], km, MPI_FLOAT, comm );
}
km MPI_Type_Contiguous
cont2d :
MPI_Datatype cont2d;
MPI_Type_ Contiguous ( km, MPI_DOUBLE, &cont2d );
MPI_Type_commit (&cont2d);
C
138

up km vp km*2 wp
km [istart][0][iendp1][0]
upvpwp dimension
ipack3 MPI_Sendrecv
Pack Unpack
int
length[3], ifirst = 1;
MPI_Datatype ipack3, itype[3];
MPI_Aint
disp[3];
. . . . . . .
if (ifirst == 1) {
ifirst = 0;
length[0] = 1;
length[1] = 2;
length[2] = 1;
itype[0] = cont2d;
itype[1] = cont2d;
itype[2] = cont2d;
MPI_Address( (void *)&up[istart][0], &disp[0]);
MPI_Address((void *)&vp[istart][0], &disp[1]);
MPI_Address((void *)&wp[istart][0], &disp[2]);
For (i=2; i>=0; i--)
disp[i] -= disp[0];
MPI_Type_struct( 3, length, disp, itype, &ipack3);
MPI_Type_commit(&ipack3);
}
itag=202;
MPI_Sendrecv( (void *)&up[istart][0], 1, ipack3, l_nbr, itag,
(void *)&up[iendp1][0], 1, ipack3, r_nbr, itag, comm, istat);
Pack/Unpack
139
7.2
(Array Transpose) 7.1
2nd dimension
a(i,j)
P0
P1
n
.
.
.
P2
3333666999
3333666999
2222555888
2222555888
1111444777
1111444777
j=1
i=1
n3333
3333
2222
2222
1111
1111
j=1
i=1 . .
666
666
555
555
444
444
.
999
999
888
888
777
777
m
1st dimension
7.1
row_to_col
block transpose
7.1 CPU 4 CPU 16 5 CPU 25
CPU 7.1
row distribution
column distribution CPU
derived data type 7.2 [i][j]
CPU
140
2nd dimension
P0
P1
P2
A(I,J)
itype(i,j)
N
. (0,2)
.
. (0,1)
. 1111444777
. (0,0) (1,0) (2,0)
J=1
I=1 . . . . M
itype(i,j)
itype(i,j)
3333666999
(0,2) (1,2) (2,2)
(1,2)
2222555888
(0,1) (1,1) (2,1)
(2,1)
(1,0)
(2,0)
1st dimension
7.2 Derived Data Type Row_to_Col
Transpose
Initial address of the derived data type

Data represented by the derived data type
2nd dimension
jmax
jleng
ileng
1st dimension
jmin
7.3
vector2d
derived data type
7.2 7.3
block2d
int
jmin, jmax, ileng, jleng, count, stride;
MPI_Datatype block2d[ip][jp];
stride = jmax - jmin +1;
141
for (i=0; i<nproc; i++) {

ileng=iendg[i]-istartg[i]+1;
for (j=0; j<nproc; j++) {
jleng=jendg[j]-jstartg[j]+1;
MPI_Type_vector (ileng,jleng,nn,MPI_INT,& block2d[i][j]);
MPI_Type_commit (&block2d[i][j]);
}
}
block2d
row_to_col
transpose
itag=10;
k=-1;
for (id = 0; id < nproc; id++) {
if (id != myid ) {
k=k+1;
istart1=istartg[id];
jstart1=jstartg[id];
MPI_Isend( (void *)&a[istart1][jstart], 1, block2d[id][myid], id, itag, comm, &req1[k]);
MPI_Irecv( (void *)&a[istart][jstart1], 1, block2d[myid][id], id, itag, comm, &req2[k]);
}
}
icount=nproc-1;
MPI_Waitall (icount, req1, stat);
MPI_IsendMPI_Irecv MPI_Waitall :
MPI_Waitall (count, request, status);
count
request
MPI_Isend MPI_Irecv
count MPI_Request
status
count MPI_Status
row_to_col col_to_row id myid req1(id) req2(id)

MPI_Waitall k
142
request
non-blocking send/recv blocked sendrecv
non-blocking send/recv
itag=10;
if (id != myid ) {
MPI_Sendrecv( (void *)&a[istart1][jstart], 1, vector[id][myid], id, itag,
(void *)&a[istart][jstart1], 1, vector[myid][id], id, itag, comm, istat);
}
}
P0
a(i,j)
n3333
(0,2)
2222
(0,1)
1111
(0,0)
j=1
i=1 . .
P1
666
(1,2)
555
(1,1)
444
(1,0)
.
P2
999
(2,2)
888
(2,1)
777
(2,0)
3333666999
3333666999
2222555888
2222555888
1111444777
1111444777
7.5
block2d
col_to_row transpose
col_to_row
143
itag=20;
k=-1;
if (id != myid ) {
k = k +1;
MPI_Isend( (void *)&a[istart][jstart1], 1, block2d[myid][id], id, itag, comm, &req1[k]);
MPI_Irecv( (void *)&a[istart1][jstart], 1, block2d[id][myid], id, itag, comm, &req2[k]);
}
}
icount=nproc-1;
block2d
/*
program transpose
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
*/
#define np 3
#define mm 9
#define nn 6
main ( argc, argv)
int argc;
char **argv;
{
int
a[mm][nn];
int
int
int
istartg[np], iendg[np], jstartg[np], jendg[np];

i, j, k, nproc, myid, istart, iend, jstart, jend;
iu, id, itag, icount, istart1, jstart1, jleng;
FILE
*fp;
char
string[80], fname[16];
MPI_Datatype vector[np][np];
MPI_Request
req1[np], req2[np];
MPI_Status
istat[8];
144
MPI_Comm
comm;

comm = MPI_COMM_WORLD;
MPI_Barrier( comm );
for (i = 0; i < nproc; i++)
startend( i, nproc, 0, mm-1, &istartg[i],&iendg[i] );
for (j = 0; j < nproc; j++)
startend(j, nproc, 0, nn-1, &jstartg[j], &jendg[j] );
icount=iendg[i]-istartg[i]+1;
for (j=0; j<nproc; j++) {
jleng=jendg[j]-jstartg[j]+1;
MPI_Type_vector (icount,jleng,nn,MPI_INT,&block2d[i][j]);
MPI_Type_commit (&block2d[i][j]);
}
}
istart=istartg[myid];
iend=iendg[myid];
jstart=jstartg[myid];
jend=jendg[myid];
printf("myid,istart,iend,jstart,jend=%d %d %d %d %d\n",
myid,istart,iend,jstart,jend);
for (i=0; i<3; i++)
a[i][j]=1+myid;
for (i=3; i<6; i++)
a[i][j]=4+myid;
for (i=6; i<9; i++)
a[i][j]=7+myid;
}
iu=myid+11;
sprintf( fname,"output.%d", iu);
fp = fopen( fname, "w");
145
sprintf(string,"%d %d %d %d %d %d %d %d %d\n",
a[0][j],a[1][j],a[2][j],a[3][j],a[4][j],a[5][j],a[6][j],a[7][j],a[8][j]);
fwrite( (void *)&string, sizeof(string), 1, fp );
}
/*
row_to_col
*/
itag=10;
k=-1;
if (id != myid ) {
k=k+1;
MPI_Isend( (void *)&a[istart1][jstart], 1, block2d[id][myid],
id, itag, comm, &req1[k]);
MPI_Irecv( (void *)&a[istart][jstart1], 1, block2d[myid][id],
}
}
icount=nproc-1;
MPI_Waitall (icount, req1, istat);
sprintf( string, "after row_to_col\n");
for (j=nn-1; j>=0; j--) {
sprintf(string,"%d %d %d\0\0",
a[istart][j],a[istart+1][j],a[istart+2][j]);
}
/*
col_to_row
*/
MPI_Barrier( comm );
itag=20;
k=-1;
146
if (id != myid ) {
k=k+1;
MPI_Isend( (void *)&a[istart][jstart1], 1, block2d[myid][id],
MPI_Irecv( (void *)&a[istart1][jstart], 1, block2d[id][myid],
}
}
icount=nproc-1;
for (j=jstart; j<=jend; j++)
a[i][j]=a[i][j]+10;
sprintf( string, "after col_to_row\n");
sprintf(string,"%d %d %d %d %d %d %d %d %d\n",
a[0][j],a[1][j],a[2][j],a[3][j],a[4][j],a[5][j],a[6][j],a[7][j],a[8][j]);
}
MPI_Finalize();
return 0;
}
{
ilength=is2-is1+1;
if(myid < ir) {
}
147
else {
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
CPU fort.11
1.
1.
1.
1.
1.
1.
4.
4.
4.
4.
4.
4.
7.
7.
7.
7.
7.
7.
after col_to_row
11. 11. 11. 14.
11. 11. 11. 14.
14.
14.
14.
14.
17.
17.
17.
17.
17.
17.
after row_to_col
1.
1.
1.
1.
1.
1.
2.
2.
3.
3.
2.
2.
3.
3.
2.
2.
3.
3.
fort.12
2.
2.
2.
2.
2.
2.
after row_to_col
4.
4.
4.
5.
5.
5.
5.
5.
5.
8.
8.
8.
8.
8.
8.
6.
6.
6.
6.
6.
6.
after col_to_row
12. 12. 12. 15.
15.
15.
18.
18.
18.
4.
5.
5.
4.
5.
5.
4.
5.
5.
148
12.
12.
12.
15.
15.
15.
18.
18.
18.
6.
6.
6.
6.
6.
6.
9.
9.
9.
9.
9.
9.
16.
16.
16.
16.
16.
16.
19.
19.
19.
19.
19.
19.
fort.13
3.
3.
3.
3.
3.
3.
after row_to_col
7.
7.
7.
7.
7.
7.
8.
8.
8.
8.
8.
8.
9.
9.
9.
9.
9.
9.
after col_to_row
13.
13.
13.
13.
13.
13.
149
7.3
for loop
for loop X
index [i][j] index [i][j][i-1][j][i][j-1]i j
(Recursive) (2-Way Recursive )
(Pipeline Method)
#define m
128
#define m 128
double x[m+2][n+2];
for (i=1; i<=m; i++)
for (j=1; j<=n; j++)
x[i][j]=x[i][j]+( x[i-1][j]+x[i][j-1] )*0.5;
2nd Dimension
j
P0
P1
P2
x[i][j]
3
i
1st Dimension
7.2 (a)
150
x 7.2(a)
CPU
CPU j CPU
CPU CPU
CPU CPU 7.2(b)
345 CPU 26 CPU
17 CPU j
j
P0
P1
P2
time
7
6
7.2 (b)
/*
program pipeseq
#include <stdio.h>
*/
#include <stdlib.h>
#define m 128
#define n 128
151
main ()
{
double
int
FILE
x[m+2][n+2], eps, omega, err1, temp, clock;

i, j, k, loop, isec1, nsec1, isec2, nsec2;
*fp;
fread( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
for (loop=0; loop<36000; loop++) {
err1=0.0;
for (i=1; i<=m; i++) {
for (j=1; j<=n; j++) {
temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j]+=omega*temp;
if(temp < 0) temp=-temp;
if(temp > err1) err1=temp;
}
}
if(err1 <= eps) break;
}
printf( "loop,err1 = %d %.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
152

printf( " clock time=%f\n", clock);
return 0;
}
{
int iret;
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
PIPESEQ IBM SP2 SMP
4.349
6.040
4.860
7.116
4.919
0.318
2.283
3.704
2.286
4.340
loop,err1 = 10566 9.99821e-06
x[i][n], i=1; i<=128; i+=8
3.700
6.900
2.232
6.741
8.797
1.727
7.457
3.974
4.557
4.171
7.183
3.670
clock time=10.663873
6.317
5.158
6.057
6.431
4.818
3.384
6.752
5.198
5.458
5.561
/*
program pipeline
Parallel on 1st dimension
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define m 128
#define n
128
main ( argc, argv)

int argc;
153
char **argv;
{
double
int
int
x[m+2][n+2], eps, omega, err1, gerr1, temp, clock;

i, j, k, ip, itag, loop, iblock, iblklen, jj, isrc;
nproc, myid, istart, iend, count, istart1, count1,
istartm1, iendp1, l_nbr, r_nbr, lastp;

int
istartg[32], iendg[32];
MPI_Status
istat[8];
MPI_Comm comm;
FILE
*fp;

clock=MPI_Wtime();
for (i = 0; i < nproc; i++) {
startend(i, nproc, 1, m, &istartg[i], &iendg[i]);
}
iend =iendg[myid];
lastp =nproc-1;
istartm1=istart-1;
iendp1=iend+1;
l_nbr = myid-1;
r_nbr = myid+1;
if(myid == 0)
if(myid == lastp) r_nbr = MPI_PROC_NULL;

if( myid==0) {
154
fclose( fp );
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
count=(m+2)*(n+2);
MPI_Bcast((void *)&x, count, MPI_DOUBLE, 0, comm);
for (ip=0; ip<nproc; ip++)
startend( ip, nproc, 1, m, &istartg[i], &iendg[i]);
iblock = 4;
omega = 0.5;
eps = 1.0e-5;
err1 = 1.0e-15;
itag = 20;
MPI_Sendrecv ((void *)&x[istart][0], n+2, MPI_DOUBLE, l_nbr, itag,
(void *)&x[iendp1][0], n+2, MPI_DOUBLE, r_nbr, itag, comm, istat);
itag = 10;
for (jj=1; jj<=m; jj+=iblock) {
iblklen = min(iblock, n-jj+1);
MPI_Recv( (void *)&x[istartm1][jj], iblklen, MPI_DOUBLE, l_nbr, itag, comm, istat);
for (j=jj; j<=jj+iblklen-1; j++) {
temp = 0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] )-x[i][j];
x[i][j] = x[i][j]+omega*temp;
if ( temp < 0.0) temp = -temp;
if(temp >
err1) err1 = temp;
}
}
MPI_Send( (void *)&x[iend][jj], iblklen, MPI_DOUBLE, r_nbr, itag, comm);
}
MPI_Allreduce((void *)&err1,(void *)&gerr1,1,MPI_DOUBLE,MPI_MAX, comm);
err1 = gerr1;
if(err1 < eps) break;
155
}
itag = 110;
if( myid == 0) {
istart1=istartg[isrc];
count1=(iendg[isrc]-istart1+1)*(n+2);
MPI_Recv((void *)&x[istart1][0], count1, MPI_DOUBLE, isrc, itag, comm, istat);
}
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
else {
count = (iend-istart+1)*(n+2);
MPI_Send ((void *)&x[istart][0], count, MPI_DOUBLE, 0, itag, comm);
}
clock = MPI_Wtime() - clock;
printf( " myid, clock time= %d
MPI_Finalize();
return 0;
%f\n", myid, clock);
}
{
ilength=is2-is1+1;
if(myid < ir) {
}
else {
156
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
min(int i1, int i2)
{
if (i1 < i2) return i1;
else return i2;
}
PIPELINE IBM SP2 SMP CPU 13.93
10.66
ATTENTION: 0031-408
0
1
2
3
4.349
6.040
4.860
7.116
4.919
0.318
2.283
3.704
2.286
4.340
myid, clock time= 2 13.927630
loop,err1 = 10567 9.99821e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
6.752
4.171
7.183
3.670
5.198
myid, clock time= 0
5.458
5.561
1
33
65
97
32
64
96
128
3.700
6.900
2.232
6.741
8.797
1.727
6.317
5.158
6.057
6.431
4.818
3.384
13.928672
157
SOR
SOR(Successive
Over-Relaxation)
SOR
SOR
SOR
SOR
158
8.1 SOR
Successive Over-Relaxation (SOR) method Laplace
for-loop x for-loop x
omega
for-loop
for (i=1; i<=m; i++)
for (j=1; j<=n; j++) {

temp=0.25*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] ) x[i][j];
x[i][j]=x[i][j] + omega*temp;
if (temp < 0.0) temp=-temp;
if (temp > err1)
err1=temp;
}
}
x[i][j] 8.1
x[i-1][j] x[i][j-1] x[i+1][j] x[i][j+1]
(pipeline method)
(red-black SOR method)
2nd Dimension
X(I,J)
not updated yet
about to be updated
already updated
x[i][j+1]
x[i-1][j]
j=1
x[i][j]
x[i+1][j]
x[i][j-1]
i=1
8.1
1st Dimension
SOR
159
/*
program sor
Sequential version of Successive Over-Relaxation Method
*/
#include <stdio.h>
#include <stdlib.h>
#define m 128
#define n
128
main ( argc, argv)

int argc;
char **argv;
{
double x[m+2][n+2], eps, omega, err1, temp, clock;
int
FILE
*fp;
fclose( fp );
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
err1 = 0.0;
for (i=1; i<=m; i++) {
for (j=1; j<=n; j++) {
}
}
160
for (i=1; i<=m; i++) {

for (j=mod(i,2)+1; j<=n; j+=2) {
}
}
}
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
return 0;
}
{
int iret;
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
input.dat
/*
program sordata
*/
#include <stdio.h>
161
#include <stdlib.h>
#define m 128
#define n
double
main ()
{
double
int
FILE
128
seed = 123456.78;
x[m+2][n+2];
i, j;
*fp;
for (i=0; i<=m+1; i++) {

randnum( n+2, &x[i][0]);
}
for (i=0; i<=m+1; i++)
for (j=0; j<=n+1; j++)
x[i][j]=x[i][j]*10.0;
fwrite( (void *)&x, sizeof(x), 1, fp );
fclose( fp );
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
randnum ( int number, double array[])
{
int
i, ic, ifirst=1;
double a=16807.0, b=2147483647.0, twom01=0.50;
static double twom62,twom31,twom16,twom08,twom04,twom02;
if (ifirst == 1) {
ifirst=0;
twom02=twom01*twom01;
162
twom31=twom16*twom08*twom04*twom02*twom01;
}
for (i=0; i<number; i++) {
seed=seed*a;
ic=(int) (seed/b);
seed-=b*(double)ic;
array[i] = seed*twom31+seed*twom62;
}
return 0;
}
SORSEQ IBM SP2 SMP loop 10567
eps 10.66
4.349
6.040
4.860
7.116
0.318
2.283
3.704
2.286
loop,err1 = 10567 9.99821e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
6.752
4.171
7.183
3.670
5.198
4.919
3.700
2.232
8.797
4.340
6.900
6.741
1.727
5.458
6.317
6.057
4.818
5.561
5.158
6.431
3.384
163
8.2 SOR
(red-black SOR method)
8.2
i+j
i+j
SOR
2nd Dimension
x[i][j]
j=n
red (white) element
black element
3
2
j=1
i=1
8.2
1st Dimension
SOR
j+i
j+i
/*
program sorrb
Sequential version of red-black Successive Over-Relaxation Method
*/
#include <stdio.h>
#include <stdlib.h>
#define m 128
#define n 128
main ( argc, argv)
int argc;
164
char **argv;
{
double
int
FILE

*fp;
fclose( fp );
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
err1 = 0.0;
for (i=1; i<=m; i++) {
/*
*/
for (j=mod(i+1,2)+1; j<=n; j+=2) {

}
}
for (i=1; i<=m; i++) {
/*
*/
for (j=mod(i,2)+1; j<=n; j+=2) {

}
}
}
165

printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
return 0;
}
{
int iret;
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
mod(int i1, int i2)
{
int
i3;
i3=i1/i2;
i1=i1-i3*i2;
return i1;
}
SORRB IBM SP2 SMP loop 10313
eps 4.51 SOR 10567
SOR 10.66
4.349
6.040
4.860
7.116
0.318
2.283
3.704
2.286
loop,err1 = 10313 9.99917e-06
x[i][n], i=1; i<=128; i+=8
4.919
3.700
2.232
8.797
4.340
6.900
6.741
1.727
166
7.457
4.171
3.974
7.183
4.557
3.670
6.752
5.198
5.458
5.561
6.317
5.158
6.057
6.431
4.818
3.384
clock time=4.505438
2nd Dimension
P0
P1
P2
x[i][j]
j=n
red (white) element
black element
3
2
j=1
i=1
1st Dimension
8.3 red_black SOR method

Red-Black SOR COMMU(0)
COMMU(1)
/*
program sorrbp -- red-black Successive Over-Relaxation Method

Parallel on the first dimension
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define m 128
#define n 128
double
int
x[m+2][n+2];
nproc, myid, istart, iend, count, count1, istart1,
istartm1, iendp1, l_nbr, r_nbr, lastp, itag;

int
MPI_Status
istat[8];
MPI_Comm
comm;
167
main ( argc, argv)

int argc;
char **argv;
{
double
int
FILE
eps, omega, err1, gerr1, temp, clock;

i, j, n2, loop;
*fp;

clock=MPI_Wtime();
for (i = 0; i < nproc; i++)
startend(i, nproc, 1, m, &istartg[i], &iendg[i]);
iend =iendg[myid];
lastp =nproc-1;
istartm1=istart-1;
iendp1=iend+1;
l_nbr = myid-1;
r_nbr = myid+1;
if(myid == 0)
if( myid==0) {
fclose( fp );
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
168
}
}
count=(m+2)*(n+2);
eps=1.0e-5;
omega=0.5;
err1 = 0.0;
n2 = n+2;
MPI_Sendrecv ((void *)&x[istart][0], n2, MPI_DOUBLE, l_nbr, 1,
(void *)&x[iendp1][0], n2, MPI_DOUBLE, r_nbr, 1, comm, istat);
MPI_Sendrecv ((void *)&x[iend][0],
n2, MPI_DOUBLE, r_nbr, 2,
(void *)&x[istartm1][0],n2, MPI_DOUBLE, l_nbr, 2, comm, istat);
/*
red(white) grid) */
for (j=mod(i+1,2)+1; j<=n; j+=2) {
}
}
MPI_Sendrecv ((void *)&x[istart][0], n2, MPI_DOUBLE, l_nbr, 3,
(void *)&x[iendp1][0], n2, MPI_DOUBLE, r_nbr, 3, comm, istat);
n2, MPI_DOUBLE, r_nbr, 4,
(void *)&x[istartm1][0],n2, MPI_DOUBLE, l_nbr, 4, comm, istat);
/*
black grid
*/
for (j=mod(i,2)+1; j<=n; j+=2) {
}
}
err1 = gerr1;
}
169
itag=30;
if (myid == 0) {
istart1=istartg[i];
count1 =(iendg[i]-istart1+1)*(n+2);
MPI_Recv ((void *)&x[istart1][0], count1, MPI_DOUBLE, i, itag, comm, istat);
}
}
else {
count=(iend-istart+1)*(n+2);
}
if (myid == 0) {
printf( "loop,err1
= %d
%.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");

for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
printf( " myid,clock time=%d
MPI_Finalize();
return 0;
%f\n", myid,clock);
}
mod(int i1, int i2)
{
int i3;
i3 = i1/i2;
i3 = i1-i3*i2;
return i3;
}
{
ilength=is2-is1+1;
170
if(myid < ir) {
}
else {
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
SORRBP IBM SP2 SMP CPU ITER 10313
eps 4.72 4.51
Red-Black SOR 10313 (speed up) =4.51/4.72=0.96
ATTENTION: 0031-408
0
1
2
3
4.349
6.040
4.860
7.116
4.919
0.318
2.283
3.704
2.286
4.340
myid,clock time=1 4.717445
loop,err1 = 10313 9.99917e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
6.752
4.171
7.183
3.670
5.198
1
33
65
97
32
64
96
128
3.700
6.900
2.232
6.741
8.797
1.727
5.458
6.317
6.057
4.818
5.561
5.158
6.431
3.384
IBM SP2 SMP CPU loop 10313

eps 9.56 CPU
171
ATTENTION: 0031-408
0
1
2
3
1
17
33
49
16
32
48
64
5
6
7
4
81
97
113
65
96
112
128
80
4.349
6.040
4.860
7.116
0.318
2.283
3.704
2.286
4.919
4.340
3.700
6.900
2.232
6.741
8.797
1.727
5.458
5.561
6.317
5.158
6.057
6.431
4.818
3.384

loop,err1 = 10313 9.99917e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.171
7.183
myid,clock time=0
4.557
6.752
3.670
5.198
9.560302
172
8.3 SOR
SOR
8.4
1 n
2nd
Dimension
x[i][j]
j=n
white element
black element
3
2
j=1
i=1
1st Dimension
8.4 SOR
SOR
/*
program sorzebra
Sequential version of zebra SOR
(Successive Over-Relaxation Method)
*/
#include <stdio.h>
#include <stdlib.h>
#define m 128
#define n 128
main ( argc, argv)
int argc;
char **argv;
{
double x[m+2][n+2], eps, omega, err1, temp, clock;
int
FILE

*fp;
173
fclose( fp );
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
err1 = 0.0;
for (i=1; i<=m; i+=2) {
for (j=1; j<=n; j++) {
}
}
for (i=2; i<=m; i+=2) {
for (j=1; j<=n; j++) {
}
}
}
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
174
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
return 0;
}
{
int iret;
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
sorzebra IBM SP2 SMP loop 10409
eps 10.51
4.349
6.040
4.860
7.116
4.919
0.318
2.283
3.704
2.286
4.340
loop,err1 = 10409 9.99896e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
4.171
7.183
3.670
6.752
5.198
5.458
5.561
3.700
6.900
2.232
6.741
8.797
1.727
6.317
5.158
6.057
6.431
4.818
3.384
8.5 CPU
CPU CPU index
for (i = 0; i < nproc; i++) {
startend(i, nproc, 1, (m+1)/2, &istart, &iend);
istartg[i] = istart*2-1;
iendg[i] = min (m, iend*2);
iend =iendg[myid];
175
2nd
Dimension
P0
P1
P2
j=n
x[i][j]
white element
black element
3
2
j=1
i=1
1st Dimension
8.5 SOR
CPU index i istart istart-1
CPU MPI_Sendrecv istart-1
i+1 CPU iend+1
MPI_ Sendrecv iend+1 i-1
CPU sor_zebrap
/*
program zebrap -- Parallel on 1st Dimension

(Successive Over-Relaxation Method)
of zebra SOR
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define m 128
#define n 128
double
int
x[m+2][n+2];
int
MPI_Status
MPI_Comm
istat[8];
comm;
main ( argc, argv)

176
int argc;
char **argv;
{
double
int
FILE
eps, omega, err1, gerr1, temp, clock;

i, j, k, loop, itag;
*fp;

clock=MPI_Wtime();
for (i = 0; i < nproc; i++) {
}
iend =iendg[myid];
lastp =nproc-1;
istartm1=istart-1;
iendp1=iend+1;
l_nbr = myid-1;
r_nbr = myid+1;
if(myid == 0)

if( myid==0) {
fclose( fp );
}
count=(m+2)*(n+2);
177

eps=1.0e-5;
omega=0.5;
err1 = 0.0;
itag=10;
n+2, MPI_DOUBLE, r_nbr, itag,
(void *)&x[istartm1][0], n+2, MPI_DOUBLE, l_nbr, itag, comm, istat);
for (i=istart; i<=iend; i+=2) {
for (j=1; j<=n; j++) {
}
}
itag=20;
MPI_Sendrecv ((void *)&x[istart][0], n+2, MPI_DOUBLE, l_nbr, itag,
for (i=istart+1; i<=m; i+=2) {
for (j=1; j<=n; j++) {
}
}
err1 = gerr1;
}
itag=30;
if (myid == 0) {
istart1=istartg[i];
178
}
}
else {
}
if (myid == 0) {
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
%f\n", myid,clock);
MPI_Finalize();
return 0;
}
{
ilength=is2-is1+1;
if(myid < ir) {
}
else {
}
if(ilength < 1) {
*istart=1;
*iend=0;
179
}
}
min(int i1, int i2)
{
else return i2;
}
SOR_ZEBRAP IBM SP2 SMP CPU loop
10409 eps 8.39 (speed up) =10.51/8.39=1.25
1
33
64
2
65
96
3
0
97
1
128
32

loop,err1 = 10409 9.99896e-06
x[i][n], i=1; i<=128; i+=8
7.457
3.974
4.557
6.752
4.171
7.183
3.670
5.198
5.458
5.561
6.317
5.158
6.057
6.431
4.818
3.384
180
8.4 SOR
x[i][j]
8.6 for loop :
err1 = 0.0;
for (i=1; i<=m; i++) {
for (j=1; j<=n; j++) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] +
x[i-1][j-1]+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
}
}
2nd Dimension
x[i][j]
n
.
x(i-1,j+1) x(i,j+1) x(i+1,j+1)
3
x(i-1,j)
x(i,j)
x(i+1,j)
x(i-1,j-1)
x(i,j-1) x(i+1,j-1)
2
j=1
I=1
1st Dimension
8.6 SOR
() 8.6
181
/*
program color_seq
Sequential version of 4 colour Successive Over-Relaxation Method
*/
#include <stdio.h>
#include <stdlib.h>
#define m 128
#define n
128
main ( argc, argv)

int argc;
char **argv;
{
double
int
FILE

*fp;
fclose( fp );
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
eps=1.0e-5;
omega=0.5;
err1 = 0.0;
for (i=1; i<=m; i+=2) { /* update circle */
for (j=1; j<=n; j+=2) {
}
182
}
for (i=1; i<=m; i+=2) {
/*
update triangle */
for (j=2; j<=n; j+=2) {

}
}
for (i=2; i<=m; i+=2) { /* update square
*/
for (j=1; j<=n; j+=2) {
}
}
for (i=2; i<=m; i+=2) { /*
for (j=2; j<=n; j+=2) {
update <>
*/
}
}
}
printf( " x[i][n], i=1; i<=128; i+=8\n");
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
183

return 0;
}
{
int iret;
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
COLOR_SOR IBM SP2 SMP loop 8157
eps 5.87
4.349
6.040
4.860
7.116
0.318
2.283
3.704
2.286
loop,err1 = 8157 9.99831e-06
x[i][n], i=1; i<=128; i+=8
6.584
4.322
5.045
7.129
4.587
7.141
4.051
4.837
clock time=5.869461
4.919
4.340
3.700
6.900
2.232
6.741
8.797
1.727
4.290
5.531
5.596
5.200
5.881
5.883
4.026
3.643
CPU
CPU CPU index
for (i = 0; i < nproc; i++) {

}
iend =iendg[myid];
184
2nd Dimension
P0
P1
P2
x[i][j]
n
.
3
2
j=1
i=1
1st Dimension
8.6 SOR
istart-1 istart-1
iend+1 iend+1
CPU
/*
program colorp
Parallel on 1st dimension of 4 colour Successive Over-Relaxation Method
*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define m 128
#define n 128
main ( argc, argv)
int argc;
char **argv;
{
double
int
int
x[m+2][n+2], eps, omega, err1, gerr1, temp, clock;

i, j, ip, itag, loop;
int
MPI_Status
istat[8];
185
MPI_Comm
FILE
comm;
*fp;

MPI_Barrier
(MPI_COMM_WORLD);
clock=MPI_Wtime();
for (i = 0; i < nproc; i++) {
}
iend =iendg[myid];
lastp =nproc-1;
istartm1=istart-1;
iendp1=iend+1;
l_nbr = myid-1;
r_nbr = myid+1;
if(myid == 0)
if( myid==0) {
fclose( fp );
for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
}
}
186
count=(m+2)*(n+2);
eps=1.0e-5;
omega=0.5;
for (loop=1; loop<=36000; loop++) {
err1 = 0.0;
itag=10;
n+2, MPI_DOUBLE, r_nbr, itag,
(void *)&x[istartm1][0], n+2, MPI_DOUBLE, l_nbr, itag, comm, istat);
/*
for (i=1; i<=m; i+=2) {

update circle */
for (j=1; j<=n; j+=2) {
temp=0.125*( x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1] + x[i-1][j-1]+
+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
}
/*
}
for (i=1; i<=m; i+=2) {
update square */

for (j=2; j<=n; j+=2) {
+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
}
}
itag=20;
MPI_Sendrecv ((void *)&x[istart][0],
n+2, MPI_DOUBLE, l_nbr, itag,
/*
for (i=2; i<=m; i+=2) {

update triangle
*/
for (i=istart+1; i<=iend; i+=2) {
for (j=1; j<=n; j+=2) {
+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
187
}
/*
}
for (i=2; i<=m; i+=2) {
update <>
*/
for (i=istart+1; i<=iend; i+=2) {

for (j=2; j<=n; j+=2) {
+x[i-1][j+1]+x[i+1][j-1]+x[i+1][j+1] )-x[i][j];
}
}
err1 = gerr1;
}
itag=30;
if (myid == 0) {
istart1=istartg[i];
}
}
else {
}
if (myid == 0) {
printf( "loop,err1
= %d
%.5e\n", loop, err1);
printf( " x[i][n], i=1; i<=128; i+=8\n");

for (i = 1; i <= m; i+=64) {
x[i][n],x[i+8][n],x[i+16][n],x[i+24][n],
x[i+32][n],x[i+40][n],x[i+48][n],x[i+56][n]);
188
}
}
MPI_Finalize();
return 0;
%f\n", myid,clock);
}
{
ilength=is2-is1+1;
if(myid < ir) {
}
else {
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
min(int i1, int i2)
{
else return i2;
}
colorp IBM SP2 SMP CPU loop 8157
eps 3.24 (speed up) =5.87/3.24=1.81
189

0
1
32
4.349
6.040
4.860
7.116
0.318
2.283
3.704
2.286
loop,err1 = 8157 9.99831e-06
x[i][n], i=1; i<=128; i+=8
6.584
4.322
5.045
7.129
4.587
7.141
4.051
4.837
myid,clock time=0
1
2
3
4.919
33
64
65
96
97
128
3.700
2.232
8.797
4.340
6.900
6.741
1.727
4.290
5.531
5.596
5.200
5.881
5.883
4.026
3.643
3.240597
190

(finite element method - FEM)
(finite difference method)
(implicit method)
(explicit method)

191
9.1
/*
program femseq -- sequential version of finite element explicit method
*/
#include <stdio.h>
#include <stdlib.h>
#define ne 18
#define nn 28
main ( argc, argv)
int argc;
char **argv;
{
double
int
ve[ne+1], vn[nn+1], clock;

index[ne][4], i, j, k, ie, in, loop;
int
isec1, nsec1, isec2, nsec2;
for (i=1; i<=ne; i++) {
scanf("%d %d %d %d\n",&index[i][0],&index[i][1],&index[i][2],&index[i][3]);
}
for (ie=1; ie<=ne; ie++)
ve[ie]=10.0*ie;
for (in=1; in<=nn; in++)
vn[in]=100.0*in;
for (ie=1; ie<=ne; ie++) {
for (j=0; j<4; j++) {
k= index[ie][j];
vn[k]= vn[k] + ve[ie];
}
}
for (in=1; in<=nn; in++)
vn[in] = vn[in] * 0.25;
for (ie=1; ie<=ne; ie++) {
for (j=0; j<4; j++) {
192
k= index[ie][j];
ve[ie] = ve[ie] + vn[k];
}
}
for (ie=1; ie<=ne; ie++)
ve[ie] = ve[ie] *0.25;
}
printf("result of vn\n");
for (i=1; i<=nn; i+=7)
printf(" %.3f %.3f %.3f %.3f %.3f %.3f %.3f\n",
vn[i],vn[i+1],vn[i+2],vn[i+3],vn[i+4],vn[i+5],vn[i+6]);
printf("result of ve\n");
for (i=1; i<=ne; i+=6)
printf(" %.3f %.3f %.3f %.3f %.3f %.3f\n",
ve[i],ve[i+1],ve[i+2],ve[i+3],ve[i+4],ve[i+5]);
return 0;
}
{
int iret;
*isec=tb.tv_sec;
*nsec=tb.tv_nsec;
return 0;
}
Element node 9.1 18 element () element node
() (unstructured grid)
ve vn element node loop for loop
ve vn loop ve vn loop
vn loop vn ve
loop ve
193
8
3
12
6
7
2
20
9
11
5
6
16
12
15
8
10
4
15
19
11
23
18
10
27
17
node -> element
17
22
13
13
28
18
14
14
7
24
26t
16
21
element
25
node
3
3
12
6
7
2
2
9
11
5
6
15
8
10
4
16
15
11
14
18
27
element -> node
17
22
13
17
28
23
18
10
13
24
12
19
14
7
20
26
16
21
25
9.1 element node

element node index Index[i][j] i
element node j 1 4
1
2
INDEX= 6
5
2
3
7
6
3
5
6
7
9 10 11 13 14 15 17 18 19 21 22 23
4
6
7
8 10 11 12 14 15 16 18 19 20 22 23 24
8 10 11 12 14 15 16 18 19 20 22 23 24 26 27 28
7
9 10 11 13 14 15 17 18 19 21 22 23 25 26 27
FEM_SEQ
result of vn
303.506 737.138 743.620 309.989 905.479 2197.706 2214.970
922.743 1476.091 3579.268 3602.639 1499.462 1927.588 4670.236
4695.415 1952.767 2066.994 5005.284 5028.654 2090.365 1655.717
4008.155 4025.419 1672.981 642.193 1554.410 1560.892 648.676
194
result of ve
1281.497 1823.797 1298.016 2526.136 3591.987 2554.243
3617.916 5139.575 3651.343 4262.652 6051.210 4296.079
3991.965 5664.587 4020.072 2474.166 3510.129 2490.686
clock time=0.000746
195
9.2
element 9.218 element
CPU CPU 6 element node CPU
node rank CPU (primary processor) rank CPU
(secondary processor)
25
26
27
28
node
16
17
18
element
21
23
23
24
process 2
13
18
17
17
18
10
14
13
14
19
19
11
15
15
20
20
12
16
process 1
7
10
11
9
12
10
11
12
4
5
5
7
6
8
process 0
1
2
2
3
3
4
9.2
element CPU element
iecntg[i] CPU i element iestartg[i] CPU i element
CPU node incntg[i] CPU i node
inodeg[i][j] CPU i node 9.3 g global
196
ecntg
P0
P1
P2
12
P0
8
8
estartg
eendg
12
13
18
P1
13
14
15
16
17
18
19
20
P2
21
22
23
24
25
26
27
28
ncntg
10
11
12
nodeg
9.3
element node
CPU node CPU

( irregular mesh) CPU
CPU CPU CPU
CPU CPU I node scnt[i]
snode[i][j]j=1, scnt[i] CPU CPU i node pcnt[i]
pnode[i][j]j=1,pcnt[i] 9.4
Associated with
scnt
Process 0
P0
P1 P2
0
Process 2
P0 P1 P2
9
10
11
12
snode
Associated with P0
pcnt
0
pnode
Process 1
P0
P1
P2
P1
4
9
10
11
12
9.4
P2
0
P0
0
17
18
19
20
P1
0
P2
4
P0 P1
0
0
P2
0
17
18
19
20
CPU node CPU

197
fem_seq
/*
program femp -- parallel version of finite element explicit method */
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define ne 18
#define nn 28
#define np
3
int
int
ecntg[np], estartg[np], eendg[np];

ncntg[np], nodeg[np][nn], itg[np][nn+1];
main ( argc, argv)

int argc;
char **argv;
{
double
int
int
int
ve[ne+1], vn[nn+1], bufs[np+1][nn], bufr[np+1][nn], clock;

index[ne+1][4], i, j, k, ie, in, ii, itime, itag, iflag;
nproc, myid, istart, iend, count, icount, iu, irank, is;
scnt [np], snode[np][nn];
int
pcnt [np], pnode[np][nn];
MPI_Status
istat[8];
MPI_Comm comm;
clock=MPI_Wtime();
if (myid == 0) {
for (i=1; i<=ne; i++)
scanf("%d %d %d %d\n",&index[i][0],&index[i][1],&index[i][2],&index[i][3]);
}
icount=(ne+1)*4;
MPI_Bcast ((void *)&index, icount, MPI_INT, 0, comm);
198
/*
clear counters, CPU and node association indicators
*/
for (irank = 0; irank < nproc; irank++) {
ncntg[irank]=0;
scnt[irank]=0;
pcnt[irank]=0;
for (j=0; j<nn; j++) {
itg [irank][j]=0;
snode[irank][j]=0;
pnode[irank][j]=0;
}
}
itg CPU node for loop CPU
nn node for loop CPU node
1 9.2 9.1
j
P0
P1
P2
1
1
0
0
2
1
0
0
3
1
0
0
4
1
0
0
5
1
0
0
6
1
0
0
9.1
7
1
0
0
8
1
0
0
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
itg
/*set node association indicator for associated nodes

*/
for (irank = 0; irank < nproc; irank++) {
startend(irank, nproc, 1,ne, &istart, &iend);
estartg[irank]=istart;
eendg[irank]=iend;
ecntg[irank]=iend-istart+1;
for (ie=istart; ie<=iend; ie++) {
for (j=0; j<4; j++) {
k=index[ie][j];
itg[irank][k]=1;
}
}
if (myid == 0 ) {
199
printf("itg values for irank= %d\n", irank);

for (j=1; j<=nn; j+=4) {
printf("%d %d %d %d\n",
itg[irank][j],itg[irank][j+1],itg[irank][j+2],itg[irank][j+3]);
}
}
}
istart=estartg[myid];
iend=eendg[myid];
count=ecntg[myid];
printf( "NPROC,MYID,ISTART,IEND=%d %d %d %d\n",nproc,myid,istart,iend);
9.1 node () 1 CPU
node P0 P1 node 9101112 node P0 primary node
P1 secondary nodeP1 P2 node 17181920 node P1 primary node
P2 secondary node for loop node
scntpcnt node snodepnode s
secondaryp primary
/*count and store boundary node code
*/
for (in=1; in<=nn; in++) {
iflag=1;
for (irank=0; irank<nproc; irank++) {
if (itg[irank][in] == 1) {
if (iflag == 1) {
iflag=2;
ii=irank;
}
else {
itg[irank][in]=0;
if (irank == myid) {
scnt[ii]=scnt[ii]+1;
snode[ii][scnt[ii]]=in; }
/* node[in] belong to irank */

/* 1st time itg[irank][in]==1
/* 2nd time itg[irank][in]==1
/* secondary node count

/* secondary node code
else {
if (ii == myid) {
pcnt[irank]=pcnt[irank]+1;
pnode[irank][pcnt[irank]]=in; }
}
*/
*/
*/
*/
/* primary node count */

/* primary node code */
200
}
}
}
}
/*
*/
count and store all primary node code which belongs to each CPU
for (in=1; in<=nn; in++) {
if (itg[irank][in] == 1) {
ncntg[irank]=ncntg[irank]+1;
nodeg[irank][ncntg[irank]]=in;
}
}
k=ncntg[irank];
if(myid == 0) {
printf("nodeg values for irank,k= %d
for (j=1; j<=k; j+=4)
%d\n", irank,k);
printf("%d %d %d %d\n",
nodeg[irank][j],nodeg[irank][j+1],nodeg[irank][j+2],nodeg[irank][j+3]);
}
}
/*
/*
set initial values

*/
for (ie=1; ie<=ne; ie++) */
for (ie=istart; ie<=iend; ie++)
ve[ie]=10.0*ie;
/*
for (in=1; in<=nn; in++) */

for (ii=1; ii<=ncntg[myid]; ii++) {
in=nodeg[myid][ii];
vn[in]=100.0*in;
}
itime loop vn node vn node

vn node node node
for (itime=0; itime<10; itime++) {

for (irank=0; irank<nproc; irank++)
201
for (is=1; is<=scnt[irank]; is++)

vn[ snode[irank][is] ]=0.0;
/*
for (ie=1; ie<=ne; ie++) { */

for (j=0; j<4; j++) {
k= index[ie][j];
vn[k]= vn[k] + ve[ie];
}
}
for (is=1; is<=scnt[irank]; is++)
bufs[irank][is]=vn[ snode[irank][is] ];
itag=10;
if (scnt[irank] > 0)
MPI_Send((void *)&bufs[irank][1],scnt[irank], MPI_DOUBLE,
irank, itag, comm);
if (pcnt[irank] > 0 )
MPI_Recv ((void *)&bufr[irank][1],pcnt[irank],MPI_DOUBLE,
irank, itag, comm, istat);
}
for (i=1; i<=pcnt[irank]; i++) {
k=pnode[irank][i];
vn[k]=vn[k]+bufr[irank][i];
}
}
vn CPU CPU CPU

vn ve ve
/*
for (in=1; in<=nn; in++) */

for (ii=1; ii<=ncntg[myid]; ii++) {
in=nodeg[myid][ii];
vn[in] = vn[in] * 0.25;
}
202
for (i=1; i<=pcnt[irank]; i++)

bufs[irank][i]=vn[ pnode[irank][i] ];
itag=20;
if (pcnt[irank] > 0)
MPI_Send ((void *)&bufs[irank][1],pcnt[irank],MPI_DOUBLE,
irank, itag, comm);
if (scnt[irank] > 0 )
MPI_Recv ((void *)&bufr[irank][1],scnt[irank],MPI_DOUBLE,
}
for (i=1; i<=scnt[irank]; i++)
vn[ snode[irank][i] ]=bufr[irank][i];
/*
for (ie=1; ie<=ne; ie++) { */

for (j=0; j<4; j++) {
k= index[ie][j];
ve[ie] = ve[ie] + vn[k];
}
}
/*
for (ie=1; ie<=ne; ie++) */

for (ie=istart; ie<=iend; ie++)
ve[ie] = ve[ie] *0.25;
}
itime loop CPU CPU 0 CPU 0
MPI_Barrier(comm);
for (i=1; i<=ncntg[myid]; i++)
bufs[myid][i]=vn[ nodeg[myid][i] ];
itag=30;
if (myid == 0)
MPI_Recv ((void *)&bufr[irank][1], ncntg[irank], MPI_DOUBLE,
else
203
MPI_Send ((void *)&bufs[myid][1],ncntg[myid],MPI_DOUBLE, 0, itag, comm);

if (myid == 0)
for (i=1; i<=ncntg[irank]; i++)
vn[ nodeg[irank][i] ]=bufr[irank][i];
itag=40;
if (myid == 0)
MPI_Recv ((void *)&ve[ estartg[irank] ], ecntg[irank], MPI_DOUBLE,
else
MPI_Send ((void *)&ve[istart],count,MPI_DOUBLE, 0, itag, comm);
MPI_Barrier(comm);
if (myid == 0) {
printf("result of vn\n");
for (i=1; i<=nn; i+=7)
printf(" %.3f %.3f %.3f %.3f %.3f %.3f %.3f\n",
vn[i],vn[i+1],vn[i+2],vn[i+3],vn[i+4],vn[i+5],vn[i+6]);
printf("result of ve\n");
for (i=1; i<=ne; i+=6)
printf(" %.3f %.3f %.3f %.3f %.3f %.3f\n",
ve[i],ve[i+1],ve[i+2],ve[i+3],ve[i+4],ve[i+5]);
}
MPI_Finalize();
return 0;
%f\n", myid,clock);
}
{
ilength=is2-is1+1;
if(myid < ir) {
204
}
else {
}
if(ilength < 1) {
*istart=1;
*iend=0;
}
}
femp IBM SP2 SMP CPU fem_seq
ATTENTION: 0031-408
itg values for irank= 0

1111
1
2
7 12
13 18
1111
1111
0000
0000
0000
0000
0000
0000
1111
1111
1111
0000
0000
0000
0000
205
0000
0000
1111
1111
1111
nodeg values for irank,k= 0 12

1234
5678
9 10 11 12
nodeg values for irank,k= 1
13 14 15 16
17 18 19 20
nodeg values for irank,k= 2
21 22 23 24
25 26 27 28
result of vn
303.506 737.138 743.620 309.989 905.479 2197.706 2214.970
922.743 1476.091 3579.268 3602.639 1499.462 1927.588 4670.236
4695.415 1952.767 2066.994 5005.284 5028.654 2090.365 1655.717
4008.155 4025.419 1672.981 642.193 1554.410 1560.892 648.676
result of ve
1281.497 1823.797 1298.016 2526.136 3591.987 2554.243
3617.916 5139.575 3651.343 4262.652 6051.210 4296.079
3991.965 5664.587 4020.072 2474.166 3510.129 2490.686
206

1. Tutorial on MPI : The Message-Passing Interface
By William Gropp, Mathematics and Computer Science Division, Argonne National Laboratory,
gropp@mcs.anl.gov
2. MPI in Practice
by William Gropp, Mathematics and Computer Science Division, Argonne National Laboratory,
gropp@mcs.anl.gov
3. A Users Guide to MPI
by Peters S. Pacheco, Department of Mathematics, University of San Francisco, peter@usfca.edu
4. Parallel Programming Using MPI
by J.M.Chuang, Department of Mechanical Engineering, Dalhousie University, Canada
chuangjm@newton.ccs.tuns.ca
5. RS/6000 SP : Practical MPI Programming
IBM International Technical Support Organization,
http://www.redbooks.ibm.com
207
Parallel Processing of 1-D Arrays without Partition

i=1 . . . . . . 50
float a[200], b[200], c[200], d[200]
P0
float a[200], b[200], c[200], d[200]
P1
float a[200], b[200], c[200], d[200]
P2
float a[200], b[200], c[200], d[200]
P3
i=51 . . . . 100
i=101 . . . 150
i=151 . . . 200
for (i=0; i<200; i++)

a[i]=b[i] + c[i]*d[i];

a[i]=b[i] + c[i]*d[i];
208
Parallel Processing of 1-D Arrays with Partition
i=1 . . . . . . 50
i=(1 . . . . . 50)
1 . . . . . . 50
(51 . . . 100)
1 . . . . . . 50
(101 . . . 150)
1 . . . . . . 50
Local index
(151 . . 200)
float a[50]
Global index
float b[50]
float c[50]
float d[50]
P0
P1
for (i=0; i<200; i++)

a[i]=b[i] + c[i]*d[i];
P2
P3

a[i]=b[i] + c[i]*d[i];
209
Parallel on the 1st Dimension of 2-D Arrays without Partition
i=151 . . . . . . . . .200
i=101 . . . . . . . . . 150
i=51 . . . . . . . . . . 100
j=8
.
j=1
i=1 . . . . . . . . . . . .50
x[i][j]
x[200][8]
x[200][8]
x[200][8]
x[200][8]
y[200][8]
z[200][8]
y[200][8]
z[200][8]
y[200][8]
z[200][8]
y[200][8]
z[200][8]
P0
P1
P2
P3
210
Parallel on the 1st Dimension of 2-D Arrays with Partition

global index
i=(1 . . . . . . . . . . . .50
i=1 . . . . . . . . . . . . 50
51
100
i=1 . . . . . . . . . . . . 50
101
150
i=1 . . . . . . . . . . . . 50
151
200)
i=1 . . . . . . . . . . . . 50
local index
x[50] [8]
y[50] [8]
z[50] [8]
P0
x[50] [8]
y[50] [8]
z[50] [8]
P1
x[50] [8]
y[50] [8]
z[50] [8]
P2
x[50] [8]
y[50] [8]
z[50] [8]
P3
Sequential Version
X[200][8], y[200] [8], z[200] [8]
211
Partition on the 1st dimension of 3-D Arrays

i=50
i=50(100)
i=1 (51)
i=1
i=50(150)
i=1 (101)
i=50(200)
i=1 (151)
j=24
j=24
j=24
j=24
.
.
j=1
k=1 . . . . . . 8
.
.
j=1
k=1 . . . . . . . 8
.
.
j=1
k=1 . . . . . . .8
.
.
j=1
k=1 . . . . . . 8
Global index
x[i][j][k]
x[50][24][8]
y[50][24][8]
x[50][24][8]
y[50][24][8]
x[50][24][8]
y[50][24][8]
x[50][24][8]
y[50][24][8]
z[50][24][8]
z[50][24][8]
z[50][24][8]
z[50][24][8]
P0
P1
P2
P3
Sequential Version :
x[200][24][8], y[200][24][8], z[200][24][8]
212
for (i=0; i<m; i++)

for (j=0; j<n; j++)
y[i][j]=0.25*( x[i-1][j] + x[i+1][j] + x[i][j-1] + x[i][j+1] ) + h*f[i][j];
for (i=0; i<m; i++)

for (j=0; j<n; j++)
x[i][j]=0.25*( x[i-1][j] + x[i+1][j] + x[i][j-1] +x [i][j+1]) + h*f[i][j];
This kind of loops is called DATA RECURSIVE

If it is IMPLICIT scheme and CONVERGES within SUBDOMAIN,
Exchange the boundary data after each Iteration
213

Mpic 2002 Ebook

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Mpic 2002 Ebook

Загружено:

Авторское право:

Доступные форматы

C MPI

: (03) 5776085 x 305

1.4.1 PC Cluster C MPI ...............................................11

2.1.4 MPI_Send, MPI_Recv ......................................................................................17

2.4.2 MPI_Reduce, MPI_Allreduce ..........................................................................29

5.4 MPI .....................................................................................91

IBM SP2 MPI

MPI (Message Passing Interface) Message Passing

IBM SP2IBM SP2 SMPHP SPP2000SGI Origin2000 Fujitsu

IBM SP2 MPI

C shell home directory .cshrc

1.3.1 IBM SP2 MPI C

mpcc -O3 -qarch=auto -qstrict -o file.x file.f

1.3.2 IBM SP2 Job command file

= /usr/bin/poe poe Parallel Operating Environment

CPU short class 4 CPUmedium class 32 CPU

IBM SP2 SMP (ivory) LoadLeveler job command

(CPU 24 32 Node 128 CPU)

CPU medium class 16 Node 64 CPU class

1.3.3 IBM SP2

job_id llq llcancel

MPICH C shell home directory .cshrc

MPICH C mpicc GNU gcc

PC Cluster Job command file

DQS node list

PC cluster DQS job command file

----Pending Jobs -----------------------------------------------------------------------------------------c00tch00 RAD5

job_id qstat qdel qstat

(sequential program) MPI

2.1.1 mpi.h include file

2.1.2 MPI_Init, MPI_Finalize

2.1.3 MPI_Comm_size, MPI_Comm_rank

2.1.4 MPI_Send, MPI_Recv

CPU CPU '' CPU

MPI data type

istat mpi.h MPI_STATUS_SIZE

MPI_Recv( (void *)&buff, icount, DATA_TYPE, MPI_ANY_SOURCE, itag,

T2SEQ test data generation bcd

for loop loop

test data generation and write out to file

compute and write out the result

for (i = 0; i < n; i++) {

fp = fopen( "input.dat", "r");

T2SEQ ? '' 2.5

array element inside territory

array element outside territory

myid CPU0 CPU

suma, a[n], b[n], c[n], d[n];

READ 'input.dat', COMPUTE AND WRITE OUT THE RESULT */

fp = fopen( "input.dat", "r");

SPMD (Single Program Multiple Data) CPU

MPI_Gather MPI_Scatter idest CPU CPU a

2.4.2 MPI_Reduce, MPI_Allreduce

sumall CPU suma

2.1 MPI Reduction Function

ntotal np dimension ntotal n :

Data & Computational Partition Using MPI_Scatter, MPI_Gather

read input data and distribute input data

4 nodes allocated by LoadLeveler, continuing...

MPI_Sendrecv '' MPI_Bcast ''

3.1.1 MPI_ Sendrecv

MPI_Bcast ( (void *)&b, icount, DATA_TYPE, iroot, comm);

T3SEQ for loop a[i] c[i]d[i]