CUDA May 09 TB L5

Lecture 5
Multi-GPU computing with CUDA and MPI Tobias Brandvik
The story so far
Getting started (Pullan) An introduction to CUDA for science (Pullan) Developing kernels I (Gratton) Developing kernels II (Gratton) CUDA with multiple GPUs (Brandvik) Medical imaging registration (Ansorge)
Agenda
MPI overview The MPI programming model Heat conduction example (CPU) MPI and CUDA Heat conduction example (GPU) Performance measurements
MPI overview
MPI is a specification of a Message Passing Interface The specification is a set of functions with prescribed behaviour Not a library there are multiple competing implementations of the specification Two popular open-source implementations are Open-MPI and MPICH2 Most MPI implementations from vendors are customized versions of these.
Why use MPI?
Performance Scalability Stability
What hardware does MPI run on?
Distributed memory clusters MPIs popularity is in large part due to the rise of cheap clusters with commodity x86 nodes over the last 15 years Ethernet or Infiniband interconnects Shared memory Some MPI implementations are also suitable for multi-core shared memory machines (e.g. high-end desktops)
MPI programming model
An MPI program consists of several processes Each process can execute different instructions Each process has its own memory space Processes can only communicate by sending messages to each other
MPI programming model
Communicator Rank 0 Memory Rank 1 Memory
Rank: A unique integer identifier for a process Communicator: The collection of processes which may communicate with each other
CPU
CPU
A simple example in pseudo-code
We want to copy an array from one processor to another

rank 0 float a[10]; float b[10]; rank 1 float a[10]; float b[10];
recv(b, 10, float, 1, 200) send(a, 10, float 1, 300) wait()
A simple example

recv(b, 10, float, 1, 200) send(a, 10, float 1, 300) wait() memory location
A simple example

recv(b, 10, float, 1, 200) send(a, 10, float 1, 300) wait() memory location message length
A simple example

recv(b, 10, float, 1, 200) send(a, 10, float 1, 300) wait() datatype memory location message length
A simple example

recv(b, 10, float, 1, 200) send(a, 10, float 1, 300) wait() datatype memory location message length
recv(b, 10, float, 0, 300) send(a, 10, float 0, 200) wait() sending rank
A simple example

recv(b, 10, float, 1, 200) send(a, 10, float 1, 300) wait() datatype memory location
message tag
sending rank message length
The only 7 MPI functions youll ever need
MPI-1 has more than 100 functions But most applications only use a small subset of these In fact, you can write production code using only 7 MPI functions But youll probably use a few more
The only 7 MPI functions youll ever need
MPI_Init MPI_Comm_size MPI_Comm_rank MPI_Isend MPI_Irecv MPI_Waitall MPI_Finalize The MPI specification is defined for C, C++ and Fortran well consider the C function prototypes
A closer look at the functions
int MPI_Init( int *argc, char ***argv ) Initialises the MPI execution environment int MPI_Comm_size ( MPI_Comm comm, int *size ) Determines the size of the group associated with a communicator int MPI_Comm_rank ( MPI_Comm comm, int *rank ) Determines the rank of the calling process in the communicator int MPI_Finalize() Terminates MPI execution environment
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) source: rank of source tag: message tag comm: communicator request: communication request (used for checking message status)
int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) dest: rank of src tag: message tag comm: communicator request: communication request (used for checking message status)
The structure of an MPI program

Startup
MPI_Init MPI_Comm_size/MPI_Comm_rank Read in and initialise data based on the process rank
Inner loop
Post all receives MPI_Irecv Post all sends MPI_Isend Wait for message passing to finish MPI_Waitall Perform computation
End
Write out data MPI_Finalize
An actual MPI program

#include <mpi.h> int main(int argc, char *argv[]) { MPI_Request req_in, req_out; MPI_Status stat_in, stat_out; float a[10], b[10]; int mpi_rank, mpi_size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank); MPI_Comm_size(MPI_COMM_WORLD, &mpi_size); if (mpi_rank == 0) { MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in); MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out); } if (mpi_rank == 1) { MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in); MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out); } MPI_Waitall(1, &req_in, &stat_in); MPI_Waitall(1, &req_out, &stat_out); MPI_Finalize(); }

#include <mpi.h> int main(int argc, char *argv[]) { MPI_Request req_in, req_out; MPI_Status stat_in, stat_out; float a[10], b[10]; int mpi_rank, mpi_size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank); MPI_Comm_size(MPI_COMM_WORLD, &mpi_size); if (mpi_rank == 0) { MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in); MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out); } if (mpi_rank == 1) { MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in); MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out); } MPI_Waitall(1, &req_in, &stat_in); MPI_Waitall(1, &req_out, &stat_out); MPI_Finalize(); }

#include <mpi.h> int main(int argc, char *argv[]) { MPI_Request req_in, req_out; MPI_Status stat_in, stat_out; float a[10], b[10]; int mpi_rank, mpi_size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank); MPI_Comm_size(MPI_COMM_WORLD, &mpi_size); if (mpi_rank == 0) { MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in); MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out); } if (mpi_rank == 1) { MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in); MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out); } MPI_Waitall(1, &req_in, &stat_in); MPI_Waitall(1, &req_out, &stat_out); MPI_Finalize();



Compiling and running MPI programs
MPI implementations provide wrappers for popular compilers These are normally named mpicc/mpicxx/mpif77 etc. Running an MPI program normally through mpirun np N ./a.out So, for previous example: mpicc mpi_example.c mpirun np 2 ./a.out These commands are for Open-MPI, others may differ slightly
Heat conduction example (CPU)
Well modify the head conduction example from earlier to work with multiple CPUs
2D heat conduction
In 2D:
2T 2T T = 2 + 2 t y x
2D heat conduction
In 2D:
2T 2T T = 2 + 2 t y x
For which a possible finite difference approximation is:
Ti +1, j 2Ti, j + Ti1, j Ti, j +1 2Ti, j + Ti, j 1 T = + 2 2 t x y

where T is the temperature change over a time t and i,j are indices into a uniform structured grid (see next slide)
Stencil
Update red point using data from blue points (and red point)
Finding more parallelism
In the previous lectures, we have tried to find enough parallelism in the problems for 1000s of threads This is fine-grained parallelism For MPI, we need another level of parallelism on top of this This is coarse-grained parallelism
Domain decomposition and halos
The fictitious boundary nodes are called halos
Message passing pattern
The left-most rank sends data to the right The inner ranks send data to both the left and the right The right-most rank sends data to the left
Rank 0
Rank 1
Rank 2
Message buffers
MPI can read and write directly from 2D arrays using an advanced feature called datatypes (but this is complicated and doesnt work for GPUs) Instead, we use 1D incoming and outgoing buffers Message-passing strategy is then: Fill outgoing buffers (2D -> 1D) Send from outgoing buffers, receive into incoming buffers Wait Fill arrays from incoming buffers (1D -> 2D)
Heat conduction example (single CPU)

for (i=0; i; nstep; i++) { step_kernel(); }
Heat conduction example (multi-CPU)

for (i=0; i; nstep; i++) fill_out_buffers(); if (mpi_rank == 0) { // left receive_right(); send_right(); } if (mpi_rank > 0 && mpi_rank < mpi_size-1) { // inner receive_left(); receive_right(); send_left(); send_right(); } if (mpi_rank == mpi_size-1) { // right receive_left(); send_left(); } wait_all(); empty_in_buffers(); step_kernel(); }
Heat conduction example (multi-GPU)
How does all this work when we use GPUs? Just like with CPUs, except we need buffers on both the CPU and the GPU Use one MPI process per GPU
Message buffers with GPUs
Message-passing strategy with GPUs: Fill outgoing buffers on GPU using a kernel (2D -> 1D) Copy buffers to CPU - cudaMemcpy(DeviceToHost) Send from outgoing buffers, receive into incoming buffers Wait Copy buffers to GPU - cudaMemcpy(HostToDevice) Fill arrays from incoming buffers on GPU using a (1D -> 2D)

for (i=0; i; nstep; i++) fill_out_buffers_cpu(); recv(); send(); wait(); empty_in_buffers_cpu(); step_kernel_cpu(); }

for (i=0; i; nstep; i++) fill_out_buffers_gpu(); // (2D -> 1D) cudaMemcpy(DeviceToHost); recv(); send(); wait(); cudaMemcpy(HostToDevice); empty_in_buffers_gpu(); // (1D -> 2D) step_kernel_gpu(); }
Compiling code with CUDA and MPI
Can use a .cu file and use nvcc like before, but need to include the MPI headers and library: nvcc mpi_example.cu I $HOME/open-mpi/include L $HOME/open-mpi-lib lmpi Or, compile C code with mpicc and CUDA code with nvcc and link the results together into an executable For simple examples, the first approach is fine, but for complicated applications the second approach is cleaner
Scaling performance
When benchmarking MPI applications, we look at two issues: Strong scaling how well does the application scale with multiple processors for a fixed problem size? Weak scaling how well does the application scale with multiple processors for a fixed problem size per processor?
GPU scaling issues
Achieving good scaling is more difficult with GPUs for two reasons: 1. There is an extra memory copy involved for every message 2. The kernels are much faster so the MPI communication becomes a larger fraction of the overall runtime
Typical scaling experience

Weak scaling Performance Performance Strong scaling
Ideal CPU GPU
Procs
Procs
GPU scaling issues
Achieving good scaling is more difficult with GPUs for two reasons: 1. There is an extra cudaMemcpy() involved for every message 2. The kernels are much faster so the communication becomes a larger fraction of the overall runtime
Summary
MPI is a good approach to parallelism on distributed memory machines It uses an explicit message-passing model Grid problems can be solved in parallel by using halo nodes You dont need to change your kernels to use MPI, but you will need to add the message passing logic Using MPI and CUDA together can be done by using both host and device message buffers Achieving good scaling is more difficult since the kernels are faster on the GPU

CUDA May 09 TB L5

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CUDA May 09 TB L5

Загружено:

Авторское право:

Доступные форматы

Lecture 5

Multi-GPU computing with CUDA and MPI Tobias Brandvik

The story so far

Why use MPI?

Performance Scalability Stability

What hardware does MPI run on?

MPI programming model

MPI programming model

Communicator Rank 0 Memory Rank 1 Memory

A simple example in pseudo-code

We want to copy an array from one processor to another

recv(b, 10, float, 1, 200) send(a, 10, float 1, 300) wait()

recv(b, 10, float, 0, 300) send(a, 10, float 0, 200) wait()

We want to copy an array from one processor to another

recv(b, 10, float, 0, 300) send(a, 10, float 0, 200) wait()

We want to copy an array from one processor to another

recv(b, 10, float, 0, 300) send(a, 10, float 0, 200) wait()

We want to copy an array from one processor to another

recv(b, 10, float, 0, 300) send(a, 10, float 0, 200) wait()

We want to copy an array from one processor to another

We want to copy an array from one processor to another

recv(b, 10, float, 0, 300) send(a, 10, float 0, 200) wait()

sending rank message length

The only 7 MPI functions youll ever need

The only 7 MPI functions youll ever need

A closer look at the functions

A closer look at the functions

A closer look at the functions

The structure of an MPI program

An actual MPI program

An actual MPI program

An actual MPI program

An actual MPI program

An actual MPI program

An actual MPI program

Compiling and running MPI programs

Heat conduction example (CPU)

Ti +1, j 2Ti, j + Ti1, j Ti, j +1 2Ti, j + Ti, j 1 T = + 2 2 t x y

Finding more parallelism

Domain decomposition and halos

Domain decomposition and halos

Domain decomposition and halos

Domain decomposition and halos

Domain decomposition and halos

The fictitious boundary nodes are called halos

Message passing pattern

Heat conduction example (single CPU)

Heat conduction example (multi-CPU)

Heat conduction example (multi-GPU)

Message buffers with GPUs

Heat conduction example (multi-GPU)

Heat conduction example (multi-GPU)

Compiling code with CUDA and MPI

GPU scaling issues

Typical scaling experience

Ideal CPU GPU

GPU scaling issues

Вам также может понравиться