Вы находитесь на странице: 1из 101

INTRODUCTION TO PARALLEL

COMPUTING

Aamir Shafi
Khizar Hussain

OUTLINE • Who are we?
• Why are we giving this talk?


• Parallel Computer Organization
• SMP Technologies
• Distributed Memory Technologies














Optional Exercise


OBJECTIVES

• Provide a general overview of programming frameworks for parallel computing


• Shared Memory
• Distributed Memory
• Introduce MPI (Message Passing Interface) and MPJ-Express as an
implementation of the MPI standard.
• Get the user up to speed with the fundamentals of parallel programming with
MPI.
• Run some MPI code.

OUTLINE • Who are we?
• Why are we giving this talk?


• Parallel Computer Organization
• SMP Technologies
• Distributed Memory Technologies





PARALLEL PROGRAMMING
TECHNOLOGIES
Parallel Computer Organization
• Traditional dichotomy in parallel computing
• Shared Memory Vs Distributed Memory
• Symmetric Multiprocessing Vs Clustered Computing
processors Individual nodes
0 1 2 3

System Memory
FAST INTERCONNECT

Shared Memory Distributed Memory


SMP Programming Techniques / Frameworks

• Programs are parallelized using threads or processes running on


multiple cores through:

o Explicit threading - POSIX, C++11, Java Threads etc.

o Compiler Directives / pragmas – OpenMP

o Language Extensions – Clik Plus, Threading Building


Blocks(Intel).
Threads
• Traditional languages like C/C++ make use of threading libraries such as
POSIX threads (pthreads) or the native C++11 threads to achieve
concurrency.

• Higher level languages like Java have native threading classes or interfaces for
threading. C# has native threads as well as Task, Background runner i.e.
abstractions of threads to make things simpler.

• On multicore systems, thread scheduling on different cores is handled


automatically for parallel execution.
OpenMP
• OpenMP is a set of compiler directives along with libraries for exploiting
shared memory computers.

• Its bindings are available for C/C++ and Fortran

• Code snippets -> compiler directives used to parallelize segments in our


code. Low level threading is handled automatically by the compiler.

• No need to create threads explicitly.


Extensions
• Cilk Plus, a set of C/C++ linguistic extensions:
• Add support for fork-join.
• Vector parallelism also featured.
• Provides library and global variables for parallelization.
• Utilities like Cilkscreen race detector and Cilkview scalability
analyzer.

• Threading Building Blocks (Intel) is a C++ template library:


• Breaks down computation into tasks.
• Features a work stealing scheduler.
• Manages and schedules threads to execute each task as required.
Limits of Shared Memory
• Shared Memory programming grew in popularity due to the rise of
multicore processors in recent years.

• Greatest drawback is limited scalability:


o Only a single node is responsible for processing.
o Fixed number of processor-cores depending on node.
o Limited common memory divided among cores.

• The search for scalability paved the path for Distributed Memory
systems.
Distributed Memory Systems
• A number of nodes connected to each other through a high
throughput interconnect.
• Each processor(node) has its own local memory, optional auxiliary
coprocessors / accelerators etc.
• Nodes are connected in different configurations.
• Nodes communicate with each other through message passing.
A Simple Cluster

CPUs P0 P1 P2 P3

A group of
four simillar
workstations.
Memory

LAN
Total memory available is four times that of individual workstation, but distributed over groups.
A Continuum
• Range of systems and configurations available for distributed memory
parallel computation.
Clusters of
Ad hoc Supercomputers
General purpose
grouping of with thousands of
computers
workstation nodes on custom
dedicated to
s on TCP/IP interconnect
parallel computing
LAN in lab
(Beowulf Clusters)

Increasing Scale
Transition from Threads to Processes
• Move from cooperating threads to cooperating processes.

• Process has private memory. No more sharing of memory.

• Physical hardware constraints.

• Analogous to Threads vs. Processes within individual operating


system.
Cluster Programming Frameworks
• Cooperating Processes communicate using:

o MPI – Message Passing Interface (PVM,etc)


o Co-array communication – Fortran, UPC, etc
o Global Array Toolkit
• We will stick to MPI for all our intents and purposes.
Cluster Programming Drawbacks

• Programming cooperating processes harder.

• Explicit Inter-Process Communication between nodes


unavoidable.

• Message Passing required for inter-node communication.



OUTLINE • Who are we?
• Why are we giving this talk?


• Parallel Computer Organization
• SMP Technologies
• Distributed Memory Technologies





QUICK
RECAP
O F J AVA
The Java Programming Language

• Object Oriented programming language.

• C like syntax

• Portable - “Write Once, Run Anywhere” philosophy.

• Developed by James Gosling – released in 1995

• One of the most used programming languages according to Github.


Hello World in Java
public class HelloWorld{

public static void main ( String[] Args){


int x =10;
int[] arr = new int [x];

for( int i = 0 ; i < arr.length; i++){


arr[i]=I;
System.out.println(arr[i]);
}

System.out.println(“Hello world.”);
}
TASK#1:
EXECUTE HELLO
Expected Output
WORLD IN JAVA

• The code for Hello world is provided in


the file HelloWorld.java
• Compile the code using the following
command
javac HelloWorld.java
• Run the program using the following
command
java HelloWorld









Optional Exercise


MESSAGE PASSING INTERFACE
SPMD Programming
• Single Program Multiple Data
• Participating Nodes(processors) run the same program.
• Each operate in local memory content.
• Communicate using MPI.
The Message Passing Interface
• MPI – Message Passing Interface
Programming interface for sending and receiving
messages between programs.

• API defined for C/C++, Fortran.


(Unofficial bindings for Java and Python available.)

• Most popular framework for SPMD.


MPI PROGRAMMING MODEL
MPI Programming Model
• Fixed number of processes (P, e.g.) participate in a parallel
program.

• “The World”
– Set of all participating processes
– Often (not always) equal to available physical nodes

• Utilizes SPMD Model.


MPI Execution Model
0 1 2 3 4 5

Rank varies from 0-5

Size is 6
MPJ EXPRESS AS AN MPI LIBRARY
MPJ Express
• Message Passing Interface for Java.

• Unofficial Java API for MPI.

• MPJ and MPJ Express are slightly different,


• We shall address both as MPJ
• Labs will utilize MPJ Express

• MPJ API adheres strictly to the official MPJ standard unless explicitly
specified for practical reasons.
MPJ Fundamentals
• Ever MPJ program is a Java application
• Execution begins from main() method

• MPJ programs must begin with call to init()


• MPI.Init(args)
• args: command like arguments passed to main().

• MPJ program terminate with call to finalize()


• MPI.Finalize()
• Takes no arguments.
MPJ Hello World
import mpi.* ;
public class HelloMPJ {
public static void main(String args[]) throws Exception {
MPI.Init(args) ;
int P = MPI.COMM_WORLD.Size() ;
int me = MPI.COMM_WORLD.Rank() ;
System.out.println("Hi from <" + me + ">") ;
MPI.Finalize() ;
}
}
MPJ World

• Java object accessed using: MPI.COMM_WORLD


• MPI communicator that spans the whole world (All processes)
• Instance of Comm class.
• Used to call most of the useful communication methods.
• Send, Recv, SendRecv, BCast, etc.
MPJ Process Rank

• Each MPI process(processor/node) distinguished by unique rank


• Equivalent to numeric identifier such as thread id
• Assuming the size of world is P,
• 0 <= Rank < P (OR) 0 <= Rank <= P-1
• Essentially runs from 0 to P-1
• Strictly, rank is relative to the world:
• Ranks within subgroup can also be defined
World Size and Process Rank

• Size of the world (# of processes/nodes) not defined by program.


• Specified in the command used to run the program.
• Interrogate size of the world by calling
• MPI.COMM_WORLD.Size()
• Interrogate current process rank by calling
• MPI.COMM_WORLD.Rank()
MPI Program in Execution
• Pool of available host computers, each running an MPI service or MPI daemon.
• Client program, e.g. mpirun, connects to P daemons and asks them to start
processes.

MPI MPI MPI MPI MPI MPI


service service service service service service
Hello Hello Hello Hello
World World World World

mpirun Hello World process (say) starts on P hosts.


Running MPJ Programs
• We shall illustrate this using the hello world program exhibited in the previous slides.
• At the command prompt on client computer, might run:
> mpjrun -np 4 HelloWorld
Hi from <1>
Hi from <3>
Hi from <4>
Hi from <0>

• Of course there is no particular order to the output from the 4 processes.

 Instructions on how mpj-express and jdk are setup will be provided seperately
TASK#2:
E XE CUT E
MP J H E L LO WOR L D Expected Output

• The code is provided in the file HelloMPJ.java

• Compile the code using the following command

javac –cp ,;%MPJ_HOME%/lib/mpj.jar HelloMPJ.java

• Run the program using the following command

mpjrun –np 4 HelloMPJ


MPJ Express Configurations
MPJ Express Configurations

Cluster Multicore

smpdev
native hybdev

mxdev niodev
MPJ Express Configurations
• Cluster Mode:
• Used for parallelisation over multiple nodes
• Requires machine file
• Run command -> mpjrun.sh –np 2 –dev niodev HelloWorld

• Multicore Mode:
• Used for parallelisation on multiple cores of a single node
• Run Command -> mpjrun.sh –np 2 HelloWorld









Optional Exercise


PEER TO PEER COMMUNICATION
USING MPI
Send Recv
SendRecv

0 1 2 3 4 5

• What if node 1 wants to send something to node 2?


OR
• How can nodes 4 and 5 communicate data concurrently without
blocking?

* Rank varies from 0-5 • The solution to such problems is Peer-Peer Communication.
• MPJ provides several methods such as:
** Size is 6
• Send
• Recv
• SendRecv
Sending a Message
• Simplest method for sending a message:
Send(buffer, offset, count, type, dest, tag)
• Buffer: array in the user’s program (i.e. java array in this case)
• Offset: index of starting element to be sent
• Count: number of elements to be sent starting at offset index
• Type: the type of elements present in the buffer (int, float etc.*)
• Dest: destination or the rank of process that should receive
• Tag: user defined code for purpose identification(let it be 0)
* MPI implementation usually provide an extensive list of compatible types such as MPI.Int that need to be used.
Sending a Message
buffer array
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

offset = 8
count = 12

• Elements of buffer actually sent are in green.


• Note offset is often 0.
• count may take value 1 to send a single element.
Receiving a Message
• Simplest method for receiving a message:
Recv(buffer, offset, count, type, src, tag)
• Buffer: array in which contents of incoming message will be stored
• Offset: index of where in the array should you receive
• Count: maximum message size that can be received
• Type: the type of elements expected to be received(int, float etc.*)
• Src: rank of process that we expect the message to come from
• Tag: user defined code for purpose identification(let it be 0)
* MPI implementation usually provide an extensive list of compatible types such as MPI.Int that need to be used.
Receiving a Message
buffer array
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

offset = 8
count = 12

• Elements of buffer actually written are in green – this example assumes actual message
received contained fewer elements than the maximum specified by count in the Recv
call.
Simple Send-Recv Program
Import mpi.*;

public class SimpleSendRecv{


public static void main(String[] args){
MPI.Init(args);
int[] buffer = new int [10];
int tag = 10;
int peer;
int rank = MPI.COMM_WORLD.Rank();

if( rank == 0){


for(int i=0; i < buffer.length; i++) buffer[i] = i*i;
peer =1;
MPI.COMM_WORLD.Send(buffer,0,buffer.length,MPI.INT,peer,tag);
}
else if ( rank == 1){
peer = 0;
MPI.COMM_WORLD.Recv(buffer,0,buffer.length,MPI.INT,peer,tag);
for( int I =0; i < buffer.length; i++) System.out.println(buffer[i]);
}
MPI.Finalize();
TASK#3
PERFORM A SEND Expected Output
AND RECEIVE

• The code is provided in the file SimpleSendRecv.java

• Compile the code using the following command

javac –cp ,;%MPJ_HOME%/lib/mpj.jar


SimpleSendRecv.java

• Run the program using the following command

mpjrun –np 4 HelloMPJ

OPTIONAL

• Try sending a string from one process to another.


Exchanging Data Using Sendrecv
• A process cannot execute both send and recv at the same time.
• Could cause deadlocks in some parallel programs.
• Workaround is Sendrecv method, send and recv simultaneously
Sendrecv( sendBuffer, sendOffset, sendCount, sendType, dest, sendTag
recvBuffer, recvOffset, recvCount, recvType, src, recvTag )

• Input parameter serve same purpose as send and recv methods.


Probe for Incoming Messages
• Probe method is a blocking test for incoming messages.
• Returns Status object which gives us useful information.
• Probe (source, tag)
• Source: rank of the process expected to send message
• Tag: the tag associated with expected message
Behavior of Recv
• Call Recv, MPI system looks for any message with matching src and
tag values that already arrived at this node.
• If found, immediately copies the data from the matching message
that arrived first to buffer, then Recv completes.
• Errors may occur if the amount or type of data in the that
message disagrees with the Recv arguments.
• Otherwise blocks until a matching message arrives from src.
• Arriving messages with different tag values ignored.
Ordering of Recv Completions
• MPI delivers messages from a given source to a given
destination in the same order that Send was called at the
source processes.

• But if Recv is posted with a different tag, messages may


effectively “overtake” – an earlier message may be ignored
until another Recv is posted with a matching tag.
Wildcards
• The src argument can take the special value MPI.ANY_SRC, in
which case the Recv accepts a message with matching tag from
any source.
• Similarly, the tag argument can take the special value
MPI.ANY_TAG, in which case the Recv accepts a message with
specified source from any tag.
• Can combine both these wildcards to accept any incoming message
(on the current communicator – see later).
Status Object

• Many MPI calls return a Status object.

• It contains source and tag information.


(Useful when wildcards are used)

• Has methods used to determine how much data received.

• etc
MINI-TASK#1
SENDRECV Expected Output

• Incomplete code is provided in the file


SendRecv.java

• Complete the code where indicated to get the


expected output.

• Compile and run the program using the commands


you have learnt.

OPTIONAL

• Try sending a string from one process to another.


DIFFERENT SEND MODES
sender receiver

contro

time ->
• Modes of send: l mes
sa
actua ge to rece
l data iver
• Standard sent

• Synchronous
• Ready sender receiver
• Buffered
contro
l mes
sage
to rec
• Eager Send and eiver

Rendezvous protocols ent


m

time ->
ge
are used to ac knowl
e d

implement various
modes actu
al da
ta se
nt
Behavior of Send
• Recv call will block if the matching message has not yet been sent.

• In general send operations may also block if:

• corresponding recv has not yet been initiated (“posted”)

• Local MPI system may not have enough memory available to buffer
the sent message internally.
Standard Mode Send
• MPI standard says that the basic “standard mode” Send may or may not block.
(depends on implementer of MPI).
• In typical implementations:
• short messages will be sent immediately.
• longer messages will wait until Recv is posted at destination, leading to blocking.
• In other implementations, all calls to Send block until Recv is posted.
• Anecdote:
• all MPI programs should be written with the “pessimistic” assumption that calls to
Send block.
• It is essentially unpredictable whether they will or not.
Buffered Mode Send
• Bsend(buffer, offset, count, type, dest, tag)

• Send should not block using this mode.

• However, depends on allocated memory for buffering.

• Buffer implemented using MPI.Bufferr_attach

• Unattractive option.
Other Modes
• Synchronous send: always blocks until matching recv called.
• Ssend(buffer, offset, count, type, dest, tag)

• Ready send: may only be used if program guarantees matching Recv


called ahead of time.
• Rsend(buffer, offset, count, type, dest, tag)
IMMEDIATE COMMUNICATION
NON-BLOCKING AKA IMMEDIATE
COMMUNIC ATION

Sender “Blocking” Receiver


time send() recv()

CPU waits CPU waits

Sender “Non Blocking” Receiver

isend() irecv()

CPU CPU
time

perform task perform task


iwait() iwait() 66
CPU waits CPU waits
Non-Blocking / Immediate Communication
• Mechanism that separates the initiation of communications from
the stage of waiting for their completion.
• Non-blocking communication, because the initiation operation never
blocks a process.
• Slightly misleading, because all communications must also be
completed, and that stage may or may not block (depending amongst
other things on the mode).
Non-Blocking Send Example
Request req =
MPI.COMM_WORLD.Isend(buffer, offset, count, type, dest, tag) ;
...
req.Wait() ;

• Immediate communication methods like Isend() return immediately


with a Request object.
• To wait for completion, execute the Wait() method on that object.
• Effect of Isend/Wait above identical Send, but can do other things
(...) in between initiation and waiting.
Non-Blocking Methods
Normal method Immediate version
Send Isend
Recv Irecv
Bsend Ibsend
Ssend Issend
Rsend Irsend

 Each of the basic communication methods has a


corresponding immediate method that returns a
Request object.
The Wait() Method
• Must call Wait(or equivalent) on request to guarantee
completion of communication
• Wait() returns Status object:
• In case of immediate send operations, the object can be
ignored
• In case of immediate Irecv operations, the object is same
as Status returned by Recv()
• Different from Java.lang.Object wait() method which is used in
threading.
The Test() Method
• A non-blocking version of the Wait() method is the Test()
method:
Status status = Request.Test();

• Test() has a return value which is a Status object.


• Status is valid if communication has completed.
• Status is null if communication has not completed.
Various Completion Methods
public static Status Request.Waitany(Request[] r) ;
public static Status[] Request.Waitsome(Request[] r) ;
public static Status[] Request.Waitall(Request[] r) ;
• Many applications require completing any, some, or all
pending (immediate) communication.
• There are also test variants for these methods:

public static Status Request.Testany(Request[] r) ;


public static Status[] Request.Testsome(Request[] r) ;
public static Status[] Request.Testall(Request[] r) ;
Waitany()
• Static method in the Request Class:
• Input parameter : Array of Request objects
• Waits for completion of one communication in the
request array – so for example you can wait on several
possible receives until the first arrives, then act on that
before dealing with other communications, later.
COLLECTIVE COMMUNICATION
USING MPI
Collective Communication
• An important paradigm (supported by MPI) involves all
processes working together to move data between the
memories of the processors.

• This is called collective communication


The Broadcast
• One process needs to send a particular data item(as MPI Array) to every
other process in the program.

• Used to provide input data to different processes at the start of


execution etc.

• Can be implemented using Send/Recv


• Overhead of breaking down broadcast into Send/Recv
• Less efficient O(MxP) time complexity

• We will now exhibit the broadcast method using a collective


Bcast Method
• Bcast(buffer, offset, count, type, root)
• Buffer + Offset + Count: source/destination arrays, their index etc
• Type: Type of data being communicated
• Root: Rank of the process participating in broadcast

• Bcast must be called by all participating processes at the same


point with consistent arguments:
• Use barrier to synchronize all processes.
MINI-TASK#2
BROADC AST Expected Output

• Incomplete code is provided in the file


broadcast.java

• Complete the code where indicated to get the


expected output.

• Compile and run the program using the


commands you have learnt.
Reduction
• Opposite operation to broadcast.

• Values take from all processes -> Reduced to a single value.

• Reduction depends on combining operations.


10 1 33 7

• Such as sum, min max, prod etc

51

MPI.SUM
Reduce Method
Reduce(sendbuf, sendoffset, recvbuf, recvoffset, count,
type, op, root)
• If count > 1, sendbuf arrays from P are combined element
by element to produce an array of count results in recvbuf

• Op is the combining operation:

MPI.SUM MPI.PROD MPI.MAX


MPI.MIN MPI.LAND MPI.BAND
MPI.LOR MPI.BOR MPU.LXOR
MPI.BXOR MPI.MINLOC MPI.MAXLOC
MINI-TASK#3
REDUCTION Expected Output

• Incomplete code is provided in the file


reduce.java

• Complete the code where indicated to get the


expected output.

• Compile and run the program using the


commands you have learnt.
SYNCHRONIZATION IN MPI
Global Synchronization
• Two types of synchronous computations.

• Fully Synchronous:
• All processes synchronized at regular points, exchange data etc.
• All processes involved in computation are synchronized

• Locally Synchronous:
• Processes synchronize with a set of nearby process with logical
dependencies.
• Not all processes involved in computation need to be synchronized.
Fully Synchronous: Barrier
• A basic mechanism for
synchronizing processes -
inserted at the point in each
process where it must wait.

• All processes can continue


from this point when all the
processes have reached it
(or, in some implementations,
when a stated number of
processes have reached this
point).
MPI.COMM_WORLD.Barrier()
• Called by each process in the world communicator, blocking until
all members of the group have reached the barrier call and only
returning then.

• Good barrier implementations must take into account that a


barrier might be used more than once in a process.

• Might be possible for a process to enter the barrier for a second


time before previous processes have left the barrier for the first
time. 6.4
Local Synchronization

• Suppose a process Pi needs be synchronized and exchange data with process Pi-1.
Could consider:

• Need synchronous send()’s if synchronization as well as data transfer.


Deadlock Free Blocking
Sendrecv()









Optional Exercise


LAPLACE EQUATION SOLVER
2D-Laplace Equation
• Two-dimensional Laplace equation crops up a lot in Mathematics
and Physics.

2 2
2
𝜕 𝑢 𝜕 𝑢
∇ 𝑢 = 2+ 2
=0
𝜕𝑥 𝜕𝑦
• We don’t need to worry about the instances for this simulation.
• We are interested in the “Discrete” form:-
• Laplace Equation on a two-dimensional grid of points
• We shall use the numerically iterative method Relaxation
to approximate the values of unknows.
Relaxation Method
N

i=1 i=N-2

j=1
• Imagine a NxN grid of points. (2D-plane)

• Each point on the grid holds the value of an


I,j unknown variable.
( Say Phi(i,j), where i and j are rectangular
X
coordinates. )

• We can store those unknows in a 2d array


with some initial value.

j=N-1
𝟎≤𝒙≤𝟏

N x N Grid phi = 1
**Each square is one unknown
Legend phi = 0
Relaxation Method
N

i=1 i=N-2

j=1
• Now the equation that approximates the
values of the unknows is:
phi[i][j] = 0.25 * (phi[i-1][j] + phi[i+1][j] +
I,j
phi[i][j+1] + phi[i][j-1]
X

• Each unknow is the average of its immediate


neighbor.
• Values will converge after a large number of
j=N-1 iterations.

N x N Grid phi = 1 Updating Unknown


**Each square is one unknown
Legend phi = 0 Dependency of the Unknown
Decomposition
B+2
• Red borders represent elements where
phi is set to fixed boundary values.

• Blue area represents the elements that


will get updated by the iterative
algorithm.

N • The black borders are Ghost Regions


which are used to share inaccessible
regions between processes.

• Ghost region are used as part of the


calculation but offer no result.

• The numbers on the sides represent the


ranges and indexes used in the loop.
B
• B = N / (number of processes available)
Proc 0 Proc 1 Proc 2 Proc 3
Edge Exchange Using Sendrecv()

Proc 0 Proc 1 Proc 2 Proc 3


displayPhi()
1 2 3

Send

Recv

REPAINT
TASK#3
Expected Output
COMPILE AND RUN
THE SIMULATION

• The code is provided in the file MPJLaplace.java

• Compile the code using the following command

javac –cp ,;%MPJ_HOME%/lib/mpj.jar


MPJLaplace.java

• Run the program using the following command

mpjrun –np 4 MPJLaplace


OPTIONAL PRACTICE EXERCISE
Expected Output
OP T I ONAL TA SK

• The code is provided in the file MPJLaplace.java

• Modify it such that all borders are red.

• Run the simulation and display your results.

• HINT: Add a few lines of code where the array


is being initialized
1. Try replacing N with different values and see how it affects the
simulation.
2. Try initializing the grid differently and
1. Make all borders red.
2. Make only two opposite borders red etc.
3. Use different values fo nitter (iterations).
FURTHER MATERIAL

• MPJ Express:
• API:
• http://mpj-express.org/docs/javadocs/index.html
• READMEs for Win & Linux:
• http://mpjexpress.org/docs/readme/README
• http://mpjexpress.org/docs/readme/README-win.txt
• User Guides for Win & Linux:
• http://mpjexpress.org/docs/guides/linuxguide.pdf
• http://mpjexpress.org/docs/guides/windowsguide.pdf

Вам также может понравиться