PARALLEL COMPUTING: Models and Algorithms

PARALLEL COMPUTING:
Models and Algorithms

Course for Undergraduate Students in the 4
th
year
(Major in Computer Science-Software)
Instructors: Instructors:
Mihai L. Mocanu, Ph.D., Professor
Cristian M. Mihescu, Ph.D., Lecturer
Cosmin M. Potera, Ph.D. Student, Assistant
E-mail: mmocanu@software.ucv.ro
Office: Room 303 Office hours: Thursday12:00-14:00
Course page: http://software.ucv.ro/~mocanu_mihai
(ask for the passw and use the appropriate entry)
Course objectives
Understanding of basic concepts of parallel computing
understand various approaches to parallel hardware
architectures and their strong/weak points
become familiar with typical software/programming
approaches
learn basic parallel algorithms and algorithmic techniques
learn the jargon so you understand what people are talking
about
be able to apply this knowledge
Course objectives (cont.)
Familiarity with Parallel Concepts and Techniques
drastically flattening the learning curve in a parallel environment
Broad Understanding of Parallel Architectures and Programming
Techniques Techniques
be able to quickly adapt to any parallel programming environment
Flexibility
Textbooks and Working
Textbooks:
1. Vipin Kumar, Ananth Grama, Anshul Gupta, George Kyrypis - Introduction to
Parallel Computing Benjamin/Cummings 2003, (2
nd
Edition - ISBN 0-201-
64865-2) or Benjamin/Cummings 1994, (1
st
Edition ISBN 0-8053-3170-0)
2. Behrooz Parhami - Introduction to Parallel Processing: Algorithms and
Architectures, Kluwer Academic Publ, 2002
3. Dan Grigoras Parallel Computing. From Systems to Applications, 3. Dan Grigoras Parallel Computing. From Systems to Applications,
Computer Libris Agora, 2000, ISBN 973-97534-6-9
4. Mihai Mocanu Algorithms and Languages for Parallel Processing, Publ.
University of Craiova, 1995
Laboratory and Projects:
1. Mihai Mocanu, Alexandru Patriciu Parallel Computing in C for Unix and
Windows NT Networks, Publ. University of Craiova, 1998
2. Christofer H.Nevison et al. - Laboratories for Parallel Computing, Jones and
Bartlett, 1994
Other resources are on the web page
Topics Covered (overview)
Fundamental Models (C 1..5)
Introduction
Parallel Programming Platforms
Principles of Parallel Algorithm Design
Basic Communication Operations
Analytical Modeling of Parallel Programs
Parallel Programming (C 6 & a part of C 7)
Programming using Message Passing Paradigm
Parallel Algorithms (C 8, 9, 10 & a part of C 11)
Dense Matrix Algorithms
Sorting
Graph Algorithms
Topics in Detail I
1. Parallel Programming Platforms & Parallel Models
logical and physical organization
interconnection networks for parallel machines
communication costs in parallel machines
process - processors mappings, graph embeddings process - processors mappings, graph embeddings
Why?
It is better to be aware of the physical and economical constraints and tradeoffs
of the parallel system you are designing for, then to be sorry later.
Topics in Detail II
2. Quick Introduction to PVM (Parallel Virtual Machine) and MPI
(Message Passing Interface)
semantics and syntax of basic communication operations
setting up your PVM/MPI environment, compiling and running PVM or MPI
programs
Why?
You can start to program simple parallel programs early on.
Topics in Detail III
3. Principles of Parallel Algorithm Design
decomposition techniques
load balancing
techniques for reducing communication overhead
parallel algorithm models parallel algorithm models
Why?
These are fundamental issues that appear/apply to every parallel program. You
really should learn this stuff by hearth.
Topics in Detail IV
4. Implementation and Cost of Basic Communication Operations
broadcast, reduction, scatter, gather, parallel prefix,
Why?
These are fundamental primitives you would often use and you should known These are fundamental primitives you would often use and you should known
them well: Not only what they do, but also how much do they cost and when
and how to use them.
Going through details of implementation allows us to see how are the principles
from the previous topic applied to relatively simple problems.
Topics in Detail V
5. Analytical Modeling of Parallel Programs
sources of overhead
execution time, speedup, efficiency, cost, Amdahl's law
Why? Why?
Parallel programming is done to increase performance.
Debugging and profiling is extremely difficult in parallel setting, so it is better to
understand from the beginning what performance to expect from a given parallel
program and, more generally, how to design parallel programs with low execution
time. It is also important to know the limits of what can be done and what not.
Topics in Detail VI
6. Parallel Dense Matrix Algorithms
matrix vector multiplication
matrix matrix multiplication
solving systems of linear equations
7. Parallel Sorting
odd-even transpositions sort
sorting networks, bitonic sort
parallel quicksort
bucket and sample sort
Why?
Classical problems with lots of applications, many interesting and useful
techniques exposed.
Topics in Detail VII
8. Parallel Graph Algorithms
minimum spanning tree
single-source shortest paths
all-pairs shortest paths
connected components
algorithms for sparse graphs
9. Search Algorithms for Discrete Optimization Problems
search overhead factor, speedup anomalies
parallel depth-first search
Why?
As before, plus shows many examples of hard-to-parallelize problems.
Grading (tentative)
20% continuous test quizes (T)
20% continuous practical laboratory assignments (L)
20% cont. practical evaluation through projects (P)
40% final written exam (E)
You have to get at least 50% on any continuous evaluation You have to get at least 50% on any continuous evaluation
form (T, L and P) in order to be allowed to sustain the
final exam during session.
You have to get at least 50% on the final exam (E) to pass
and obtain a mark greater than 5. All the grades obtained
go with the specified weight into the computation of the
final mark.
Assignments and evaluations
assignments from your project for a total of 20 points
mostly programming in C, C++ with threads or multiple
processes, PVM or MPI, etc. implementing (relatively) simple
algorithms and load balancing techniques, so make sure and check
the lab info as soon as possible
continuous evaluation based on some theoretical questions continuous evaluation based on some theoretical questions
thrown in to prepare you better for the final exam
If you have problems with setting up your working environment
and/or running your programs, ask the TA for help/advice. He is
there to help you with that.
Use him, but do not abuse him with normal programming bugs.
Project (tentative)
Project:
may be individual or done in groups of 2-3
intermediary reports or presentations weight 30% of the
final grade
required: programs + written documentation + final and
intermediary presentations (2 by the end of semester) intermediary presentations (2 by the end of semester)
three main types
report on interesting non-covered algorithms
report on interesting parallel applications
not-so-trivial programming project
final written report and presentation due date: end of Jan.
Introduction
Background
Speedup. Amdahls Law
The Context and Difficulties of Actual Parallel
Computing Computing
Demand for computational speed. Grand
challenge problems
Global weather forecasting
N-body problem: modeling motion of
astronomical bodies
Background
Parallel Computing: using more than one computer, or a
computer with more than one processor, to solve a task
Parallel computers (computers with more than one
processor), and their way of programming - parallel processor), and their way of programming - parallel
programming have been around for more than 40
years! Motives:
Usually faster computation - very simple idea - that n
computers operating simultaneously can achieve the result n
times faster - it will not be n times faster for various reasons.
Other motives include: fault tolerance, larger amount of
memory available, ...
... There is therefore nothing new in the idea of parallel
programming, but its application to computers. The author
cannot believe that there will be any insuperable difficulty in
extending it to computers. It is not to be expected that the
necessary programming techniques will be worked out
overnight. Much experimenting remains to be done. After all,
the techniques that are commonly used in programming today
were only won at the cost of considerable toil several years were only won at the cost of considerable toil several years
ago. In fact the advent of parallel programming may do
something to revive the pioneering spirit in programming
which seems at the present to be degenerating into a rather
dull and routine occupation ...
Gill, S. (1958), Parallel Programming, The Computer Journal, vol. 1, April 1958, pp. 2-10.
Speedup Factor
Speedup factor can also be cast in terms of computational steps:
S(p) =
Execution time using one processor (best sequential algorithm)
Execution time using a multiprocessor with p processors
=
t
s
t
p
S(p) =
Number of computational steps using one processor
Number of par allel computational steps with p processors
S(p) gives increase in speed by using a multiprocessor
Hints:
Use best sequential algorithm with single processor system
Underlying algorithm for parallel implementation might be (and it
is usually) different
Number of par allel computational steps with p processors
Maximum Speedup
Is usually p with p processors (linear speedup)
Speedup factor is given by:
S(p) =
t
s
p
=
ft
s
+ (1 f )t
s
/p 1 + (p 1)f
This equation is known as Amdahls law
Remark: Possible but unusual to get superlinear speedup
(greater than p) but due to a specific reason such as:
Extra memory in multiprocessor system
Nondeterministic algorithm
ft
s
+ (1 f )t
s
/p 1 + (p 1)f
Maximum Speedup
Amdahls law
Serial section Parallelizable sections
(a) One processor
ft
s
(1 - f)t
s
t
s
(b) Multiple
processors
(1 - f)t
s
/p
t
p
p processors
Speedup against number of processors
Even with infinite number of processors, maximum speedup limited to 1/f
Ex: With only 5% of computation being serial, maximum speedup is 20
Superlinear Speedup - Searching
(a) Searching each sub-space sequentially
t
s
t /p
Start Time
t
s
/p
t
Solution found
xt
s
/p
Sub-space
search
x indeterminate
(b) Searching each sub-space in parallel
Speedup is given by:
Worst case for sequential search when
solution found in last sub-space search. Then
parallel version offers greatest benefit, i.e.
t
t
p
t
x
p S
s
+
= ) (
p 1
p
------------
t
s
t +
Solution found
t
Least advantage for parallel version when
solution found in first sub-space search of
the sequential search, i.e.
Sp()
p
------------
t
s
t +
t
----------------------------------------
=
as t tends to zero
The Context of Parallel Processing
Facts:
The explosive growth of digital computer architectures
The need for:
a better understanding of various forms/ degrees of
concurrency concurrency
user-friendliness, compactness and simplicity of code
high performance but low cost, low power consumption a.o.
High-performance uniprocessors are increasingly complex and
expensive, and they have high power-consumption
They may also be under-utilized - mainly due to the lack of
appropriate software.
Possible trade-offs to achieve efficiency
Whats better?
The use of one or a small number of such complex processors,
at one extreme, OR
A moderate to very large number of simpler processors, at the
other
The answer may seem simple, but there is a clue, forcing us to The answer may seem simple, but there is a clue, forcing us to
answer first to another question: how good is communication
between processors?
So:
When combined with a high-bandwidth, but logically simple,
inter-processor communication facility, the latter approach may
lead to significant increase in efficiency, not only at the execution
but also in earlier stages (i.e. in the design process)
The Difficulties of Parallel Processing
Two are the major problems that have prevented over the years
the immediate and widespread adoption of such (moderately to)
massively parallel architectures:
the inter-processor communication bottleneck
the difficulty, and thus high cost, of algorithmic/software
development development
How were these problems overcomed?
At very high clock rates, the link between the processor and
memory becomes very critical
integrated processor/memory design optimization
emergence of multiple-processor microchips
The emergence of standard programming and communication
models has removed some of the concerns with compatibility
and software design issues in parallel processing
The Difficulties of Parallel Processing
Two are the major problems that have prevented over the years
the immediate and widespread adoption of such (moderately to)
massively parallel architectures:
the inter-processor communication bottleneck
the difficulty, and thus high cost, of algorithmic/software
development development
How were these problems overcomed?
At very high clock rates, the link between the processor and
memory becomes very critical
integrated processor/memory design optimization
emergence of multiple-processor microchips
The emergence of standard programming and communication
models has removed some of the concerns with compatibility
and software design issues in parallel processing
Demand for Computational Speed
Continuous demand for greater computational speed
from a computer system than is usually possible
Areas requiring great computational speed include Areas requiring great computational speed include
numerical modeling and simulation, scientific and
engineering problems etc.
Remember: Computations must not only be completed,
but completed within a reasonable time period
Grand Challenge Problems
One that cannot be solved in a reasonable amount of
time with todays computers. Obviously, an
execution time of 2 months is always unreasonable
Examples
Modeling large DNA structures
Global weather forecasting
Modeling motion of astronomical bodies.
Global Weather Forecasting
Atmosphere modeled by dividing it into 3-dim. cells
Computations in each cell repeat many times to model
time passing
Suppose whole global atmosphere divided into cells of size 1
mile 1 mile 1 mile to a height of 10 miles (10 cells high) - mile 1 mile 1 mile to a height of 10 miles (10 cells high) -
about 5 10
8
cells
Suppose each calculation requires 200 float. point operations.
In one time step, 10
11
floating point operations necessary.
To forecast weather over 7 days using 1-minute intervals, a
computer operating at 1Gflops (10
9
flops) takes 10
6
s/ >10 days
To perform calculation in 5 minutes requires computer
operating at 3.4 Tflops (3.4 10
12
flops).
Modeling Motion of Astronomical Bodies
Bodies are attracted to each others by gravitational forces
Movement of each body is predicted by calculating total
force on each body
With N bodies, N - 1 forces to calculate for each body, or
approx. N
2
calculations (N log
2
N for an efficient approx. approx. N
2
calculations (N log
2
N for an efficient approx.
algorithm); after determining new positions of bodies,
calculations repeated
If a galaxy might have, say, 10
11
stars, even if each
calculation done in 1 ms (extremely optimistic figure), it
takes 10
9
years for one iteration using N
2
algorithm and
almost a year for one iteration using an efficient N log
2
N
approximate algorithm.
Astrophysical N-body simulation screen snapshot
PARALLEL COMPUTING: Models
and Algorithms
A course for the 4
th
year students
(Major in Computer Science - Software)
Parallel Programming Platforms
Contents
Parallel Computing : definitions and terminology
Historical evolution
A taxonomy of parallel solutions
Pipelining Pipelining
Functional parallelism
Vector parallelism
Multi-processing
Multi-computing
Von Neumann constraints

External
Memory
Internal
Memory
*
Output
Unit
Input
Unit
Control
Unit
A.L. Unit
*
*
CPU
Parallel Computing What is it? (here, from the
platform point of view)
Try a simple definition, fit for our purposes Try a simple definition, fit for our purposes
A historical overview: How did parallel
platforms evolved?
WHAT IS PARALLEL
COMPUTING?
From the platform point of view, it is:
Use of several processors/ execution units in parallel to
collectively solve a problem
Ability to employ different processors/ computers/ Ability to employ different processors/ computers/
machines to execute concurrently different parts of a
single program
Questions:
How big are the parts? (grain of parallelism) Can be
instruction, statement, procedure, or other size.
Parallelism in this way is loosely defined, with plenty of
overlap with distributed computing
PARALLEL COMPUTING AND
PROGRAMMING PLATFORMS
Definition for our purposes:
We will mainly focus on relative coarse grain
Main goal: shorter running time!
The processors are contributing to the solution of the The processors are contributing to the solution of the
same problem
In distributed systems the problem is often that of
coordination (e.g. leader election, commit, termination
detection)
In parallel computing a problem involves lots of data and
computation (e.g. matrix multiplication, sorting),
communication is to be kept to an optimum
Terminology
Distributed System: A collection of multiple autonomous
computers, communicating through a computer network, that
interact with each other in order to achieve a common goal.
SistemParalel: An optimized collection of processors, dedicated SistemParalel: An optimized collection of processors, dedicated
to the execution of complex tasks; each processor executes in a
semi-independent manner a subtask and co-ordination may be
needed from time to time. The primary goal of parallel processing
is a significant increase in performance.
Remark. Parallel processing in distributed environments is not
only possible but a cost-effective attractive alternative.
Do we need powerful computer platforms?
Yes, to solve much bigger problems much faster!
Coarse-grain parallelism is mainly applicable to long-
running, scientific programs
Performance
- there are problems which can use any amount of computing
(i.e. simulation) (i.e. simulation)
Capability
- to solve previously unsolvable problems (such as prime
number factorization): too big data sizes, real time constraints
Capacity
-to handle a lot of processing much faster, perform more
precise computer simulations (e.g. weather prediction)
Measures of Performance
To computer scientists: speedup, execution time.
To applications people: size of problem, accuracy of
solution, etc.
Speedup of algorithm
= sequential execution time/execution time on p
processors (with the same data set). processors (with the same data set).
Speedup on problem
= sequential execution time of best known sequential
algorithm / execution time on p processors.
A more honest measure of performance.
Avoids picking an easily parallelizable algorithm with
poor sequential execution time.
How did parallel platforms evolved?
Execution Speed
With a 10
2
times increase of (floating point) execution
speed every 10 years
Communication Technology
A factor which is critical to the performance of A factor which is critical to the performance of
parallel computing platforms
1985 1990 : in spite of an average 20x increase in
processor performance, the communication speed kept
constant
Parallel Computing How platforms evolved
Time(s) per
fp instruction
Motto: I think there is a world Motto: I think there is a world
market for maybe five computers market for maybe five computers
( (Thomas Watson, IBM Chairman, 1943) )
Towards Parallel Computing The 5 ERAs
Why are powerful computers parallel?
From Transistors to FLOPS
by Moores law the no of transistors per area doubles every 18
months
how to make use of these transistors?
more execution units, graphical pipelines, etc. more execution units, graphical pipelines, etc.
more processors
So, technology is not the only key, computer structure (architecture)
and organization are also important!
Inhibitors of parallelism:
Dependencies
Why are powerful computers parallel? (cont.)
The Data Communication Argument
for huge data it is cheaper and more feasible to move
computation towards data
The Memory/Disk Speed Argument
parallel platforms typically yield better memory system parallel platforms typically yield better memory system
performance, because they have
larger aggregate caches
higher aggregate bandwidth to memory system
Explicit Parallel Programming Platforms
physical organization hardware view
communication network
logical organization - programmers view of the logical organization - programmers view of the
platform
process-processors mappings
A bit of historical perspective
Parallel computing has been here since the early days of computing.
Traditionally: custom HW, custom SW, high prices
The doom of the Moore law:
- custom HW has hard time catching up with the commodity processors
Current trend: use commodity HW components, standardize SW Current trend: use commodity HW components, standardize SW
Parallelism sneaking into commodity computers:
Instruction Level Parallelism - wide issue, pipelining, OOO
Data Level Parallelism 3DNow, Altivec
Thread Level Parallelism Hyper-threading in Pentium IV
Transistor budgets allow for multiple processor cores on a chip.
A bit of historical perspective (cont.)
Most applications would benefit from being parallelized and
executed on a parallel computer.
even PC applications, especially the most demanding ones games,
multimedia
Chicken & Egg Problem:
1. Why build parallel computers when the applications are sequential? 1. Why build parallel computers when the applications are sequential?
2. Why parallelize applications when there are no parallel commodity
computers?
Answers:
1. What else to do with all those transistors?
2. Applications already are a bit parallel (wide issue, multimedia
instructions, hyper-threading), and this bit is growing.
Parallel Solutions: A Taxonomy
Pipelining
- instructions are decomposed into elementary operations; different operations
belonging to several instructions may be at a given moment in execution
Functional parallelism
- independent units are provided to execute specialized functions
Vector parallelism
- identical units are provided to execute under unique control the same operation on
different data items
Multi-processing
- several tightly coupled processors execute independent instructions,
communicating through a common shared memory
Multi-computing
- several tightly coupled processors execute independent instructions, and usually
communicate with each other by sending messages
Pipelining (often completed by functional/vector parallelism)
Ex. IBM 360/195, CDC 6600/7600, Cray 1
Vector Processors
Early parallel computers use vector processors; their
design was MISD, their programming was SIMD (see Flynns
taconomy next)
Most significant representatives of this class:
CDC Cyber 205, CDC 6600
Cray-1, Cray-2, Cray XMP, Cray YMP etc. Cray-1, Cray-2, Cray XMP, Cray YMP etc.
IBM 3090 Vector
Inovative aspects:
Superior organization
Use of performant technologies (not CMOS), i.e. cooling
Use of peripheral processors (minicomputers)
Generally, do not rely on usual techniques for paging/
segmentation, that slow down computations
Cray XMP/4
Cray 2
Flynns Taxonomy
Data
Instr. Flow
Flow
Simple Multiple
Simple SISD SIMD
Multiple MISD MIMD
SIMD (Single Instruction stream, Multiple Data stream)
Global control unit
Interconnection network
PE PE PE PE PE
Ex: early parallel machines

Illiac IV, MPP, CM-2, MasPar MP-1
Modern settings
multimedia extensions - MMX, SSE
DSP chips
SIMD (cont.)
Positives:
less hardware needed (compared to MIMD computers, they
have only one global control unit)
less memory needed (must store only a copy of the program)
less startup time to communicate with neighboring processors
easy to understand and reason about
Negatives:
proprietary hardware needed fast obsolescence, high
development costs/time
rigid structure suitable only for highly structured problems
inherent inefficiency due to selective turn-off
SIMD and Data-Parallelism
SIMD computers are naturally suited for data-parallel
programs
programs in which the same set of instructions are
executed on a large data set
Example: Example:
for (i=0; i<1000; i++) pardo
c[i] = a[i]+b[i];
Processor k executes c[k] = a[k]+b[k]
SIMD inefficiency example (1)
Example:
for (i=0; i<10; i++)
~ if (a[i]<b[i])
~ c[i] = a[i]+b[i];
~ else
~ c[i] = 0;
Different processors cannot execute distinct instructions in the same clock cycle
~ c[i] = 0;
4 1 7 2 9 3 3 0 6 7
5 3 4 1 4 5 3 1 4 8
a[]
b[]
c[]
p
0
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
p
9
Example:
for (i=0; i<10; i++) pardo
~ if (a[i]<b[i])
~ c[i] = a[i]+b[i];
~ else
~ c[i] = 0;
4 1 7 2 9 3 3 0 6 7
5 3 4 1 4 5 3 1 4 8
a[]
b[]
c[]
p
0
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
p
9
Example:
~ if (a[i]<b[i])
~ c[i] = a[i]+b[i];
~ else
~ c[i] = 0;
4 1 7 2 9 3 3 0 6 7
5 3 4 1 4 5 3 1 4 8
a[]
b[]
9 4 8 1 15
c[]
p
0
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
p
9
p
0
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
p
9
Example:
~ if (a[i]<b[i])
~ c[i] = a[i]+b[i];
~ else
~ c[i] = 0;
4 1 7 2 9 3 3 0 6 7
5 3 4 1 4 5 3 1 4 8
a[]
b[]
9 4 0 0 0 8 0 1 0 15
c[]
p
0
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
p
9
p
0
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
p
9
p
0
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
p
9
MIMD(Multiple Instruction stream, Multiple Data stream)
PE +
control unit

PE +
control unit
PE +
control unit
Single Program, Multiple Data
a popular way to program MIMD computers
simplifies code maintenance/program distribution
equivalent to MIMD (big switch at the beginning)
MIMD (cont)
Positives:
can be easily/fast/cheaply built from existing microprocessors
very flexible (suitable for irregular problems)
can have extra hardware to provide fast synchronization,
which enables them to operate in SIMD mode (ex. CM5) which enables them to operate in SIMD mode (ex. CM5)
Negatives:
more complex (each processor has its own control unit)
requires more resources (duplicated program, OS, )
more difficult to reason about/design correct programs
Address-Space Organization
Aka Bells Taxonomy (only for MIMD computers)
Multiprocessors
(single address space, communication uses common memory)
Scalable (distributed memory)
Not scalable (centralized memory) Not scalable (centralized memory)
Multicomputers
(multiple address space, communication uses transfer of messages)
Distributed
Centralized
Vector Parallelism
Is based on primary high-level, efficient operations,
able to process in one step whole linear arrays (vectors)
It may be extended to matrix processing etc.
Multiprocessors
Ex. Compaq SystemPro, Sequent Symmetry 2000
Multicomputers
Ex. nCube, Intel iPSC/860
Multiprocessor Architectures
Typical examples are the Connection Machine-s
CM2
CM5
Organization
Host Computer
Microcontroller
CM Processors
And
Memories
Host sends commands/ data to a microcontroller
The microcontroller broadcasts control signals and
data back to the processor network
It also collects data from the network
CM* Processors and Memory
Bit dimension (this means the memory is
addressable at bit level)
Operations are bit serialized
Data organization in fields is arbitrary (may
include any number of bits, starts anywhere)
A set of contextual bits (flags) in all processors
determines their activation
Programming
PARIS - PArallel Instruction Set, similar to an
assembly language
*LISP Common Lisp extension that includes
explicit parallel operations
*LISP Common Lisp extension that includes
explicit parallel operations
C* - C extension with explicit parallel data
and implicit parallel operations
CM-Fortran the implemented dialect of
Fortran 90
CM2 Architecture
Connection Machine
Processors
Connection Machine
Processors
Nexus Front
End
Sequencer
0
Sequencer
3
Connection Machine
Processors
Sequencer
1
Connection Machine
Processors
Sequencer
2
Interconnection Network
of CM2 Processors
Any node in the network is a cluster (chip), with:
16 data processors on a chip
Memory Memory
Routing node
Nodes are connected in a 12D hypercube
There are 4096 nodes, each has direct links to other 11 nodes
Maximal dimension of a CM is thus 12 x 4096, or 64K
processors
CM5
Starting with CM-5, the Thinking Machines Co. went
(in 1991) from a hypercube architecture of simple
processors to a complete new one, MIMD, based on a processors to a complete new one, MIMD, based on a
fat tree of RISC processors (SPARC)
A few years later CM-5E replaced SPARC processors
with more fast SuperSPARCs
Levels of parallelism
Implicit Parallelism in Modern Microprocessors
pipelining, superscalar execution, VLIW
Hardware parallelism
- as given by machine architecture and hardware multiplicity (Hwang)
- reflects a model of resource utilization by operations with a potential of - reflects a model of resource utilization by operations with a potential of
simultaneous execution, or refers the resources peak performance
Software parallelism
- acts at job, program, instruction or even bit (arithmetic) level
Limitations of Memory System Performance
Problem: high latency of memory vs. speed of computing
Solutions: caches, latency hiding using multithreading and
prefetching
Granularity
- is a measure for the amount of computations within a process - is a measure for the amount of computations within a process
- usually described as coarse, medium and fine
Latency
- opposed to granularity, measures the overhead due to communication
between fragments of code
PARALLEL COMPUTING:
A course for the 4
th
year students
Communication in Parallel Systems
Contents
Role of communication in parallel systems
Types of interconnection networks
General topologies: clique, star, linear array, ring, General topologies: clique, star, linear array, ring,
tree & fat tree, 2D & 3D mesh/torus, hypercube,
butterfly
Evaluating interconnection networks: diameter,
connectivity, bandwidth, cost
Communication
plays a major role, for both:
Shared Address Space Platforms (multiprocessors)
Uniform Memory Access multiprocessors
Non-Uniform Memory Access multiprocessors
cache coherence issues cache coherence issues
Message Passing Platforms
network characteristics are important
mapping between parallel processes and processors is
critical
Sequential Programming Paradigm
Message-Passing Programming Paradigm
Shared Address Space Platforms
P P P P P
M M M M
shared
memory,
UMA
P
C
M
P
C
M
P
C
M
distributed
memory,
NUMA
Interconnection Networks for Parallel Computers
Static networks
point-to-point communication links among processing nodes
also called direct networks
Dynamic networks
communication links are connected dynamically by switches to
create paths between processing nodes and memory banks/other
processing nodes
also called indirect networks
Quasi-static/ Pseudo-dynamic networks
to be introduced later
Interconnection Networks
p p
Static/direct network
p p
Dynamic/indirect network
p p
processing node
network interface/switch
p p
switching element
Static Interconnection Networks
Just the most usual topologies:
Complete network (clique)
Star network
Linear array
Ring Ring
Tree
2D & 3D mesh/torus
Hypercube
Butterfly
Fat tree
Clique, Star, Linear Array, Ring, Tree
p0 pn-1
p1 p2
p0 pn-1
p1 p2
Clique, Star, Linear Array, Ring, Tree
- important logical topologies, as many common communication patters
correspond to these topologies:
- clique: all-to-all broadcast
- star: master slave, broadcast
- line, ring: pipelined execution
- tree: hierarchical decomposition
- none of them is very practical
- clique: cost
- star, line, ring, tree: low bisection width
- line, ring: high diameter
- actual execution is performed on the embedding into the physical network
2D & 3D Array & Torus
- good match for discrete simulation and matrix operations
- easy to manufacture and extend
Examples: Cray 3D (3d torus), Intel Paragon (2D mesh)
Hypercube
- good graph-theoretic properties (low diameter, high bisection width)
- nice recursive structure
- good for simulating other topologies (they can be efficiently embedded into
hypercube)
- degree log (n), diameter log (n), bisection width n/2
- costly/difficult to manufacture for high n, not so popular nowadays - costly/difficult to manufacture for high n, not so popular nowadays
000 001
010 011
100 101
110 111
Butterfly
- Hypercube derived network of log(n) diameter and constant degree
- perfect match for some complex algorithms (like Fast Fourier Transform)
- there are other Hypercube-related networks (Cube Connected Cycles, Shuffle-
Exchange, De-Bruin and Bene networks)
B
n
B
n
B
n+1
Fat Tree
Main idea: exponentially increase the multiplicity of links as the distance
from the bottom increases
- keeps nice properties of the binary tree (low diameter)
- solves the low bisection and bottleneck at the top levels
Example: CM5
Dynamic Interconnection Networks
BUS Based Interconnection Networks
processors and the memory modules are connected to a shared bus
Advantages:
simple, low cost
Disadvantages:
only one processor can access memory at a given time only one processor can access memory at a given time
bandwidth does not scale with the number of processors/memory
modules
Example:
quad Pentium Xeon
Crossbar
Advantages:
non blocking network
Disadvantages:
cost O(pm)
Evaluating Interconnection Networks
diameter
the longest distance (number of hops) between any two nodes
gives lower bound on time for algorithms communicating only with
direct neighbours
connectivity
multiplicity of paths between any two nodes multiplicity of paths between any two nodes
high connectivity lowers contention for communication resources
bisection width (bisection bandwidth)
the minimal number of links (resp. their aggregate bandwidth) that
must be removed to partition the network into two equal halves
provides lower bound on time when the data must be shuffled from
one half of the network to another half
VLSI area/volume: in 2D, in 3D ) (
2
w O ) (
2 / 3
w O
Evaluating Interconnection Networks
p-1 1 1 2log((p+1)/2) complete
p-1 1 1 2 star
p(p-1)/2 p-1 p
2
/4 1 clique
Cost
(# of links)
Arc
Connectivity
Bisection
Width
Diameter Network
(p log p)/2 log p p/2 log p hypercube
2p 4 2p 2|p/2| 2D torus
2(p-p) 2 p 2(p-1) 2D mesh
p-1 1 1 p-2 linear array
p-1 1 1 2log((p+1)/2) complete
binary tree
So, the Logical View of PP Platform:
Control Structure - how to express parallel tasks
Single Instruction stream, Multiple Data stream
Multiple Instruction stream, Multiple Data stream
Single Program Multiple Data
Communication Model - how to specify interactions between tasks
Shared Address Space Platforms (multiprocessors)
Uniform Memory Access multiprocessors
Non-Uniform Memory Access multiprocessors
Cache-Only Memory Access multiprocessors (+ cache coherence issues)
Message Passing Platforms (multicomputers)
PARALLEL COMPUTING:
A course for the 4
th
year students
Parallel Programming Models
Contents
The ideal parallel computer: PRAM
Categories of PRAMs
PRAM algorithm examples
Algorithmic Models
Data-Parallel Model
Task Graph Model Task Graph Model
Work Pool Model
Master-Slave Model
Pipeline (Producer-Consumer) Model
Parallel Algorithm Design
Performance Models
Decomposition Techniques
Explicit Parallel Programming
Platforms & physical organization
Communication network
Logical organization - programmers view
hardware view
Logical organization - programmers view
of the platform
Process-processors mappings
The Ideal Parallel Computer
PRAM - Parallel Random Access Machine
consists of:
p processors, working in lock-step, synchronous
manner on the same program instructions
each with its local memory each with its local memory
each connected to an unbounded shared memory
the access time to shared memory costs one step
PRAM abstracts away communication, allows to
focus on the parallel tasks
Why PRAM is an Ideal Parallel Computer?
PRAM is a natural extension of the sequential model of
computation (RAM), it provides a means of interaction
between processors at no cost
it is not feasible to manufacture PRAMs:
the real cost of connecting p processors to m memory
cells such that their accesses do not interfere is o(pm), cells such that their accesses do not interfere is o(pm),
which is huge for any practical values of m
an algorithm for PRAM might lead to a good algorithm for a
real machine
if something cannot be efficiently solved on PRAM, it cannot
be efficiently done on any practical machine (based on
current technology)
Restrictions may be imposed for simultaneous read/write
operations in the common memory
There are 4 main classes, depending on how simultaneous
accesses are handled
Exclusive read, exclusive write - EREW PRAM
Categories of PRAMs
Exclusive read, exclusive write - EREW PRAM
Concurrent read, exclusive write - CREW PRAM
Exclusive read, concurrent write - ERCW PRAM (for
completeness)
Concurrent read, concurrent write - CRCW PRAM
Allowing concurrent read access does not create semantic
discrepancies in the program
Concurrent write access to the same memory location
requires arbitration
Ways of resolving concurrent writes
Resolving concurrent writes
Ways of resolving concurrent writes
Common all writes must write the same value
Arbitrary arbitrary write succeeds
Priority the write with highest priority succeeds
Sum the sum of the written values is stored
PRAM Algorithm Example 1
Problem (parallel prefix): use EREW PRAM to sum numbers stored at
m
0
, m
1
, , m
n-1
, where n=2
k
for some k. The result should be stored at m
0
.
Algorithm for processor p
i
:
for (j=0; j<k; j++)
~ if (i % 2^(j+1) == 0) {
~ a = read(m
i
);
~ b = read(m );
1 8 3 2 7 3 1 4
p
0
p
2
p
4
p
6
9 8 5 2 10 3 5 4
~ b = read(m
i+2^j
);
~ write(a+b, m
i
);
~ }
9 8 5 2 10 3 5 4
p
0
p
4
14 8 5 2 15 3 5 4
29 8 5 2 15 3 5 4
p
0
Example for k=3
PRAM Example Notes
the program is written in SIMD (and SPMD) format
the inefficiency caused by idling processors is clearly visible
can be easily extended for n not power of 2
takes log
2
(n) rounds to execute takes log
2
(n) rounds to execute
Important!
using a similar approach to parallel prefix (+ some other ideas) it
can be shown that:
Any CRCW PRAM can be simulated by an EREW PRAM with a
slowdown factor of O(log n)
PRAM Algorithm Example 2
Problem: use Sum - CRCW PRAM with n
2
processors to sort
n numbers stored at x
0
, x
1
, , x
n-1
.
CRCW condition: processors can write concurrently 0s and
1s in a location, the sum of values will actually be written
Question: How many steps would it take? Question: How many steps would it take?
1. O(n log n)
2. O(n)
3. O(log n)
4. O(1)
5. less then (n log n)/ n
2
PRAM Example 2
Note: We will mark processors p
i,j
for 0<=i,j<n
Algorithm for processor p
i,j
:
a = read(x
i
); ~
b = read(x
j
); ~
if ((a>b)|| ((a==b)&&(i>j)))
~ write(1, m );
1 7 3 9 3 0
0 1 1 1 1 0
x[]
m
0
m
1
m
2
m
3
m
4
m
5
~ write(1, m
i
);
if (j==0) {
~ b = read(m
i
); ~ ~
write(a, x
b
);
}
0 1 1 1 1 0
0 0 0 1 0 0
0 1 0 1 0 0
0 0 0 0 0 0
0 1 0 1 1 0
1 1 1 1 1 0
1 4 2 5 3 0
0 1 3 3 7 9
O(1) sorting algorithm!
(Chaudhuri, p.90-91)
m[]
x[]
Find the small error in the matrix!
The beauty and challenge of parallel algorithms
Problems that are trivial in sequential setting can be quite
interesting and challenging to parallelize.
Homework: Compute sum of n numbers
How would you do it in parallel?
using n processors
using p processors
when communication is cheap
when communication is expensive
Algorithmic Models
try to offer a common base to the development,
expressing and comparisons of parallel algorithms
generally, they use the architectural model of
shared memory parallel machine (multi-processor)
shared memory is a useful abstraction from the shared memory is a useful abstraction from the
programmer point of view, especially for the early
phases of algorithm design
communication is kept as simple as possible
usual causes of inefficiency are eliminated
Parallel Algorithmic Models
Data-Parallel Model
Task Graph Model
Work Pool Model
Master-Slave Model
Pipeline (Producer-Consumer) Model
Data Parallel Model
Working principle
divide data up amongst processors
process different data segments in parallel
communicate boundary information, if necessary
Features Features
includes loop parallelism
well suited for SIMD machines
communication is often implicit
Task Graph Model
decompose algorithm into different sections
assign sections to different processors
often uses fork()/join()/spawn()
usually does not yield itself to high level of parallelism
Work Pool Model
dynamic mapping of tasks to processes
typically small amount of data per task
the pool of tasks (priority queue, hash table, tree) can be
centralized or distributed
get task
P
0
P
1
P
2
P
3
t
0
t
8
t
3
t
2
t
7
get task
process
task
possibly add task
work pool
Master-Slave Model
master generates and allocates tasks
can be also hierarchical/multilayer
master potentially a bottleneck
overlapping communication and computation at the
master often useful master often useful
Pipelining
a sequence of tasks whose execution can overlap
sequential processor must execute them sequentially,
without overlap
parallel computer can overlap the tasks, increasing
throughput (but not decreasing latency)
Parallel Algorithms Performance
Granularity
fine grained: large number of small tasks
coarse grained: small number of large tasks
Degree of Concurrency
the maximal number of tasks that can be executed simultaneously
Critical Path Critical Path
the costliest directed path between any pair of start and finish
nodes in the task dependency graph
the cost of the path is the sum of the weights of the nodes
Task Interaction Graph
tasks correspond to nodes and an edge connects two tasks if they
communicate/interact with each other
directed acyclic graph capturing causal dependency between tasks
a task corresponding to a node can be executed only after all tasks on
the other sides of the incoming edges have already been executed
Task Dependency Graphs
sequential
summation,
traversal,
binary summation,
merge sort,
5-step Guide to Parallelization
Identify computational hotspots
find what is worth to parallelize
Partition the problem into smaller semi-independent tasks
find/create parallelism
Identify Communication requirements between these tasks
realize the constraints communication puts on parallelism realize the constraints communication puts on parallelism
Agglomerate smaller tasks into larger tasks
group the basic tasks together so that the communication is
minimized, while still allowing good load balancing properties
Translate (map) tasks/data to actual processors
balance the load of processors, while trying to minimize
communication
Parallel Algorithm Design
Involves all of the following:
1. identifying the portions of the work that can be
performed concurrently
2. mapping the concurrent pieces of work onto multiple
processes running in parallel
3. distributing the input, output and intermediate data 3. distributing the input, output and intermediate data
associated with the program
4. managing access to data shared by multiple processes
5. synchronizing the processes in various stages of parallel
program execution
Optimal choices depend on the parallel architecture
Platform dependency example
Problem:
process each element of an array, with interaction
between neighbouring elements
1
st
Setting: message passing computer
Solution: distribute the array into blocks of size n/p
2
nd
Setting: shared memory computer with shared cache
Solution: stripped partitioning
p
0
p
1
p
2
p
3
Decomposition techniques
Recursive Decomposition
Data Decomposition
Task Decomposition
Exploratory Decomposition Exploratory Decomposition
Speculative Decomposition
Recursive Decomposition
Divide and conquer leads to natural concurrency.
quick sort:
6 2 5 8 9 5 1 7 3 4 3 0
2 5 5 1 3 4 3 0 6 8 9 7 2 5 5 1 3 4 3 0 6 8 9 7
1 0 2 5 5 3 4 3 6 7 8 9
finding minimum recursively:

rMin(A[0..n-1]) = min(rMin(A[0..n/2-1], rMin(A[n/2..n-1]));
Data Decomposition
Begin by focusing on the largest data structures, or the ones that
are accessed most frequently
Divide the data into small pieces, if possible of similar size
Strive for more aggressive partitioning then your target computer
will allow
Use data partitioning as a guideline for partitioning the computation Use data partitioning as a guideline for partitioning the computation
into separate tasks; associate some computation with each data
element
Take communication requirements into account when partitioning
data
Data Decomposition (cont.)
Partitioning according to
input data (e.g. find minimum, sorting)
output data (e.g. matrix multiplication)
intermediate data (bucket sort)
Associate tasks with the data
do as much as you can with the data before further
communication
owner computes rule
Partition in a way that minimizes communication
costs
Task Decomposition
Partition the computation into many small tasks of
approximately uniform computational requirements
Associate data with each task
Common in problems where data structures are highly
unstructured, or no obvious data structures to partition exist
Exploratory Decomposition
commonly used in search space exploration
unlike the data decomposition, the search space is not known
beforehand
computation can terminate as soon as a solution is found
the work amount can be more or less than in sequential case
1 2 3 4 1 2 3 4
5 6 7 8
9 10 11
13 14 15 12
1 2 3 4
5 6 7 8
9 10 15 11
13 14 12
1 2 3 4
5 6 8
9 10 7 11
13 14 15 12
1 2 3 4
5 6 7 8
9 10 11
13 14 15 12
1 2 3 4
5 6 7 8
9 10 11
13 14 15 12
1 2 3 4
5 6 7
9 10 11 8
13 14 15 12
Speculative Decomposition
Example:
Discrete event simulation state-space vertical partitioning
execute branches concurrently, assuming certain
restrictions are met (i.e. lcc), then keep the executions
that are correct, re-taking the others in new conditions
total amount of work is always more then in the
sequential case, but execution time can be less
Task Characteristics
Task generation: static vs dynamic
Task sizes: uniform, non-uniform, known,
unknown
Size of Data Associated with Tasks:
influences mapping decisions, input/output sizes influences mapping decisions, input/output sizes
InterTask Communication Characteristics
static vs dynamic
regular vs irregular
read-only vs read-write
one way vs two way one way vs two way
Load Balancing
Efficiency adversely affected by uneven
workload:
execution time
P
0
P
1
computation
P
2
idle (wasted)
P
3
P
4
Load Balancing (cont.)
Load balancing: shifting work from heavily loaded
processors to lightly loaded ones.
P
0
computation
execution time saved
P
1
computation
P
2
idle (wasted)
P
3
moved
P
4
Static load balancing Dynamic load balancing
- before execution - during execution
Static Load Balancing
Map data and tasks into processors prior to execution
the tasks must be known beforehand (static task generation)
usually task sizes need to be known in order to work well
even if the sizes are known (but non-uniform), the problem
of optimal mapping is NP hard (but there are reasonable
approximation schemes) approximation schemes)
1D Array Partitioning
p
1
p
2
p
3
p
4
p
0
p p p p p
block partitioning
cyclic(striped) partitioning
p
1
p
2
p
3
p
4
p
0
p
1
p
2
p
3
p
4
p
0
block-cyclic partitioning
2D Array Partitioning
p
1
p
2
p
3
p
4
p
0
p
5
p
2
p
3
p
4
p
0
p
1
p
5
p
2
p
4
p
0
p
1
p
3
p
5
p
1
p
2
p
0
p
1
p
2
p
0
p
2
p
4
p
0
p
1
p
3
p
5
Example: Geometric Operations
Image filtering, geometric transformations,
Trivial observation:
Workload is directly proportional to the number of objects.
If dealing with pixels, the workload is proportional to area. If dealing with pixels, the workload is proportional to area.
Load balancing achieved by assigning to processors blocks
of the same area.
Dynamic Load Balancing
Centralized Schemes
master-slave:
master generates tasks and distributes workload
easy to program, prone to master becoming bottleneck
self scheduling
take a task from the work pool when you are ready take a task from the work pool when you are ready
chunk scheduling
self scheduling with taking single task can be costly
take a chunk of tasks at once
when there are few tasks left, the chunk size
decreases
Dynamic Load Balancing
Distributed Schemes
Distributively share workload with other processors.
Issues:
how to pair sending and receiving processors
transfer of workload initiated by sender or receiver?
how much work to transfer?
when to decide to transfer?
Example: Computing Mandelbrot Set
Colour of each pixel c is defined
solely from its coordinates:
int getColour(Complex c) {
int colour = 0;
Complex z = (0,0);
1.5
while ((|z|<2) && (colour<max)) {
z = z
2
+c;
colour++;
}
return colour;
}
-2 real +1
-1.5
Mandelbrot Set Example (cont.)
Possible partitioning strategies:
Partition by individual output pixels; most aggressive partitioning.
Partition by rows.
Partition by columns.
Partition by 2-D blocks.
Assignment: Evaluate Mandelbrot set partitioning strategies
A course for the 4
th
year students
PARALLEL COMPUTING:
Parallel Performance: System and
Software Measures
Remember:
The Development of Parallel Programs
Involves ALL of the following:
1. identifying the portions of the work that can be
performed concurrently
2. mapping the concurrent pieces of work onto multiple
processes running in parallel processes running in parallel
3. distributing the input, output and intermediate data
associated with the program
4. managing access to data shared by multiple processes
5. synchronizing the processes in various stages of
parallel program execution
The goal is to attain good performance in all stages.
Predicting and Measuring
Parallel Performance
Building parallel versions of software can enable appl.:
to run a given data set in significantly less time
run multiple data sets in a fixed amount of time
or run large-scale data sets that are prohibitive with
sequential software
OK, these are visible cases, but how do we measure the
performance of a parallel system in the other cases?
Traditional measures like MIPS and MFLOPS really
dont get a parallel computation performance
I.e., clusters, that can have very high FLOPS, may
be poorer in accessing all data in the cluster
Metrics for Parallel Systems and
Algorithms Performance
Systematic ways to measure parallel performance are
needed:
Execution Time
Speedup
Efficiency
System Throughput
Cost-effectiveness
Utilization
Data access speed etc.
Execution time and overhead
The response time measures interval between
submission of a request until the first response
is produced
Execution Time Execution Time
Parallel runtime, T
p
Sequential runtime, T
s
Total Parallel Overhead:
The overhead is any combination of excess or
indirect computation time, memory, bandwidth,
etc.
s p o
T pT T =
Minimizing Overhead in Parallel
Computing
Sources of overhead:
Invoking a function incurs the overhead of
branching and modifying the stack pointer
regardless of what that function does
Recursion
When we can choose among several algorithms,
each of which have known characteristics, their
overhead is different
Overhead can influence the decision whether or
not to parallelize a piece of code!
Speedup
Speedup is the most likely used measure of parallel
performance
If T
s
is the best possible serial time and T
n
is the time
taken by a parallel algorithm on n processors
Linear speedup occurring when T
p
= T
s
/p, is
considered ideal
Superlinear speedup can happen in some cases
p
s
T
T
s =
Speedup Definition Variability (1)
Exactly what is meant by T
s
(i.e. the time
taken to run the fastest serial algorithm on
one processor)
One processor of the parallel computer?
The fastest serial machine available?
A parallel algorithm run on a single processor?
Is the serial algorithm the best one?
To keep things fair, T
s
should be the best
possible time in the serial world
Speedup Definition Variability (2)
A slightly different definition of speedup:
The time taken by the parallel algorithm on one
processor divided by the time taken by the parallel
algorithm on N processors algorithm on N processors
However this is misleading since many parallel
algorithms contain extra operations to accommodate
the parallelism (e.g the communication)
Result: T
s
is increased thus exaggerating the speedup
Factors That Limit Speedup
Computational (Software) Overhead
Even with a completely equivalent algorithm, software
overhead arises in the concurrent implementation
Poor Load Balancing
Speedup is generally limited by the speed of the slowest Speedup is generally limited by the speed of the slowest
node. So an important consideration is to ensure that
each node performs the same amount of work
Communication Overhead
Assuming that communication and calculation cannot
be overlapped, then any time spent communicating the
data between processors directly degrades the speedup
Linear Speedup
Which ever definition is used the ideal is to
produce linear speedup (N, using N cores)
However in practice the speedup is reduced
from its ideal value of N from its ideal value of N
For applications that scale well, the speedup
should increase at or close to the same rate of
increase in the number of processors (threads)
Superlinear speedup results when
unfair values are used for T
s
differences in the nature of the hardware used
Speedup Curves
Linear Speedup
Superlinear Speedup
S
p
e
e
d
u
p
Typical Speedup
Number of Processors
S
p
e
e
d
u
p
Efficiency
Speed up does not measure how efficiently the
processors are being used
Is it worth using 100 processors to get a speedup of 2?
Efficiency is defined as the ratio of speedup and Efficiency is defined as the ratio of speedup and
number of processors required to achieve it
The efficiency is bounded from above by 1, measuring the
fraction of time when a processor is usefully employed
In an ideal case, as s=p, it comes that e=1
p
s
e =
Amdahls Law
Used to compute an upper bound of speedup
A parallel algorithm has 2 types of operations:
Those which must be executed in serial
Those which can be executed in parallel Those which can be executed in parallel
The speedup of a parallel algorithm is limited
by the percentage of operations which must be
performed sequentially
Amdahl's Law assumes a fixed data set size,
and same % of overall serial execution time
Let the time taken to do the serial calculations
be some fraction of the total time ( 0 < 1)
The parallelizable portion is 1- of the total
Assuming linear speedup
Amdahls Law
Assuming linear speedup
T
serial
= T
1
T
parallel
= (1- )T
1
/N
By substitution:
N
Speedup
) - 1 (

1
+
=
Consequences of Amdahls Law
Say we have a program containing 100
operations each of which take 1 time unit.
Suppose =0.2, using 80 processors
Speedup = 100 / (20 + 80/100) = 100 / 20.8 < 5 Speedup = 100 / (20 + 80/100) = 100 / 20.8 < 5
A speedup of only 5 is possible no matter how
many processors are available
So why bother with parallel computing?...
Just wait for a faster processor
Limitations of Amdahls Law
To avoid the limitations of Amdahls law:
Concentrate on parallel algorithms with small serial
components
Amdahl's Law has been criticized for ignoring real-
world overheads such as communication, world overheads such as communication,
synchronization, thread management, as well as the
assumption of infinite-core processors
It is not complete in that it does not take into account
problem size
As the no. of processors increases, the amount of data
handled is likely to increase as well
Gustafson's Law
If a parallel application using 32 processors is able to
compute a data set 32 times the size of the original,
does the execution time of the serial portion increase?
It does not grow in the same proportion as the data set
Real-world data suggests that the serial execution time will Real-world data suggests that the serial execution time will
remain almost constant
Gustafson's Law, aka scaled speedup, considers an
increase in the data size in proportion to the increase in
the number of processors, computing the (upper bound)
speedup of the application, as if the larger data set could
be executed in serial
Gustafson's Law Formula
Speedup p + (1-p)s
where:
p is the number of processors
s is the percentage of serial execution time in the s is the percentage of serial execution time in the
parallel application for a given data set size
Since the % of serial time within the parallel execution
must be known, a typical usage for this formula is to
compute the speedup of the scaled parallel execution
(larger data sets as the number of processors increases)
to the serial execution of the same sized problem
Comparative Results
I.e., if 1% of execution time on 32 cores will be spent
in serial execution, the speedup of this application
over the same data set being run on a single core with
a single thread (assuming that to be possible) is:
Speedup 32 + (1-32)0.01 = 32 - 0.31 = 31.69 Speedup 32 + (1-32)0.01 = 32 - 0.31 = 31.69
Assuming the serial execution percentage to be 1%,
the equation for Amdahl's Law yields:
Speedup 1/(0.01 + (0.99/32)) = 24.43
This is a false computation, however, since the given
% of serial time is relative to the 32-core execution
Redundancy
Hardware redundancy: more processors are employed
for a single application, at least one acting as standby
Very costly, but often very effective, solution
Redundancy can be planned at a finer grain
Individual servers can be replicated
Redundant hardware can be used for non-critical activities
when no faults are present
Software redundancy: software must be designed so
that the state of permanent data can be recovered or
rolled back when a fault is detected
Granularity of Parallelism
Given by the average size of a sequential component
in a parallel computation
Independent parallelism: independent processes. No
need to synchronize.
Coarse-grained parallelism: relatively independent Coarse-grained parallelism: relatively independent
processes with occasional synchronization.
Medium-grained parallelism. E.g. multi-threads
which synchronize frequently.
Fine-grained parallelism: synchronization every few
instructions.
Degree of Parallelism
Is given by the number of operations which can be
scheduled for simultaneous (parallel) execution
For pipeline parallelism, where data is vector
shaped, the degree is coincident with vector size shaped, the degree is coincident with vector size
(length)
It may be constant throughout the steps of an
algorithm, but most often it varies
Its best illustrated by that representation of parallel
computations that uses DAGs
In parallel programming there is a large gap:
Problem Structure <--.--> Solution Structure
We may try an intermediate step:
Problem ---> Directed Acyclic Graph (DAG) ---> Solution
DAGs (Directed Acyclic Graphs)
- very simple, yet powerful tools -
(Particular DAGs are the so-called Task Graphs)
Problem ---> DAG:
split problem into tasks
DAG ---> Solution:
map tasks to parallel architecture
A
A,B,C and D
are graphs but
they are not task
A graph which has:
1 root, 1 leaf, no cycles and all nodes connected
What Is A Task Graph?
A
B
C
they are not task
graphs
D
The standard algorithm to create task graphs :
1. Divide problem into set of n tasks
2. Every task becomes a node in the task graph
How to go from Problem ---> Task Graph?
3. If task(x x) cannot start before task(yy) has finished
then draw a line from node(yy) to node(xx)
4. Identify (or create) starting and finishing tasks
The process (execution) flows through the task
graph like pipelining in a single processor system
Memory Performance
Capacity: How many bytes can be held
Bandwidth: How many bytes can be
transferred per second
Latency: How much time is needed to fetch Latency: How much time is needed to fetch
a word
Routing delays: when data must be gathered
from different parts of memory
Contention: resolved by memory blocking
A course for the 4
th
year students
PARALLEL COMPUTING:
Principles of Parallel Algorithms
Design
Contents
Combinational Circuits (CCs): Metrics
Parallel Design for List Operations Using CCs
SPLIT(list1,property) --- O(size(list1))
MERGE(list1,list2) --- O (max(size(list1),size(list2)))
SORT (list1) ---- O(size(list1)^2)
SEARCH(key,directory) --- O(size(directory))
Metrics for Communication Networks (CNs):
clique, mesh, torus, linear array, ring
hypercube, shuffle exchange
Designing a Connection Network
Remember our Base (Ideal) Platform:
Parallel Random Access Machine
consists of:
p processors, working in lock-step, synchronous
manner on the same program instructions
each has local memory
each is connected to unbounded shared memory each is connected to unbounded shared memory
access time to shared memory costing only one step
an algorithm for PRAM might lead to a good algorithm
for a real machine
if something cannot be efficiently solved on PRAM, it
cannot be efficiently done on any practical machine
(based on current technology)
Parallel Algorithms Modeling
Can be done in different ways
We have already seen some models for parallel
computations:
Directed Acyclic Graphs (DAGs)
Task Graphs Task Graphs
Will examine next modeling by means of Combinational
Circuits (CCs)
Modeling computations is not enough
We must also put into model the communication needs of the
parallel algorithm
Modeling Parallel Computations: using
Combinational Circuits (CCs)
CCs - a family of models of computation, consisting of:
A number of inputs at one end
A number of outputs at the other end
A number of interconnected components (internally) arranged in
columns called stages columns called stages
Each component can be viewed as a single (logical) processor
with constant fan-in and constant fan-out.
Components synchronise their computations (input to output) in a
constant time unit (independent of input values) like PRAMs!
Computations are usually simple logical operations (directly
implementable in hardware for speed!), but they may be used for
more complex operations as well
There must be no feedback
Metrics for Combinational Circuits
Width
Shows the most efficient use of parallel resources during
execution
Depth
Measures the complexity of the parallel algorithm implemented
using a CC
Size
Measures the constructive complexity of the CC, which is
equivalent with the total number of fundamental operations
May be an indicator for the total number of operations in the
algorithm
List Processing using Combinational
Circuits
Imagine we have direct hardware implementation of some list
processing functions
Fundamental operations of these hardware computers correspond to
fundamental components in our CCs
Processing tasks which are non-fundamental on a standard single Processing tasks which are non-fundamental on a standard single
processor architecture can be parallelised (to reduce complexity)
Classic processing examples searching, sorting, permuting, .
Implementing them on a different parallel machine may be done
using a number of components set up in a combinational circuit.
Question: what components are useful for implementation in a CC?
Answering this will help us to reveal some fundamental principles in
parallel algorithm design
Example: consider the following fundamental operations:
(BI)PARTITION(list1) --- constant time (no need to parallelise)
APPEND(list1,list2) --- constant time (no need to parallelise)
and the following non-fundamental operations:
SPLIT(list1,property) --- O(size(list1))
Parallel Design for List Operations
What can we
MERGE(list1,list2) --- O (max(size(list1),size(list2)))
SORT (list1) ---- O(size(list1)^2)
SEARCH(key,directory) --- O(size(directory))
Compositional Analysis - use the analysis of each component to construct the design,
with further analysis of speedup and efficiency
Advantage - re-use of already done analysis
Requires - complexity analysis for each component.
What can we
do here
to attack
the complexity?
Parallel Design - the SPLIT operation
L1
.
Where: split partitions L into M
and N -
Forall Mx, Property(Mx)
Consider:
M1
.
.
SPLIT(property)
.
.
Ln
N1
.
.
Nq
Forall Mx, Property(Mx)
Forall Ny, Not(Property(Ny))
Append(M,N) IsA permutation
of L
.
Mp
Question: Can we use the property structure to help parallelise the
design? EXAMPLE: A ^ B, A v B, any boolean expression
Example: Splitting on structured
property A ^ B
SPLIT(B) SPLIT(A) app
bipartition
append
app
BIP
SPLIT(B)
app
app
SPLIT(A)
Example: Splitting on property A ^ B
(cont.)
SPLIT(B)
NEED TO DO PROBABILISTIC ANALYSIS:
Typical gain when P(A) = 0.5 and P(B) = 0.5
n/2
n/4
app
SPLIT(B)
SPLIT(A)
BIP
SPLIT(B)
app
app
SPLIT(A)
app
n
n/2
n/2
n/4
Depth of circuit is 1+ (n/2) + (n/4) +1+1 = 3 + (3n/4)
Example: Splitting on an unstructured
property (integer list into evens and odds)
BIP
SPLIT app
Question: what is average speedup for the design?
BIP
SPLIT
app
Example: Splitting on an unstructured
property (cont.)
BIP
SPLIT
(even/odd)
app
n
n/2 n/2
BIP
SPLIT
(even/odd)
app
n/2
n/2
Answer: Doing probabilistic analysis as before Depth = 2+ n/2
Parallel Design - the MERGE operation
Merging is applied on 2 sorted sequences; let them in the average,
typical case, be of equal length m = 2^n.
Recursive implementation:
Base case, n=0 => m = 1.
Precondition is met: a list with 1 element is already sorted! Precondition is met: a list with 1 element is already sorted!
The component required is actually a comparison operator:
Merge(1) = Compare
C (or CAE)
X= [x1]
Y = [y1]
[min (x1,y1)]
[max (x1,y1)]
M1
MERGE - the first recursive composition
QUESTION:
Using only component M1 (the comparison C), how can we construct
a circuit for merging lists of length 2 (M2)?
Useful Measures: Width, Depth, Size (all = 1 for the base case)
ANALYSIS:
How many M1s (the size) are needed in total?
What is the complexity (based on the depth)?
What is the most efficient use of parallel resources (based on
width) during execution?
MERGE - building M2 recursively
from a number of M1s
X = [x1,x2]
x1
x2
z1
M2
M1
We may use 2 M1s to initially merge odd and even input elements
Then another M1 is used to compare the middle values
C
C
C
X = [x1,x2]
x2
y1
y2
Y = [y1,y2]
z2
z3
z4
M1
M1
Width = 2 Depth = 2 Size = 3
MERGE - proving M2 is correct
Validation is based on testing the CC with different input values
for X and Y
This does not prove that the CC is correct for all possible cases
Clearly, there are equivalence classes of tests
These must be identified and correctness must be proved only for These must be identified and correctness must be proved only for
the classes
Here we have 3 equivalence classes (use symmetry to swap X,Y)
DISJOINT
OVERLAP CONTAINMENT
x1
x2
y1 y2
x1
y1
x2
y2
x1
y1 y2
x2
MERGE - The next recursive step: M4
A 2-layer architecture can be used for constructing M4 from
M2s and M1s (Cs)
Consequently we can say M4 is constructed just from M1s!
M4
M2
M2
C
C
C
X
Y
Questions: 1.how can you prove the validity of the construction?
2. what are the size, width and depth (in terms of M1s)?
MERGE Measures for recursive step
on M4
M2
C
C
X
M4
M2 C
Y
Depth (M4) = Depth (M2) +1
Width (M4) = Max (2*Width(M2), 3)
Size (M4) = 2*Size(M2) + 3
that gives: Depth = 3, Width = 4, Size = 9
MERGE The general recursive
construction
Now we consider the general case:
Given any number of M
m
s how do we construct an M
2m
?
M2m
x1
Mm
Mm
C
C
C
2m-1 Cs
x1
x2m
y1
y2m
MERGE Measures and recursive
analysis on general merge circuit M
m
Width:width (M
m
) = 2 * width (M
m/2
) = = m
Depth: Let d(2m) = depth(M
2m
),
then d(2m) = 1 + d(m), for m>1 and d(1) = 1 then d(2m) = 1 + d(m), for m>1 and d(1) = 1
=> d(m) = 1 + log(m)
Size: Let s(2m) = size(M
2m
),
now s(2m) = 2s(m) + (m-1), for m>1 and s(1) = 1
=> s(m) = 1 + mlog(m)
Parallel Design - the SORT operation
Sorting can be done in a lot of manners choosing to sort by
merging exhibits two important advantages:
this is a method with great potential of parallelism
we may use the parallel implementation of another non-
fundamental operation (MERGE)
Also a good example of recursively constructing CCs; the same Also a good example of recursively constructing CCs; the same
technique can be applied to all CCs synthesis and analysis
This requires understanding of standard non-parallel (sequential)
algorithm and shows that some sequential algorithms are better
suited to parallel implementation than others
Well suited to formal reasoning (preconditions, invariants,
induction )
Sorting by Merging
M1
S8
We can use the merge circuits to sort arrays - for example,
sorting an array of 8 numbers:
M1
M1
M1
M2
M2
M4
Sorting by Merging the Analysis
Analyse the base case for sorting a 2 integer list (S2)
Synthesise and analyse S4
What are the width, depth and size of Sn?
What about cases when n is not a power of 2?
Question: is there a more efficient means of sorting using the
merge components? If so, why?
Parallel Design - the SEARCH operation
Searching is fundamentally different from all the other components:
the structure to be used for parallelisation is found in the component
(directory), not in the input data
we need to be able to cut up state not just communication channels
also, we need some sort of synchronisation mechanism
SEARCH
(directory)
SEARCH
(?)
??
SEARCH
(?)
key
data
key
data
Homework on Parallel Design using CCs
Try to prove the correctness for sorting by merging, first for the
given case n=8, and then in the general case
Look for information on parallel sorting on the web, and apply CCs
in your own parallel sorting method; at least two different methods
for sorting should be implemented, if possible in different manner:
one recursive
the other non-recursive
Give a recursive implementation for SEARCH parallel operation
using CCs, like we did for MERGE and SORT
Perform a recursive analysis on all the recursive methods designed
to compute the parallel complexity measures width, depth and size
Remember: Our Methodology
(Parallelization Guide)
Identify computational hotspots
find what is worth to parallelize
Partition the problem into small semi-independent tasks
find/create parallelism
Identify Communication requirements between tasks Identify Communication requirements between tasks
realize the constraints communication puts on parallelism
Agglomerate smaller tasks into larger tasks
group the basic tasks together so that the communication is
minimized, while still allowing good load balancing properties
Translate (map) tasks/data to actual processors
balance the load of processors, while trying to minimize
communication
Parallel Algorithm Design and
Communication Networks (CNs)
We can omit to look towards the CNs only in the early phases of
parallel design of logical processes (LPs)
The communication network and the parallel architecture for which
a parallel algorithm is destined also play an important role in its
selection, i.e.: selection, i.e.:
Matrix algorithms are best suited for meshes
Divide and conquer, recursive a.o. are appropriate for trees
Greater flexibility in the algorithm can benefit from hypercube
topologies
Engineering choices and compromises, but also metrics correct
estimation play a great role during the late phases of parallel design
Metrics for Communication Networks
Degree:
The degree of a LP (CN node) is its number of (direct) neighbours
in the CN graph
The degree of the whole algorithm (CN graph) is the maximum of
all processor degrees in the network
A high degree has theoretical power, a low degree is more practical
Connectivity:
Since a network node and/or link may fail, the network should still
continue to function with reduced capacity
The node connectivity is the minimum number of nodes that must
be removed in order to partition (divide) the network
The link connectivity is the minimum number of links that must be
removed in order to partition the network
Metrics for CNs (cont.)
Diameter:
Is the maximum distance between two nodes that is, the maximum
number of nodes that must be traversed to send a message to any node
along a shortest path
Lower diameter implies shorter time to send messages across network Lower diameter implies shorter time to send messages across network
Narrowness:
This is a measure of (potential) congestion, defined as below
We partition the CN into 2 groups of LPs (lets say A and B)
In each group the number of processors is denoted as Na and Nb (Nb<=Na)
We count the number of interconnections between A and B (call this I)
The maximum value of Nb/I for all possible partitions is the narrowness.
Metrics for CNs (cont.)
Expansion increments:
This is a measure of (potential) expansion
A network should be expandable that is, it should be possible to
create larger systems (of the same topology) by simply adding new
nodes nodes
It is better to have the option of small increments (why?)
Fully Connected Networks
A common topology: each node is
connected (directly) to all other
nodes (by 2-way links)
Example: we have
10 links with 5 nodes
Question: how many
links are with n nodes?
1
2
5
2
1
Metrics for n = 5
Degree = 4
Diameter = 1
Node Connectivity = 4, Link connectivity = 4
Narrowness = 1/3: Narrowness(1) = 2/6 = 1/3, Narrowness(2) = 1/4
Expansion Increment = 1
3
4
1
1 2
A 3 4
B 2 1
General case (we still may have to differentiate after n even or odd)
Fully Connected Networks (cont.)
4 1 5 2 3 n
If n is even:
Degree = n-1
Connectivity = n-1
Diameter = 1
Narrowness = 2/n
Expansion Increment = 1
If n is odd:
?
Mesh and Torus
In a mesh, the nodes are arranged in a k-dimensional lattice of width w,
giving a total of w^k nodes; we may have in particular:
k =1 giving a linear array, or
k =2 giving a 2-dimensional matrix
Communication allowed only between neighbours (no diagonal connections)
A mesh with wraparound is called torus A mesh with wraparound is called torus
The Linear Array and the Ring
A simple ring
Question: what are the metrics
for n = 6
in the general case?
A chordal ring
Hypercube Connections (Binary n-Cubes)
0 1
1-D hypercube (2 nodes)
The networks consist of N=2^k nodes arranged in a k-dimensional hypercube.
The nodes are numbered 0,1,,2^k-1 and two nodes are connected if their
binary labels differ by exactly 1 bit
4-D HyperCube
2-D hypercube (4nodes)
0 1
0
0
1
3 2
1
2 3
6
4 5
7
0 0 0
010
011
111
101
100
001
110
3-D hypercube (8nodes)
Question: what are the metrics of an n-dimensional hypercube?
Shuffle Exchange
p0 p2 p5 p1 p3 p4 p6 p7
A 1-way communication line links PI to PJ, where:
J= 2I for 0<=I<=4-1,
J = 2I+1-8 for 4<=I<=N-1
2-way links may be added to every even processor and its successor
Shuffle Exchange from another view
p0 p2 p5 p1 p3 p4 p6 p7
p0 p2 p5 p1 p3 p4 p6 p7
Question: what are metrics for
case n=8
the general case for any n which is a power of 2?
Typically, requirements are specified as bounds on a subset
of metrics:
min_nodes < number_nodes < max_nodes
min_links < number_links < max_links
connectivity > c_min
Designing a Connection Network
connectivity > c_min
diameter < d_max
narrowness < n_max
Normally, experience might tell if a classic CN fits.
Otherwise a CN which is close to meeting the requirements
must be refined; or 2 (or more) CN must be combined in a
complementary fashion (if possible!)
A course for the 4
th
year students
PARALLEL COMPUTING:
Parallel Virtual Computing
Environments
Contents
Historical Background
HPC-VCE Architectures
HPCVCE Programming Model
Parallel Execution Issues in a VCE
PVM
MPI
Historical Background (1)
Complex scientific research has always been
looking for immense computing resources
Supercomputers have been used traditionally
to provide processing capability (60s 90s) to provide processing capability (60s 90s)
In recent years, it has been more feasible to
use commodity computers, that is, to build
supercomputers by connecting 100s of cheap
workstations to get high processing capability
Example: the Beowulf system, created from
desktop PCs linked by a high-speed network
Historical Background (2)
High-speed networks enabled the integration
of resources, geographically distributed and
managed at different domains.
In late 1990s, Foster & Kesselman proposed a In late 1990s, Foster & Kesselman proposed a
plug in the wall approach: Grid Computing
It was aimed to make globally dispersed computer
power as easy to access as an electric power grid
In the next decade, Cloud Computing was
introduced it refers to a technology providing
virtualized distributed resources over Internet
HPC-VCE Architectures/
Programming Paradigms
There are several very large spread systems for
parallel processing, but fundamentally different
from the programming point of view:
SMP (Symmetric Multi-Processing ) SMP (Symmetric Multi-Processing )
MPP (Massively Parallel Processing)
(Computer) Clusters
Grids and Clouds
NOW (Network of Workstations)
Symmetric Multi-Processing
An architecture in which multiple CPUs,
residing in one cabinet, are driven from a
single O/S image
A Pool of Resources
Each processor is a peer (one is not favored more
than another)
Shared bus
Shared memory address space
Common I/O channels and disks Common I/O channels and disks
Separate caches per processor, synchronized via
various techniques
But if one CPU fails, the entire SMP system is down
Clusters of two or more SMP systems can be used
to provide high availability (fault resilience)
Scalability of SMPs
Is limited (2-32), reduced by several factors, such as:
Inter-processor communication
Bus contention with CPUs and serializable points
Kernel serialization
Most vendors have SMP models on the market: Most vendors have SMP models on the market:
Sequent, Pyramid, Encore pioneered SMP on Unix
platforms
IBM, HP, NCR, Unisys also provide SMP servers
Many versions of Unix, Windows NT, NetWare
and OS/2 have been designed or adapted for SMP
Speedup and Efficiency of SMPs
SMPs help with overall throughput, not a single job,
speeding up whatever processes can be overlapped
In a desktop computer, it would speed up the
running of multiple applications simultaneously
If an application is multithreaded, it will improve If an application is multithreaded, it will improve
the performance of that single application
The OS controls all CPUs, executing simultaneously,
either processing data or in an idle loop waiting to do
something
CPUs are assigned to the next available task or
thread that can run concurrently
Massively Parallel Processing
Architecture in which each available processing
node runs a separate copy of the O/S
Distributed Resources
Each CPU is a subsystem with its own memory
and copy of the OS and application
Each subsystem communicates with the others
via a high-speed interconnect via a high-speed interconnect
There are independent cache/memory/and I/O
subsystems per node
Data is shared via function, from node-to-node
generally
Sometimes this is referred as shared-nothing
architecture (example: IBM SP2)
Integrated MPP and SMP
Is possible: the Reliant computer from Pyramid Technology combined
both MPP and SMP processing.
Speedup and Efficiency
Nodes communicate by passing messages, using
standards such as MPI
Nearly all supercomputers as of 2005 are massively
parallel, and may have x100,000 CPUs
The cumulative output of the many constituent CPUs The cumulative output of the many constituent CPUs
can result in large total peak FLOPS
The true amount of computation accomplished depends on
the nature of the computational task and its implementation
Some problems are more intrinsically able to be separated
into parallel computational tasks than others
Single chip implementations of massively parallel
architectures are becoming cost effective
Further Comments
To use MPP effectively, a problem must be breakable
into pieces that can all be solved simultaneously
It is the case of scientific environments: simulations or
mathematical problems can be split apart and each part
processed at the same time processed at the same time
In the business world: a parallel data query (PDQ) can
divide a large database into pieces (parallel groups)
In contrast: applications that support parallel operations
(multithreading) may immediately take advantage of
SMPs - and performance gains are available to all
applications simply because there are more processors
Computer Clusters
Composed of multiple computing nodes working
together closely so that in many respects they form
a single computer to process computational jobs
Clusters are increasingly built by assembling the
same or similar type of commodity machines that same or similar type of commodity machines that
have one or several CPUs and CPU cores
Clusters are used typically when the tasks of a job
are relatively independent of each other so that they
can be farmed out to different nodes of the cluster
In some cases, tasks of a job may still need to be
processed in a parallel manner, i.e. tasks may be
required to interact with each other during execution
Computer vs. Data Clusters
Computer clusters should not be confused with data
clusters, that refer to allocation for files/ directories
They are loosely coupled sets of independent
processors functioning as a single system to provide:
Higher Availability (remember: clusters of 2+ SMP
systems are used to provide fault resilience)
Performance and Load Balancing
Maintainability
Examples: RS/6000 (up to 8 nodes), DEC Open
VMS Cluster (up to 16 nodes), IBM Sysplex (up to
32 nodes), Sun SparcCluster
Clustering Issues
(valid for both computing and data)
A cluster of servers may
provide fault tolerance
and/or load balancing
If one server fails, one or If one server fails, one or
more additional servers
are still available
Load balancing is used to
distribute the workload
over multiple systems
How It Works
The allocation of jobs to individual nodes of a cluster
is handled by a Distributed Resource Manager (DRM)
The DRM allocates a task to a node using the resource
allocation policies that may consider node availability,
user priority, job waiting time, etc. user priority, job waiting time, etc.
Typically, DRMs also provide submission and monitor
interface, enabling users to specify jobs to be executed
and keep track of the progress of execution
Examples of popular resource managers are: Condor,
the Sun Grid Engine (SGE) and the Portable Batch
Queuing System (PBS)
Types of Clusters
The primary distinction within computer clusters is how
tightly-coupled the individual nodes are:
The Beowulf Cluster Design: densely located, sharing
a dedicated network, probably has homogenous nodes a dedicated network, probably has homogenous nodes
"Grid" Computing: when a compute task uses one or
few nodes, and needs little inter-node communication
Middleware such as MPI (Message Passing Interface) or
PVM (Parallel Virtual Machine) allows well designed
programs to be portable to a wide variety of clusters
Speedup and Efficiency
The TOP500 list includes many clusters
Tightly-coupled computer clusters are often designed
for "supercomputing
The central concept of a Beowulf cluster is the use of The central concept of a Beowulf cluster is the use of
commercial off-the-shelf (COTS) computers to produce a
cost-effective alternative to a traditional supercomputer
But clusters, that can have very high Flops, may be
poorer in accessing all data in the cluster
They are excellent for parallel computation, but inferior to
traditional supercomputers at non-parallel computation
Grids
A computational grid is a hardware and software
infrastructure providing dependable, consistent,
pervasive and cheap access to high-end computational
capabilities. (Foster, Kesselman, The Grid: Blueprint
for a New Computing Infrastructure, 1998) for a New Computing Infrastructure, 1998)
The key concept in a grid is the ability to negotiate
resource-sharing arrangements among a set of
resources from participating parties and then to use the
resulting resource pool for some purpose
The ancestor of the Grid is Metacomputing, which tried
to interconnect supercomputer centers with the purpose
to obtain superior processing resources
A Grid Checklist
1) Coordination of resources that are not
subject to centralized control
2) Use of standard, open, general-purpose
protocols and interfaces
3) Delivery of nontrivial qualities of service 3) Delivery of nontrivial qualities of service
(response time, throughput, availability, security)
Grid computing is concerned with coordinated resource sharing
and problem solving in dynamic, multi-institutional virtual
organizations (Foster, Tuecke, The Anatomy of the Grid,
2000), and/or co-allocation of multiple resource types to meet
complex user demands, so that the utility of the combined system
is significantly greater than that of the sum of its parts
How Grids Work
A typical grid computing architecture includes a meta-
scheduler connecting a number of geographically
distributed clusters that are managed by local DRMs
The meta-scheduler (e.g. GridWay, GridSAM) aims
to optimize computational workloads by combining an to optimize computational workloads by combining an
organization s multiple DRMs into an aggregated
single view, allowing jobs to be directed to the best
location (cluster) for execution
It integrates computational resources into a global
infrastructure, so that users no longer need to be aware
of which resources are used for their jobs
Grid Middleware
Grid computing tried to introduce common interfaces
and standards that eliminate the heterogeneity from
the resource access in different domains
Therefore, several grid middleware systems have been
developed to resolve the differences that exist between developed to resolve the differences that exist between
submission, monitoring and query interfaces of DRMs
The Globus Toolkit provides: a platform-independent job
submission interface, GRAM (Globus Resource Allocation
Manager), which cooperates underlying DRMs to integrate
the job submission method; a security framework, GSI (the
Grid Security Infrastructure), and a resource information
mechanism, MDS (the Monitoring and Discovery Service)
The Globus Toolkit
Interactions with its components are mapped to local
management system specific calls; support is provided
for many DRMs, including Condor, SGE and PBS
GridSAMprovides a common job submission/ monitoring
interface to multiple underlying DRMs interface to multiple underlying DRMs
As a Web Service based submission service, it implements
the Job Submission Description Language (JSDL) and a
collection of DRM plug-ins that map JSDL requests and
monitoring calls to system-specific calls
In addition, a set of new open standards and protocols
like OGSA (Open Grid Services Architecture), WSRF
(Web Services Resource Framework) are introduced
to facilitate mapping between independent systems
Grid Challenges
Computing in grid environments may be difficult due to:
Resource heterogeneity: results in differing capability
of processing jobs, making the execution performance
difficult to assess
Resource dynamic behavior: it exists in both the Resource dynamic behavior: it exists in both the
networks and computational resources
Resource co-allocation: the required resources must be
offered at the same time, or the computation cannot go
Resource access security: important things need to be
managed, i.e. access policy (what is shared? to whom?
when?), authentication (how users/resources identify?),
authorization (are operations consistent with the rules?)
Clouds
Cloud computing uses the Web server facilities of a 3
rd
party provider on the Internet to store, deploy and run
applications
It takes two main forms:
1. Infrastructure as a Service (IaaS): only hardware/ 1. Infrastructure as a Service (IaaS): only hardware/
software infrastructure (OS, databases) are offered
Includes Utility Computing, DeskTop Virtualization
2. Software as a Service" (SaaS), which includes the
business applications as well
Regardless whether the cloud is infrastructure only or
includes applications, major features are self service,
scalability and speed
Speedup and Performance
Customers log into the cloud and run their applications
as desired; although a representative of the provider
may be involved in setting up the service, customers
make all configuration changes from their browsers
In most cases, everything is handled online from start In most cases, everything is handled online from start
to finish by the customer
The cloud provides virtually unlimited computing
capacity and supports extra workloads on demand
Cloud providers may be connected to multiple Tier 1
Internet backbones for fast response times/ availability
Infrastructure Only (IaaS/PaaS)
Using the cloud for computing power only can be
cheap to support new projects or seasonal increases
When constructing a new datacenter, there are very big
security, environmental and management issues, not to
mention hardware/software maintenance forever after mention hardware/software maintenance forever after
In addition, commercial cloud facilities may be able to
withstand natural disasters
Infrastructure-only cloud computing is also named
infrastructure as a service (IaaS), platform as a service
(PaaS), cloud hosting, utility computing, grid hosting
Infrastructure & Applications (SaaS)
More often, cloud computing refers to application
service providers (ASPs) that offer everything: the
infrastructure as outlined below and the applications,
relieving the organization of virtually all maintenance
Google Apps and Salesforce.com's CRM products are Google Apps and Salesforce.com's CRM products are
examples of this "software-as-a-service" model (SaaS)
This is a paradigm shift because company data are
stored externally; even if data are duplicated in-house,
copies "in the cloud" create security and privacy issues
Companies may create private clouds within their own
datacenters, or use hybrid clouds (both private/public)
Networks of Workstations (NOW)
Uses network based architecture (even the Internet),
even when working in Massively Parallel model
More appropriate to distributed computing, therefore
they are seen sometimes as distributed computers
NOW formed the hardware/ software foundation
used by the Inktomi search engine (Inktomi was
acquired by Yahoo! in 2002)
This led to a multi-tier architecture for Internet
services based on distributed systems, in use today
NOW Working in Parallel
Application partitions task into manageable subtasks
Application asks participating nodes to post available
resources and computational burdens
Network bandwidth
Available RAM
Processing power available
Nodes respond and application parses out subtasks to
nodes with less computational burden and most
available resources
Application must parse out subtasks and synchronize
answer
HPCVCE Programming Model
High Performance Computing environments (HPCe)
have to deliver a tremendous amount of power over a
short period of time
A Virtual Computing Environment (VCE) :
Uses existing software to build a programming model on top Uses existing software to build a programming model on top
of which rapid parallel/ distributed applications development
is made possible
Provides tools to create, debug, and execute applications on
heterogeneous hardware
Let the software map high level descriptions of the problems
to available hardware
Low-level issues are no longer a concern of the programmer
Parallel Computing/ Programming in
Distributed Environments
The Bad News
Too many architectures
Existing architectures are too specific
Programs too closely tied to architecture
Software was developed using an obsolete mentality Software was developed using an obsolete mentality
The Good News
Centralized systems are a thing of the past
Computing was evolving towards cycle servers
Each user has his/her own computer
Workstations are networked
Typical LAN speeds are 100 mbs
Workstation Users in VCEs
All VCE configuration include workstations
Workstations are chronically underutilized
Workstation users can be classified as:
Casual Users Casual Users
Sporadic Users
Frustrated Users
The VCE must help frustrated users without
hurting casual and sporadic users
Other Considerations
The VCE must be cost effective
Use existing tools like NFS, PVM, MPI
whenever possible
Must not require tremendous amounts of
processor power processor power
The VCE must coexist with other software
Non-VCE applications should not be impacted
by the VCE
The VCE must avoid kernel modes
A VCE Minimal Configuration
Problem Specification
Design Stage
Coding Level
SDM
The SDM (software
development module)
provides tools to build
application task graph
Coding Level
Compilation Manager
Runtime Manager
SEM
application task graph
The SEM (software
execution module)
compiles applications
and dispatches tasks
Parallel Execution Issues in a VCE
Compilation Issues
Executables must be prepared to maximize scheduling
flexibility
Compilations must be scheduled to maximize
application performance and hardware utilization application performance and hardware utilization
Runtime Issues
Task Placement: the criteria for automatically selecting
machines to host tasks must consider both hardware
utilization and application throughput
Programmers may improve task placement decisions
Processor Utilization and Task
Migration
Free parallelism: parallel applications with low
efficiency benefit when run on idle machines
Load balancing: a central issue in the execution
module
Various migration strategies are possible Various migration strategies are possible
Redundant execution
Check-pointing
Dump and migrate
Recompilation
Byte coded tasks
Parallel VCE systems in a decade
P4
Chameleon
Parmacs
PVM PVM
MPI
CHIMP
NX (Intel i860, Paragon)

PVM : What is it?
Heterogeneous Virtual Machine support for:
Resource Management
add/delete hosts from a virtual machine (VM)
Process Control
spawn/kill tasks dynamically spawn/kill tasks dynamically
Communication using Message Passing
blocking send, blocking and non-blocking receive, mcast
Dynamic Task Groups
task can join or leave a group at any time
Fault Tolerance
VM automatically detects faults and adjusts
Popular PVM Uses
Poormans Supercomputer
PC clusters, Linux, Solaris, NT
Cobble together whatever resources you can get
Metacomputer linking multiple (super)computers Metacomputer linking multiple (super)computers
ultimate performance: eg. have combined x1000s processors
and up to 50 supercomputers
Education Tool
teaching parallel programming
academic and thesis research
PVM In a Nutshell
PVM is set on top of different architectures running
different operating systems (Hosts)
Each host runs a PVM daemon (PVMD)
A collection of PVMDs define the VM A collection of PVMDs define the VM
Once configured, tasks can be started (spawned),
killed, signaled from a console
Communicate using basic message passing
Performance is good
API Semantics limit optimizations
Inside View of PVM
Every process has a unique, virtual-machine-wide,
identifier called a task ID (TID)
A single master PVMD disseminates current
virtual machine configuration and holds the so- virtual machine configuration and holds the so-
called PVM mailbox.
The VM can grow and shrink around the master
(if the master dies, the machine falls apart)
Dynamic configuration is used whenever practical
host (one per IP address)
pvmd - one PVM daemon per host
pvmd
PVM Design
libpvm- task linked to PVM library
task task task
Unix Domain Sockets
inner host messages
tcp
direct connect
pvmd
pvmd
pvmds fully connected using UDP
OS network interface
task task task
Shared Memory
shared memory multiprocessor
P0 P1 P2
task task task
distributed memory MPP
task task task
task task task
internal interconnect
Multiple Transports of PVM Tasks
PVM uses sockets mostly
Unix-domain on host
TCP between tasks on different hosts
UDP between Daemons (custom reliability) UDP between Daemons (custom reliability)
SysV Shared Memory Transport for SMPs
Tasks still use pvm_send(), pvm_recv()
Native MPP
PVM can ride atop a native MPI implementation
PVM uses tid to identify pvmd, tasks, groups
Fits into 32-bit integer
Task ID (tid)
18 bits 12 bits
S G host ID local part
S bit addresses pvmd, G bit forms mcast address
Local part defined by each pvmd e.g.
12 bits
S G host ID process node ID
11 bits 7 bits
4096 hosts 2048 nodes each with
PVM Addressing
Strengths/Weaknesses
Addresses contain routing information by
virtue of the host part
Transport selection at runtime is simplified:
Bit-mask + table lookup Bit-mask + table lookup
Moving a PVM task is very difficult
Group/multicast bit makes it straightforward
to implement multicast within point-to-point
infrastructure
MPI : Design Goals
Make it go faster than PVM (as fast as possible)
Operate in a serverless (daemonless environment)
Specify portability but not interoperability
Standardize best practices of parallel VCEs Standardize best practices of parallel VCEs
Encourage competing implementations
Enable the building of safe libraries
Make it the assembly language of Message
Passing
MPI Implementations
MPICH (Mississippi-Argonne) open source
A top-quality reference implementation
http://www-unix.mcs.anl.gov/mpi/mpich/
High Performance Cluster MPIs High Performance Cluster MPIs
AM-MPI, FM-MPI, PM-MPI, GM-MPI, BIP-
MPI
10us latency, 100MB/sec on Myrinet
Vendor supported MPI
SGI, Cray, IBM, Fujitsu, Sun, Hitachi,
PVM vs. MPI
easy to use
interoperable
fault tolerant
is a standard
widely supported
MPP performance
PVM
MPI
Each API has its unique strengths
heterogeneity support
resource control
dynamic model
good for experiment
MPP performance
many comm. methods
topology support
static model (SPMD)
good to build scalable products
Best
Distributed Computing
Best
Large Multiprocessor
Evaluate the needs of your application then choose

PARALLEL COMPUTING: Models and Algorithms

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

PARALLEL COMPUTING: Models and Algorithms

Загружено:

Авторское право:

Доступные форматы

PARALLEL COMPUTING:

Models and Algorithms

Ex: early parallel machines

finding minimum recursively:

Вам также может понравиться