ch1 PDF

Introduction
Introduction
What is Parallel Architecture?
Why Parallel Architecture?
Evolution and Convergence of Parallel Architectures
Fundamental Design Issues
What is Parallel Architecture?

A parallel computer is a collection of processing elements
that cooperate to solve large problems fast
Some broad issues:
Resource Allocation:
Data access, Communication and Synchronization
how large a collection?

how powerful are the elements?
how much memory?
how do the elements cooperate and communicate?
how are data transmitted between processors?
what are the abstractions and primitives for cooperation?
Performance and Scalability
how does it all translate into performance?

how does it scale?
3
Why Study Parallel Architecture?

Role of a computer architect:
To design and engineer the various levels of a computer system
to maximize performance and programmability within limits of
technology and cost.
Parallelism:
Provides alternative to faster clock for performance
Applies at all levels of system design
Is a fascinating perspective from which to view architecture
Is increasingly central in information processing
Why Study it Today?

History: diverse and innovative organizational structures, often
tied to novel programming models
Rapidly maturing under strong technological constraints
The killer micro is ubiquitous
Laptops and supercomputers are fundamentally similar!
Technological trends cause diverse approaches to converge
Technological trends make parallel computing inevitable
In the mainstream
Need to understand fundamental principles and design tradeoffs,

not just taxonomies
Naming, Ordering, Replication, Communication performance

5
Inevitability of Parallel Computing

Application demands: Our insatiable need for computing cycles
Scientific computing: CFD, Biology, Chemistry, Physics, ...

General-purpose computing: Video, Graphics, CAD, Databases, TP...
Technology Trends
Number of transistors on chip growing rapidly

Clock rates expected to go up only slowly
Architecture Trends
Instruction-level parallelism valuable but limited

Coarser-level parallelism, as in MPs, the most viable approach
Economics
Current trends:
Todays microprocessors have multiprocessor support

Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!...
Tomorrows microprocessors are multiprocessors
6
Application Trends
Demand for cycles fuels advances in hardware, and vice-versa
Cycle drives exponential increase in microprocessor performance
Drives parallel architecture harder: most demanding applications
Range of performance demands
Need range of system performance with progressively increasing cost
Platform pyramid
Goal of applications in using parallel machines: Speedup

Speedup (p processors) =
Performance (p processors)
Performance (1 processor)
For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =
Time (1 processor)
Time (p processors)
7
Scientific Computing Demand
Engineering Computing Demand

Large parallel machines a mainstay in many industries
Petroleum (reservoir analysis)
Automotive (crash simulation, drag analysis, combustion efficiency),
Aeronautics (airflow analysis, engine efficiency, structural mechanics,

electromagnetism),
Computer-aided design
Pharmaceuticals (molecular modeling)
Visualization
in all of the above

entertainment (films like Toy Story)
architecture (walk-throughs and rendering)
Financial modeling (yield and derivative analysis)
etc.
9
Applications: Speech and Image Processing

10 GIPS
1 GIPS
Telephone
Number
Recognition
100 MIPS
10 MIPS
200 Words
Isolated Speech
Recognition
1,000 Words
Continuous
Speech
Recognition
5,000 Words
Continuous
Speech
Recognition
HDTVReceiver
CIF Video
ISDN-CD Stereo
Receiver
CELP
Speech Coding
Speaker
Verication
1 MIPS
1980
Sub-Band
Speech Coding
1985
1990
1995
Also CAD, Databases, . . .

100 processors gets you 10 years, 1000 gets you 20 !
10
Learning Curve for Parallel Applications
AMBER molecular dynamics simulation program

Starting point was vector code for Cray-1
145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
891 on 128-processor Cray T3D
11
Commercial Computing
Also relies on parallelism for high end
Scale not so large, but use much more wide-spread
Computational power determines scale of business that can be handled
Databases, online-transaction processing, decision support, data

mining, data warehousing ...
TPC benchmarks (TPC-C order entry, TPC-D decision support)
Explicit scaling criteria provided
Size of enterprise scales with size of system

Problem size no longer fixed as p increases, so throughput is used as a
performance measure (transactions per minute or tpm)
12
TPC-C Results for March 1996

25,000
Throughput (tpmC)
20,000
Tandem Himalaya
DEC Alpha
SGI PowerChallenge
HP PA
IBM PowerPC
Other
15,000
10,000
5,000
20
40
60
80
100
120
Number of processors
Parallelism is pervasive
Small to moderate scale parallelism very important
Difficult to obtain snapshot to compare across vendor platforms
13
Summary of Application Trends

Transition to parallel computing has occurred for scientific and
engineering computing
In rapid progress in commercial computing
Database and transactions as well as financial
Usually smaller-scale, but large-scale systems also used
Desktop also uses multithreaded programs, which are a lot like

parallel programs
Demand for improving throughput on sequential workloads
Greatest use of small-scale multiprocessors
Solid application demand exists and will increase
14
Technology Trends
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
1995
The natural building block for multiprocessors is now also about the fastest!
15
General Technology Trends

Microprocessor performance increases 50% - 100% per year
Transistor count doubles every 3 years
DRAM size quadruples every 3 years
Huge investment per generation is carried by huge commodity market
180
160
140
DEC
alpha
120
100
80
60
40
20
MIPS
Sun 4 M/120
260
0
1987
1988
MIPS
M2000
1989
IBM
RS6000
540
1990
Integer
FP
HP 9000
750
1991
1992
Not that single-processor performance is plateauing, but that

parallelism is a natural way to improve it.
16
Technology: A Closer Look

Basic advance is decreasing feature size ( )
Circuits become either faster or lower in power
Die size is growing too
Clock rate improves roughly proportional to improvement in

Number of transistors improves like 2 (or faster)
Performance > 100x per decade; clock rate 10x, rest transistor count
How to use more transistors?
Parallelism in processing
Proc
Locality in data access
multiple operations per cycle reduces CPI

avoids latency and reduces CPI
also improves processor utilization
Interconnect
Both need resources, so tradeoff
Fundamental issue is resource distribution, as in uniprocessors

17
Clock Frequency Growth Rate
Clock rate (MHz)
1,000
100
10
i8086
R10000
Pentium100
i80386
i80286
i8080
i8008
i4004
0.1
1970
1975
1980
1985
1990
1995
2000
2005
30% per year

18
Transistor Count Growth Rate

100,000,000
Transistors
10,000,000
1,000,000
i80286
100,000
R10000
Pentium
i80386
R3000
R2000
i8086
10,000
i8080
i8008
i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
100 million transistors on chip by early 2000s A.D.

Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
19
Similar Story for Storage

Divergence between memory capacity and speed more pronounced
Capacity increased by 1000x from 1980-95, speed only 2x
Gigabit DRAM by c. 2000, but gap with processor speed much greater
Larger memories are slower, while processors get faster

Need to transfer more data in parallel
Need deeper cache hierarchies
How to organize caches?
Parallelism increases effective size of each level of hierarchy,

without increasing access time
Parallelism and locality within memory systems too
New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
Buffer caches most recently accessed data
Disks too: Parallel disks plus caching

20
Architectural Trends
Architecture translates technologys gifts to performance and
capability
Resolves the tradeoff between parallelism and locality
Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
Tradeoffs may change with scale and technology advances
Understanding microprocessor architectural trends
Helps build intuition about design issues or parallel machines
Shows fundamental role of parallelism even in sequential computers
Four generations of architectural history: tube, transistor, IC, VLSI
Here focus only on VLSI generation
Greatest delineation in VLSI has been in type of parallelism exploited

21
Architectural Trends
Greatest trend in VLSI generation is increase in parallelism
Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
Mid 80s to mid 90s: instruction level parallelism
slows after 32 bit

adoption of 64-bit now under way, 128-bit far (not performance issue)
great inflection point when 32-bit micro and cache fit on a chip
pipelining and simple instruction sets, + compiler advances (RISC)
on-chip caches and functional units => superscalar execution
greater sophistication: out of order execution, speculation, prediction
to deal with control transfer and latency problems
Next step: thread level parallelism
22
Phases in VLSI Generation

Bit-level parallelism
Instruction-level
Thread-level (?)
100,000,000
10,000,000
1,000,000
R10000
Pentium
Transistors
i80386
i80286
100,000
R2000
R3000
i8086
10,000
i8080
i8008
i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
How good is instruction-level parallelism?

Thread-level needed in microprocessors?
23
Architectural Trends: ILP

Reported speedups for superscalar processors
Horst, Harris, and Jardine [1990] ......................
1.37
Wang and Wu [1988] ..........................................
1.70
Smith, Johnson, and Horowitz [1989] ..............
2.30
Murakami et al. [1989] ........................................
2.55
Chang et al. [1991] .............................................
2.90
Jouppi and Wall [1989] ......................................
3.20
Lee, Kwok, and Briggs [1991] ...........................
3.50
Wall [1991] ..........................................................
Melvin and Patt [1991] .......................................
Butler et al. [1991] .............................................
17+
Large variance due to difference in
application domain investigated (numerical versus non-numerical)

capabilities of processor modeled
24
ILP Ideal Potential

3
25
2.5
20
2
Speedup
Fraction of total cycles (%)
30
15
1.5
10
0.5
0
0
6+
Number of instructions issued
10
15
Instructions issued per cycle
Infinite resources and fetch bandwidth, perfect branch prediction and renaming
real caches and non-zero miss latencies
25
Results of ILP Studies
Concentrate on parallelism for 4-issue machines

perfect branch prediction
4x
1 branch unit/real prediction
3x
2x
1x
Jouppi_89
Smith_89 Murakami_89 Chang_91
Butler_91
Melvin_91
Realistic studies show only 2-fold speedup

Recent studies show that more ILP needs to look across threads
26
Architectural Trends: Bus-based MPs

Micro on a chip makes it natural to connect many to shared memory
dominates server and enterprise market, moving down to desktop
Faster processors began to saturate bus, then bus technology advanced

today, range of sizes for bus-based systems, desktop to large servers
70
CRAY CS6400
Sun
E10000
60
Number of processors
50
40
SGI Challenge
Symmetry81
Sequent B2100
30
SE60
Sun E6000
SE70
Sun SC2000
20
SC2000E
SGI PowerChallenge/XL
AS8400
Symmetry21
Sequent B8000
SE10
10
Power
SGI PowerSeries
0
1984
1986
1988
SS690MP 140
SS690MP 120
1990
1992
SS1000
SE30
SS1000E
AS2100 HP K400
SS20
SS10
1994
1996
P-Pro
1998
No. of processors in fully configured commercial shared-memory systems
27
Bus Bandwidth
100,000
Sun E10000
Shared bus bandwidth (MB/s)
10,000
SGI
Sun E6000
PowerCh
AS8400
XL
CS6400
SGI Challenge
HPK400
SC2000E
AS2100
SC2000
P-Pro
SS1000E
SS1000
SS20
SS690MP 120
SE70/SE30
SS10/
SS690MP 140
SE10/
1,000
SE60
Symmetry81/21
100
SGI PowerSeries
Power
Sequent B2100
Sequent
B8000
10
1984
1986
1988
1990
1992
1994
1996
1998
28
Economics
Commodity microprocessors not only fast but CHEAP
Development cost is tens of millions of dollars (5-100 typical)
BUT, many more are sold compared to supercomputers
Crucial to take advantage of the investment, and use the

commodity building block
Exotic parallel architectures no more than special-purpose
Multiprocessors being pushed by software vendors (e.g. database)

as well as hardware vendors
Standardization by Intel makes small, bus-based SMPs commodity
Desktop: few smaller processors versus one larger one?
Multiprocessor on a chip
29
Consider Scientific Supercomputing

Proving ground and driver for innovative architecture and techniques
Market smaller relative to commercial as MPs become mainstream
Dominated by vector machines starting in 70s
Microprocessors have made huge gains in floating-point performance
high clock rates

pipelined floating point units (e.g., multiply-add every cycle)
instruction-level parallelism
effective use of caches (e.g., automatic blocking)
Plus economics
Large-scale multiprocessors replace vector supercomputers
Well under way already

30
Raw Uniprocessor Performance: LINPACK

10,000
CRAY n = 1,000
CRAY n = 100
Micro n = 1,000
Micro n = 100
1,000
T94
LINPACK (MFLOPS)
C90
100
DEC 8200
Ymp
Xmp/416
Xmp/14se
IBM Power2/990
MIPS R4400
DEC Alpha
HP9000/735
DEC Alpha AXP

HP 9000/750
CRAY 1s
IBM RS6000/540
10
MIPS M/2000
MIPS M/120
Sun 4/260
1
1975
1980
1985
1990
1995
2000
31
Raw Parallel Performance: LINPACK

10,000
MPP peak
CRAY peak
ASCI Red
LINPACK (GFLOPS)
1,000
Paragon XP/S MP
(6768)
Paragon XP/S MP
(1024)
T3D
CM-5
100
T932(32)
Paragon XP/S
CM-200
CM-2
Ymp/832(8)
1
C90(16)
Delta
10
iPSC/860
nCUBE/2(1024)
Xmp /416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
Since 1993, Cray produces MPPs too (T3D, T3E)
32
500 Fastest Computers

350
Number of systems
300
313
200
284
239
250
187
110
106
100
0
11/93
MPP
PVP
SMP
198
150
50
319
63
11/94
11/95
106
73
11/96
33
Summary: Why Parallel Architecture?

Increasingly attractive
Economics, technology, architecture, application demand
Increasingly central and mainstream

Parallelism exploited at many levels
Instruction-level parallelism
Multiprocessor servers
Large-scale multiprocessors (MPPs)
Focus of this class: multiprocessor level of parallelism

Same story from memory system perspective
Increase bandwidth, reduce average latency with many local memories
Wide range of parallel architectures make sense
Different cost, performance and scalability

34
Convergence of Parallel Architectures
History
Historically, parallel architectures tied to programming models
Divergent architectures, with no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
Uncertainty of direction paralyzed parallel software development!

36
Today
Extension of computer architecture to support communication
and cooperation
OLD: Instruction Set Architecture
NEW: Communication Architecture
Defines
Critical abstractions, boundaries, and primitives (interfaces)
Organizational structures that implement interfaces (hw or sw)
Compilers, libraries and OS are important bridges today
37
Modern Layered Framework
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Data
parallel
Compilation
or library
Operating systems support
Communication hardware
Parallel applications
Programming models
Communication abstraction
User/system boundary
Hardware/software boundary
Physical communication medium
38
Programming Model
What programmer uses in coding applications
Specifies communication and synchronization
Examples:
Multiprogramming: no communication or synch. at program level
Shared address space: like bulletin board
Message passing: like letters or phone calls, explicit point to point
Data parallel: more regimented, global actions on data
Implemented with shared address space or message passing
39
Communication Abstraction
User level communication primitives provided
Realizes the programming model
Mapping exists between language primitives of programming model

and these primitives
Supported directly by hw, or via OS, or via user sw

Lot of debate about what to support in sw and gap between layers
Today:
Hw/sw interface tends to be flat, i.e. complexity roughly uniform
Compilers and software play important roles as bridges today
Technology trends exert strong influence
Result is convergence in organizational structure
Relatively simple, general purpose communication primitives

40
Communication Architecture
= User/System Interface + Implementation
User/System Interface:
Comm. primitives exposed to user-level by hw and system-level sw
Implementation:
Organizational structures that implement the primitives: hw or OS

How optimized are they? How integrated into processing node?
Structure of network
Goals:
Performance
Broad applicability
Programmability
Scalability
Low Cost
41
Evolution of Architectural Models

Historically machines tailored to programming models
Prog. model, comm. abstraction, and machine organization lumped

together as the architecture
Evolution helps understand convergence
Identify core concepts
Shared Address Space

Message Passing
Data Parallel
Others:
Dataflow
Systolic Arrays
Examine programming model, motivation, intended applications, and

contributions to convergence
42
Shared Address Space Architectures

Any processor can directly reference any memory location
Communication occurs implicitly as result of loads and stores
Convenient:
Location transparency
Similar programming model to time-sharing on uniprocessors
Except processes run on different processors
Good throughput on multiprogrammed workloads
Naturally provided on wide range of platforms
History dates at least to precursors of mainframes in early 60s
Wide range of scale: few to hundreds of processors
Popularly known as shared memory machines or model
Ambiguous: memory may be physically distributed among processors

43
Shared Address Space Model

Process: virtual address space plus one or more threads of control
Portions of address spaces of processes are shared
Virtual address spaces for a
collection of processes communicating
via shared addresses
Load
P1
Machine physical address space

Pn pr i v at e
Pn
P2
Common physical
addresses
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i vat e
P1 pr i vat e
P0 pr i vat e
Writes
to shared address visible to other threads (in other processes
too)
Natural extension of uniprocessors model: conventional memory
operations for comm.; special atomic operations for synchronization
OS uses shared memory to coordinate processes
44
Communication Hardware
Also natural extension of uniprocessor
Already have processor, one or more memory modules and I/O
controllers connected by hardware interconnect of some sort
I/O
devices
Mem
Mem
Mem
Interconnect
Processor
Mem
I/O ctrl
I/O ctrl
Interconnect
Processor
Memory capacity increased by adding modules, I/O by controllers

Add processors for processing!
For higher-throughput multiprogramming, or parallel programs
45
History
Mainframe approach
Motivated by multiprogramming
Extends crossbar used for mem bw and I/O
Originally processor cost limited to small
later, cost of crossbar
P
P
I/O
Bandwidth scales with p

I/O
High incremental cost; use multistage instead
C
C
M
Minicomputer approach
Almost all microprocessor systems have bus

Motivated by multiprogramming, TP
I/O
Used heavily for parallel computing
C
Called symmetric multiprocessor (SMP)
Latency larger than for uniprocessor
Bus is bandwidth bottleneck
caching is key: coherence problem
I/O
C
Low incremental cost

46
Example: Intel Pentium Pro Quad

CPU
P-Pro
module
256-KB
Interrupt
L2 $
controller
Bus interface
P-Pro
module
P-Pro
module
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
Memory
controller
MIU
1-, 2-, or 4-way
interleaved
DRAM
All coherence and

multiprocessing glue in
processor module
Highly integrated, targeted at
high volume
Low latency and bandwidth
47
Example: SUN Enterprise

P
$
P
$
$2
$2
CPU/mem
cards
Mem ctrl
Bus interface/switch
Gigaplane bus (256 data, 41 address, 83 MHz)
I/O cards
2 FiberChannel
SBUS
SBUS
SBUS
100bT, SCSI
Bus interface
16 cards of either type: processors + memory, or I/O

All memory accessed over bus, so symmetric
Higher bandwidth, higher latency bus
48
Scaling Up
M
Network
Network
Dance hall
M $
M $
M $
P
Distributed memory
Problem is interconnect: cost (crossbar) or bandwidth (bus)
Dance-hall: bandwidth still scalable, but lower cost than crossbar
Distributed memory or non-uniform memory access (NUMA)
latencies to memory uniform, but uniformly large

Construct shared address space out of simple message transactions
across a general-purpose network (e.g. read-request, read-response)
Caching shared (particularly nonlocal) data?

49
Example: Cray T3E

External I/O
P
$
Mem
Mem
ctrl
and NI
XY
Switch
Scale up to 1024 processors, 480MB/s links

Memory controller generates comm. request for nonlocal references
No hardware mechanism for coherence (SGI Origin etc. provide this)
50
Message Passing Architectures

Complete computer as building block, including I/O
Communication via explicit I/O operations
Programming model: directly access only private address space

(local memory), comm. via explicit messages (send/receive)
High-level block diagram similar to distributed-memory SAS
But comm. integrated at IO level, neednt be into memory system

Like networks of workstations (clusters), but tighter integration
Easier to build than scalable SAS
Programming model more removed from basic hardware operations
Library or OS intervention
51
Message-Passing Abstraction
Match
ReceiveY, P, t
AddressY
Send X, Q, t
AddressX
Local process
address space
Process P
Process Q
Send specifies buffer to be transmitted and receiving process

Recv specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
Optional tag on send and matching rule on receive
User process names local data and entities in process/tag space too
In simplest form, the send/recv match achieves pairwise synch event
Local process
address space
Other variants too
Many overheads: copying, buffer management, protection

52
Evolution of Message-Passing Machines

101
001
100
000
111
Early machines: FIFO on each link
011
110
010
Hw close to prog. Model; synchronous ops

Replaced by DMA, enabling non-blocking ops
Buffered by system at destination until recv
Diminishing role of topology
Store&forward routing: topology important

Introduction of pipelined routing made it less so
Cost is in node-network interface
Simplifies programming
53
Example: IBM SP-2

Power 2
CPU
IBM SP-2 node
L2 $
Memory bus
General interconnection
network formed from
8-port switches
4-way
interleaved
DRAM
Memory
controller
MicroChannel bus
I/O
DMA
i860
NI
DRAM
NIC
Made out of essentially complete RS6000 workstations

Network interface integrated in I/O bus (bw limited by I/O
bus)
54
Example Intel Paragon

i860
i860
L1 $
L1 $
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
Sandia s Intel Paragon XP/S-based Supercomputer
2D grid network
with processing node
attached to every switch
NI
4-way
interleaved
DRAM
8 bits,
175 MHz,
bidirectional
55
Toward Architectural Convergence

Evolution and role of software have blurred boundary
Send/recv supported on SAS machines via buffers
Can construct global address space on MP using hashing
Page-based (or finer-grained) shared virtual memory
Hardware organization converging too
Tighter NI integration even for MP (low-latency, high-bandwidth)
At lower level, even hardware SAS passes hardware messages
Even clusters of workstations/SMPs are parallel systems
Emergence of fast system area networks (SAN)
Programming models distinct, but organizations converging
Nodes connected by general network and communication assists
Implementations also converging, at least in high-end machines

56
Data Parallel Systems

Programming model
Operations performed in parallel on each element of data structure
Logically single thread of control, performs sequential or parallel steps
Conceptually, a processor associated with each data element
Architectural model
Array of many simple, cheap processors with little memory each
Processors dont sequence through instructions
Attached to a control processor that issues instructions
Specialized and general communication, cheap global synchronization
Original motivations
Matches simple differential equation solvers
Centralize high cost of instruction
fetch/sequencing
Control
processor
PE
PE
PE
PE
PE
PE
PE
PE
PE
57
Application of Data Parallelism

Each PE contains an employee record with his/her salary
If salary > 100K then
salary = salary *1.05

else
salary = salary *1.10
Logically, the whole operation is a single step
Some processors enabled for arithmetic operation, others disabled
Other examples:
Finite differences, linear algebra, ...
Document searching, graphics, image processing, ...
Some recent machines:
Thinking Machines CM-1, CM-2 (and CM-5)

Maspar MP-1 and MP-2,
58
Evolution and Convergence

Rigid control structure (SIMD in Flynn taxonomy)
SISD = uniprocessor, MIMD = multiprocessor
Popular when cost savings of centralized sequencer high
60s when CPU was a cabinet

Replaced by vectors in mid-70s
More flexible w.r.t. memory layout and easier to manage
Revived in mid-80s when 32-bit datapath slices just fit on chip

No longer true with modern microprocessors
Other reasons for demise
Simple, regular applications have good locality, can do well anyway

Loss of applicability due to hardwiring data parallelism
MIMD machines as effective for data parallelism and more general
Prog. model converges with SPMD (single program multiple data)
Contributes need for fast global synchronization

Structured global address space, implemented with either SAS or MP
59
Dataflow Architectures
Represent computation as a graph of essential dependences
Logical processor at each node, activated by availability of operands

Message (tokens) carrying tag of next instruction sent to next processor
Tag compared with others in matching store; match fires execution
1
a = (b +1) (b c)
d
= c e
f = a d
Dataflow graph
a
Network
f
Token
store
Program
store
Waiting
Matching
Instruction
fetch
Execute
Form
token
Network
Token queue
Network
60
Evolution and Convergence

Key characteristics
Ability to name operations, synchronization, dynamic scheduling
Problems
Operations have locality across them, useful to group together

Handling complex data structures like arrays
Complexity of matching store and memory units
Expose too much parallelism (?)
Converged to use conventional processors and memory
Support for large, dynamic set of threads to map to processors

Typically shared address space as well
But separation of progr. model from hardware (like data-parallel)
Lasting contributions:
Integration of communication with thread (handler) generation

Tightly integrated communication and fine-grained synchronization
Remained useful concept for software (compilers etc.)
61
Systolic Architectures
Replace single processor with array of regular processing elements
Orchestrate data flow for high throughput with less memory access
M
PE
PE
PE
PE
Different from pipelining
Nonlinear array structure, multidirection data flow, each PE may

have (small) local instruction and data memory
Different from SIMD: each PE may do something different

Initial motivation: VLSI enables inexpensive special-purpose chips
Represent algorithms directly by chips connected in regular pattern
62
Systolic Arrays (contd.)

Example: Systolic array for 1-D convolution
y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)
x8
x7
x6
x5
x4
x3
x2
w3
w4
y3
y2
yin
x
w
xout
xout = x
x = xin
yout = yin + w xin
yout
Enable variety of algorithms on same hardware
But dedicated interconnect channels
w1
Practical realizations (e.g. iWARP) use quite general processors
w2
y1
xin
x1
Data transfer directly from register to register across channel
Specialized, and same problems as SIMD
General purpose systems work well for same algorithms (locality etc.)
63
Convergence: Generic Parallel Architecture

A generic modern multiprocessor
Network
Communication
assist (CA)
Mem
$
P
Node: processor(s), memory system, plus communication assist

Network interface and communication controller
Scalable network
Convergence allows lots of innovation, now within framework
Integration of assist with node, what operations, how efficiently...
64
Understanding Parallel Architecture

Traditional taxonomies not very useful
Programming models not enough, nor hardware structures
Same one can be supported by radically different architectures
Architectural distinctions that affect software
Compilers, libraries, programs
Design of user/system and hardware/software interface
Constrained from above by progr. models and below by technology
Guiding principles provided by layers
What primitives are provided at communication abstraction
How programming models map to these
How they are mapped to hardware

66

At any layer, interface (contract) aspect and performance aspects
Naming: How are logically shared data and/or processes referenced?
Operations: What operations are provided on these data
Ordering: How are accesses to data ordered and coordinated?
Replication: How are data replicated to reduce communication?
Communication Cost: Latency, bandwidth, overhead, occupancy
Understand at programming model first, since that sets requirements

Other issues
Node Granularity: How to split between processors and memory?
...
67
Sequential Programming Model

Contract
Naming: Can name any variable in virtual address space
Hardware (and perhaps compilers) does translation to physical addresses
Operations: Loads and Stores
Ordering: Sequential program order
Performance
Rely on dependences on single location (mostly): dependence order
Compilers and hardware violate other orders without getting caught
Compiler: reordering and register allocation
Hardware: out of order, pipeline bypassing, write buffers
Transparent replication in caches

68
SAS Programming Model

Naming: Any process can name any variable in shared space
Operations: loads and stores, plus those needed for ordering
Simplest Ordering Model:
Within a process/thread: sequential program order
Across threads: some interleaving (as in time-sharing)
Additional orders through synchronization
Again, compilers/hardware can violate orders without getting caught
Different, more subtle ordering models also possible (discussed later)
69
Synchronization
Mutual exclusion (locks)
Ensure certain operations on certain data can be performed by

only one process at a time
Room that only one person can enter at a time
No ordering guarantees
Event synchronization
Ordering of events to preserve dependences
e.g. producer > consumer of data
3 main types:
point-to-point
global
group
70
Message Passing Programming Model

Naming: Processes can name private data directly.
No shared address space
Operations: Explicit communication through send and receive

Send transfers data from private address space to another process
Receive copies data from process to private address space
Must be able to name processes
Ordering:
Program order within a process
Send and receive can provide pt to pt synch between processes
Mutual exclusion inherent
Can construct global address space:

Process number + address within process address space
But no direct operations on these names
71
Design Issues Apply at All Layers

Prog. models position provides constraints/goals for system
In fact, each interface between layers supports or takes a position on:
Naming model
Set of operations on names
Ordering model
Replication
Communication performance
Any set of positions can be mapped to any other by software

Lets see issues across layers
How lower layers can support contracts of programming models
Performance issues
72
Naming and Operations

Naming and operations in programming model can be directly
supported by lower levels, or translated by compiler, libraries or OS
Example: Shared virtual address space in programming model
Hardware interface supportsshared physical address space
Direct support by hardware through v-to-p mappings, no software layers
Hardware supports independent physical address spaces
Can provide SAS through OS, so in system/user interface
v-to-p mappings only for data that are local

remote data accesses incur page faults; brought in via page fault handlers
same programming model, different hardware requirements and cost model
Or through compilers or runtime, so above sys/user interface
shared objects, instrumentation of shared accesses, compiler support

73
Naming and Operations (contd)

Example: Implementing Message Passing
Direct support at hardware interface
But match and buffering benefit from more flexibility
Support at sys/user interface or above in software (almost always)
Hardware interface provides basic data transport (well suited)
Send/receive built in sw for flexibility (protection, buffering)
Choices at user/system interface:
OS each time: expensive

OS sets up once/infrequently, then little sw involvement each time
Or lower interfaces provide SAS, and send/receive built on top

with buffers and loads/stores
Need to examine the issues and tradeoffs at every layer
Frequencies and types of operations, costs

74
Ordering
Message passing: no assumptions on orders across processes except
those imposed by send/receive pairs
SAS: How processes see the order of other processes references
defines semantics of SAS
Ordering very important and subtle
Uniprocessors play tricks with orders to gain parallelism or locality
These are more important in multiprocessors
Need to understand which old tricks are valid, and learn new ones
How programs behave, what they rely on, and hardware implications
75
Replication
Very important for reducing data transfer/communication
Again, depends on naming model
Uniprocessor: caches do it automatically
Reduce communication with memory
Message Passing naming model at an interface
A receive replicates, giving a new name; subsequently use new name
Replication is explicit in software above that interface
SAS naming model at an interface
A load brings in data transparently, so can replicate transparently
Hardware caches do this, e.g. in shared physical address space
OS can do it at page level in shared virtual address space, or objects

No explicit renaming, many copies for same name: coherence problem
in uniprocessors, coherence of copies is natural in memory hierarchy

76
Communication Performance
Performance characteristics determine usage of operations at a layer
Programmer, compilers etc make choices based on this
Fundamentally, three characteristics:
Latency: time taken for an operation
Bandwidth: rate of performing operations
Cost: impact on execution time of program
If processor does one thing at a time: bandwidth 1/latency
But actually more complex in modern systems
Characteristics apply to overall operations, as well as individual

components of a system, however small
Well focus on communication or data transfer across nodes
77
Simple Example
Component performs an operation in 100ns
Simple bandwidth: 10 Mops
Internally pipeline depth 10 => bandwidth 100 Mops
Rate determined by slowest stage of pipeline, not overall latency
Delivered bandwidth on application depends on initiation frequency

Suppose application performs 100 M operations. What is cost?
op count * op latency gives 10 sec (upper bound)
op count / peak op rate gives 1 sec (lower bound)
assumes full overlap of latency with useful work, so just issue cost
if application can do 50 ns of useful work before depending on result of

op, cost to application is the other 50ns of latency
78
Linear Model of Data Transfer Latency

Transfer time (n) = T0 + n/B
useful for message passing, memory access, vector ops etc
As n increases, bandwidth approaches asymptotic rate B

How quickly it approaches depends on T0
Size needed for half bandwidth (half-power point):
n1/2 = T0 / B
But linear model not enough
When can next transfer be initiated? Can cost be overlapped?
Need to know how transfer is performed
79
Communication Cost Model

Comm Time per message= Overhead + Assist Occupancy +
Network Delay + Size/Bandwidth + Contention
= ov + oc + l + n/B + Tc
Overhead and assist occupancy may be f(n) or not
Each component along the way has occupancy and delay
Overall delay is sum of delays
Overall occupancy (1/bandwidth) is biggest of occupancies
Comm Cost = frequency * (Comm time - overlap)

General model for data transfer: applies to cache misses too
80
Summary of Design Issues

Functional and performance issues apply at all layers
Functional: Naming, operations and ordering
Performance: Organization, latency, bandwidth, overhead, occupancy
Replication and communication are deeply related
Management depends on naming model
Goal of architects: design against frequency and type of operations

that occur at communication abstraction, constrained by tradeoffs
from above or below
Hardware/software tradeoffs
81
Recap
Parallel architecture is important thread in evolution of architecture
At all levels
Multiple processor level now in mainstream of computing
Exotic designs have contributed much, but given way to convergence
Push of technology, cost and application performance
Basic processor-memory architecture is the same
Key architectural issue is in communication architecture
How communication is integrated into memory and I/O system on node
Fundamental design issues
Functional: naming, operations, ordering
Performance: organization, replication, performance characteristics
Design decisions driven by workload-driven evaluation
Integral part of the engineering focus

82
Outline for Rest of Class

Understanding parallel programs as workloads
Much more variation, less consensus and greater impact than in sequential
What they look like in major programming models (Ch. 2)
Programming for performance: interactions with architecture (Ch. 3)
Methodologies for workload-driven architectural evaluation (Ch. 4)
Cache-coherent multiprocessors with centralized shared memory
Basic logical design, tradeoffs, implications for software (Ch 5)
Physical design, deeper logical design issues, case studies (Ch 6)
Scalable systems
Design for scalability and realizing programming models (Ch 7)
Hardware cache coherence with distributed memory (Ch 8)
Hardware-software tradeoffs for scalable coherent SAS (Ch 9)

83
Outline (contd.)
Interconnection networks (Ch 10)
Latency tolerance (Ch 11)
Future directions (Ch 12)
Overall: conceptual foundations and engineering issues across broad

range of scales of design, all of which are important
84

ch1 PDF

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ch1 PDF

Загружено:

Авторское право:

Доступные форматы

Introduction

What is Parallel Architecture?

Data access, Communication and Synchronization

how large a collection?

Performance and Scalability

how does it all translate into performance?

Why Study Parallel Architecture?

Why Study it Today?

Technological trends make parallel computing inevitable

Need to understand fundamental principles and design tradeoffs,

Naming, Ordering, Replication, Communication performance

Inevitability of Parallel Computing

Scientific computing: CFD, Biology, Chemistry, Physics, ...

Number of transistors on chip growing rapidly

Instruction-level parallelism valuable but limited

Todays microprocessors have multiprocessor support

Cycle drives exponential increase in microprocessor performance

Drives parallel architecture harder: most demanding applications

Range of performance demands

Need range of system performance with progressively increasing cost

Goal of applications in using parallel machines: Speedup

For a fixed problem size (input data set), performance = 1/time

Scientific Computing Demand

Engineering Computing Demand

Petroleum (reservoir analysis)

Automotive (crash simulation, drag analysis, combustion efficiency),

Aeronautics (airflow analysis, engine efficiency, structural mechanics,

Pharmaceuticals (molecular modeling)

in all of the above

Financial modeling (yield and derivative analysis)

Applications: Speech and Image Processing

Also CAD, Databases, . . .

Learning Curve for Parallel Applications

AMBER molecular dynamics simulation program

Scale not so large, but use much more wide-spread

Computational power determines scale of business that can be handled

Databases, online-transaction processing, decision support, data

Explicit scaling criteria provided

Size of enterprise scales with size of system

TPC-C Results for March 1996

Summary of Application Trends

Database and transactions as well as financial

Usually smaller-scale, but large-scale systems also used

Desktop also uses multithreaded programs, which are a lot like

Greatest use of small-scale multiprocessors

Solid application demand exists and will increase

General Technology Trends

Not that single-processor performance is plateauing, but that

Technology: A Closer Look

Circuits become either faster or lower in power

Die size is growing too

Clock rate improves roughly proportional to improvement in

Locality in data access

multiple operations per cycle reduces CPI

Both need resources, so tradeoff

Fundamental issue is resource distribution, as in uniprocessors

Clock Frequency Growth Rate

Clock rate (MHz)

30% per year

Transistor Count Growth Rate

100 million transistors on chip by early 2000s A.D.

Similar Story for Storage