Академический Документы
Профессиональный Документы
Культура Документы
Introduction
What is Parallel Architecture?
Why Parallel Architecture?
Evolution and Convergence of Parallel Architectures
Fundamental Design Issues
Resource Allocation:
Parallelism:
Provides alternative to faster clock for performance
Applies at all levels of system design
Is a fascinating perspective from which to view architecture
Is increasingly central in information processing
In the mainstream
Technology Trends
Architecture Trends
Economics
Current trends:
Application Trends
Demand for cycles fuels advances in hardware, and vice-versa
Platform pyramid
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)
7
Computer-aided design
Visualization
etc.
9
1 GIPS
Telephone
Number
Recognition
100 MIPS
10 MIPS
200 Words
Isolated Speech
Recognition
1,000 Words
Continuous
Speech
Recognition
5,000 Words
Continuous
Speech
Recognition
HDTVReceiver
CIF Video
ISDN-CD Stereo
Receiver
CELP
Speech Coding
Speaker
Verication
1 MIPS
1980
Sub-Band
Speech Coding
1985
1990
1995
11
Commercial Computing
Also relies on parallelism for high end
12
Throughput (tpmC)
20,000
Tandem Himalaya
DEC Alpha
SGI PowerChallenge
HP PA
IBM PowerPC
Other
15,000
10,000
5,000
20
40
60
80
100
120
Number of processors
Parallelism is pervasive
Small to moderate scale parallelism very important
Difficult to obtain snapshot to compare across vendor platforms
13
14
Technology Trends
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
1995
The natural building block for multiprocessors is now also about the fastest!
15
120
100
80
60
40
20
MIPS
Sun 4 M/120
260
0
1987
1988
MIPS
M2000
1989
IBM
RS6000
540
1990
Integer
FP
HP 9000
750
1991
1992
Performance > 100x per decade; clock rate 10x, rest transistor count
How to use more transistors?
Parallelism in processing
Proc
Interconnect
1,000
100
10
i8086
R10000
Pentium100
i80386
i80286
i8080
i8008
i4004
0.1
1970
1975
1980
1985
1990
1995
2000
2005
Transistors
10,000,000
1,000,000
i80286
100,000
R10000
Pentium
i80386
R3000
R2000
i8086
10,000
i8080
i8008
i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
Gigabit DRAM by c. 2000, but gap with processor speed much greater
Architectural Trends
Architecture translates technologys gifts to performance and
capability
Resolves the tradeoff between parallelism and locality
Architectural Trends
Greatest trend in VLSI generation is increase in parallelism
22
Instruction-level
Thread-level (?)
100,000,000
10,000,000
1,000,000
R10000
Pentium
Transistors
i80386
i80286
100,000
R2000
R3000
i8086
10,000
i8080
i8008
i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
1.37
1.70
2.30
2.55
2.90
3.20
3.50
17+
25
2.5
20
2
Speedup
30
15
1.5
10
0.5
0
0
6+
10
15
Infinite resources and fetch bandwidth, perfect branch prediction and renaming
real caches and non-zero miss latencies
25
2x
1x
Jouppi_89
Butler_91
Melvin_91
Sun
E10000
60
Number of processors
50
40
SGI Challenge
Symmetry81
Sequent B2100
30
SE60
Sun E6000
SE70
Sun SC2000
20
SC2000E
SGI PowerChallenge/XL
AS8400
Symmetry21
Sequent B8000
SE10
10
Power
SGI PowerSeries
0
1984
1986
1988
SS690MP 140
SS690MP 120
1990
1992
SS1000
SE30
SS1000E
AS2100 HP K400
SS20
SS10
1994
1996
P-Pro
1998
27
Bus Bandwidth
100,000
Sun E10000
10,000
SGI
Sun E6000
PowerCh
AS8400
XL
CS6400
SGI Challenge
HPK400
SC2000E
AS2100
SC2000
P-Pro
SS1000E
SS1000
SS20
SS690MP 120
SE70/SE30
SS10/
SS690MP 140
SE10/
1,000
SE60
Symmetry81/21
100
SGI PowerSeries
Power
Sequent B2100
Sequent
B8000
10
1984
1986
1988
1990
1992
1994
1996
1998
28
Economics
Commodity microprocessors not only fast but CHEAP
Development cost is tens of millions of dollars (5-100 typical)
BUT, many more are sold compared to supercomputers
Multiprocessor on a chip
29
Plus economics
CRAY n = 1,000
CRAY n = 100
Micro n = 1,000
Micro n = 100
1,000
T94
LINPACK (MFLOPS)
C90
100
DEC 8200
Ymp
Xmp/416
Xmp/14se
IBM Power2/990
MIPS R4400
DEC Alpha
HP9000/735
CRAY 1s
IBM RS6000/540
10
MIPS M/2000
MIPS M/120
Sun 4/260
1
1975
1980
1985
1990
1995
2000
31
ASCI Red
LINPACK (GFLOPS)
1,000
Paragon XP/S MP
(6768)
Paragon XP/S MP
(1024)
T3D
CM-5
100
T932(32)
Paragon XP/S
CM-200
CM-2
Ymp/832(8)
1
C90(16)
Delta
10
iPSC/860
nCUBE/2(1024)
Xmp /416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
Since 1993, Cray produces MPPs too (T3D, T3E)
32
Number of systems
300
313
200
284
239
250
187
110
106
100
0
11/93
MPP
PVP
SMP
198
150
50
319
63
11/94
11/95
106
73
11/96
33
History
Historically, parallel architectures tied to programming models
Divergent architectures, with no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
Today
Extension of computer architecture to support communication
and cooperation
OLD: Instruction Set Architecture
NEW: Communication Architecture
Defines
37
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Data
parallel
Compilation
or library
Operating systems support
Communication hardware
Parallel applications
Programming models
Communication abstraction
User/system boundary
Hardware/software boundary
38
Programming Model
What programmer uses in coding applications
Specifies communication and synchronization
Examples:
Multiprogramming: no communication or synch. at program level
Shared address space: like bulletin board
Message passing: like letters or phone calls, explicit point to point
39
Communication Abstraction
User level communication primitives provided
Communication Architecture
= User/System Interface + Implementation
User/System Interface:
Implementation:
Goals:
Performance
Broad applicability
Programmability
Scalability
Low Cost
41
Dataflow
Systolic Arrays
Convenient:
Location transparency
Similar programming model to time-sharing on uniprocessors
Except processes run on different processors
Good throughput on multiprogrammed workloads
P1
Pn
P2
Common physical
addresses
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i vat e
P1 pr i vat e
P0 pr i vat e
Writes
too)
Natural extension of uniprocessors model: conventional memory
operations for comm.; special atomic operations for synchronization
OS uses shared memory to coordinate processes
44
Communication Hardware
Also natural extension of uniprocessor
Already have processor, one or more memory modules and I/O
controllers connected by hardware interconnect of some sort
I/O
devices
Mem
Mem
Mem
Interconnect
Processor
Mem
I/O ctrl
I/O ctrl
Interconnect
Processor
History
Mainframe approach
Motivated by multiprogramming
Extends crossbar used for mem bw and I/O
Originally processor cost limited to small
P
P
I/O
C
C
M
Minicomputer approach
I/O
C
P-Pro
module
256-KB
Interrupt
L2 $
controller
Bus interface
P-Pro
module
P-Pro
module
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
Memory
controller
MIU
1-, 2-, or 4-way
interleaved
DRAM
47
P
$
$2
$2
CPU/mem
cards
Mem ctrl
Bus interface/switch
I/O cards
2 FiberChannel
SBUS
SBUS
SBUS
100bT, SCSI
Bus interface
48
Scaling Up
M
Network
Network
Dance hall
M $
M $
M $
P
Distributed memory
Mem
Mem
ctrl
and NI
XY
Switch
50
Library or OS intervention
51
Message-Passing Abstraction
Match
ReceiveY, P, t
AddressY
Send X, Q, t
AddressX
Local process
address space
Process P
Process Q
Local process
address space
001
100
000
111
011
110
010
L2 $
Memory bus
General interconnection
network formed from
8-port switches
4-way
interleaved
DRAM
Memory
controller
MicroChannel bus
I/O
DMA
i860
NI
DRAM
NIC
54
i860
L1 $
L1 $
Intel
Paragon
node
Mem
ctrl
DMA
Driver
2D grid network
with processing node
attached to every switch
NI
4-way
interleaved
DRAM
8 bits,
175 MHz,
bidirectional
55
Architectural model
Original motivations
Matches simple differential equation solvers
Centralize high cost of instruction
fetch/sequencing
Control
processor
PE
PE
PE
PE
PE
PE
PE
PE
PE
57
Other examples:
Finite differences, linear algebra, ...
Document searching, graphics, image processing, ...
Some recent machines:
59
Dataflow Architectures
Represent computation as a graph of essential dependences
a = (b +1) (b c)
d
= c e
f = a d
Dataflow graph
a
Network
f
Token
store
Program
store
Waiting
Matching
Instruction
fetch
Execute
Form
token
Network
Token queue
Network
60
Problems
Lasting contributions:
Systolic Architectures
Orchestrate data flow for high throughput with less memory access
M
PE
PE
PE
PE
62
x7
x6
x5
x4
x3
x2
w3
w4
y3
y2
yin
x
w
xout
xout = x
x = xin
yout = yin + w xin
yout
w1
w2
y1
xin
x1
General purpose systems work well for same algorithms (locality etc.)
63
Communication
assist (CA)
Mem
$
P
Scalable network
Convergence allows lots of innovation, now within framework
Integration of assist with node, what operations, how efficiently...
64
Performance
69
Synchronization
Mutual exclusion (locks)
No ordering guarantees
Event synchronization
3 main types:
point-to-point
global
group
70
Ordering:
Program order within a process
Send and receive can provide pt to pt synch between processes
Mutual exclusion inherent
71
Naming model
Set of operations on names
Ordering model
Replication
Communication performance
72
Ordering
Message passing: no assumptions on orders across processes except
those imposed by send/receive pairs
SAS: How processes see the order of other processes references
defines semantics of SAS
Need to understand which old tricks are valid, and learn new ones
How programs behave, what they rely on, and hardware implications
75
Replication
Very important for reducing data transfer/communication
Again, depends on naming model
Uniprocessor: caches do it automatically
Communication Performance
Performance characteristics determine usage of operations at a layer
77
Simple Example
Component performs an operation in 100ns
Simple bandwidth: 10 Mops
Internally pipeline depth 10 => bandwidth 100 Mops
assumes full overlap of latency with useful work, so just issue cost
78
79
Hardware/software tradeoffs
81
Recap
Parallel architecture is important thread in evolution of architecture
At all levels
Much more variation, less consensus and greater impact than in sequential
Scalable systems
Outline (contd.)
Interconnection networks (Ch 10)
Latency tolerance (Ch 11)
Future directions (Ch 12)
84