Академический Документы
Профессиональный Документы
Культура Документы
VAX
AMD/05
Intel/06
IBM/04
Sun/05
Processors/chip
Threads/Processor
Threads/chip
32
Manufacturer/Year
Other Factors
Multiprocessors
Growth in data-intensive applications
Data bases, file servers,
Flynns Taxonomy
Back to Basics
A parallel computer is a collection of
processing elements that cooperate and
communicate to solve large problems fast.
Parallel Architecture = Computer Architecture
+
Communication Architecture
2 classes of multiprocessors WRT memory:
1. Centralized Memory Multiprocessor
< few dozen processor chips (and < 100 cores) in 2006
Small enough to share single, centralized memory
Pn
Pn
P1
Mem
Mem
Inter
connection network
Inter
connection network
Mem
Mem
Centralized Memory
Distributed Memory
Centralized Memory
Multiprocessor
Also called symmetric multiprocessors (SMPs)
because single main memory has a symmetric
relationship to all processors
Large caches single memory can satisfy memory
demands of small number of processors
Can scale to a few dozen processors by using a
switch and by using many memory banks
Although scaling beyond that is technically
conceivable, it becomes less attractive as the
number of processors sharing centralized memory
increases
8
Distributed Memory
Multiprocessor
Pro: Cost-effective way to scale memory
bandwidth
If most accesses are to local memory
Pro: Reduces latency of local memory
accesses
Con: Communicating data between
processors more complex
Con: Must change software to take advantage
of increased memory BW
9
Challenges of Parallel
Processing
First challenge is % of program inherently
sequential
Suppose 80X speedup from 100
processors. What fraction of original
program can be sequential?
a.10%
b.5%
c.1%
d.<1%
11
Challenges of Parallel
Processing
Second challenge is long latency to remote
memory
Suppose 32 CPU MP, 2GHz, 200 ns remote
memory, all local accesses hit memory hierarchy
and base CPI is 0.5 (Remote access = 200/0.5 =
400 clock cycles.)
What is performance impact if 0.2% instructions
involve remote access?
a. 1.5X
b. 2.0X
c. 2.5X
13
Challenges of Parallel
Processing
1. Application parallelism primarily via
new algorithms that have better parallel
performance
2. Long remote latency impact both by
architect and by the programmer
For example, reduce frequency of remote
accesses either by
Caching shared data (HW)
Restructuring the data layout to make more
accesses local (SW)
15
Symmetric Shared-Memory
Architectures
From multiple boards on a shared bus to
multiple processors inside a single chip
Caches both
Private data are used by a single processor
Shared data are used by multiple processors
P1
u=?
$
P3
3
u=?
u :5 u = 7
u :5
I/O devices
u :5
Memory
17
Example
P1
P2
/*Assume initial value of A and flag is 0*/
A = 1;
flag = 1;
print A;
Pn
P1
Conceptual
Picture
Mem
18
L2
100:35
Memory
Disk
100:34
Reading an address
should return the last
value written to that
address
Easy in uniprocessors,
except for I/O
1.
Write Consistency
For now assume
1. A write does not complete (and allow the
next write to occur) until all processors
have seen the effect of that write
2. The processor does not change the order
of any write with respect to any other
memory access
if a processor writes location A followed by
location B, any processor that sees the
new value of B must also see the new
value of A
These restrictions allow the processor to
reorder reads, but forces the processor to
finish writes in program order
21
Snoopy Cache-Coherence
Protocols
State
Address
Data
P1
Pn
Bus snoop
Mem
I/O devices
Cache-memory
transaction
Example: Write-thru
Invalidate
P2
P1
u=?
$
P3
3
u=?
u :5 u = 7
u :5
I/O devices
u :5
u=7
Memory
25
Write-back harder
Most recent copy can be in a cache
28
29
30
Example Protocol
Snooping coherence protocol is usually
implemented by incorporating a finite-state
controller in each node
Logically, think of a separate controller
associated with each cache block
That is, snooping operations or cache requests for
different blocks can proceed independently
Write-through Invalidate
Protocol
as in uniprocessor
state of a block is a p-vector of
states
Hardware state bits associated
with blocks that are in the cache
other blocks can be seen as being
in invalid (not-present) state in
that cache
PrRd/ --
BusWr / -
PrRd / BusRd
PrWr / BusWr
PrWr / BusWr
P1
Pn
Bus
Mem
I/O devices
32
Is 2-state Protocol
Coherent?
Processor only observes state of memory system by
issuing memory operations
Assume bus transactions and memory operations are
atomic and a one-level cache
all phases of one bus transaction complete before next one starts
processor waits for memory operation to complete before issuing next
with one-level cache, assume invalidations applied during bus
transaction
Ordering
P0:
P1:
P2:
Write-Back State
Machine - CPU
State machine
for CPU requests
for each
cache block
Non-resident blocks invalid
Invalid
CPU Read
Place read miss
(read/only)
on bus
CPU Write
Place Write
Miss on bus
CPU Write
Cache Block
State
(read/write)
36
State machine
for bus requests
for each
cache block
Invalid
Write miss
for this block
Shared
(read/only)
Write miss
for this block
Write Back
Block; (abort
memory access)
Exclusive
(read/write)
Read miss
for this block
Write Back
Block; (abort
memory access)
37
Block-replacement
CPU Read hit
State machine
for CPU requests
for each
cache block
Invalid
CPU Read
Place read miss
Shared
(read/only)
on bus
CPU Write
Place Write
Miss on bus
Cache Block
State
on bus
(read/write)
38
Write-back State
Machine-III
State machine
for CPU requests
for each
cache block and
for bus requests
for each
cache block
Cache
State
Write miss
for this block
Shared
CPU
Read
Invalid
Place read miss (read/only)
CPU Write
on bus
Place Write
Miss on bus
Write miss
CPU read miss
CPU Read miss
for this block
Write back block,
Place read miss
Write Back
Place read miss CPU Write on bus
Block; (abort
Place Write Miss on Bus
on bus
memory
access)
Block
Read miss
Write Back
for this block
Exclusive
Block; (abort
CPU read hit
CPU write hit
(read/write)
memory access)
CPU Write Miss
Write back cache block
Place write miss on bus
39
Example
step
P1
A1
P1:Write
Write 10 to A1
P1:P1:
Read
A1A1
Read
P2:
Read A1
A1
P2: Read
P1
State
Addr
P2
Value State
Addr
Bus
Value Action Proc. Addr
Memory
Value Addr Value
P2:
P2: Write
Write 20
20 to
toA1
A1
P2:
40 to
toA2
A2
P2: Write
Write 40
40
Example
step
P1
A1
P1:Write
Write 10 to A1
P1:P1:
Read
A1A1
Read
P2:
A1
P2: Read
Read A1
P1
State
Excl.
Addr
A1
P2
Value State
10
Addr
Bus
Value Action Proc. Addr
WrMs
P1
A1
Memory
Value Addr Value
P2:
P2: Write
Write 20
20 to
toA1
A1
P2:
40 to
toA2
A2
P2: Write
Write 40
41
Example
step
P1
A1
P1:Write
Write 10 to A1
P1:P1:
Read
A1A1
Read
P2:
Read A1
A1
P2: Read
P1
State
Excl.
Excl.
Addr
A1
A1
P2
Value State
10
10
Addr
Bus
Value Action Proc. Addr
WrMs
P1
A1
Memory
Value Addr Value
P2:
P2: Write
Write 20
20 to
toA1
A1
P2:
40 to
toA2
A2
P2: Write
Write 40
42
Example
step
P1
A1
P1:Write
Write 10 to A1
P1:P1:
Read
A1A1
Read
P2:
Read A1
A1
P2: Read
P1
State
Excl.
Excl.
Shar.
Addr
A1
A1
A1
P2
Value State Addr
10
10
Shar.
A1
10
Shar.
A1
Bus
Value Action Proc. Addr
WrMs
P1
A1
10
RdMs
WrBk
RdDa
P2
P1
P2
A1
A1
A1
Memory
Value Addr Value
10
10
A1
A1
P2:
P2: Write
Write 20
20 to
toA1
A1
P2:
40 to
toA2
A2
P2: Write
Write 40
43
10
10
Example
step
P1
A1
P1:Write
Write 10 to A1
P1:P1:
Read
A1A1
Read
P2:
Read A1
A1
P2: Read
P1
State
Excl.
Excl.
Shar.
P2:
P2: Write
Write 20
20 to
toA1
A1
P2:
40 to
toA2
A2
P2: Write
Write 40
Inv.
Addr
A1
A1
A1
P2
Value State Addr
10
10
Shar.
A1
10
Shar.
A1
Excl.
A1
Bus
Value Action Proc. Addr
WrMs
P1
A1
10
20
RdMs
WrBk
RdDa
WrMs
P2
P1
P2
P2
A1
A1
A1
A1
Memory
Value Addr Value
10
10
A1
A1
A1
44
10
10
10
Example
step
P1
Write
A1
P1: Write 10 to A1
P1:P1:
Read
A1A1
Read
P2:
A1
P2: Read
Read A1
P1
State
Excl.
Excl.
Shar.
P2:
P2: Write
Write 20
20 to
toA1
A1
P2:
40 to
toA2
A2
P2: Write
Write 40
Inv.
Addr
A1
A1
A1
P2
Value State
Addr
10
10
Shar.
A1
10
Shar.
A1
Excl.
A1
Excl.
A2
Bus
Value Action Proc. Addr
WrMs
P1
A1
10
20
40
RdMs
WrBk
RdDa
WrMs
WrMs
WrBk
P2
P1
P2
P2
P2
P2
A1
A1
A1
A1
A2
A1
Memory
Value Addr Value
10
10
20
A1
A1
A1
A1
A1
45
10
10
10
10
20
Implementation
Complications
Write Races:
Implementing Snooping
Caches
Multiple processors must be on bus, access
to both addresses and data
Add a few new commands to perform
coherency,
in addition to read and write
Processors continuously snoop on address
bus
If address matches tag, either invalidate or
update
47
49
Coherency Misses
1. True sharing misses arise from the
communication of data through the cache
coherence mechanism
50
Time
P1
Write x1
2
3
4
5
P2
Read x2
MP Performance 4 Processor
Commercial Workload: OLTP, Decision
Support (Database), Search Engine
Uniprocessor
cache misses
improve with
cache size
increase
(Instruction,
Capacity/Conflict,
Compulsory)
3.25
3
Memory Cycles
cycles per
per Instruction
instruction
(Memory)
True sharing
and false
sharing
unchanged
going from 1 MB
to 8 MB (L3 cache)
Instruction
Capacity/Conflict
Cold
False Sharing
True Sharing
2.75
2.5
2.25
2
1.75
1.5
1.25
1
0.75
0.5
0.25
0
1 MB
2 MB
4 MB
8 MB
Cache size
52
(Memory)
Cyclesper
perinstruction
Instruction
Memory cycles
Processor count
53
Bus-based
All of (a), (b), Coherence
(c) done through broadcast on bus
faulting processor sends out a search
others respond to the search probe and take
necessary action
Scalable coherence:
can have same cache states and state transition
diagram
different mechanisms to manage protocol
55
Scalable Approach:
Directories
Basic Operation of
Directory
Cache
Cache
k processors.
With each cache-block in memory:
k presence-bits, 1 dirty-bit
Interconnection Network
Memory
presence bits
Directory
dirty bit
Directory Protocol
Similar to Snoopy Protocol: Three states
Shared: 1 processors have data, memory up-todate
Uncached (no processor hasit; not valid in any
cache)
Exclusive: 1 processor (owner) has data;
memory out-of-date
Directory Protocol
No bus and dont want to broadcast:
interconnect no longer single arbitration point
all messages have explicit responses
Message type
Read miss
Directory Protocol
Messages (Fig 4.22)
Source
Destination
Msg Content
Local cache
Home directory
P, A
Write miss
Local cache
Home directory
P, A
Invalidate
Home directory
Remote caches
Fetch
Home directory
Fetch/Invalidate
Home directory
Remote cache
Home directory
Local cache
Data
Remote cache
Home directory
A, Data
61
Invalidate
Invalid
CPU Write:
Send Write Miss
msg to h.d.
CPU Read
Send Read Miss
(read/only)
CPU read miss:
message
Fetch/Invalidate
Shared
(read/write)
to home directory
Fetch: send Data Write Back
message to home directory
CPU read miss: send Data
Write Back message and
read miss to home directory
CPU write miss:
send Data Write Back message
62
State machine
for Directory
requests for each
memory block
Uncached state
if in memory
Sharers += {P};
Read miss:
Sharers = {P}
Uncached
Write Miss:
Sharers = {}
(Write back block)
Sharers = {P};
send Data
msg
Exclusive
(read/write)
Shared
(read only)
Write Miss:
send Invalidate
to Sharers;
Value Reply
Sharers = {P};
send Fetch/Invalidate;
Write Miss:
Read miss:
Sharers += {P};
Reply msg
send Fetch;
64
Example Directory
Protocol
Message sent to directory causes two actions:
Update the directory
More messages to satisfy request
Example Directory
Protocol
Example
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1
Directory Memory
P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value ActionProc. Addr Value Addr State {Procs}Value
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
67
Example
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1
Directory Memory
P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value ActionProc. Addr Value Addr State {Procs}Value
WrMs P1 A1
A1
Ex {P1}
Excl. A1 10
DaRp P1 A1
0
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
68
Example
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
Directory Memory
P1
P2
Bus
Directory
Memory
State Addr Value State Addr Value ActionProc. Addr Value Addr State {Procs}Value
WrMs P1 A1
A1
Ex {P1}
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
P2: Write 20 to A1
P2: Write 40 to A2
69
Example
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P1
P2
Bus
State Addr Value State Addr Value ActionProc. Addr Value
WrMs P1 A1
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
Shar. A1
RdMs P2 A1
Shar. A1 10
Ftch P1 A1 10
Shar. A1 10 DaRp P2 A1 10
P2: Write 40 to A2
Directory Memory
Directory
Memory
Addr State {Procs}Value
A1
Ex {P1}
10
A1 Shar. {P1,P2} 10
10
10
10
A1
A1
Write
Back
Example
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
State Addr Value State Addr Value ActionProc. Addr Value
WrMs P1 A1
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
Shar. A1
RdMs P2 A1
Shar. A1 10
Ftch P1 A1 10
Shar. A1 10 DaRp P2 A1 10
Excl. A1 20 WrMs P2 A1
Inv.
Inval. P1 A1
Directory Memory
Directory
Memory
Addr State {Procs}Value
A1
Ex {P1}
10
A1 Shar. {P1,P2} 10
10
A1 Excl. {P2}
10
10
A1
A1
71
Example
Processor 1 Processor 2 Interconnect
step
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1
P2
Bus
State Addr Value State Addr Value ActionProc. Addr Value
WrMs P1 A1
Excl. A1 10
DaRp P1 A1
0
Excl. A1 10
Shar. A1
RdMs P2 A1
Shar. A1 10
Ftch P1 A1 10
Shar. A1 10 DaRp P2 A1 10
Excl. A1 20 WrMs P2 A1
Inv.
Inval. P1 A1
WrMs P2 A2
WrBk P2 A1 20
Excl. A2 40 DaRp P2 A2
0
Directory Memory
Directory
Memory
Addr State {Procs}Value
A1
Ex {P1}
10
A1 Shar. {P1,P2} 10
10
A1 Excl. {P2}
10
A2 Excl. {P2}
0
A1 Unca. {}
20
A2 Excl. {P2}
0
A1
A1
72
Implementing a Directory
We assume operations atomic, but
they are not; reality is much harder;
must avoid deadlock when run out of
bufffers in network (see Appendix E)
Optimizations:
read miss or write miss in Exclusive:
send data directly to requestor from
owner vs. 1st to memory and then from
memory to requestor
73
Requestor
1.
P
C
RdEx request
to directory
Read request
to directory
Directory node
for block
M/D
1.
2.
3.
Read req.
to owner
Reply with
owner identity
4a.
Data
Reply
M/D
4b.
Inval. ack
P
C
Node with
dirty copy
Directorynode
4a.
Inval. ack
4b.
Revision message
to directory
M/D
3b.
Inval. req.
to sharer
3a.
Inval. req.
to sharer
M/D
C
A
C
A
2.
Reply with
sharers identity
M/D
M/D
Sharer
M/D
Sharer
74
P1: pA
Dir
R/reply
ctrl
P1
P2
R/req
I
ld vA -> rd pA
I
75
P1: pA
P2: pA
R/_
Dir
R/reply
ctrl
R/_
P1
R/_
R/req
I
P2
R/req
ld vA -> rd pA
ld vA -> rd pA
76
R/_
Dir
R/reply
P2: pA
U
ctrl
Inv ACK
reply xD(pA)
Invalidate pA
Read_to_update pA
W/_
W/req E
W/req E
R/_
P1
R/req
Inv/_
I
st vA -> wr pA
R/_
P2
R/req
Inv/_
I
77
Example Directory
Protocol (Wr to Ex)
RU/_
RX/invalidate&reply
P1: pA
R/_
Dir
R/reply
ctrl
Read_toUpdate pA
Inv pA
Reply xD(pA)
Write_back pA
W/_
W/req E
W/_
W/req E
R/_
S
I
W/req E
$
R/req
Inv/_
W/req E
P1
R/_
P2
R/req
Inv/_
I
st vA -> wr pA
78
A Popular Middle
Ground
Two-level hierarchy
Individual nodes are multiprocessors,
connected non-hiearchically
e.g. mesh of SMPs
79
Synchronization
Why Synchronize? Need to know when it
is safe for different processes to use
shared data
Issues for Synchronization:
Uninterruptable instruction to fetch and
update memory (atomic operation);
User level synchronization operation using this
primitive;
For large scale MPs, synchronization can be a
bottleneck; techniques to reduce contention
and latency of synchronization
80
Uninterruptable Instruction
to Fetch and Update
Memory
81
Uninterruptable
Instruction to Fetch and
Update Memory
mov
R2,0(R1)
R3,0(R1)
R3,try
R4,R2
R3,R4
; mov exchange value
; load linked
; store conditional
; branch store fails (R3 = 0)
; put load value in R4
ll
R2,0(R1) ; load linked
R2,R2,#1
; increment (OK if regreg)
R2,0(R1) ; store conditional
R2,try
; branch store fails (R2 = 0)
82
Synchronization
Operation Using this
Primitive
Spin locks: processor
continuously tries to acquire,
spinning around a loop trying to get the lock
li
lockit:
bnez
R2,#1
exch
R2,0(R1)
;atomic exchange
R2,lockit
;already locked?
R2,#1
lw
83
Another MP Issue:
Memory Consistency
Models
What is consistency? When must a processor see the
new value? e.g., seems that
P1: A = 0; P2:
..... .....
A = 1;
B = 1;
L1: if (B == 0) ...
B = 0;
L2:
if (A == 0) ...
Memory Consistency
Model
Schemes faster execution to sequential consistency
89
90
92
93
Cautionary Tale
Key to success of birth and development
of ILP in 1980s and 1990s was software
in the form of optimizing compilers that
could exploit ILP
Similarly, successful exploitation of TLP
will depend as much on the development
of suitable software systems as it will on
the contributions of computer architects
Given the slow progress on parallel
software in the past 30+ years, it is likely
that exploiting TLP broadly will remain
challenging for years to come
94
T1 (Niagara)
Target: Commercial server
applications
High thread level parallelism (TLP)
Large numbers of parallel client requests
95
T1 Architecture
Also ships with 6 or 4 processors
96
T1 pipeline
L1 $, L2 $
TLB
X units
pipe registers
Hazards:
Data
Structural
97
T1 Fine-Grained
Multithreading
Each core supports four threads and has its own
level one caches (16KB for instructions and 8 KB
for data)
Switching to a new thread on each clock cycle
Idle threads are bypassed in the scheduling
Waiting due to a pipeline delay or cache miss
Processor is idle only when all 4 threads are idle or
stalled
Write through
allocate LD
no-allocate ST
T1
100
T1
101
CPI Breakdown of
Performance
Benchmark
Per
Thread
CPI
Per
Effective Effective
core
CPI for
IPC for
CPI
8 cores
8 cores
TPC-C
7.20
1.80
0.23
4.4
SPECJBB
5.60
1.40
0.18
5.7
SPECWeb99
6.60
1.65
0.21
4.8
102
Dell PowerEdge
63,378
61,789
14,001
7,881
16,061
14,740
Benchmark\Architecture
HP marketing view of T1
Niagara
1. Suns radical UltraSPARC T1 chip is made up of individual
http://h71028.www7.hp.com/ERC/cache/280124-0-0-0-121.html
2.
3.
4.
5.
105
Processor
Microprocessor
Comparison
Cores
Instruction issues
/ clock / core
Multithreading
L1 I/D in KB per core
L2 per core/shared
Clock rate (GHz)
Transistor count (M)
Die size (mm2)
Power (W)
SUN T1
Opteron
Pentium D
IBM Power 5
No
SMT
SMT
Finegrained
16/8
64/64
3 MB
1MB /
shared
1.2
300
379
79
core
2.4
233
199
110
12K
uops/16
1MB/
core
3.2
230
206
130
64/32
1.9 MB
shared
1.9
276
389
125
106
Performance Relative to
Pentium D
107
Performance/mm2,
Performance/Watt
108
Niagara 2
Improve performance by increasing
threads supported per chip from 32 to 64
8 cores * 8 threads per core
110
Parallel Programmer
Productivity
Parallel Programmer
Productivity