Академический Документы
Профессиональный Документы
Культура Документы
Remember!
projects due next Wednesday 141LFinal on next Tuesday (3-6 pm) at midnight. 141
The nal
Comprehensive as the Midterm. Same general format the slides, and the Review the homeworks, quizzes.
Key Points
is wide issue mean? Whatdoes does it affect performance? How does it affect pipeline design? How is the basic idea behind out-of-order What execution? What is the difference between a true and false dependence? How do OOO processors remove false dependences? What is Simultaneous Multithreading?
4
Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can
Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can We can reduce our CPI to less than 1. The processor must do multiple operations at once. This is called Instruction Level Parallelism (ILP)
5
EX
1 Process two instructions at once instead of PC 1 odd PC instruction and 1 even Often keeps the instruction fetch logic simpler. This 2-wide, in-order, superscalar processor Potential problems?
6
sub $s2,$s4,$s5
ld $s3, 0($s2)
sub $s2,$s4,$s5
Forwarding
E M W D D E M W
ld $s3, 0($s2)
sub $s2,$s4,$s5
Forwarding
E M W
ld $s3, 0($s2)
Forwarding
add $t1, $s3, $s3 F D D E M W
CPI == 0.5!
8
Dual issue: Structural Hazards Structural hazards everything We might not replicate
Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
Decode Deco 2 de inst Fetch 4 values Deco de
EX
Mem
*
EX
<<
If an upper instruction needs the lower pipeline, squash the lower instruction
9
Dual issue: Structural Hazards Structural hazards everything We might not replicate
Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
Decode Deco 2 de inst Fetch 4 values Deco de
EX
Mem
*
EX
<<
If an upper instruction needs the lower pipeline, squash the lower instruction
9
10
PC = 0
F F
10
PC = 0 PC = 8
F F
10
PC = 0 PC = 8 PC = 12
F F
10
PC = 0 PC = 8 PC = 12 PC = 12
F F
10
Mem
*
EX
<<
11
12
EX
Mem
*
EX
<<
branch unit.
13
Change the ISA and build a smart compiler: VLIW Keep the same ISA and build a smart processors: Outof-order
14
4: add $t3,$t1,$t2
15
1
2
4: add $t3,$t1,$t2
1
2
4: add $t3,$t1,$t2
Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either
2: sub $t1,$s3,$s4
Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either
2: sub $t1,$s3,$s4
False Dependence #1
Also called Write-after-Write dependences (WAW) occur when two instructions write to
the same value The dependence is false because no data ows between the instructions -- They just produce an output with the same name.
17
Beware again!
Is there a dependence here?
1: add $t1,$s2,$s3
1 2
2: sub $s2,$s3,$s4
Beware again!
Is there a dependence here?
1: add $t1,$s2,$s3
1 2
2: sub $s2,$s3,$s4
18
False Dependence #2
(WAR) This isita Write-after-Read no data dependence ows between Again, is false because the instructions
19
Out-of-Order Execution
instructions Any sequence of hazards that has set of RAW, WAW, and WAR constrain its
execution. Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences?
20
21
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
WAR WAW RAW
4: add $t5,$t1,$t2
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 2
4: add $t5,$t1,$t2
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
8 Instructions in 5 cycles
22
Fetch
Mem
Write back
23
24
opcode etc
vrs
vrt
rs_value
valid
rt_value
valid
25
ALU0 Arbitration
insts
ALU1
Schedule
execute
26
Branches are every 4-5 instructions. 6-8 This means that the processor predict the consecutive branches correctly to keep
considerable)
window full. On a mispredict, you ush the pipeline, which includes the emptying the window.
27
Not much, in the presence of WAW and WAR dependences. arise because we must registers, and Theseare a limited number wereusefreely reuse. there can How can we get rid of them?
28
If WAW and WAR dependences arise because we have too few registers
cant! But! Wewhy didThe Architecture only gives us 32 (why or we only use 5 bits?) Solution:a set of internal physical register that is as large Dene
as the number of instructions that can be in ight -128 in a recent intel chip. Every instruction in the pipeline gets a registers Maintaining a register mapping table that determines which physical register currently holds the value for the required architectural registers.
29
r1 0: p1 1: 2: 3:
r2 p2
r3 p3
4: 5:
r1 0: p1 1:
r2 p2
r3 p3
1 2 3 4 5 RAW
4: 5:
WAW WAR
r1 0: p1 1: p1 2: 3:
r2 p2 p2
r3 p3 p4
4: 5:
r1 0: p1 1: p1 2: p1 3:
r2 p2 p2 p5
r3 p3 p4 p4
4: 5:
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5
r3 p3 p4 p4 p4
4: 5:
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5 p7
r3 p3 p4 p4 p4 p4
4: p6 5:
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5 p7 p8
r3 p3 p4 p4 p4 p4 p4
4: p6 5: p6
r1 0: p1 1: p1 2: p1 3: p6 4: p6 5: p6
r2 p2 p2 p5 p5 p7 p8
r3 p3 p4 p4 p4 p4 p4
1 2 3 4 5 RAW WAW 4 2
1 3 5 WAR
Mem
Write back
38
Non-dependent instructions can keep executing during cache misses. This is so-called memory-level parallelism. It is enormously important. CPU performance is (almost) all about memory performance nowadays (remember the memory wall graphs!)
39
usually Poor branch prediction performance parallelism. Single threads also have little memory Observation many ALUs and instruction queue slots many Onempty cycles, sit
40
Simultaneous Multithreading
AKA HyperThreading in Intel machines Run multiple threads at the same time Just throw all the instructions into the pipeline Keep some separate data for each
Renaming table TLB entries PCs
But the rest of the hardware is shared. It is surprisingly simple (but still quite complicated)
Fetch T1 Fetch T2 Fetch T3 Fetch T4
Deco de
Rena me
Sche dule
EX
Mem
Write back
Deco de
Rena me
Sche dule
EX
Mem
Write back
41
SMT Advantages
multiple threads at once Exploit the ILP of or branch prediction (fewer Less dependence required per thread) correct predictions Less idle hardware (increased power efciency) Much higher IPC -- up to 4 (in simulation) threads can Disadvantages: other down. ght over resources and slow each Invented, in part, Historical footnote:he was at UW by our own Dean Tullsen when
42