Parallelism I: Inside The Core

Parallelism I: Inside the Core
Remember!
projects due next Wednesday 141LFinal on next Tuesday (3-6 pm) at midnight. 141
The nal
Comprehensive as the Midterm. Same general format the slides, and the Review the homeworks, quizzes.
Key Points
is wide issue mean? Whatdoes does it affect performance? How does it affect pipeline design? How is the basic idea behind out-of-order What execution? What is the difference between a true and false dependence? How do OOO processors remove false dependences? What is Simultaneous Multithreading?
4
Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can
Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can We can reduce our CPI to less than 1. The processor must do multiple operations at once. This is called Instruction Level Parallelism (ILP)
5
Approach 1: Widen the pipeline

Fetch PC and PC+4 Fetch Decode Deco 2 de inst Fetch 4 values Deco de
EX
EX
Mem Two Memory ops Mem
Write Write back back 2 values Write back
1 Process two instructions at once instead of PC 1 odd PC instruction and 1 even Often keeps the instruction fetch logic simpler. This 2-wide, in-order, superscalar processor Potential problems?
6
Single issue refresher

cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W
sub $s2,$s4,$s5
ld $s3, 0($s2)
add $t1, $s3, $s3

sub $s2,$s4,$s5
Forwarding
E M W D D E M W
ld $s3, 0($s2)
add $t1, $s3, $s3

sub $s2,$s4,$s5
Forwarding
E M W
ld $s3, 0($s2)
Forwarding
add $t1, $s3, $s3 F D D E M W
Dual issue: Ideal Case

0 add $s1,$s2,$s3 sub $s2,$s4,$s5 ld $s3, 0($s2) add $t1, $s3, $s3 ... ... ... ... ... ... F F 1 D D F F 2 E E D D F F 3 M M E E D D F F 4 W W M M E E D D F F W W M M E E D D W W M M E E W W M M W W 5 6 7 8
CPI == 0.5!
8
Dual issue: Structural Hazards Structural hazards everything We might not replicate

Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
Decode Deco 2 de inst Fetch 4 values Deco de
EX
Fetch PC and PC+4 Fetch
Mem
*
EX
<<
If an upper instruction needs the lower pipeline, squash the lower instruction
9
Dual issue: Structural Hazards Structural hazards everything We might not replicate

Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
EX
Mem
*
EX
<<
If an upper instruction needs the lower pipeline, squash the lower instruction
9
Dual issue: dealing with hazards

0 add sub Mul Shift Shift Ld Shift Ld F F 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10
10

0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10
PC = 0
add sub Mul Shift Shift Ld Shift Ld
F F
10

0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10
PC = 0 PC = 8
F F
10

0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M 5 6 7 8 9 10
PC = 0 PC = 8 PC = 12
F F
Shift moves to lower pipe load is squashed

x W
10

0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M 5 6 7 8 9 10
PC = 0 PC = 8 PC = 12 PC = 12
F F
Shift moves to lower pipe load is squashed

x W
Load uses lower pipe Shift becomes a noop
10
Dual issue: Data Hazards

The lower instruction may need a value produced by the upper instruction Forwarding cannot help us -- we must stall.
Fetch PC and PC+4 Fetch Decode Deco 2 de inst Fetch 4 values Deco de
EX
Mem
*
EX
<<
11

is essential! Forwardingstall. Both pipes
0 add $s1, $s3,#4 sub $s4, $s1, #4 add ... sub ... and ... or ... F F 1 D D F F 2 E D F F 3 M E D D F F 4 W M E E D D W M M E E W W M M W W 5 6 7 8 9 10
12
Dual issue: Control Hazards

The upper instruction might be branch. The lower instruction might be on the wrong path Solution 1: Require branches to execute in the lower pipeline -See structural hazards. What about consecutive branches? -- Exercise for the reader What about branches to odd addresses? -- Squash the upper pipe
EX
Mem
*
EX
<<
branch unit.
13
Beyond Dual Issue

are possible. Wider pipelinesseparate oating point pipeline. There is often a leads to hardware Wide issuegets harder, too. complexity Compiling processors use of two options if they In practice, ILP want more
Change the ISA and build a smart compiler: VLIW Keep the same ISA and build a smart processors: Outof-order
14
Going Out of Order: Data dependence refresher.

1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1
4: add $t3,$t1,$t2
15

1
1
2
4: add $t3,$t1,$t2
There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15

1
1
2
4: add $t3,$t1,$t2
We can parallelize instructions that do not have a read-afterwrite dependence (RAW)
There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15
Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either
order or simultaneously. But beware:

1: add $t1,$s2,$s3
Is there a dependence here?

1 2
Can we reorder the instructions?

2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3
2: sub $t1,$s3,$s4
Is the result the same?

16
Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either
order or simultaneously. But beware:

1: add $t1,$s2,$s3

1 2

2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3
2: sub $t1,$s3,$s4
No! The nal value of $t1 is different

16
False Dependence #1
Also called Write-after-Write dependences (WAW) occur when two instructions write to
the same value The dependence is false because no data ows between the instructions -- They just produce an output with the same name.
17
Beware again!
1: add $t1,$s2,$s3
1 2

2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3
2: sub $s2,$s3,$s4

18
Beware again!
1: add $t1,$s2,$s3
1 2

2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3
2: sub $s2,$s3,$s4
No! The value in $s2 that 1 needs will be destroyed
18
False Dependence #2
(WAR) This isita Write-after-Read no data dependence ows between Again, is false because the instructions
19
Out-of-Order Execution
instructions Any sequence of hazards that has set of RAW, WAW, and WAR constrain its
execution. Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences?
20
The Central OOO Idea

1. Fetch a bunch of instructions 2. Build the dependence graph 3. Find all instructions with no unmet dependences 4. Execute them. 5. Repeat
21
Example
WAR WAW RAW
4: add $t5,$t1,$t2
22
Example
1 WAR WAW RAW 3 2
4: add $t5,$t1,$t2
22
Example
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2
22
Example
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2
8: add $t3,$t5,$t1
22
Example
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
22
Example
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
22
Example
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
22
Example
1 WAR WAW RAW 3 3 2
8: add $t3,$t5,$t1
8 Instructions in 5 cycles
22
Simplied OOO Pipeline

A new schedule stage manages the Instruction Window The window holds the set of instruction the processor examines The fetch and decode ll the window Execute stage drains it Typically, OOO pipelines are also wide but it is not necessary. Impacts More forwarding, More stalls, longer branch resolution Fundamentally more work per instruction.
Deco de Sche dule
EX
Fetch
Mem
Write back
23
The Instruction Window

is the set The Instruction Windowexamines of instruction the processor the window, The larger can nd, but...the more parallelism the processor Keeping the window lled is a challenge
The fetch and decode ll the window Execute stage drains it
24
The Issue Window

Decoded Instruction data alu_out_dst_0 vrs vrt alu_out_dst_1 alu_out_value_0 alu_out_value_1
opcode etc
vrs
vrt
rs_value
valid
rt_value
valid
= = = = rt_value rs_value opcode Ready
25
The Issue Window
ALU0 Arbitration
insts
ALU1
Schedule
execute
26
Keeping the Window Filled

instruction lled is key! Keeping thewindows arewindow32 instructions about Instruction by their complexity, which is (size is limited
Branches are every 4-5 instructions. 6-8 This means that the processor predict the consecutive branches correctly to keep
considerable)
window full. On a mispredict, you ush the pipeline, which includes the emptying the window.
27
How Much Parallelism is There?
Not much, in the presence of WAW and WAR dependences. arise because we must registers, and Theseare a limited number wereusefreely reuse. there can How can we get rid of them?
28
Removing False Dependences
If WAW and WAR dependences arise because we have too few registers
cant! But! Wewhy didThe Architecture only gives us 32 (why or we only use 5 bits?) Solution:a set of internal physical register that is as large Dene
Lets add more!
This is called Register Renaming
as the number of instructions that can be in ight -128 in a recent intel chip. Every instruction in the pipeline gets a registers Maintaining a register mapping table that determines which physical register currently holds the value for the required architectural registers.
29
Alpha 21264: Renaming

Register map table
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3
r1 0: p1 1: 2: 3:
r2 p2
r3 p3
1 2 3 4 5 RAW WAW WAR
4: 5:

Register map table
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3
r1 0: p1 1:
r2 p2
r3 p3
1 2 3 4 5 RAW
p1 currently holds the value of architectural 2: registers r1 3:
4: 5:
WAW WAR

1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3
r1 0: p1 1: p1 2: 3:
r2 p2 p2
r3 p3 p4
4: 5:

1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3 p5, p1, p4
r1 0: p1 1: p1 2: p1 3:
r2 p2 p2 p5
r3 p3 p4 p4
4: 5:

1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3 p5, p1, p4 p6, p4, p1
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5
r3 p3 p4 p4 p4
4: 5:

1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p5, p6, p7, p2, p1, p4, p4, p3 p4 p1 p6
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5 p7
r3 p3 p4 p4 p4 p4
4: p6 5:

1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p5, p6, p7, p8, p2, p1, p4, p4, p6, p3 p4 p1 p6 p4
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5 p7 p8
r3 p3 p4 p4 p4 p4 p4
4: p6 5: p6

1: 2: 3: 4: 5: Add r3, r2, r3 Sub r2, r1, r3 Mult r1, r3, r1 Add r2, r3, r1 Add r2, r1, r3 p4, p5, p6, p7, p8, p2, p1, p4, p4, p6, p3 p4 p1 p6 p4
r1 0: p1 1: p1 2: p1 3: p6 4: p6 5: p6
r2 p2 p2 p5 p5 p7 p8
r3 p3 p4 p4 p4 p4 p4
1 2 3 4 5 RAW WAW 4 2
1 3 5 WAR
New OOO Pipeline

The register le is larger (to hold the physical registers) The pipeline is longer more forwarding The payoff had better be signicant (and it is)
Fetch Deco de Rena me Sche dule
EX
Longer branch delay
Mem
Write back
38
Modern OOO Processors

The fastest machines in the world are OOO superscalars AMD Barcelona 6-wide issue
106 instructions inight at once. Intel Nehalem 12 ALUs issue to 5-way instructions in ight > 128
OOO provides the most benet for memory operations.
Non-dependent instructions can keep executing during cache misses. This is so-called memory-level parallelism. It is enormously important. CPU performance is (almost) all about memory performance nowadays (remember the memory wall graphs!)
39
The Problem with OOO

the fastest OOO machines EvenIPC, even though they are 4-5only get about 1-2 wide. Problems ILP within applications. -- 1-2 per thread, Insufcient
usually Poor branch prediction performance parallelism. Single threads also have little memory Observation many ALUs and instruction queue slots many Onempty cycles, sit
40
Simultaneous Multithreading

AKA HyperThreading in Intel machines Run multiple threads at the same time Just throw all the instructions into the pipeline Keep some separate data for each
Renaming table TLB entries PCs
But the rest of the hardware is shared. It is surprisingly simple (but still quite complicated)
Fetch T1 Fetch T2 Fetch T3 Fetch T4
Deco de
Rena me
Sche dule
EX
Mem
Write back
Deco de
Rena me
Sche dule
EX
Mem
Write back
41
SMT Advantages
multiple threads at once Exploit the ILP of or branch prediction (fewer Less dependence required per thread) correct predictions Less idle hardware (increased power efciency) Much higher IPC -- up to 4 (in simulation) threads can Disadvantages: other down. ght over resources and slow each Invented, in part, Historical footnote:he was at UW by our own Dean Tullsen when
42

Parallelism I: Inside The Core

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Parallelism I: Inside The Core

Загружено:

Авторское право:

Доступные форматы

Parallelism I: Inside the Core

Approach 1: Widen the pipeline

Mem Two Memory ops Mem

Write Write back back 2 values Write back

Single issue refresher

add $t1, $s3, $s3

Single issue refresher

add $t1, $s3, $s3

Single issue refresher

Dual issue: Ideal Case

Fetch PC and PC+4 Fetch

Write Write back back 2 values Write back

Fetch PC and PC+4 Fetch

Write Write back back 2 values Write back

Dual issue: dealing with hazards

Dual issue: dealing with hazards

add sub Mul Shift Shift Ld Shift Ld

Dual issue: dealing with hazards

add sub Mul Shift Shift Ld Shift Ld

Dual issue: dealing with hazards

add sub Mul Shift Shift Ld Shift Ld

Shift moves to lower pipe load is squashed

Dual issue: dealing with hazards

add sub Mul Shift Shift Ld Shift Ld

Shift moves to lower pipe load is squashed

Load uses lower pipe Shift becomes a noop

Dual issue: Data Hazards

Write Write back back 2 values Write back

Dual issue: dealing with hazards

Dual issue: Control Hazards

Fetch PC and PC+4 Fetch

Decode Deco 2 de inst Fetch 4 values Deco de

Write Write back back 2 values Write back

Beyond Dual Issue

Going Out of Order: Data dependence refresher.

Going Out of Order: Data dependence refresher.

There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15

Going Out of Order: Data dependence refresher.

We can parallelize instructions that do not have a read-afterwrite dependence (RAW)

There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15

order or simultaneously. But beware:

Is there a dependence here?

Can we reorder the instructions?

Is the result the same?

order or simultaneously. But beware:

Is there a dependence here?

Can we reorder the instructions?

No! The nal value of $t1 is different

Is the result the same?

Can we reorder the instructions?

Is the result the same?

Can we reorder the instructions?

Is the result the same?

No! The value in $s2 that 1 needs will be destroyed

The Central OOO Idea

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

Simplied OOO Pipeline

The Instruction Window