Вы находитесь на странице: 1из 61

Parallelism I: Inside the Core

Remember!

projects due next Wednesday 141LFinal on next Tuesday (3-6 pm) at midnight. 141

The nal

Comprehensive as the Midterm. Same general format the slides, and the Review the homeworks, quizzes.

Key Points
is wide issue mean? Whatdoes does it affect performance? How does it affect pipeline design? How is the basic idea behind out-of-order What execution? What is the difference between a true and false dependence? How do OOO processors remove false dependences? What is Simultaneous Multithreading?
4

Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can

Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can We can reduce our CPI to less than 1. The processor must do multiple operations at once. This is called Instruction Level Parallelism (ILP)
5

Approach 1: Widen the pipeline


Fetch PC and PC+4 Fetch Decode Deco 2 de inst Fetch 4 values Deco de
EX

EX

Mem Two Memory ops Mem

Write Write back back 2 values Write back

1 Process two instructions at once instead of PC 1 odd PC instruction and 1 even Often keeps the instruction fetch logic simpler. This 2-wide, in-order, superscalar processor Potential problems?
6

Single issue refresher


cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W

sub $s2,$s4,$s5

ld $s3, 0($s2)

add $t1, $s3, $s3

Single issue refresher


cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W

sub $s2,$s4,$s5

Forwarding
E M W D D E M W

ld $s3, 0($s2)

add $t1, $s3, $s3

Single issue refresher


cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W

sub $s2,$s4,$s5

Forwarding
E M W

ld $s3, 0($s2)

Forwarding
add $t1, $s3, $s3 F D D E M W

Dual issue: Ideal Case


0 add $s1,$s2,$s3 sub $s2,$s4,$s5 ld $s3, 0($s2) add $t1, $s3, $s3 ... ... ... ... ... ... F F 1 D D F F 2 E E D D F F 3 M M E E D D F F 4 W W M M E E D D F F W W M M E E D D W W M M E E W W M M W W 5 6 7 8

CPI == 0.5!
8

Dual issue: Structural Hazards Structural hazards everything We might not replicate

Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
Decode Deco 2 de inst Fetch 4 values Deco de
EX

Fetch PC and PC+4 Fetch

Mem

*
EX

<<

Write Write back back 2 values Write back

If an upper instruction needs the lower pipeline, squash the lower instruction
9

Dual issue: Structural Hazards Structural hazards everything We might not replicate

Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
Decode Deco 2 de inst Fetch 4 values Deco de
EX

Fetch PC and PC+4 Fetch

Mem

*
EX

<<

Write Write back back 2 values Write back

If an upper instruction needs the lower pipeline, squash the lower instruction
9

Dual issue: dealing with hazards


0 add sub Mul Shift Shift Ld Shift Ld F F 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10

10

Dual issue: dealing with hazards


0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10

PC = 0

add sub Mul Shift Shift Ld Shift Ld

F F

10

Dual issue: dealing with hazards


0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10

PC = 0 PC = 8

add sub Mul Shift Shift Ld Shift Ld

F F

10

Dual issue: dealing with hazards


0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M 5 6 7 8 9 10

PC = 0 PC = 8 PC = 12

add sub Mul Shift Shift Ld Shift Ld

F F

Shift moves to lower pipe load is squashed


x W

10

Dual issue: dealing with hazards


0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M 5 6 7 8 9 10

PC = 0 PC = 8 PC = 12 PC = 12

add sub Mul Shift Shift Ld Shift Ld

F F

Shift moves to lower pipe load is squashed


x W

Load uses lower pipe Shift becomes a noop

10

Dual issue: Data Hazards


The lower instruction may need a value produced by the upper instruction Forwarding cannot help us -- we must stall.
Fetch PC and PC+4 Fetch Decode Deco 2 de inst Fetch 4 values Deco de
EX

Mem

*
EX

<<

Write Write back back 2 values Write back

11

Dual issue: dealing with hazards


is essential! Forwardingstall. Both pipes
0 add $s1, $s3,#4 sub $s4, $s1, #4 add ... sub ... and ... or ... F F 1 D D F F 2 E D F F 3 M E D D F F 4 W M E E D D W M M E E W W M M W W 5 6 7 8 9 10

12

Dual issue: Control Hazards



The upper instruction might be branch. The lower instruction might be on the wrong path Solution 1: Require branches to execute in the lower pipeline -See structural hazards. What about consecutive branches? -- Exercise for the reader What about branches to odd addresses? -- Squash the upper pipe

Fetch PC and PC+4 Fetch

Decode Deco 2 de inst Fetch 4 values Deco de

EX

Mem

*
EX

<<
branch unit.

Write Write back back 2 values Write back

13

Beyond Dual Issue


are possible. Wider pipelinesseparate oating point pipeline. There is often a leads to hardware Wide issuegets harder, too. complexity Compiling processors use of two options if they In practice, ILP want more

Change the ISA and build a smart compiler: VLIW Keep the same ISA and build a smart processors: Outof-order

14

Going Out of Order: Data dependence refresher.


1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1

4: add $t3,$t1,$t2

15

Going Out of Order: Data dependence refresher.


1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1

1
2

4: add $t3,$t1,$t2

There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15

Going Out of Order: Data dependence refresher.


1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1

1
2

4: add $t3,$t1,$t2

We can parallelize instructions that do not have a read-afterwrite dependence (RAW)

There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15

Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either

order or simultaneously. But beware:


1: add $t1,$s2,$s3

Is there a dependence here?


1 2

Can we reorder the instructions?


2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3

2: sub $t1,$s3,$s4

Is the result the same?


16

Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either

order or simultaneously. But beware:


1: add $t1,$s2,$s3

Is there a dependence here?


1 2

Can we reorder the instructions?


2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3

2: sub $t1,$s3,$s4

No! The nal value of $t1 is different


16

Is the result the same?

False Dependence #1
Also called Write-after-Write dependences (WAW) occur when two instructions write to

the same value The dependence is false because no data ows between the instructions -- They just produce an output with the same name.

17

Beware again!
Is there a dependence here?
1: add $t1,$s2,$s3
1 2

Can we reorder the instructions?


2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3

2: sub $s2,$s3,$s4

Is the result the same?


18

Beware again!
Is there a dependence here?
1: add $t1,$s2,$s3
1 2

Can we reorder the instructions?


2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3

2: sub $s2,$s3,$s4

Is the result the same?

No! The value in $s2 that 1 needs will be destroyed

18

False Dependence #2

(WAR) This isita Write-after-Read no data dependence ows between Again, is false because the instructions

19

Out-of-Order Execution
instructions Any sequence of hazards that has set of RAW, WAW, and WAR constrain its

execution. Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences?

20

The Central OOO Idea


1. Fetch a bunch of instructions 2. Build the dependence graph 3. Find all instructions with no unmet dependences 4. Execute them. 5. Repeat

21

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
WAR WAW RAW

4: add $t5,$t1,$t2

22

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 2

4: add $t5,$t1,$t2

22

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2

4: add $t5,$t1,$t2

22

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

22

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

22

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

22

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

22

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2

4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2

8: add $t3,$t5,$t1

8 Instructions in 5 cycles

22

Simplied OOO Pipeline



A new schedule stage manages the Instruction Window The window holds the set of instruction the processor examines The fetch and decode ll the window Execute stage drains it Typically, OOO pipelines are also wide but it is not necessary. Impacts More forwarding, More stalls, longer branch resolution Fundamentally more work per instruction.
Deco de Sche dule
EX

Fetch

Mem

Write back

23

The Instruction Window


is the set The Instruction Windowexamines of instruction the processor the window, The larger can nd, but...the more parallelism the processor Keeping the window lled is a challenge

The fetch and decode ll the window Execute stage drains it

24

The Issue Window


Decoded Instruction data alu_out_dst_0 vrs vrt alu_out_dst_1 alu_out_value_0 alu_out_value_1

opcode etc

vrs

vrt

rs_value

valid

rt_value

valid

= = = = rt_value rs_value opcode Ready

25

The Issue Window

ALU0 Arbitration

insts

ALU1

Schedule

execute
26

Keeping the Window Filled


instruction lled is key! Keeping thewindows arewindow32 instructions about Instruction by their complexity, which is (size is limited

Branches are every 4-5 instructions. 6-8 This means that the processor predict the consecutive branches correctly to keep

considerable)

window full. On a mispredict, you ush the pipeline, which includes the emptying the window.
27

How Much Parallelism is There?

Not much, in the presence of WAW and WAR dependences. arise because we must registers, and Theseare a limited number wereusefreely reuse. there can How can we get rid of them?
28

Removing False Dependences

If WAW and WAR dependences arise because we have too few registers
cant! But! Wewhy didThe Architecture only gives us 32 (why or we only use 5 bits?) Solution:a set of internal physical register that is as large Dene

Lets add more!

This is called Register Renaming

as the number of instructions that can be in ight -128 in a recent intel chip. Every instruction in the pipeline gets a registers Maintaining a register mapping table that determines which physical register currently holds the value for the required architectural registers.

29

Alpha 21264: Renaming


Register map table
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3

r1 0: p1 1: 2: 3:

r2 p2

r3 p3

1 2 3 4 5 RAW WAW WAR

4: 5:

Alpha 21264: Renaming


Register map table
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3

r1 0: p1 1:

r2 p2

r3 p3

1 2 3 4 5 RAW

p1 currently holds the value of architectural 2: registers r1 3:

4: 5:
WAW WAR

Alpha 21264: Renaming


1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3

r1 0: p1 1: p1 2: 3:

r2 p2 p2

r3 p3 p4

1 2 3 4 5 RAW WAW WAR

4: 5:

Alpha 21264: Renaming


1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3 p5, p1, p4

r1 0: p1 1: p1 2: p1 3:

r2 p2 p2 p5

r3 p3 p4 p4

1 2 3 4 5 RAW WAW WAR

4: 5:

Alpha 21264: Renaming


1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3 p5, p1, p4 p6, p4, p1

r1 0: p1 1: p1 2: p1 3: p6

r2 p2 p2 p5 p5

r3 p3 p4 p4 p4

1 2 3 4 5 RAW WAW WAR

4: 5:

Alpha 21264: Renaming


1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p5, p6, p7, p2, p1, p4, p4, p3 p4 p1 p6

r1 0: p1 1: p1 2: p1 3: p6

r2 p2 p2 p5 p5 p7

r3 p3 p4 p4 p4 p4

1 2 3 4 5 RAW WAW WAR

4: p6 5:

Alpha 21264: Renaming


1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p5, p6, p7, p8, p2, p1, p4, p4, p6, p3 p4 p1 p6 p4

r1 0: p1 1: p1 2: p1 3: p6

r2 p2 p2 p5 p5 p7 p8

r3 p3 p4 p4 p4 p4 p4

1 2 3 4 5 RAW WAW WAR

4: p6 5: p6

Alpha 21264: Renaming


1: 2: 3: 4: 5: Add r3, r2, r3 Sub r2, r1, r3 Mult r1, r3, r1 Add r2, r3, r1 Add r2, r1, r3 p4, p5, p6, p7, p8, p2, p1, p4, p4, p6, p3 p4 p1 p6 p4

r1 0: p1 1: p1 2: p1 3: p6 4: p6 5: p6

r2 p2 p2 p5 p5 p7 p8

r3 p3 p4 p4 p4 p4 p4

1 2 3 4 5 RAW WAW 4 2

1 3 5 WAR

New OOO Pipeline


The register le is larger (to hold the physical registers) The pipeline is longer more forwarding The payoff had better be signicant (and it is)
Fetch Deco de Rena me Sche dule
EX

Longer branch delay

Mem

Write back

38

Modern OOO Processors


The fastest machines in the world are OOO superscalars AMD Barcelona 6-wide issue
106 instructions inight at once. Intel Nehalem 12 ALUs issue to 5-way instructions in ight > 128

OOO provides the most benet for memory operations.

Non-dependent instructions can keep executing during cache misses. This is so-called memory-level parallelism. It is enormously important. CPU performance is (almost) all about memory performance nowadays (remember the memory wall graphs!)
39

The Problem with OOO


the fastest OOO machines EvenIPC, even though they are 4-5only get about 1-2 wide. Problems ILP within applications. -- 1-2 per thread, Insufcient

usually Poor branch prediction performance parallelism. Single threads also have little memory Observation many ALUs and instruction queue slots many Onempty cycles, sit
40

Simultaneous Multithreading

AKA HyperThreading in Intel machines Run multiple threads at the same time Just throw all the instructions into the pipeline Keep some separate data for each
Renaming table TLB entries PCs

But the rest of the hardware is shared. It is surprisingly simple (but still quite complicated)
Fetch T1 Fetch T2 Fetch T3 Fetch T4

Deco de

Rena me

Sche dule

EX

Mem

Write back

Deco de

Rena me

Sche dule

EX

Mem

Write back

41

SMT Advantages
multiple threads at once Exploit the ILP of or branch prediction (fewer Less dependence required per thread) correct predictions Less idle hardware (increased power efciency) Much higher IPC -- up to 4 (in simulation) threads can Disadvantages: other down. ght over resources and slow each Invented, in part, Historical footnote:he was at UW by our own Dean Tullsen when
42

Вам также может понравиться