Академический Документы
Профессиональный Документы
Культура Документы
Unit 5: Pipelining
!"#$%&'$%(%")*%$'+,'-)%'.%(#%/0'1#")'12345'6'78#3'9):;'2:'<*%55 ' =#:;'&)>3?%&':;2:'#5?">$%$'<5#(%3&#:,')@'A#&?)5'&"#$%& ' +,'123B'C#""0'D>3#'!);#0'-#8'!8#:;0'25$'.2(#$'A))$ '
! ! ! !
Single-cycle & multi-cycle datapaths Latency vs throughput & performance Basic pipelining Data hazards
! Bypassing ! Load-use stalling
Readings
! Chapter 2.1 of MA:FSPTCM
In-Class Exercise
! You have a washer, dryer, and folding robot
! ! ! ! Each takes 30 minutes per load How long for one load in total? How long for two loads of laundry? How long for 100 loads of laundry?
! Now assume:
! ! ! ! Washing takes 30 minutes, drying 60 minutes, and folding 15 min How long for one load in total? How long for two loads of laundry? How long for 100 loads of laundry?
! Now assume:
! ! ! ! Washing takes 30 minutes, drying 60 minutes, and folding 15 min How long for one load in total? 105 minutes How long for two loads of laundry? 105 + 60 = 165 minutes How long for 100 loads of laundry? 105 + 60*99 = 6045 min
Datapath Background
! Processor logically executes loop at left ! Atomic: insn finishes before next insn starts
! Implementations can break this constraint physically ! But must maintain illusion to preserve correctness
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 8
Single-Cycle Datapath
+ 4
PC
Insn Mem
Register File
s1 s2 d
Data Mem
Tsinglecycle
! Fetch, decode, execute one complete instruction every cycle +! Takes 1 cycle to execution any instruction by definition (CPI is 1) ! Long clock period: to accommodate slowest instruction (worst-case delay through circuit, must wait this long every time)
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 10
Multi-Cycle Datapath
+ 4
PC
Insn Mem
IR
Register File
s1 s2 d
A O B Data Mem D
Tinsn-mem
Tregfile
TALU
Tdata-mem
Tregfile
! Fetch, decode, execute one complete insn over multiple cycles ! Allows insns to take different number of cycles +! Opposite of single-cycle: short clock period (less work per cycle) -! Multiple cycles per instruction (higher CPI)
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 11
Single-cycle
insn0.fetch insn0.dec insn0.exec
Multi-cycle
! Single-cycle datapath:
! Fetch, decode, execute one complete instruction every cycle +! Low CPI: 1 by definition ! Long clock period: to accommodate slowest instruction
! Multi-cycle
! Branch: 20% (3 cycles), load: 20% (5 cycles), ALU: 60% (4 cycles) ! Clock period = 11ns, CPI = (20%*3)+(20%*5)+(60%*4) = 4 ! Why is clock period 11ns and not 10ns? overheads ! Performance = 44ns/insn
501 News
! paper review #2 not actually graded yet :-( ! HW2: question 4/5 revised
14
Pipelined Datapath
15
Single-cycle
insn0.fetch insn0.dec insn0.exec
Multi-cycle
Pipelining
insn0.fetch insn0.dec insn0.exec
Multi-cycle
+ 4
P C
Insn Mem
I R
Register File
s1 s2 d
A O B
S X
Data Mem d
19
PC
Register File
s1 s2 d
Data Mem
Tregfile
TALU
Tdata-mem
Tregfile
Tsinglecycle
20
Insn Mem
IR
Register File
s1 s2 d
A B IR
PC
B IR
Data Mem
IR
22
Instruction Convention
! Different ISAs use inconsistent register orders ! Some ISAs (for example MIPS)
! Instruction destination (i.e., output) on the left ! add $1, $2, $3 means $1!$2+$3
! Other ISAs
! Instruction destination (i.e., output) on the right add r1,r2,r3 means r1+r2r3 ld 8(r5),r4 means mem[r5+8]r4 st r4,8(r5) means r4mem[r5+8]
PC
<< 2
PC
PC
Insn Mem
Register File
s1 s2 d D
IR
Data Mem d
W
X
IR
M
IR
IR
add $3<-$2,$1
! 3 instructions
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 24
PC
<< 2
PC
PC
Insn Mem
Register File
s1 s2 d D
IR
Data Mem d
W
X
IR
add $3<-$2,$1
M
IR
IR
lw $4,8($5)
25
PC
<< 2
PC
PC
Insn Mem
Register File
s1 s2 d D
IR
Data Mem d
W
X
IR
lw $4,8($5)
M
IR
IR
sw $6,4($7)
add $3<-$2,$1
26
PC
<< 2
PC
PC
Insn Mem
Register File
s1 s2 d D
IR
sw $6,4($7)
Data Mem d
W
X
IR
M
IR
IR
add $3<-$2,$1
lw $4,8($5)
! 3 instructions
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 27
PC
<< 2
PC
PC
Insn Mem
Register File
s1 s2 d D
IR
Data Mem d
W
X
IR
M
IR
IR
lw $4,8($5) add
sw $6,4($7)
28
PC
<< 2
PC
PC
Insn Mem
Register File
s1 s2 d D
IR
Data Mem d
W
X
IR
M
IR
IR
sw $6,4(7) lw
29
PC
<< 2
PC
PC
Insn Mem
Register File
s1 s2 d D
IR
Data Mem d
W
X
IR
M
IR
IR
sw
30
Pipeline Diagram
! Pipeline diagram: shorthand for what we just saw
! Across: cycles ! Down: insns ! Convention: X means lw $4,8($5) finishes eXecute stage and writes into M latch at end of cycle 4
1
add $3<-$2,$1 lw $4,8($5) sw $6,4($7)
2 D F
3 X D F
4 M X D
5 W M X
6 W M
31
! Multi-cycle
! Branch: 20% (3 cycles), load: 20% (5 cycles), ALU: 60% (4 cycles) ! Clock period = 11ns, CPI = (20%*3)+(20%*5)+(60%*4) = 4 ! Performance = 44ns/insn
! 5-stage pipeline
! Clock period = 12ns approx. (50ns / 5 stages) + overheads +! CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle) +! Performance = 12ns/insn ! Well actually CPI = 1 + some penalty for pipelining (next) ! CPI = 1.5 (on average insn completes every 1.5 cycles) ! Performance = 18ns/insn ! Much higher performance than single-cycle or multi-cycle
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 32
33
36
Data Hazards
A O O B
Register File
s1 s2 d D
IR
sw $6,4($7)
X
IR
S X
M
IR
Data Mem d
W
IR
add $3<-$2,$1
lw $4,8($5)
! Lets forget about branches and the control for a while ! The three insn sequence we saw earlier executed fine
! But it wasnt a real program ! Real programs have data dependences ! They pass values via registers and memory
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 37
Dependent Operations
! Independent operations
add $3<-$2,$1 add $6<-$5,$4
38
Data Hazards
A O O B
Register File
s1 s2 d D
IR
sw $3,4($7)
X
IR
S X
M
IR
Data Mem d
W
IR
lw $4,8($3) add $3<-$2,$1
addi $6<-1,$3
Observation!
A O O B
Register File
s1 s2 d D
IR
X
IR
S X
M
IR
Data Mem d
W
IR
add $3<-$2,$1
lw $4,8($3)
Bypassing
A O O B
S X
Register File
s1 s2 d D
IR
Data Mem d
X
IR
M
IR
W
IR
add $3<-$2,$1
lw $4,8($3)
! Bypassing
! ! ! ! Reading a value from an intermediate (architectural) source Not waiting until it is available from primary source Here, we are bypassing the register file Also called forwarding
41
WX Bypassing
A O O B
Register File
s1 s2 d D
IR
X
IR
S X
M
IR
Data Mem d
W
IR
add $3<-$2,$1
add $4<-$3,$2
42
ALUinB Bypassing
A O O B
Register File
s1 s2 d D
IR
X
IR
S X
M
IR
Data Mem d
W
IR
add $3<-$2,$1
add $4<-$2,$3
43
WM Bypassing?
A O O B
Register File
s1 s2 d D
IR
X
IR
S X
M
IR
sw $3,4($4)
Data Mem d
W
IR
lw $3,8($2)
Bypass Logic
A O O B
Register File
s1 s2 d D
IR
X
IR
S X
M
IR
Data Mem d
W
IR
bypass
1 F
2 D F 2 D F
3 X D 3 X D F 3 X D
4 M X 4 M X D 4 M X
5 W M 5 W M X 5 W M
6 W 6 W M 6 W
10
! Example: WX bypass
add r1<-r2,r3 ld r5,[r7+4] sub r2<-r1,r4
1 F
7 W 7
10
! Example: WM bypass
add r1<-r2,r3 ?
1 F
2 D F
10
Register File
s1 s2 d D
IR
X nop
stall IR
S X
M
IR
Data Mem d
W
IR
add $4<-$2,$3
lw $3,8($2)
! ! ! !
No. Consider a load followed by a dependent add insn Bypassing alone isnt sufficient! Hardware solution: detect this situation and inject a stall cycle Software solution: ensure compiler doesnt generate such code
47
Register File
s1 s2 d D
IR
X nop
stall add $4<-$2,$3 IR
S X
M
IR
Data Mem d
W
IR
lw $3,8($2)
! Prevent D insn from advancing this cycle ! Re-evaluate situation next cycle
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining
! Write nop into X.IR (effectively, insert nop in hardware) ! Keep same D insn, same PC next cycle
48
Register File
s1 s2 d D
IR
X nop
stall add $4<-$2,$3 IR
S X
M
IR
Data Mem d
W
IR
lw $3,8($2)
Stall = (X.IR.Operation == LOAD) && ( (D.IR.RegSrc1 == X.IR.RegDest) || ((D.IR.RegSrc2 == X.IR.RegDest) && (D.IR.Op != STORE)) )
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining
49
Register File
s1 s2 d D
IR
X nop
stall add $4<-$2,$3 IR
S X
M
IR
Data Mem d
W
IR
(stall bubble)
lw $3,8($2)
Stall = (X.IR.Operation == LOAD) && ( (D.IR.RegSrc1 == X.IR.RegDest) || ((D.IR.RegSrc2 == X.IR.RegDest) && (D.IR.Op != STORE)) )
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining
50
Register File
s1 s2 d D
IR
X nop
stall IR
S X
M
IR
Data Mem d
W
IR
add $4<-$2,$3
(stall bubble)
lw $3,
Stall = (X.IR.Operation == LOAD) && ( (D.IR.RegSrc1 == X.IR.RegDest) || ((D.IR.RegSrc2 == X.IR.RegDest) && (D.IR.Op != STORE)) )
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining
51
! Calculate CPI
! CPI = 1 + (1 * 20% * 50%) = 1.1
52
2 D F
3 X D F
4 M X D
5 W M d* F
6 W X D
M X
W M W
2 D F
3 X D F
4 M X D F
5 W M X D
6 W M X
W M W
53
Register File
s1 s2 d D
IR
X
IR
S X
M
IR
lw $4,8($1)
Data Mem d
W
IR
sw $5,8($1)
Structural Hazards
! Structural hazards
! Two insns trying to use same circuit at same time ! E.g., structural hazard on register file write port
PC
<< 2
PC
PC
Insn Mem
D
IR
Register File
s1 s2 d X
Data Mem d
W
M
IR
IR
add $3<-$2,$1
IR
lw $4,8($5)
Multi-Cycle Operations
57
Register File
D
IR
s1 s2 d
X
IR
Data Mem d
IR
IR
X Xctrl
P IR P
501 News
! Paper review #4 due 9 Oct at midnight
59
A Pipelined Multiplier
A O O B
Register File
D
IR
s1 s2 d
X
IR P M IR
Data Mem d
IR P M IR P M IR P M IR
IR
P0
P1
P2
P3
2 D F
3 P0 D
4 P1 X
5 P2 M
6 P3 W
7 W
2 D F
3 P0 D
4 P1 P0
5 P2 P1
6 P3
7 W
8 W
P2 P3
2 D F
3 P0 D
4 P1 d*
5 P2
6 P3
7 W X
8 M
9 W
61
d* d*
Register File
D
IR
s1 s2 d
X
IR P M IR
Data Mem d
IR P M IR P M IR P M IR
IR
P0 1
mul $4<-$3,$5 addi $6<-$4,1
P1 2 D F 3 P0 D 4 P1 d*
P2 5 P2 6
P3 7 W X M 8
W 9 W
62
P3
d* d*
Register File
D
IR
s1 s2 d
X
IR P M IR
Data Mem d
IR P M IR P M IR P M IR
IR
P0
P1
P2
P3
Stall = (OldStallLogic) ||
(D.IR.RegSrc1 == P0.IR.RegDest) || (D.IR.RegSrc2 == P0.IR.RegDest) || (D.IR.RegSrc1 == P1.IR.RegDest) || (D.IR.RegSrc2 == P1.IR.RegDest) || (D.IR.RegSrc1 == P2.IR.RegDest) || (D.IR.RegSrc2 == P2.IR.RegDest)
63
! Must prevent:
1
mul $4<-$3,$5 addi $6<-$1,1 add $5<-$6,$10
2 D F
3 P0 D F
4 P1 X D
5 P2 M X
6 P3 W M
7 W W
2 D F
3 P0 D F
4 P1 X D
5 P2 M d*
6 P3 W X
7 W M
W
64
Register File
D
IR
s1 s2 d
X
IR P M IR
Data Mem d
IR P M IR P M IR P M IR
IR
P0
P1
P2
P3
2 D F
3 P0 D
4 P1 X
5 P2 M
6 P3 W
7 W
Register File
D
IR
s1 s2 d
X
IR P M IR
Data Mem d
IR P M IR P M IR P M IR
IR
P0
P1
P2
P3
2 D F
3 P0 D
4 P1 d*
5 P2 d*
6 P3 X
7 W M
8 W
1 F
2 D F 2 D F
3 4 5 6 7 8 E* E* E* E* W D E* E* E* E* W 3 4 5 6 7 E/ E/ E/ E/ W D s* s* s* E/ 8 E/
10 11
1 F
9 E/
10 11 E/ W
! s* = structural hazard, two insns need same structure ! ISAs and pipelines designed to have few of these ! Canonical example: all insns forced to go through M stage
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 69
PC
<< 2
PC
Insn Mem
IR
Register File
s1 s2 d
A
S X
O B IR
B IR
! Branch speculation
! Could just stall to wait for branch outcome (two-cycle penalty) ! Fetch past branch insns before branch outcome is known ! Default: assume not-taken (at fetch, cant tell its a branch)
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining 71
Branch Recovery
PC + 4
PC
<< 2
PC
Insn Mem
IR
Register File
s1 s2 d
A
S X
O B IR
B IR
nop
nop
72
1 F
2 D F
3 X D F
4 M X D F
5 W M X D
6 W M X
W M
speculative
1 F
2 D F
3 X D F
4 M X D F
5 W M --F
6 W --D
--X
-M
73
Branch Performance
! Back of the envelope calculation
! Branch: 20%, load: 20%, store: 10%, other: 50% ! Say, 75% of branches are taken
74
75
76
BP
+ 4
TG PC
TG PC
<< 2
M
S X
PC
Insn Mem
IR
Register File
s1 s2 d
A B IR
O B IR
nop
nop
78
I$ B P
D$
BHT
T or NT
! What about aliasing? ! Two PCs with the same lower bits? ! No problem, just a prediction!
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining
T or NT
Prediction
N T T T N T T T N T T T
Outcome
State
Time
Result?
Wrong Correct Correct Wrong Wrong Correct Correct Wrong Wrong Correct Correct Wrong
81
1 N 2 T 3 T 4 T 5 N 6 T 7 T 8 T 9 N 10 T 11 T 12 T
T T T N T T T N T T T N
Outcome
State
Time
Result?
Wrong Wrong Correct Wrong Correct Correct Correct Wrong Correct Correct Correct Wrong
82
1 N 2 n 3 t 4 T 5 t 6 T 7 T 8 T 9 t 10 T 11 T 12 T
T T T N T T T N T T T N
Correlated Predictor
! Correlated (two-level) predictor [Patt 1991]
! Exploits observation that branch outcomes are correlated ! Maintains separate prediction per (PC, BHR) pairs ! Branch history register (BHR): recent branch outcomes ! Simple working example: assume program has one branch ! BHT: one 1-bit DIRP entry ! BHT+2BHR: 22 = four 1-bit DIRP entries ! Why didnt we do better? ! BHT not long enough to capture pattern
Prediction Outcome Pattern State NN NT TN TT N N N N T N N N T T N N T T N T T T N N T T T N T T T N T T T T T T T N T T T N T T T N T T T T Time Result?
1 NN 2 NT 3 TT 4 TT 5 TN 6 NT 7 TT 8 TT 9 TN 10 NT 11 TT 12 TT
N N N T N T N T T T N T
T Wrong T Wrong T Wrong N Wrong T Wrong T Correct T Wrong N Wrong T Correct T Correct T Wrong N Wrong
83
Time
Result?
1 NNN 2 NNT 3 NTT 4 TTT 5 TTN 6 TNT 7 NTT 8 TTT 9 TTN 10 TNT 11 NTT 12 TTT
N N N N N N T N T T T N
T Wrong T Wrong T Wrong N Correct T Wrong T Wrong T Correct N Correct T Correct T Correct T Correct N Correct
! Consider:
for (i=0; i<1000000; i++) { if (i % 3 == 0) { // whatever } if (random() % 2 == 0) { if (i % 3 >= 1) { // whatever } } }
CIS 501: Comp. Arch. | Prof. Joe Devietti | Pipelining
85
86
Hybrid Predictor
! Hybrid (tournament) predictor [McFarling 1993]
! Attacks correlated predictor BHT capacity problem ! Idea: combine two predictors ! Simple BHT predicts history independent branches ! Correlated predictor predicts only branches that need history ! Chooser assigns branches to one predictor or the other ! Branches start in simple BHT, move mis-prediction threshold +! Correlated predictor can be made smaller, handles fewer branches +! 9095% accuracy
PC BHR chooser BHT BHT
87
501 News
! if submitting HW2 late, email me ! WX bypassing on slide 63
88
1 F
2 D
3 X F
4 M D
5 W X
6 M
7 W
89
I$ B P
D$
! guess the future PC based on past behavior ! Last time the branch X was taken, it went to address Y ! So, in the future, if address X is fetched, fetch address Y next
! Operation
! A small RAM: address = PC, data = target-PC ! Access at Fetch in parallel with instruction memory ! predicted-target = BTB[hash(PC)] ! Updated at X whenever target != predicted-target ! BTB[hash(PC)] = target ! Hash function is just typically just extracting lower bits (as before) ! Aliasing? No problem, this is only a prediction
91
BTB
+ 4
target
predicted target
92
93
BTB
+ 4
target
predicted target
RAS
BTB
+ 4
target
predicted target
RAS
BHT
taken/not-taken
96
> thresh
Pipeline Depth
35 30
25
pipeline depth
20
15
10
0 1985
1990
1995
2000
2005
2010
98
Summary
App App App System software Mem CPU I/O
! ! ! !
Single-cycle & multi-cycle datapaths Latency vs throughput & performance Basic pipelining Data hazards
! Bypassing ! Load-use stalling
99