Pipe Lining

Computer Organization and Architecture (AT70.
01)
Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
Instructor: Dr. Sumanta Guha Slide Sources: Patterson & Hennessy COD book site (copyright Morgan Kaufmann) adapted and supplemented
COD Ch. 6 Enhancing Performance with Pipelining
Pipelining
Start work ASAP!! Do not waste time!

Time Task order A B C D 6 PM 7 8 9 10 11 12 1 2 AM
Not pipelined
Assume 30 min. each task wash, dry, fold, store and that separate tasks use separate hardware and so can be overlapped
6 PM Time 7 8 9 10 11 12 1 2 AM
Task order A B C D
Pipelined
Pipelined vs. Single-Cycle Instruction Execution: the Plan

Program execution Time order (in instructions) lw $1, 100($0) 2 4 6 8 10 12 14 16 18
Instruction Reg fetch ALU Data access Reg Instruction Reg fetch ALU
Single-cycle
Data access Reg Instruction fetch
lw $2, 200($0)
lw $3, 300($0)
8 ns
8 ns
...
8 ns
Assume 2 ns for memory access, ALU operation; 1 ns for register access: therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
Program 2 execution Time order (in instructions) Instruction lw $1, 100($0)
fetch
10
12
14
Reg Instruction fetch
ALU Reg Instruction fetch
Data access ALU Reg
Reg Data access ALU Reg Data access Reg
Pipelined
lw $2, 200($0) lw $3, 300($0)
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
2 ns
Pipelining: Keep in Mind

Pipelining does not reduce latency of a single task, it increases throughput of entire workload Pipeline rate limited by longest stage
potential speedup = number pipe stages unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it when there is slack in the pipeline reduces speedup
Example Problem
Problem: for the laundry fill in the following table when

1.
2.
the stage lengths are 30, 30, 30 30 min., resp. the stage lengths are 20, 20, 60, 20 min., resp.
Pipeline 1 finish time Ratio unpipelined to pipeline 1 Pipeline 2 finish time Ratio unpiplelined to pipeline 2
Person 1 2 3 4
Unpipelined finish time
Come up with a formula for pipeline speed-up!
Pipelining MIPS
What makes it easy with MIPS?
all instructions are same length
so fetch and decode stages are similar for all instructions simplifies instruction decode and makes it possible in one stage so memory access can be deferred to exactly one later stage one data transfer instruction requires one memory access stage
just a few instruction formats
memory operands appear only in load/stores
operands are aligned in memory
Pipelining MIPS
What makes it hard?
structural hazards: different instructions, at different stages,

in the pipeline want to use the same hardware resource control hazards: succeeding instruction, to put into pipeline, depends on the outcome of a previous branch instruction, already in pipeline data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline
Before actually building the pipelined datapath and control we first briefly examine these potential hazards individually
Structural Hazards
Structural hazard: inadequate hardware to simultaneously
support all instructions in the pipeline in the same clock cycle E.g., suppose single not separate instruction and data memory in pipeline below with one read port
then a structural hazard between first and fourth lw instructions

Program 2 execution Time order (in instructions) Instruction lw $1, 100($0)
fetch
10
12
14
Data access ALU Reg
Reg Data access ALU Reg Reg Data access ALU
Pipelined
Hazard if single memory
Reg
lw $2, 200($0) lw $3, 300($0) lw $4, 400($0)
2 ns
2 ns
2 ns
Instruction fetch
Data access
Reg
2 ns
2 ns
2 ns
2 ns
2 ns
MIPS was designed to be pipelined: structural hazards are easy

to avoid!
Control Hazards
Control hazard: need to make a decision based on the result of

a previous instruction still executing in pipeline Solution 1 Stall the pipeline
2 4 6 8 10 12
Program execution Time order (in instructions) add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)
Instruction fetch
14
16
ALU Reg
Data access ALU Instruction fetch
Reg Data access Reg Reg ALU
Note that branch outcome is computed in ID stage with added hardware (later)
Data access
2ns
bubble
Reg
4 ns
2ns
Pipeline stall
Control Hazards
Solution 2 Predict branch outcome
e.g., predict branch-not-taken :

2 4 6 8 10 12 14
Program execution Time order (in instructions) add $4, $5, $6 beq $1, $2, 40 2 ns lw $3, 300($0)
Instruction Reg fetch
ALU
Data access ALU
Reg Data access ALU
Reg Data access
2 ns
Reg
Prediction success
Program execution Time order (in instructions) add $4, $5 ,$6 beq $1, $2, 40 2 ns 2 4 6 8 10 12 14
ALU
Data access ALU
Reg Data access
Reg
bubble
bubble
bubble
bubble
ALU
bubble
Data access Reg
or $7, $8, $9
4 ns
Prediction failure: undo (=flush) lw
Control Hazards
Solution 3 Delayed branch: always execute the sequentially next statement with the branch executing after one instruction delay compilers job to find a statement that can be put in the slot that is independent of branch outcome
P r o g ra m execution orde r T im e (in instructions) beq $ 1, $2, 40
MIPS does this but it is an option in SPIM (Simulator -> Settings)

2 4 6 8 10 12 14
Instruction fetch
Data access ALU Reg
Reg Data access ALU Reg Data access Reg
add $4, $5 , $ 6 (d elaye d branch slot) lw $3 , 3 00($0)
2 ns
2 ns
2 ns
Delayed branch beq is followed by add that is independent of branch outcome
Data Hazards
Data hazard: instruction needs data from the result of a

previous instruction still executing in pipeline Solution Forward data if possible
Time add $s0, $t0, $t1 IF 2 4 6 8 10
ID
EX
MEM
WB
Instruction pipeline diagram: shade indicates use left=write, right=read
Program execution order Time (in instructions) add $s0, $t0, $t1 IF
10
ID
EX
MEM
WB
sub $t2, $s0, $t3
IF
ID
EX
MEM
WB
Without forwarding blue line data has to go back in time; with forwarding red line data is available in time
Data Hazards
Forwarding may not be enough
e.g., if an R-type instruction following a load uses the result of the load called load-use data hazard
Time Program execution order (in instructions)
lw $s0, 20($t1) IF
10
12
14
ID
EX
MEM
WB
Without a stall it is impossible to provide input to the sub instruction in time

WB
sub $t2, $s0, $t3
IF
ID
EX
MEM
Time Program execution order (in instructions) lw $s0, 20($t1) IF
10
12
14
ID
EX
MEM
WB
With a one-stage stall, forwarding can get the data to the sub instruction in time
bubble
bubble
sub $t2, $s0, $t3
bubble
bubble
bubble
IF
ID
EX
MEM
WB
Reordering Code to Avoid Pipeline Stall (Software Solution)

Example: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)
Data hazard
Reordered code: lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1)
Interchanged
Pipelined Datapath
1. 2. 3. 4. 5.
We now move to actually building a pipelined datapath First recall the 5 steps in instruction execution
Instruction Fetch & PC Increment (IF) Instruction Decode and Register Read (ID) Execution or calculate address (EX) Memory access (MEM) Write result into register (WB)
Review: single-cycle processor

all 5 steps done in a single clock cycle dedicated hardware required for each step
What happens if we break the execution into multiple cycles, but keep the extra hardware?
Review - Single-Cycle Datapath Steps

ADD ADD 4
PC
ADDR RD Instruction I
32
16
<<2
5 5
32
Instruction Memory
RN1
RN2
Register File
WD
WN RD1
ALU
M U X
Zero
RD2
ADDR
16
E X T N D
Data Memory
WD
RD
32
M U X
Instruction Fetch
IF
Instruction Decode
ID
EX
Execute/ Address Calc.
Memory Access
MEM
Write Back
WB
Pipelined Datapath Key Idea
What happens if we break the execution into multiple cycles, but keep the extra hardware?
Answer: We may be able to start executing a new instruction at each clock cycle - pipelining
but we shall need extra registers to hold data between cycles pipeline registers
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
ADD ADD 4
64 bits
Instruction I
32 16 32 5 5 5
128 bits <<2
PC
ADDR RD
97 bits
64 bits
Instruction Memory
RN1
RN2
Register File
WD
WN RD1
ALU
M U X
Zero
RD2
ADDR
16
E X T N D
Data Memory
WD
RD
32
M U X
IF/ID
ID/EX
EX/MEM
MEM/WB
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
ADD ADD 4
64 bits
Instruction I
32 16 32 5 5 5
128 bits <<2
PC
ADDR RD
97 bits
64 bits
Instruction Memory
RN1
RN2
Register File
WD
WN RD1
ALU
M U X
Zero
RD2
ADDR
16
E X T N D
Data Memory
WD
RD
32
M U X
IF/ID
ID/EX
EX/MEM
MEM/WB
Only data flowing right to left may cause hazard, why?
Bug in the Datapath

IF/ID
ADD
ADD
ID/EX
EX/MEM
MEM/WB
PC
ADDR RD Instruction I
32
<<2
5 5
Instruction Memory
16
32
RN1
RN2
Register File
WD
WN RD1
ALU
M U X
RD2
ADDR
16
E X T N D
Data Memory
WD
RD
32
M U X
Write register number comes from another later instruction!
Corrected Datapath
IF/ID
ADD
ID/EX
EX/MEM
MEM/WB
64 bits
133 bits
<<2
ADD
102 bits
69 bits
PC
ADDR RD
Instruction Memory
32
5
5 5
RN1 RN2 WN WD
RD1
ALU
M U X
Zero
Register File RD2
ADDR
16
E X 32 T N D
Data Memory RD
WD
M U X
Destination register number is also passed through ID/EX, EX/MEM and MEM/WB registers, which are now wider by 5 bits
Pipelined Example
Consider the following instruction sequence:
lw
$t0,
10($t1)
sw $t3, 20($t4) add $t5, $t6, $t7 sub $t8, $t9, $t10
LW
Single-Clock-Cycle Diagram: Clock Cycle 1
SW

LW
ADD

SW LW
SUB

ADD SW LW

SUB ADD SW
LW

SUB ADD
SW

SUB
ADD
SUB
Alternative View Multiple-Clock-Cycle Diagram

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
Time axis
lw $t0, 10($t1)
IM REG ALU DM REG
sw $t3, 20($t4)
IM
REG
ALU
DM
REG
add $t5, $t6, $t7
IM
REG
ALU
DM
REG
sub $t8, $t9, $t10
IM
REG
ALU
DM
REG
Notes
One significant difference in the execution of an R-type instruction between multicycle and pipelined implementations:
register write-back for the R-type instruction is the 5th (the last write-back) pipeline stage vs. the 4th stage for the multicycle implementation. Why? think of structural hazards when writing to the register file
Worth repeating: the essential difference between the pipeline and multicycle implementations is the insertion of pipeline registers to decouple the 5 stages The CPI of an ideal pipeline (no stalls) is 1. Why? The RaVi Architecture Visualization Project of Dortmund U. has pipeline simulations see link in our Additional Resources page As we develop control for the pipeline keep in mind that the text does not consider jump should not be too hard to implement!
Recall Single-Cycle Control the Datapath

0 M u x ALU Add result Add 4 Instruction [31 26] RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Read register 1 Shift left 2 1 PCSrc
Control
Instruction [25 21] PC Read address Instruction [31 0] Instruction memory Instruction [15 11] Instruction [20 16] 0 M u x 1
Read data 1 Read register 2 Registers Read Write data 2 register Write data
0 M u x 1
Zero ALU ALU result
Address
Read data Data memory
Write data Instruction [15 0] 16 Sign extend 32 ALU control
1 M u x 0
Instruction [5 0]
Recall Single-Cycle ALU Control

Instruction AluOp opcode
LW SW Branch eq R-type R-type R-type R-type R-type 00 00 01 10 10 10 10 10 load word store word branch eq add subtract AND OR set on less xxxxxx xxxxxx xxxxxx 100000 100010 100100 100101 101010 add add subtract add subtract and or set on less 010 010 110 010 110 000 001 111
Instruction Funct Field Desired ALU control operation ALU action input
ALUOp Funct field Operation ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0 0 0 X X X X X X 010 0 1 X X X X X X 110 1 X X X 0 0 0 0 010 1 X X X 0 0 1 0 110 1 X X X 0 1 0 0 000 1 X X X 0 1 0 1 001 1 X X X 1 0 1 0 111 Truth table for ALU control bits
Recall Single-Cycle Control Signals

Effect of control bits
Signal Name RegDst RegWrite AlLUSrc PCSrc MemRead MemWrite MemtoReg Effect when deasserted The register destination number for the Write register comes from the rt field (bits 20-16) None The second ALU operand comes from the second register file output (Read data 2) The PC is replaced by the output of the adder that computes the value of PC + 4 None None The value fed to the register Write data input comes from the ALU Effect when asserted The register destination number for the Write register comes from the rd field (bits 15-11) The register on the Write register input is written with the value on the Write data input The second ALU operand is the sign-extended, lower 16 bits of the instruction The PC is replaced by the output of the adder that computes the branch target Data memory contents designated by the address input are put on the first Read data output Data memory contents designated by the address input are replaced by the value of the Write data input The value fed to the register Write data input comes from the data memory
Memto- Reg Mem Mem Deter- Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 1 0 0 1 0 0 0 1 0 mining R-format 0 1 1 1 1 0 0 0 0 control lw sw X 1 X 0 0 1 0 0 0 bits beq X 0 X 0 0 0 1 0 1
Pipeline Control

Initial design motivated by single-cycle datapath control use the same control signals Observe:

No separate write signal for the PC as it is written every cycle modified by hazard No separate write signals for the pipeline registers as they are detection unit!! written every cycle No separate read signal for instruction memory as it is read every clock cycle No separate read signal for register file as it is read every clock cycle
Will be
Need to set control signals during each pipeline stage Since control signals are associated with components active during a single pipeline stage, can group control lines into five
groups according to pipeline stage
Pipelined Datapath with Control I

0 M u x 1 IF/ID ID/EX EX/MEM Add 4
RegWrite Shift left 2 Add Add result
PCSrc
MEM/WB
Branch
PC
Address Instruction memory
Instruction
Read register 1
MemWrite Read data 1

ALUSrc
Read register 2 Registers Read Write data 2 register Write data
0 M u x 1
Zero Zero ALU ALU result
MemtoReg Address Data memory Write data Read data 1 M u x 0
Instruction 16 [15 0]
Sign extend
32
ALU control
MemRead
Same control signals as the single-cycle datapath
Instruction [20 16] Instruction [15 11] 0 M u x 1

RegDst ALUOp
Pipeline Control Signals
There are five stages in the pipeline instruction fetch / PC increment instruction decode / register fetch execution / address calculation

Nothing to control as instruction memory read and PC write are always enabled
memory access write back

Write-back stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X
Instruction R-format lw sw beq
Execution/Address Calculation Memory access stage stage control lines control lines Reg ALU ALU ALU Mem Mem Dst Op1 Op0 Src Branch Read Write 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0 1 0 0 1 X 0 1 0 1 0 0
Pipeline Control Implementation
Pass control signals along just like the data extend each
WB Instruction
pipeline register to hold needed control bits for succeeding stages
Control
M EX
WB M WB
IF/ID
ID/EX
EX/MEM
MEM/WB
Note: The 6-bit funct field of the instruction required in the EX
stage to generate ALU control can be retrieved as the 6 least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register
Pipelined Datapath with Control II

PCSrc 0 M u x 1 Control ID/EX WB M EX EX/MEM WB M MEM/WB WB IF/ID Add 4 Add Add result
RegWrite
ALUSrc
PC
0 M u x 1
Zero ALU ALU result
Address Data memory Write data
Read data
Control signals emanate from the control portions of the pipeline registers
Sign extend
32
ALU control
MemRead
Instruction [20 16]
Instruction [15 11]
0 M u x 1 RegDst
ALUOp
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
Pipelined Execution and Control
IF: lw $10, 20($1)
ID: before<1>
EX: before<2>
MEM: before<3>
WB: before<4>
0 M u x 1
IF/ID 00 000 0000
ID/EX WB M 00 000
EX/MEM
MEM/WB
Control
WB M
00 0 0 0 0 WB 0
0 00 EX 0
Add 4 Add Add result
RegWrite
ALUSrc
PC
0 M u x 1
Zero ALU ALU result
Read data
Instruction [15 0]
Sign extend
ALU control
MemRead
Instruction sequence:
lw sub and or add $10, $11, $12, $13, $14, 20($1) $2, $3 $4, $7 $6, $7 $8, $9
Clock cycle 1
Clock 1
IF: sub $11, $2, $3
Instruction [20 16] Instruction [15 11]
0 M u x 1 RegDst
ALUOp
ID: lw $10, 20($1)
EX: before<1>
MEM: before<2>
WB: before<3>
0 M u x 1
IF/ID 11 lw 010 0001
ID/EX WB M 00 000
EX/MEM
MEM/WB
Control
WB M
00 0 0 0 0 WB 0
0 00 EX 0
RegWrite
ALUSrc
PC
Read $1 data 1 Read register 2 Registers Read $X Write data 2 register Write data
Label before<i> means i th instruction before lw Clock cycle 2

Clock 2
0 M u x 1
Zero ALU ALU result
Read data
20
Instruction [15 0]
Sign extend
20
ALU control
MemRead
10
10
0 M u x 1 RegDst
ALUOp
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
IF: and $12, $4, $5
ID: sub $11, $2, $3
EX: lw $10, . . .
MEM: before<1>
WB: before<2>
0 M u x 1
IF/ID 10 sub 000 1100
ID/EX WB M 11 010
EX/MEM
MEM/WB
Control
WB M
00 0 0 0 0 WB 0
0 00 EX 1
Add 4
RegWrite
Add Add result

MemWrite
Shift left 2 ALUSrc $1 Zero ALU ALU result
Branch
PC
Read $2 data 1 Read register 2 Registers Read $3 Write data 2 register Write data
0 M u x 1
Read data
Instruction [15 0]
Sign extend
20
ALU control
MemRead
lw sub and or add $10, $11, $12, $13, $14, 20($1) $2, $3 $4, $7 $6, $7 $8, $9
Clock cycle 3
Clock 3
IF: or $13, $6, $7
10
11
11
0 M u x 1 RegDst
ALUOp
ID: and $12, $2, $3
EX: sub $11, . . .
MEM: lw $10, . . .
WB: before<1>
0 M u x 1
IF/ID 10 and 000 1100
ID/EX WB M 10 000
EX/MEM
MEM/WB
Control
WB M
11 0 1 0 0 WB 0
1 10 EX 0
Add 4
RegWrite
Add Add result

MemWrite
Shift left 2 ALUSrc $2 Zero ALU ALU result
Branch
PC
$3
0 M u x 1
Read data
Instruction [15 0]
Sign extend
ALU control
MemRead
Clock cycle 4 Clock 4
12
12
11
0 M u x 1 RegDst
ALUOp 10
MemtoReg
Instruction
Read register 1
MemtoReg
Instruction
Read register 1
1 M u x 0
1 M u x 0
IF: add $14, $8, $9
ID: or $13, $6, $7
EX: and $12, . . .
MEM: sub $11, . . .
WB: lw $10, . . .
0 M u x 1
IF/ID 10 or 000 1100
ID/EX WB M 10 000
EX/MEM
MEM/WB
Control
WB M
10 0 0 0 1 WB 1
1 10 EX 0
RegWrite
ALUSrc $4 Zero ALU ALU result
PC
7 10
$5
0 M u x 1
Read data
Instruction [15 0]
Sign extend
ALU control
MemRead
Clock cycle 5
IF: after<1>
lw sub and or add $10, $11, $12, $13, $14, 20($1) $2, $3 $4, $7 $6, $7 $8, $9
Clock 5
13
13
12
0 M u x 1 RegDst
ALUOp 11 10
ID: add $14, $8, $9
EX: or $13, . . .
MEM: and $12, . . .
WB: sub $11, . . .
0 M u x 1
IF/ID 10 add 000 1100
ID/EX WB M 10 000
EX/MEM
MEM/WB
Control
WB M
10 0 0 0 1 WB 0
1 10 EX 0
RegWrite
PC
9 11
$7
Label after<i> means i th instruction after add Clock cycle 6

Clock 6
0 M u x 1
Read data
Instruction [15 0]
Sign extend
ALU control
MemRead
14
14
13
0 M u x 1 RegDst
ALUOp 12 11
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
IF: after<2>
ID: after<1>
EX: add $14, . . .
MEM: or $13, . . .
WB: and $12, . . .
0 M u x 1
IF/ID 00 000 0000
ID/EX WB M 10 000
EX/MEM
MEM/WB
Control
WB M
10 0 0 0 1 WB 0
1 10 EX 0
RegWrite
PC
12
$9
0 M u x 1
Read data
Instruction [15 0]
Sign extend
ALU control
MemRead
Clock cycle 7
Clock 7
IF: after<3>
14
0 M u x 1 RegDst
ALUOp 13 12
lw sub and or add $10, $11, $12, $13, $14, 20($1) $2, $3 $4, $7 $6, $7 $8, $9
ID: after<2>
EX: after<1>
MEM: add $14, . . .
WB: or $13, . . .
0 M u x 1
IF/ID 00 000 0000
ID/EX WB M 00 000
EX/MEM
MEM/WB
Control
WB M
10 0 0 0 1 WB 0
0 00 EX 0
RegWrite
ALUSrc
PC
13
0 M u x 1
Zero ALU ALU result
Read data
Instruction [15 0]
Sign extend
ALU control
MemRead
Clock cycle 8
Clock 8
0 M u x 1 RegDst
ALUOp 14 13
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
lw sub and or add $10, $11, $12, $13, $14, 20($1) $2, $3 $4, $7 $6, $7 $8, $9
IF: after<4>
ID: after<3>
EX: after<2>
MEM: after<1>
WB: add $14, . . .
0 M u x 1
IF/ID 00 000 0000
ID/EX WB M 00 000
EX/MEM
MEM/WB
Control
WB M
00 0 0 0 1 WB 0
0 00 EX 0
RegWrite
ALUSrc
PC
14
0 M u x 1
Zero ALU ALU result
Read data
Instruction [15 0]
Sign extend
ALU control
MemRead
Instruction [20 16]
Clock cycle 9
Clock 9
Instruction [15 11]
0 M u x 1 RegDst
ALUOp 14
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
Revisiting Hazards
So far our datapath and control have ignored hazards We shall revisit data hazards and control hazards and enhance our datapath and control to handle them in hardware
Data Hazards and Forwarding
Problem with starting an instruction before previous are finished:
data dependencies that go backward in time called data hazards

Time (in clock cycles) CC 1 Value of register $2: 10 Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg CC 2 10 CC 3 10 CC 4 10 CC 5 10/ 20 CC 6 20 CC 7 20 CC 8 20 CC 9 20
$2 = 10 before sub; $2 = -20 after sub
sub and or add sw
$2, $12, $13, $14, $15,
$1, $3 $2, $5 $6, $2 $2, $2 100($2)
and $12, $2, $5
IM
Reg
DM
Reg
or $13, $6, $2
IM
Reg
DM
Reg
add $14, $2, $2
IM
Reg
DM
Reg
sw $15, 100($2)
IM
Reg
DM
Reg
Software Solution
Have compiler guarantee never any data hazards!
by rearranging instructions to insert independent instructions between instructions that would otherwise have a data hazard between them, or, if such rearrangement is not possible, insert nops $2, $1, $3 sub $2, $1, $3
sub
lw slt and or add sw
$10, 40($3) $5, $6, $7 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2)
or
nop nop and or add sw
$12, $13, $14, $15,
$2, $5 $6, $2 $2, $2 100($2)
Such compiler solutions may not always be possible, and nops slow the machine down
MIPS: nop = no operation = 000 (32bits) = sll $0, $0, 0
Hardware Solution: Forwarding
Idea: use intermediate data, do not wait for result to be finally written to the destination register. Two steps:
1. 2.
Detect data hazard Forward intermediate data to resolve hazard
Pipelined Datapath with Control II (as before)

RegWrite
ALUSrc
PC
0 M u x 1
Zero ALU ALU result
Read data
Sign extend
32
ALU control
MemRead
Instruction [20 16]
Instruction [15 11]
0 M u x 1 RegDst
ALUOp
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
Hazard Detection
Hazard conditions:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Eg., in the earlier example, first hazard between sub $2, $1, $3 and and $12, $2, $5 is detected when the and is in EX stage and the sub is in MEM stage because
EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)
Whether to forward also depends on:
forward, even if there is register number match as in conditions above if the destination register of the later instruction is $0 in which case there is no need to forward value ($0 is always 0 and never overwritten)
if the later instruction is going to write a register if not, no need to
Data Forwarding Plan:

allow inputs to the ALU not just from ID/EX, but also later pipeline registers, and use multiplexors and control signals to choose appropriate inputs to ALU
Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X Program execution order (in instructions) sub $2, $1, $3 IM Reg DM Reg CC 2 10 X X CC 3 10 X X CC 4 10 20 X CC 5 10/ 20 X 20 CC 6 20 X X CC 7 20 X X CC 8 20 X X CC 9 20 X X
sub and or add sw
$2, $12, $13, $14, $15,
$1, $3 $2, $5 $6, $2 $2, $2 100($2)
and $12, $2, $5
IM
Reg
DM
Reg
or $13, $6, $2
IM
Reg
DM
Reg
add $14, $2, $2
IM
Reg
DM
Reg
sw $15, 100($2)
IM
Reg
DM
Reg
Dependencies between pipelines move forward in time
ID/EX
EX/MEM
MEM/WB
Forwarding Hardware
Registers
ALU Data memory
M u x
a. No forwarding
Datapath before adding forwarding hardware

ID/EX EX/MEM MEM/WB
M u x Registers ForwardA M u x ALU Data memory
M u x
Rs Rt Rt Rd
ForwardB M u x Forwarding unit
EX/MEM.RegisterRd MEM/WB.RegisterRd
b. With forwarding
Datapath after adding forwarding hardware
Forwarding Hardware: Multiplexor Control

Mux control
ForwardA = 00 ForwardA = 10 ForwardA = 01 ForwardB = 00 ForwardB = 10 ForwardB = 01
Source
ID/EX EX/MEM MEM/WB ID/EX EX/MEM MEM/WB
Explanation
The first ALU operand comes from the register file The first ALU operand is forwarded from prior ALU result The first ALU operand is forwarded from data memory or an earlier ALU result The second ALU operand comes from the register file The second ALU operand is forwarded from prior ALU result The second ALU operand is forwarded from data memory or an earlier ALU result
Depending on the selection in the rightmost multiplexor (see datapath with control diagram)
Data Hazard: Detection and Forwarding
Forwarding unit determines multiplexor control according to the following rules:

EX hazard if ( EX/MEM.RegWrite // if there is a write and ( EX/MEM.RegisterRd 0 ) // to a non-$0 register and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches, then ForwardA = 10 EX/MEM.RegWrite // if there is a write and ( EX/MEM.RegisterRd 0 ) // to a non-$0 register and ( EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches, then ForwardB = 10 if (
1.
Data Hazard: Detection and Forwarding

2.
MEM hazard if ( MEM/WB.RegWrite and ( MEM/WB.RegisterRd 0 ) and ( EX/MEM.RegisterRd ID/EX.RegisterRs )
// if there is a write // to a non-$0 register // and not already a register match // with earlier pipeline register and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline register, then ForwardA = 01 if ( // if there is a write // to a non-$0 register // and not already a register match // with earlier pipeline register and ( MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline register, then ForwardB = 01
This check is necessary, e.g., for sequences such as add $1, $1, $2; add $1, $1, $3; add $1, $1, $4; (array summing?), where an earlier pipeline (EX/MEM) register has more recent data
MEM/WB.RegWrite and ( MEM/WB.RegisterRd 0 ) and ( EX/MEM.RegisterRd ID/EX.RegisterRt )
Forwarding Hardware with Control

ID/EX WB M EX/MEM WB Control IF/ID EX M
Called forwarding unit, not hazard detection unit, because once data is forwarded there is no hazard!
MEM/WB WB
Instruction
M u x Registers ALU M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd Rs Rt Rt Rd M u x Forwarding unit EX/MEM.RegisterRd Data memory
PC
Instruction memory
M u x
MEM/WB.RegisterRd
Datapath with forwarding hardware and control wires certain details, e.g., branching hardware, are omitted to simplify the drawing Note: so far we have only handled forwarding to R-type instructions!
or $4, $4, $2
and $4, $2, $5
sub $2, $1, $3

ID/EX 10 WB M EX 10
before<1>
before<2>
Forwarding
PC Instruction memory
EX/MEM WB M MEM/WB WB
Control
IF/ID
2
Instruction
$2
$1 M u x
5 Registers
ALU $5 $3 M u x 2 5 4 1 3 2 M u x Forwarding unit
Data memory
M u x
Execution example:
$2, $4, $4, $9, $1, $2, $4, $4, $3 $5 $2 $2
Clock cycle 3
Clock 3 add $9, $4, $2 or $4, $4, $2 and $4, $2, $5
ID/EX
sub $2, . . .
before<1>
Instruction
sub and or add
10
WB M EX
10 EX/MEM WB M 10 MEM/WB WB
Control
IF/ID
4 6 Registers
$4
$2 M u x ALU Data memory
PC
Instruction memory
$2
$5 M u x
M u x
2 6 4
2 5 4 M u x Forwarding unit 2
Clock cycle 4
Clock 4
after<1>
add $9, $4, $2
or $4, $4, $2
ID/EX 10 WB M EX 10
and $4, . . .
sub $2, . . .
EX/MEM WB M 10 MEM/WB WB 1
Forwarding
Control
IF/ID
4
Instruction
$4
$4 M u x
2 2 Registers
ALU $2 $2 M u x 4 2 4 2 4 M u x Forwarding unit 4
Data memory
M u x
Execution example (cont.):

$2, $4, $4, $9, $1, $2, $4, $4, $3 $5 $2 $2
Clock cycle 5
Clock 5 after<2> after<1> add $9, $4, $2
ID/EX
or $4, . . .
and $4, . . .
Instruction
sub and or add
WB M EX
10 EX/MEM WB M 10 MEM/WB WB 1
Control
IF/ID
$4 M u x 4 Registers ALU $2 M u x 4 2 9 M u x Forwarding unit 4 4 Data memory
PC
Instruction memory
M u x
Clock cycle 6
Clock 6
Data Hazards and Stalls
Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that writes to the same register
20($1) $2, $5 $2, $6 $4, $2 $6, $7
Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20($1) IM CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
lw and or add Slt
$2, $4, $8, $9, $1,
Reg
DM
Reg
and $4, $2, $5
IM
Reg
DM
Reg
As even a pipeline dependency goes backward in time forwarding will not solve the hazard
or $8, $2, $6
IM
Reg
DM
Reg
add $9, $4, $2
IM
Reg
DM
Reg
slt $1, $6, $7
IM
Reg
DM
Reg
therefore, we need a hazard detection unit to stall the pipeline after the load instruction
Pipelined Datapath with Control II (as before)

RegWrite
ALUSrc
PC
0 M u x 1
Zero ALU ALU result
Read data
Sign extend
32
ALU control
MemRead
Instruction [20 16]
Instruction [15 11]
0 M u x 1 RegDst
ALUOp
MemtoReg
1 M u x 0
Instruction
Read register 1
MemWrite
Shift left 2
Branch
Hazard Detection Logic to Stall
Hazard detection unit implements the following check if to stall
if ( ID/EX.MemRead // if the instruction in the EX stage is a load and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination register or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register // of the instruction in the ID stage, then stall the pipeline
Mechanics of Stalling

If the check to stall verifies, then the pipeline needs to stall only 1 clock cycle after the load as after that the forwarding unit can resolve the dependency What the hardware does to stall the pipeline 1 cycle:
does not let the IF/ID register change (disable write!) this will cause the instruction in the ID stage to repeat, i.e., stall
therefore, the instruction, just behind, in the IF stage must be stalled as well so hardware does not let the PC change (disable write!) this will cause the instruction in the IF stage to repeat, i.e., stall
changes all the EX, MEM and WB control fields in the ID/EX pipeline register to 0, so effectively the instruction just behind the load becomes a nop a bubble is said to have been inserted
into the pipeline
note that we cannot turn that instruction into an nop by 0ing all the bits in the instruction itself recall nop = 000 (32 bits) because it has already been decoded and control signals generated
Hazard Detection Unit

Hazard detection unit
IF/IDWrite
ID/EX.MemRead ID/EX WB M u x EX/MEM WB M
Control 0
M EX
MEM/WB WB
IF/ID
PCWrite
Instruction
M u x Registers ALU M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd ID/EX.RegisterRt Rt Rd Rs Rt Data memory
PC
Instruction memory
M u x
M u x Forwarding unit
EX/MEM.RegisterRd
MEM/WB.RegisterRd
Datapath with forwarding hardware, the hazard detection unit and controls wires certain details, e.g., branching hardware are omitted to simplify the drawing
Stalling Resolves a Hazard
Same instruction sequence as before for which forwarding by itself could not resolve the hazard:
Program Time (in clock cycles) execution CC 1 CC 2 order (in instructions) CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
lw and or add Slt
$2, $4, $8, $9, $1,
20($1) $2, $5 $2, $6 $4, $2 $6, $7
lw $2, 20($1)
IM
Reg
DM
Reg
and $4, $2, $5
IM
Reg
Reg
DM
Reg
or $8, $2, $6
IM
IM
Reg
DM
Reg
bubble
add $9, $4, $2 IM Reg DM Reg
slt $1, $6, $7
IM
Reg
DM
Reg
Hazard detection unit inserts a 1-cycle bubble in the pipeline, after which all pipeline register dependencies go forward so then the forwarding unit can handle them and there are no more hazards
and $4, $2, $5
lw $2, 20($1)
1 X
IF/IDWrite
before<1>
ID/EX.MemRead ID/EX 11 M u x WB M EX
before<2>
before<3>
Instruction
Stalling
Control 0
IF/ID
PCWrite
1 X Registers
$1 M u x ALU $X M u x Data memory
PC
Instruction memory
M u x
Execution example:
Clock cycle 2 Clock 2 lw and or add $2, $4, $4, $9, 20($1) $2, $5 $4, $2 $4, $2
or $4, $4, $2 and $4, $2, $5
2 5
IF/IDWrite
1 X 2
ID/EX.RegisterRt
lw $2, 20($1)
ID/EX.MemRead ID/EX 00 M u x WB M EX 11
before<1>
before<2>
Control 0
IF/ID
PCWrite
2
Instruction
$2
$1 M u x
5 Registers
PC
Instruction memory
ALU $5 $X M u x 2 5 4 ID/EX.RegisterRt 1 X 2 M u x Forwarding unit
Data memory
M u x
Clock cycle 3
Clock 3
or $4, $4, $2
and $4, $2, $5

2 5 Hazard detection unit ID/EX.MemRead
bubble
ID/EX 10 M u x WB M EX 00
lw $2, . . .
before<1>
IF/IDWrite
EX/MEM WB M 11 MEM/WB WB
PCWrite
Instruction
Stalling
Control 0
IF/ID
2 5 Registers
$2
$5
$5 M u x
M u x

lw and or add $2, $4, $4, $9,
2 5 4 ID/EX.RegisterRt
Clock cycle 4
Clock 4 add $9, $4, $2 or $4, $4, $2
4 2 Hazard detection unit ID/EX.MemRead ID/EX 10 M u x WB M EX 10
and $4, $2, $5
bubble
lw $2, . . .
PCWrite
20($1) $2, $5 $4, $2 $4, $2
IF/IDWrite
Control 0
IF/ID
$4
$2 M u x
Instruction
2 2 Registers
PC
Instruction memory
ALU $2 $5 M u x 4 2 4 ID/EX.RegisterRt 2 5 4 M u x Forwarding unit
Data memory
M u x
Clock cycle 5
Clock 5
after<1>
add $9, $4, $2

4 2 Hazard detection unit ID/EX.MemRead ID/EX 10 M u x WB M EX 10
or $4, $4, $2
and $4, . . .
bubble
IF/IDWrite
Control 0
Instruction
Stalling
IF/ID
PCWrite
4 2 Registers
$4
PC
Instruction memory
$2
$2 M u x
M u x

lw and or add $2, $4, $4, $9, 20($1) $2, $5 $4, $2 $4, $2
4 2 9 ID/EX.RegisterRt
Clock cycle 6
Clock 6 after<2> after<1>
Hazard detection unit ID/EX.MemRead ID/EX 10 M u x WB M EX 10
add $9, $4, $2
or $4, . . .
and $4, . . .
IF/IDWrite
Control 0
IF/ID
PCWrite
$4
Instruction
M u x 4 Registers ALU $2 M u x 4 2 9 ID/EX.RegisterRt M u x Forwarding unit 4 4 Data memory
PC
Instruction memory
M u x
Clock cycle 7
Clock 7
Control (or Branch) Hazards
Problem with branches in the pipeline we have so far is that the branch decision is not made till the MEM stage so what
instructions, if at all, should we insert into the pipeline following the branch instructions?
Possible solution: stall the pipeline till branch decision is known
not efficient, slow the pipeline significantly!
Another solution: predict the branch outcome
e.g., always predict branch-not-taken continue with next
sequential instructions
if the prediction is wrong have to flush the pipeline behind the branch discard instructions already fetched or decoded and
continue execution at the branch target
Predicting Branch-not-taken: Misprediction delay

Time (in clock cycles) Program execution CC 1 CC 2 order (in instructions) 40 beq $1, $3, 7 IM Reg CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 DM Reg
44 and $12, $2, $5
IM
Reg
DM
Reg
48 or $13, $6, $2
IM
Reg
DM
Reg
52 add $14, $2, $2
IM
Reg
DM
Reg
72 lw $4, 50($7)
IM
Reg
DM
Reg
The outcome of branch taken (prediction wrong) is decided only when beq is in the MEM stage, so the following three sequential instructions already in the pipeline have to be flushed and execution resumes at lw
Optimizing the Pipeline to Reduce Branch Delay
Move the branch decision from the MEM stage (as in our current pipeline) earlier to the ID stage
calculating the branch target address involves moving the
branch adder from the MEM stage to the ID stage inputs to this adder, the PC value and the immediate fields are already available in the IF/ID pipeline register calculating the branch decision is efficiently done, e.g., for equality test, by XORing respective bits and then ORing all the results and inverting, rather than using the ALU to subtract and then test for zero (when there is a carry delay)
with the more efficient equality test we can put it in the ID stage without significantly lengthening this stage remember an objective of pipeline design is to keep pipeline stages balanced
we must correspondingly make additions to the forwarding and hazard detection units to forward to or stall the branch at the ID
stage in case the branch decision depends on an earlier result
Flushing on Misprediction
Same strategy as for stalling on load-use data hazard Zero out all the control values (or the instruction itself) in pipeline registers for the instructions following the branch that are already in the pipeline effectively turning them into nops so they are flushed
in the optimized pipeline, with branch decision made in the ID stage, we have to flush only one instruction in the IF stage the branch delay penalty is then only one clock cycle
Optimized Datapath for Branch

IF.Flush Hazard detection unit M u x M u x ID/EX WB Control 0 IF/ID M EX
IF.Flush control zeros out the instruction in the IF/ID pipeline register (which follows the branch)
Shift left 2 M u x ALU M u x Data memory
Registers PC Instruction memory
M u x
Sign extend
Branch decision is moved from the MEM stage to the ID stage simplified drawing not showing enhancements to the forwarding and hazard detection units
Pipelined Branch
and $12, $2, $5

IF.Flush
beq $1, $3, 7
sub $10, $4, $8
before<1>
before<2>
Hazard detection unit 72 48 x

M u
ID/EX WB Control 28 IF/ID 48 44 72 0 M u x M EX EX/MEM WB M MEM/WB WB
4 $1
Shift left 2
Registers 72 PC 44 Instruction memory 7
=
$3
M u x
$4
ALU M u x $8
Data memory
M u x
Execution example:
sub beq and or add slt $10, $1, $12 $13 $14, $15, $4, $4, $3, $2, $2, $4, $6, $8 7 $5 $6 $2 $7 Clock cycle 3 Clock 3
lw $4, 50($7)
IF.Flush
Sign extend
10 Forwarding unit
36 40 44 48 52 56
bubble (nop)
beq $1, $3, 7
sub $10, . . .
before<1>
Hazard detection unit ID/EX 76 x

M u
WB Control 0 IF/ID 76 72 M u x M EX
72 lw 50($7)
76 PC 72
4
Shift left 2
Registers Instruction memory
M u x
$1
ALU M u x $3
Data memory
M u x
Optimized pipeline with only one bubble as a result of the taken branch Clock cycle 4 Clock 4
Sign extend
10 Forwarding unit
Simple Example: Comparing Performance
Compare performance for single-cycle, multicycle, and pipelined datapaths using the gcc instruction mix
assume 2 ns for memory access, 2 ns for ALU operation, 1 ns for register read or write assume gcc instruction mix 23% loads, 13% stores, 19% branches, 2% jumps, 43% ALU for pipelined execution assume
50% of the loads are followed immediately by an instruction that uses the result of the load 25% of branches are mispredicted branch delay on misprediction is 1 clock cycle jumps always incur 1 clock cycle delay so their average time is 2 clock cycles
Simple Example: Comparing Performance

Single-cycle (p. 373): average instruction time 8 ns Multicycle (p. 397): average instruction time 8.04 ns Pipelined:
loads use 1 cc (clock cycle) when no load-use dependency and 2 cc when there is dependency given 50% of loads are followed by dependency the average cc per load is 1.5 stores use 1 cc each branches use 1 cc when predicted correctly and 2 cc when not given 25% misprediction average cc per branch is 1.25 jumps use 2 cc each ALU instructions use 1 cc each therefore, average CPI is 1.5 23% + 1 13% + 1.25 19% + 2 2% + 1 43% = 1.18 therefore, average instruction time is 1.18 2 = 2.36 ns

Pipe Lining

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Pipe Lining

Загружено:

Авторское право:

Доступные форматы

Computer Organization and Architecture (AT70.

COD Ch. 6 Enhancing Performance with Pipelining

Start work ASAP!! Do not waste time!

Pipelined vs. Single-Cycle Instruction Execution: the Plan

Reg Instruction fetch

ALU Reg Instruction fetch

Data access ALU Reg

Reg Data access ALU Reg Data access Reg

lw $2, 200($0) lw $3, 300($0)

Pipelining: Keep in Mind

Problem: for the laundry fill in the following table when

Unpipelined finish time

Come up with a formula for pipeline speed-up!

What makes it easy with MIPS?

all instructions are same length

just a few instruction formats

memory operands appear only in load/stores

operands are aligned in memory

What makes it hard?

structural hazards: different instructions, at different stages,

Structural hazard: inadequate hardware to simultaneously

then a structural hazard between first and fourth lw instructions

Reg Instruction fetch

ALU Reg Instruction fetch

Data access ALU Reg

Reg Data access ALU Reg Reg Data access ALU

lw $2, 200($0) lw $3, 300($0) lw $4, 400($0)

MIPS was designed to be pipelined: structural hazards are easy

Control hazard: need to make a decision based on the result of

Reg Instruction fetch

Data access ALU Instruction fetch

Reg Data access Reg Reg ALU

Solution 2 Predict branch outcome

e.g., predict branch-not-taken :

Instruction Reg fetch

Data access ALU

Reg Data access ALU

Instruction Reg fetch

Reg Data access

Instruction Reg fetch

Instruction Reg fetch

Data access ALU

Reg Data access

Instruction Reg fetch

Instruction Reg fetch

Prediction failure: undo (=flush) lw

P r o g ra m execution orde r T im e (in instructions) beq $ 1, $2, 40

MIPS does this but it is an option in SPIM (Simulator -> Settings)

Reg Instruction fetch

ALU Reg Instruction fetch

Data access ALU Reg

Reg Data access ALU Reg Data access Reg

add $4, $5 , $ 6 (d elaye d branch slot) lw $3 , 3 00($0)

Delayed branch beq is followed by add that is independent of branch outcome

Data hazard: instruction needs data from the result of a

Instruction pipeline diagram: shade indicates use left=write, right=read

sub $t2, $s0, $t3

Forwarding may not be enough

Without a stall it is impossible to provide input to the sub instruction in time

sub $t2, $s0, $t3

Time Program execution order (in instructions) lw $s0, 20($t1) IF

Reordering Code to Avoid Pipeline Stall (Software Solution)

Review: single-cycle processor

Review - Single-Cycle Datapath Steps