Вы находитесь на странице: 1из 24

CMCS

C CS 611-101
Advanced Computer Architecture

Lecture 7
Pipeline Hazards

September 28, 2009

www.csee.umbc.edu/~younis/CMSC611/CMSC611.htm

Mohamed Younis CMCS 611, Advanced Computer Architecture 1


Lecture’s Overview
‰ Previous Lecture:
Î Pipelined hazards
• Pipelining concept is natural
• Start handling of next instruction while current one is in progress

Î Pipeline performance
• Performance improvement by increasing instruction throughput
pp bound for speedup
• Ideal and upper p p is number of stages
g in p
pipeline
p

‰ This Lecture:
• Structural,
S l ddata and
d controll h
hazards
d
• Data Hazard resolution techniques
• Pipelined control
Mohamed Younis CMCS 611, Advanced Computer Architecture 2
Stages of Instruction Execution
C l 1
Cycle C l 2
Cycle C l 3
Cycle C l 4
Cycle C l 5
Cycle

Load
Ifetch Reg/Dec Exec Mem WB

‰ The load instruction is the longest


‰ All instructions follows at most the following five steps:
Î Ifetch: Instruction Fetch
• Fetch the instruction from the Instruction Memory
Î Reg/Dec: Registers Fetch and Instruction Decode
Î Exec: Calculate the memory address
Î Mem: Read the data from the Data Memory y
Î WB: Write the data back to the register file
* Slide is courtesy of Dave Patterson

Mohamed Younis CMCS 611, Advanced Computer Architecture 3


Instruction Pipelining
‰ Start handling of next instruction while the current instruction is in progress
‰ Pipelining is feasible when different devices are used at different stages of
instruction execution
Time
IFetch Dec Exec Mem WB

IFetch Dec Exec Mem WB

IFetch Dec Exec Mem WB

IFetch Dec Exec Mem WB

IFetch Dec Exec Mem WB


Program Flow
IFetch Dec Exec Mem WB

Time between instructionsnonpipelined


Time between instructions pipelined =
Number of pipe stages
Pipelining improves performance by increasing instruction throughput
Mohamed Younis CMCS 611, Advanced Computer Architecture 4
Pipeline Datapath
Data Stationary

¾ Every stage must be completed in one clock cycle to avoid stalls


¾ Values must be latched to ensure correct execution of instructions
¾ The PC multiplexer has moved to the IF stage to prevent two instructions
from updating the PC simultaneously (in case of branch instruction)
Mohamed Younis CMCS 611, Advanced Computer Architecture 5
Pipeline Hazards
‰ Pipeline hazards are cases that affect instruction execution
semantics and thus need to be detected and corrected
‰ Hazards types
Structural hazard: attempt to use a resource two different ways at same time
Î E.g., combined washer/dryer would be a structural hazard or folder busy
doing something else (watching TV)
Î Single
Si l memory for f instruction
i t ti and dddata
t
Data hazard: attempt to use item before it is ready
Î E.g., one sock of pair in dryer and one in washer; can’t fold until get sock
from washer through dryer
Î instruction depends on result of prior instruction still in the pipeline
Control hazard: attempt to make a decision before condition is evaluated
Î E.g., washing football uniforms and need to get proper detergent level;
need to see after dryer before next load in
Î branch instructions
‰ Hazards can always be resolved by waiting
Mohamed Younis CMCS 611, Advanced Computer Architecture 6
Single Memory is a Structural Hazard
Time (clock cycles)

ALU
A
I M
Mem R
Reg M
Mem R
Reg
n Load
s

ALU
Mem Reg Mem Reg
t Instr 1

U
r.

ALU
Mem Reg Mem Reg
O Instr 2
r

ALU
d Mem Reg Mem Reg
e
Instr 3
r

AL
Mem Reg Mem Reg
I t 4
Instr

LU
‰ Can be easily detected ‰ Resolved by inserting idle cycles
* Slide is courtesy of Dave Patterson

Mohamed Younis CMCS 611, Advanced Computer Architecture 7


Stalls & Pipeline Performance
Average instruction time unpipelined
Speedup from pipelining =
Average instruction time pipelined
CPI unpipelined Clock cycle unpipelined
= ×
CPI pipelined Clock cycle pipelined
‰Ideally the CPI of the pipeline execution is 1 (after fill-up), thus
Î CPI pipelined = Ideal CPI + Pipeline stall clock per instruction
= 1 + Pipeline stall clock per instruction
CPI unpipelined Clock cycle unpipelined
Speedup = ×
1 + Pipeline stall cycles per instruction Clock cycle pipelined
‰ Assuming all pipeline stages are balanced, then

1
Speedup = × Pipeline depth
1 + Pipeline stall cycles per instruction
Mohamed Younis CMCS 611, Advanced Computer Architecture 8
Control Hazard
‰ Stall: wait until decision is clear
Î It is possible to move up decision to 2nd stage by adding hardware to
check registers as being read
I Time (clock cycles)
n
s

ALU
Mem Reg Mem Reg
t Add

U
r.

ALU
Mem Reg Mem Reg
O Beq
q
r
d
Load

ALU
e
Stall Mem Reg Mem Reg
r

‰ Impact: 2 clock cycles per branch instruction ⇒ slow


* Slide is courtesy of Dave Patterson

Mohamed Younis CMCS 611, Advanced Computer Architecture 9


Control Hazard Solution
‰ Predict: guess one direction then back up if wrong
Î Predict not taken
I Time (clock cycles)
n
s

ALU
Mem Reg Mem Reg
t Add
r.

ALU
A
Mem Reg Mem Reg
O Beq
r
d
Load

ALU
A
e Mem Reg Mem Reg
r

‰ Impact: 1 clock cycles per branch instruction if right, 2 if wrong


(right 50% of time)
‰ More dynamic scheme: history of 1 branch (90%)
* Slide is courtesy of Dave Patterson

Mohamed Younis CMCS 611, Advanced Computer Architecture 10


Control Hazard Solution
‰ Redefine branch behavior ((takes p
place after next instruction))
“delayed branch”
I Time (clock cycles)
n
s

ALU
Mem Reg Mem Reg
t
r.
Add

ALU
Mem Reg Mem Reg
O Beq
r
d

ALU
A
e Misc Mem Reg Mem Reg
r

AL
L d
Load Mem R
Reg Mem Reg

LU
‰ Impact: 0 clock cycles per branch instruction if can find
instruction to put in “slot” (50% of time)
* Slide is courtesy of Dave Patterson

Mohamed Younis CMCS 611, Advanced Computer Architecture 11


Data Hazard
Time (clock cycles)
IF ID/RF EX MEM WB

ALU
I add r1,r2,r3 Im Reg Dm Reg

ALU
s Im Reg Dm Reg
t
sub r4,r1,r3
r.

ALU
A
Im Reg Dm Reg
O
and r6,r1,r7
r

ALU
Im Reg Dm Reg
d or r8,r1,r9
r8 r1 r9

U
e
r

ALU
Im Reg Dm Reg
xor r10,r1,r11

Dependencies backwards in time are hazards


* Slide is courtesy of Dave Patterson

Mohamed Younis CMCS 611, Advanced Computer Architecture 12


Data Hazard Solution
Time (clock cycles)
IF ID/RF EX MEM WB

ALU
I add r1,r2,r3 Im Reg Dm Reg

ALU
s Im Dm Reg
t sub r4,r1,r3 Reg

r.

ALU
Im Reg Dm Reg
O and r6,r1,r7
r

ALU
A
d Im Reg Dm Reg
e or r8,r1,r9
r

AL
Im Reg Reg
xor r10,r1,r11
10 1 11 D
Dm

LU
“Forward” result from one stage to another
* Slide is courtesy of Dave Patterson

Mohamed Younis CMCS 611, Advanced Computer Architecture 13


Implementing Data Forwarding

¾The ALU result from EX/MEM register is fed back and kept in next stages
¾ If data hazard is detected the forward values will be used
Mohamed Younis CMCS 611, Advanced Computer Architecture 14
Example

Add R1 R2
R1, R2, R3
LW R4, 0(R1)
SW 12(R1), R4

¾The ALU result from EX/MEM register is forwarded to MEM/WB

Mohamed Younis CMCS 611, Advanced Computer Architecture 15


Forwarding Datapath

¾ Three additional inputs are added to the ALU multiplexers each


corresponding to a bypass (forwarded data)

Mohamed Younis CMCS 611, Advanced Computer Architecture 16


Data Hazards Classification
‰ Data hazard can happen because dependence among a pair of instructions
writing and reading to the same register or memory location
‰ By stalling the pipeline on cache misses data hazards caused by memory
access are avoided
‰ Data hazards types (instruction I proceeds J)
RAW (read after write): J attempts to read an operand before I writes it
Î Most common type of hazard and typically handled by forwarding
WAW (write after write): J attempts to write an operand before I writes it
Î Can happen when writing is done in more than one pipeline stage
Î For MIPS pipeline,
pipeline WAW hazard can happen if WB is performed during
MEM stage and the memory is slow so that MEM stage take two cycles

L W R 1 , 0 (R 2 ) IF ID EX M EM 1 M EM 2 WB
AD D R 1, R 2, R 3 IF ID EX WB

WAR (write after read): J attempts to write an operand before I reads it


Î Happen when there are instructions that write early in the pipeline while
others reading in a late stage
Mohamed Younis CMCS 611, Advanced Computer Architecture 17
Data Hazards for Load Instructions

¾ Dependencies backwards in time but cannot be solved with forwarding


¾ Must delay/stall instruction dependent on loads
Mohamed Younis CMCS 611, Advanced Computer Architecture 18
Solving Hazard by Pipeline Interlock

Mohamed Younis CMCS 611, Advanced Computer Architecture 19


Compiler Scheduling for Data Hazards
‰ The compiler usually performs instruction scheduling to avoid causing data
hazard such as:
hazard,
¾ avoid generating LW followed by an immediate instruction
that uses the destination register
all
ause a sta

¾ Change order of instructions in the basic block

Example:
Example:compile
compilethe
thefollowing:
following:
ads that ca

aa==bb++c;c; dd==ee––f;
ff;
f

LW
LW Rb,
Rb,bb
Fraction of loa

LW
LW Rc,
Rc,cc
LW
ADD Ra,eRb, Rc Swapped
Re, (stall)
ADD
LW Ra,
Re,,,Rb,e , Rc
LW
SW Rf,
a, Ra
f
Swapped
SW
LW a,
Rf,Ra f
SUB
SUB Rd
Rd,
Rd
Rd,ReRe,
Re
Re,RfRf (stall)
SW
SW d,
d,Rd
Rd
Benchmark
Mohamed Younis CMCS 611, Advanced Computer Architecture 20
Data Hazards Detection
‰ Detecting hazards early in the pipeline reduces hardware complexity since
the machine state will not get erroneously changed
‰ For the MIPS integer pipeline, all data hazards can be checked in ID stage

Situation Example
p code Action
E
Example:
l sequence
Load No LW R1, 45 (R2) No hazard possible because no dependence
dependence ADD R5,R6,R7 exists on R1 in the immediately following
interlock three instructions
SUB R8
R8,R6,R7
R6 R7
detection OR R9, R6, R7
Dependence LW R1, 45 (R2) Comparators detect the use of R1 in the ADD
requiring stall ADD R5,R1,R7 and stall the ADD (and SUB and OR) before
SUB R8
R8,R6,R7
R6 R7 the ADD begins EX
OR R9, R6, R7
Dependence LW R1, 45 (R2) Comparators detect the use of R1 in the SUB
overcome by ADD R5,R6,R7 and forward result of load to ALU in time for
f
forwarding
di SUB R8,R1 ,R7 SUB to b
begin
i EX
OR R9, R6, R7
Dependence LW R1, 45 (R2) No action required because the read of R1 by
with accesses ADD R5,R6,R7 OR occurs in the second half of the ID p
phase,,
in order SUB R8,R6,R7 while the write of the loaded data occurred in
OR R9, R1, R7 the first half.

Mohamed Younis CMCS 611, Advanced Computer Architecture 21


Load Interlock Detection
‰ Pipeline stall is needed when a load instruction is followed by the an
instruction that read the yet-to-be-loaded register
‰ The load interlock conditions for RAW hazards are:
Opcode field of ID/EX Opcode field of IF/ID Matching operand fields
(ID/EX.IR 0..5) (IF/ID. IR 0..5)
Load Register-register ALU ID/EX. IR 11..15 == IF/ID.IR 6..10
Load Register-register ALU ID/EX IR 11..15
ID/EX. 11 15 == IF/ID.
IF/ID IR 11..15
11 15

Load Load, store, ALU imm., or branch ID/EX. IR 11..15 == IF/ID.IR 6..10

‰ Control logic is simple combinational circuit with input from ID/EX and IF/ID
‰ Once the hazard is detected the control unit must insert the pipeline stall and
prevent the instructions in the IF and ID stages from advancing
‰ Since all control logic is derived from the data stationary, stalling the pipeline
is simply by setting the ID/EX portion to zero (matching the NOP instruction)
‰ In case of a stall,, the contents of the IF/ID registers
g will be re-circulated to
hold the stalled instruction

Mohamed Younis CMCS 611, Advanced Computer Architecture 22


Data Forwarding Logic
Pipeline Pipeline
register Opcode of register Opcode of Destination Comparison (if
containing source containing destination of the equal then
source instruction destination instruction forwarded forward)
instruction instruction result
EX/MEM Register- ID/EX Register-register ALU, Top ALU EX/MEM.IR 16..20 ==
register ALU ALU immediate,
immediate load,
load input ID/EX IR 6..10
ID/EX.IR 6 10
store, branch
EX/MEM Register- ID/EX Register-register ALU Bottom ALU EX/MEM.IR 16..20 ==
register ALU input ID/EX.IR 11..15
MEM/WB Register- ID/EX Register-register ALU, Top ALU EX/MEM.IR 16..20 ==
register
g ALU ALU immediate,, load,, input
p ID/EX.IR 6..10
6 10
store, branch
MEM/WB Register- ID/EX Register-register ALU Bottom ALU EX/MEM.IR 16..20 ==
register ALU input ID/EX.IR 11..15
EX/MEM ALU ID/EX Register-register ALU, Top ALU EX/MEM.IR 16..20 ==
immediate ALU immediate load,, input
p ID/EX.IR 6..10
store, branch
EX/MEM ALU ID/EX Register-register ALU Bottom ALU EX/MEM.IR 16..20 ==
immediate input ID/EX.IR 11..15
MEM/WB ALU ID/EX Register-register ALU, Top ALU EX/MEM.IR 16..20 ==
immediate ALU immediate load, input ID/EX.IR 6..10
store, branch
MEM/WB ALU ID/EX Register-register ALU Bottom ALU EX/MEM.IR 16..20 ==
immediate input ID/EX.IR 11..15
MEM/WB Load ID/EX Register-register ALU, Top ALU EX/MEM.IR 16..20 ==
ALU immediate load, input ID/EX.IR 6..10
store, branch
MEM/WB load ID/EX Register-register ALU Bottom ALU EX/MEM.IR 16..20 ==
input ID/EX.IR 11..15

Mohamed Younis CMCS 611, Advanced Computer Architecture 23


Conclusion
‰ Summary
Î Pipeline Hazards
• Structural, data and control hazards
Î Data Hazards
• Forwarding techniques for simple data hazards resolution
• Data hazards classifications and detection logic
• Load-caused pipeline stalls and how to limit their scope
• Compiler-based instruction scheduling to avoid pipeline stalls
• Implementation of data hazard detection and forwarding logic

‰ Next Lecture
Î Pipeline control hazards
Î Pipelining and exception handling
Reading assignment includes Appendix A.2 & A.3 in the textbook
Mohamed Younis CMCS 611, Advanced Computer Architecture 24