Processor Pipeline

Pipelined Processor
CHAPTER 4
Introduction
Single cycle data path
Run one instruction a time sequentially in the datapath
Each instruction takes one cycle
Slow. Why?
How can we make it faster?
Run multiple instruction at the same time?
Pipelining
Agenda
What is Pipelining?
Improvement over Single Cycle
Pipelined Datapath
Pipelined Control
Hazards and Dependencies
Structural Hazards
Data Hazards
Control Hazards
Exceptions and Interrupts

Advanced Pipelining
3
Pipelining Analogy: Laundry Example

Ann, Brian, Cathy, Dave
each have one load of clothes

to wash, dry, and fold
Washer takes 30 minutes

Dryer takes 30 minutes
Folder takes 30 minutes
Stacker takes 30 minutes
to put clothes into drawers
Sequential Laundry
Sequential laundry takes 8 hours for 4 loads
6 PM
Time
10
11
12
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T
a
s
k
O
r
d
e
r
A
B
C
D
CSUSM
CS 331
Pipelined Laundry: Start work ASAP

Overlapping execution (Parallelism improves performance)
Pipelined laundry takes 3.5 hours for 4 loads!
6 PM
Time
T
a
s
k
O
r
d
e
r
10
11
12
2 AM
30 30 30 30 30 30 30
A
B
C
D
6
What is the speedup for n tasks?

Sequential time for n tasks
Each task requires balanced k stages nk
Pipeline time for n tasks
n + Fill time = n + k-1
Speedup
nk/(n+k-1) k for large n
What about unbalanced pipe stages?
Pipelining Lessons
Pipelining does not help latency of single task, it helps
throughput of entire workload

Multiple tasks operating simultaneously using different
resources
Potential max speedup = Number pipe stages
Pipeline rate limited by slowest pipeline stage
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline or time to drain it reduces speedup
Stall for Dependences
MIPS Pipeline
Five stages, one step per stage
1.
2.
3.
4.
5.
IF: Instruction fetch from memory

ID: Instruction decode & register read
EX: Execute operation or calculate address
MEM: Access memory operand
WB: Write result back to register
Pipelined vs. Single Cycle Performance

Compare average time between instructions
Assume time for stages is
100ps for register read or write
200ps for other stages
Instr
Instr fetch Register

read
ALU op
Memory
access
Register
write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
700ps
100 ps
600ps
500ps
10
Pipeline Performance
Single-cycle
(Tc= 800ps)
Pipelined
(Tc= 200ps)
11
Pipeline Performance
Pipelining improves performance by increasing instruction
throughput
Executes multiple instructions in parallel

Each instruction has the same latency
Subject to hazards that prevent starting the next instruction
in the next cycle
Structure hazard: a required resource is busy
Data hazard: need to wait for previous instruction to complete its

data read/write
Control hazard: deciding on control action depends on previous

instruction
12
Pipelined Datapath - Basic Idea

Divide a single-cycle datapath into 5 stages
13
Single Clock Cycle Diagram

Pipeline registers to chop the data path into separate stages
To hold information produced in previous cycle
14
Pipeline Operation
Cycle-by-cycle flow of instructions through the pipelined
datapath
Single-clock-cycle pipeline diagram
Shows pipeline usage in a single cycle

Highlight resources used
Multi-clock-cycle diagram
Well look at single-clock-cycle diagrams for load & store
15
IF for Load, Store,
16
ID for Load, Store,
17
EX for Load
18
MEM for Load
19
WB for Load
Did someone spot the error??
Wrong
register
number
20
Corrected Datapath for Load

Shift write register index through all subsequent pipeline stages
21
EX for Store
22
MEM for Store
WB for Store
Multi-Cycle Pipeline Diagram

Form showing resource usage
Can help answer:

o How many cycles to
execute this code?
o What is the ALU
doing during cycle 4?
25
Multi-Cycle Pipeline Diagram

Traditional form
Once the pipeline is

full, one instruction
is completed every
cycle, so CPI = 1
Time to fill the pipeline
26
Single-Cycle Pipeline Diagram

State of pipeline in a given cycle
27
Question
Some instructions do nothing during some stages.

Should we force every instruction to go through all 5
stages? Can we have R-type take 4 cycles instead of 5?
Selection
Yes/No
Reason (Choose BEST answer)
Yes
Decreasing R-type to 4 cycles improves

instruction throughput
Yes
Decreasing R-type to 4 cycles improves

instruction latency
No
Decreasing R-type to 4 cycles causes hazards
No

and doesnt impact throughput
No

and doesnt impact latency
28
Pipelined Control
No information transfer from one pipeline stage to another
is possible except through the pipeline register
Everything that happened in any previous stage will be overwritten

All data belonging to one instruction must be kept within the stage
Control information must travel with the instruction along
pipeline stages just like data
29
Pipelined Control (Simplified)
30
Pipeline Control
Recall what needs to be controlled in each stage:
Instruction Fetch and PC Increment: Identical for all instructions
Instruction Decode / Register Fetch: Identical for all instructions
Execution: RegDest, ALUOp, ALUSrc
Memory Stage: Branch, MemRead, MemWrite
Write Back: MemToReg, RegWrite
Instruction
R-format
lw
sw
beq
Execution/Address Calculation Memory access stage

stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src
Branch Read
Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0
Write-back
stage control
lines
Reg
Mem
write to Reg
1
0
1
1
0
X
0
X
31
Pipelined Control
Control signals derived from instruction
As in single-cycle implementation
32
Pipelined Control
33
Example 1
Instruction sequence
lw
sub
and
or
add
$10,
$11,
$12,
$13,
$14,
20 ($1)
$2, $3
$4, $5
$6, $7
$8, $9
Show data flow and control through the pipeline
34
Example Pipeline - 1
35
36
37
and $12, $4, $5
38
39
40
41
42
43
Example 2
Instruction sequence
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
: Does not write until 5th stage (cycle 5)

: Reads operands in 2nd stage (cycle 3)
Read after write (cycle 6) 3 stalls
Design registers so read/write in
same cycle 2 stalls
Can you spot the problem?

Data hazard: a result is needed in the pipeline before it is available
May result in stalls
44
Example 2: Dependencies
45
Hazards Continued
What happens when...
add $3, $10, $11
lw $8, 1000($3)
sub $11, $8, $7
46
MIPS Pipelining
Pipelining improves performance
Achieve high throughput without reducing instruction latency

Exploit instruction level parallelism (ILP)
Use combinational logic/registers to generate/propagate control signals
ISA design affects complexity of pipeline implementation

What makes it easy
MIPS ISA designed for pipelining

All instructions are the same length
Just a few instruction formats
Memory operands appear only in loads and stores
47
MIPS Pipelining
Hazards prevent next instruction to start in next cycle
Structural hazards
Attempt to use the same resource two different ways at the same time
e.g. the memory
Data hazards
Attempt to use item before it is ready

For example: an instruction depends on a previous instruction
Control hazards
Attempt to make a decision before condition is evaluated

For example: branch instructions
48
Structural Hazards
What happens if we
had unified instruction
and data memory?
Structural Hazard
49
Structural Hazards
Conflict for use of a resource
In MIPS pipeline with a single memory
Load/store requires data access
Instruction fetch would have to stall for that cycle
Would cause a pipeline bubble
Pipelined datapaths require separate instruction/data
memories
Or separate instruction/data caches
50
Data Hazards (and Dependencies)

RAW: read after write
True data dependency Data Hazard
Sub uses the old value of $3
add $3, $10, $11
instead of new one
sub $12, $3, $7
WAW: write after write
Write dependency: in complex pipelining with out of order execution
add $3, $10, $11
Add is delayed so $3 final
sub $3, $7, $3
value is that of add not sub
WAR: write after read
Anti-dependency
add $3, $10, $11
sub $10, $2, $7
51
How to work around true data

dependencies?
1. Hardware Stalling
2. Software Solution (nops)
3. Hardware Forwarding
4. Code Scheduling
52
Solution 1: Pipeline Stall

An instruction depends on completion of data access by a
previous instruction
add
sub
$s0, $t0, $t1

$t2, $s0, $t3
2 stalls
53
Solution 2: Software Solution

Have compiler guarantee no hazards. How?
Where do we insert the nops ?
sub $2, $1, $3

and $12, $2, $5
or
$13, $6, $2
add $14, $2, $2
sw
$15, 100($2)
54
Solution 2: Software Solution

Have compiler guarantee no hazards
Where do we insert the nops ?
sub
$2, $1, $3
Nop
Nop
and
or
add
sw
$12,
$13,
$14,
$15,
$2, $5
$6, $2
$2, $2
100($2)
Problem: this really slows us down!

55
Solution 3: Forwarding (aka Bypassing)

Use result when it is computed
Dont wait for it to be stored in a register
Forward it from one stage to the other
Requires extra connections in the datapath
56
Solution 3: Forwarding (aka Bypassing)
How do we detect
when to forward?
57
Detecting the Need to Forward

Pass register numbers along pipeline
e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX
pipeline register
ALU operand register numbers in EX stage are given by
ID/EX.RegisterRs, ID/EX.RegisterRt
Data hazards when
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd from EX/MEM

pipeline reg
Fwd from MEM/WB

pipeline reg
58
Detecting the Need to Forward

But only if forwarding instruction will write to a register!
EX/MEM.RegWrite, MEM/WB.RegWrite
And only if Rd for that instruction is not $zero
EX/MEM.RegisterRd 0,
MEM/WB.RegisterRd 0
59
Hardware Solution: Detection and Forward

Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
20
X
10/ 20
X
20
20
X
X
20
X
X
20
X
X
20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Hazard detectable in EX stage

Prior instruction in MEM stage
1A: EX/MEM.RegisterRd =
ID/EX.RegisterRs =$2
Reg
DM
Reg
60
Hardware Solution: Detection and Forward

Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
20
X
10/ 20
X
20
20
X
X
20
X
X
20
X
X
20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
What type of Hazard occurs

between sub and or?
2B: MEM/WB.RegisterRd =
ID/EX.RegisterRt =$2
Reg
DM
Reg
Reg
DM
Reg
61
Forwarding Paths
62
Resolve Data Hazards Through Forwarding
EX hazard
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)

and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
63
Forwarding Conditions
EX hazard
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
MEM hazard
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
64
Double Data Hazard

Consider the sequence:
add $1,$1,$2
add $1,$1,$3
add $1,$1,$4
What data hazard type does the third instruction have?
Both hazards occur
Want to use the most recent
Revise MEM hazard condition
Only fwd if EX hazard condition is not true
65
Data Hazards
Data hazards
Attempt to use item before it is ready
Example: an instruction depends on a previous instruction
Working around Data Hazards
Hardware Stalling
Software Solution (nops)
Hardware Forwarding
Code Scheduling
66
Datapath with Forwarding
67
Review: Data Hazard Example

Consider this code on the 5-stage pipeline processor
sub $2, $1, $3
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
How many cycles are needed without forwarding?
14
How many cycles with forwarding?
68
Example with Forwarding
69
70
71
72
What about Load?

Can not always avoid stalls by forwarding
If value not computed when needed
Can not forward backward in time!
73
Stall/Bubble in the Pipeline
Stall inserted
here
74
Stall/Bubble in the Pipeline
Or, more
accurately
75
Load-Use Hazard Detection

Check when ALU instruction is being decoded in ID stage
ALU operand register numbers in ID stage are given by
IF/ID.RegisterRs, IF/ID.RegisterRt
Load-use hazard when
ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
If detected, stall and insert bubble
76
Stalling the Pipeline

Once you detect hazard in ID how do you insert the nop and stall?
1. Flush all instructions in the pipeline (set control signals to 0)
2. Set all control signals going to ID/EX register to zero
3. Set PCWrite = 0
4. Set IF/ID regWrite = 0
Selection Changes
A
1, 3, 4
1, 2, 3
2, 3, 4
None of the
above
77
How to Stall the Pipeline

Turn the instruction in EX stage to a bubble
Force control values in ID/EX register to 0
EX, MEM and WB do nop (no-operation)
Prevent update of PC and IF/ID register
Using instruction is decoded again
Following instruction is fetched again
Keep the same instructions in ID and IF stages
Set PCWrite to 0 and set IF/ID Write to 0
1-cycle stall allows MEM to read data for lw
Can subsequently forward to EX stage
78
Datapath with Hazard Detection
79
Example with Stall

Code sequence
lw $2, 20($1)
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
How many cycles required?
9 cycles
How many values require hardware forwarding?
3 forwards
80
Example with Stall
81
Example with Stall
82
Example with Stall
83
Example with Stall
84
Example with Stall
85
Example with Stall
86
Another Example with Stall

Code sequence
add $3, $2, $1
lw $4, 100($3)
and $6, $4, $3
sub $7, $6, $2
add $9, $3, $6
How many stalls occur?
1 stall after lw
How many values require hardware forwarding support to avoid
stalling for our MIPS 5-stage pipeline?
4 forwarding: add-lw, lw-and, and-sub, and-add
87
Stalls and Performance

Stalls reduce performance
But are required to get correct results
Compiler can arrange code to avoid hazards and stalls
Requires knowledge of the pipeline structure
88
Code Scheduling to Avoid Stalls

C code for A = B + E; C = B + F;
Find the hazards and reorder instruction to avoid stalls
Reorder code to avoid use of load result in next instruction
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
89
Example: Code Scheduling

Considering following MIPS code segment
sw
$t2, 20($t0)
add $t3, $t1, $t4

lw
$t1, 4($t2)
and $t2, $t1, $t3
How many cycles required?
9 cycles
What is ALU doing during cycle 4?
Adding $t1 and $t4
Can we rearrange the code to minimize the number of cycles?
Move sw behind lw
90
Example
Suppose EX is the longest (in time) pipeline stage. To
reduce CT, we split it in half. So the pipeline becomes:

IF ID EX1 EX2 M WB
Assume the input data must be available at the start of EX1
and the ALU output is available after EX2

How many hardware stalls are required in the following
code (assuming hardware forwarding wherever possible)?
lw r1, 0(r3)
add r2, r1, r3
91
Dealing with Data Hazards

As an ISA designer, you can deal with hazards in software or
hardware. Which statement(s) is (are) True?
_____ Compilers have a large window of instructions
available to do reordering to eliminate hazards
_____ Detecting data hazards in hardware can be difficult
and expensive
_____ Hardware knows at runtime the actual dependencies
and can exploit that knowledge for better reordering
_____ Exposing the number of required stalls violates the
abstraction between hardware and software
92
Pipelining Recap
Pipelining improves performance by increasing instruction
throughput
Executes multiple instructions in parallel
Subject to hazards that prevent starting the next instruction
in the next cycle
Structure hazard: a required resource is busy
Data hazard: need to wait for previous instruction to complete its

data read/write
Control hazard: deciding on control action depends on previous

instruction
93
Example: Code Scheduling

Considering following MIPS code segment
sw
$t2, 20($t0)
add $t3, $t1, $t4

lw
$t1, 4($t2)
and $t2, $t1, $t3
How many cycles required if processor supports forwarding?
9 cycles
What is ALU doing during cycle 4?
Adding $t1 and $t4
Can we rearrange the code to minimize the number of cycles?
Move sw behind lw
94
Control Hazards
Current design for branch instruction
Decision (Taken or Not Taken) occurs in MEM stage
Dont know which is the next instruction until the decision
is made
Wait until branch outcome determined before fetching next
instruction
How many cycles will we lose per branch if we stall until we
know the branch outcome?
95
Control Hazards
MIPS 5 stage pipeline:
Stall 3 cycles until the branch outcome is known

CC1
beq $4, $0, there IM
and $12, $2, $5
CC2
CC3
Reg
Bubble
CC4
DM
Bubble
Bubble
CC5
CC6
CC7
CC8
Reg
IM
Reg
DM
Reg
With longer pipelines, stall penalty becomes unacceptable
96
Control Hazards
Control (or branch) hazards arise because we must fetch
the next instruction before we know if we are branching or

where we are branching
Control hazards are detected in hardware

We can reduce the impact of control hazards through
1. Static Branch Prediction
2. Reducing the Delay/cost of Branch Hazard
3. Delayed Branch
4. Dynamic Branch Prediction
97
Solution 1: Static Branch Prediction

Predict outcome of branch
For example, predict branch not taken

Fetch instruction after branch, with no delay
Works pretty well when prediction is right
Only stall if prediction is wrong
Add hardware to flush instructions if prediction is wrong
Same performance as stalling when wrong
98
Solution 1: Static Prediction Not Taken

Fetch instruction after branch, with no delay
Works pretty well if prediction is correct
beq $4, $0, Else
and $12, $2, $5

or ...
add ...
sw ...
CC1
CC2
IM
Reg
IM
CC3
CC4
DM
Reg
IM
CC5
CC7
CC8
Reg
DM
Reg
IM
CC6
Reg
DM
Reg
IM
Reg
DM
Reg
99
Solution 1: Static Prediction and Flushing

Flush or discard instructions when wrong
Change the original control values to 0s

Change the 3 instructions in IF, ID, and EX when branch is in MEM
beq $4, $0, Else
and $12, $2, $5

or ...
add ...
Else: sub $12, $4, $2
CC1
CC2
IM
Reg
IM
CC3
CC4
DM
Reg
IM
CC5
CC6
CC7
CC8
Reg
Flush
Reg
Flush
IM
Flush
IM
Flush these
instructions if
prediction is
wrong
Reg
100
Solution 2: Reducing Cost of Branch Hazard

Move decision to earlier stages
Move up to EXE stage, and save ____ cycle per branch

Move up to ID stage, and save ____ cycles per branch
Must add hardware to check registers as they are read
Move hardware to determine branch outcome in ID stage
Branch or Target address adder

Register comparator
Exclusive-or (XOR) of the bits, and OR of the results
Need to copy the forwarding and hazard detection hardware (why?)
add $t1, $s1, $s2

beq $t1, $0, Loop
101
Data Hazards for Branches

Do we need to stall if a comparison register is a destination
of 2nd or 3rd preceding ALU instruction?
Can resolve using forwarding
add $1, $2, $3
IF
add $4, $5, $6
beq $1, $4, target
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
102

of preceding ALU instruction or 2nd preceding load

instruction?
Need 1 stall cycle

lw
$1, addr
IF
add $4, $5, $6

beq $1,
$4, target
stalled
beq $1, $4, target
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
103

of immediately preceding load instruction?
Need 2 stall cycles
lw
$1, addr
IF
beq $1,
$4, target
stalled
beq stalled
beq $1, $4, target
ID
EX
IF
ID
MEM
WB
ID
ID
EX
MEM
WB
104
Impact of Reducing Cost of Branch Hazard

Branch detection in ID stage
Predict: guess one direction then back up if wrong
0 lost cycles per branch instruction if right, 1 if wrong

Need to flush and restart following instruction if wrong
Clear instruction field in IF/ID pipeline -> creates a NOP
Assume, we are right 50% of time
CPI of branch = (1 0.5 + 2 0.5) = 1.5
If 20% of instructions are Branches and CPI of other
instructions=1
Total CPI = 1.5 0.2 + 1 0.8 = 1.1

105
Datapath with Branch Prediction

IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
M
u
x
IF/ID
PC
EX/MEM
WB
EX
MEM/WB
WB
Shift
left 2
Registers
Instruction
memory
M
u
x
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
106
Example
Branch is taken but predicted not-taken
36
40
44
48
52
56
sub
beq
and
or
add
slt
. .
$10, $4, $8
$1, $3, 7
$12, $2, $5
$13, $2, $6
$14, $4, $2
$15, $6, $7
.
72
lw $4, 50(7)
#40 + 4 + 74 = 72
107
Example: Branch Taken
108
Example: Branch Taken
109
Solution 3: Delayed Branch

Fill slot after branch with an
instruction that executes

independent of the branch
decision
Impact: 0 cycles per branch
instruction if can find

instruction to put in slot
Harder to fill for pipelines with
more stages
110
Filling the Branch Delay Slot

1
2
3
4
5
6
7
add $5, $3, $7

add $9, $1, $3
sub $6, $1, $4
and $7, $8, $2
beq $6, $7, there
nop /* branch delay slot */
add $9, $1, $2
sub $2, $9, $5
...
there:
mult $2, $10, $11
Which instruction
can be used to fill
the delay slot?
add $9,$1,$3
111
Solution 4: More-Realistic Branch Prediction

Static branch prediction
Based on typical branch behavior
Example: loop and if-statement branches
Predict backward branches taken

Predict forward branches not taken
Dynamic branch prediction

Hardware measures actual branch behavior
e.g., Record recent history of each branch
Assume future behavior will continue the trend
When wrong, stall while re-fetching, and update history
112
Dynamic Branch Prediction

Analysis of the branch history Branch prediction Buffer
Keep a list of the recent branch instructions outcomes (taken/not
taken)
Indexed by the low order bits of the branch instruction address
To execute a branch
Check table, expect the same outcome
Start fetching from fall-through or target
If wrong, flush pipeline and flip prediction
Limited precision, but it's only a prediction
113
1-Bit Predictor: Shortcoming

Assume 1st prediction is
taken
How many times will the
prediction be wrong for

the inner loop?
Inner loop branches
mispredicted twice!
Mispredict as taken on last

iteration of inner loop
Then mispredict as not taken
on first iteration of inner
loop next time around
for (j = 0; j < 100; j ++)

{
for (i = 0; i < 9; i++)
{
outer:
inner:
beq , , inner
beq , , outer
114
Improvement: 2-Bit Predictor

Only change prediction on 2 successive wrong predictions
115
Example
Assume the following 3 sequences of branch patterns
Assume initial predict taken for 1-bit predictor and strongly
taken for 2-bit predictor

What is the accuracy of each?
1-bit
2-bit
TTTTN
TTNTN
NTNTN
116
Branch Prediction
Latest branch predictors are significantly more
sophisticated, using more advanced correlating techniques,

larger structures, and even AI techniques
Use patterns of branches (local history) and recent other
branch history (global history) to make predictions
117
Pipelining -- Recap
Pipelining focuses on improving instruction throughput,
not individual instruction latency

Data hazards can be handled by hardware or software but
most modern processors have hardware support for stalling

and forwarding
Control hazards can be handled by hardware or software
but most modern processors use Branch Target Buffers and

advanced dynamic branch prediction to reduce hazards
Exec Time = IC CPI CT
118
Example
Consider the following times per stage for MIPS 5-stage pipeline
processor:
IF = 200ps, ID = 100ps, EX = 100ps, M = 200ps, WB = 100ps
Consider splitting IF and M into 2 stages each

IF1 IF2 and M1 M2
Most frequent code run (assume branch taken most of the time):
Loop: lw r1, 0 (r2)
add r2, r1, r4
sub r5, r1, r2
beq r5, $zero, Loop
Assume pipeline has forwarding where available, predicts branch
not taken, and resolves branches in ID. What is the impact of 7stage pipeline vs. 5-stage MIPS pipeline on CPI and CT?
119
Exceptions and Interrupts

Unexpected events that change the normal flow of
instruction execution
Different ISAs use the terms differently
Exception
Arises within the CPU
e.g., undefined opcode, overflow,
Interrupt
From an external I/O controller
Dealing with them without sacrificing performance is hard
Detecting and taking appropriate action is often on critical path

120
Handling Exceptions
In MIPS, exceptions managed by a System Control
Coprocessor (CP0)
Save the address of offending (or interrupted) instruction
In MIPS: Exception Program Counter (EPC)
Save indication of the problem
In MIPS: Cause register
Well assume 1-bit
0 for undefined opcode, 1 for overflow
Jump to exception handler at 8000 00180

121
An Alternate Mechanism
Vectored Interrupts
Handler address determined by the cause of exception
Example:
Undefined opcode: C000 0000
Overflow:
C000 0020
:
C000 0040
Instructions either
Deal with the interrupt, or
Jump to real handler
122
Handler Actions
Read cause, and transfer to relevant handler
Determine action required
If restartable
Take corrective action
Use EPC to return to program
Otherwise
Terminate program
Report error using EPC, cause,
123
Pipeline with Exceptions
124
Exception Properties
Restartable exceptions
Pipeline flushes the offending instruction
What about instructions before and after it?
Handler executes, then returns to the instruction
Refetched and executed from scratch
PC saved in EPC register

Identifies causing instruction
Actually PC + 4 is saved
Handler must adjust
125
Exception Example
Exception on add in
40
sub $11,
44
and $12,
48
or
$13,
4C
add $1,
50
slt $15,
54
lw
$16,
Handler
80000180 sw
80000184 sw
$2, $4
$2, $5
$2, $6
$2, $1
$6, $7
50($7)
$25, 1000($0)
$26, 1004($0)
126
Exception Example
127
Exception Example
128
Multiple Exceptions
Pipelining overlaps multiple instructions
Could have multiple exceptions at once
Simple approach: deal with exception from earliest
instruction
Flush subsequent instructions
In complex pipelines
Multiple instructions issued per cycle
Out-of-order completion
Maintaining precise exceptions is difficult!
129
Final Datapath
Basic pipelined architecture
Forwarding
Hazard detection unit
Branch handling
Exception handling
130
Final Pipelined Datapath

Bra nc h
IF.Flus h
EX.Flus h
ID.Flush
Ha za rd
de te c tion
unit
WB
C ontrol
0
Ins truc tion

memory
PC
Addre s s
Re ad
da ta
Ins truction
S hift
left 2
C a us e
EX
Exce pt
PC
Re ad
da ta 1
Rea d
re giste r 1
Rea d
re giste r 2
Reg is te rs
Write
re gis te r
Re a d
da ta 2
Write
S ign
e xtend
WB
M
u
x
MEM/WB
ALUS rc
WB
Da ta
me mory
ALU
32
Ins truction [25 21 ]
M
u
x
data
16
EX/MEM
M
u
x
Re gWrite
IF/ID
M
u
x
M
u
x
M
u
x
Addre s s
Write
da ta
ALU
c ontrol
Me mtoRe g
ID/EX
M
u
x
Me mWrite
40000040
Rea d
da ta
M
u
x
Me mRe a d
ALUOp
Re gDs t
Ins truction [20 1 6]

Ins truction [20 1 6]
Ins truction [1 5 1 1 ]
M
u
x
Forwarding
unit
131
Advanced Pipelining
132
Instruction-Level Parallelism (ILP)

Pipelining: executing multiple instructions in parallel
To increase amount of ILP
Deeper pipeline
Less work per stage shorter clock cycle
Multiple issue
Replicate pipeline stages multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)
E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4
But dependencies reduce this in practice

133
Super Pipelining
Five-stage pipeline is a good start
Many designs include pipeline as long as 7, 10, or 20 stages
Intel Pentium III: 10 stages
Pentium 4: 20 stages
Prescott Pentium 4: 31 stages
Deeper pipeline let the processor clock run faster, but may
become less attractive with less utility
Pipeline register overheads play a role

Thermal wall/Power wall
Cannot increase clock rate
134
Multiple Issue
Static multiple issue
Compiler groups instructions
to be issued together
Packages them into issue
slots
Compiler detects and avoids
hazards
Dynamic multiple issue

CPU examines instruction
stream and chooses
instructions to issue each
cycle
Compiler can help by
reordering instructions
CPU resolves hazards using
advanced techniques at
runtime
135
Static Multiple Issue

Compiler groups instructions into issue packets
Group of instructions that can be issued on a single cycle
Determined by pipeline resources required
Think of an issue packet as a very long instruction
Specifies multiple concurrent operations
Very Long Instruction Word (VLIW)
Compiler must remove some/all hazards
Reorder instructions into issue packets
No dependencies within a packet
Pad with nop if necessary
136
Which Instructions Can This Do in Parallel?
Any two instructions?

Arithmetic and memory
instruction?
Any instruction and
memory instruction?
137
MIPS with Static Dual Issue

Two-issue packets
One ALU/branch instruction
One load/store instruction
64-bit aligned
ALU/branch, then load/store

Pad an unused instruction with nop
Address
Instruction type
Pipeline Stages
ALU/branch
IF
ID
EX
MEM
WB
n+4
Load/store
IF
ID
EX
MEM
WB
n+8
ALU/branch
IF
ID
EX
MEM
WB
n + 12
Load/store
IF
ID
EX
MEM
WB
n + 16
ALU/branch
IF
ID
EX
MEM
WB
n + 20
Load/store
IF
ID
EX
MEM
WB
138
Hazards in the Dual-Issue MIPS

More instructions executing in parallel
EX data hazard
Forwarding avoided stalls with single-issue
Now cant use ALU result in load/store in same packet
add $t0, $s0, $s1

load $s2, 0($t0)
Split into two packets, effectively a stall
Load-use hazard
Still one cycle use latency, but now two instructions
More aggressive scheduling required
139
Scheduling Example
Schedule on a static dual-issue pipeline for MIPS
Loop: lw
addu
sw
addi
bne
Loop:
$t0,
$t0,
$t0,
$s1,
$s1,
0($s1)
$t0, $s2
0($s1)
$s1,4
$zero, Loop
#
#
#
#
#
$t0=array element
add scalar in $s2
store result
decrement pointer
branch $s1!=0
ALU/branch
Load/store
cycle
nop
lw
addi $s1, $s1,4
nop
addu $t0, $t0, $s2
nop
bne
sw
$s1, $zero, Loop
$t0, 0($s1)
$t0, 4($s1)
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

140
Loop Unrolling
Basic block: straight-line code sequence with no branches
in except to entry and no branches out except at exit
Average 4 to 7 instructions execute between a pair of branches
To obtain substantial performance enhancements, we must
exploit ILP across multiple basic blocks
Replicate loop body to expose more parallelism among different iterations

Reduce loop-control overhead
141
Loop Unrolling
During unrolling, compiler introduces additional registers
or WAR dependences (name dependences)
2 instructions use same register or memory location, called a name, but no

flow of data between the instructions associated with that name
Repeated instances but completely independent sequences despite using $t0
lw
$t0,0($s1)
addu $t0,$t0,$s2
sw
$t0,0($s1)
lw
$t0,4($s1)
addu $t0,$t0,$s2
sw .....
Register renaming: use different registers per replication
Avoid loop-carried anti-dependencies

Store followed by a load of the same register
142
Loop Unrolling Example

lp:
addi
lw
lw
lw
lw
addu
addu
addu
addu
sw
sw
sw
sw
bne
$s1,$s1,-16
$t0,0($s1)
$t1,4($s1)
$t2,8($s1)
$t3,12($s1)
$t0,$t0,$s2
$t1,$t1,$s2
$t2,$t2,$s2
$t3,$t3,$s2
$t0,0($s1)
$t1,4($s1)
$t2,8($s1)
$t3,12($s1)
$s1,$0,lp
#
#
#
#
#
#
#
#
#
#
#
#
#
#
decrement pointer
$t0=array element
$t1=array element
$t2=array element
$t3=array element
add scalar in $s2
add scalar in $s2
add scalar in $s2
add scalar in $s2
store result
store result
store result
store result
branch if $s1 != 0
143
Loop Unrolling Example

IPC = 14/8 = 1.75
Closer to 2, but at cost of registers and code size
Loop:
ALU/branch
Load/store
cycle
addi $s1, $s1,16
lw
$t0, 0($s1)
nop
lw
$t1, 12($s1)
addu $t0, $t0, $s2
lw
$t2, 8($s1)
addu $t1, $t1, $s2
lw
$t3, 4($s1)
addu $t2, $t2, $s2
sw
$t0, 16($s1)
addu $t3, $t4, $s2
sw
$t1, 12($s1)
nop
sw
$t2, 8($s1)
sw
$t3, 4($s1)
bne
$s1, $zero, Loop
144
Dynamic Multiple Issue

Superscalar processors
CPU decides whether to issue 0, 1, 2, each cycle
Avoiding structural and data hazards
Avoids the need for compiler scheduling
Though it may still help
Code semantics ensured by the CPU
145
Dynamic Pipeline Scheduling

Allow the CPU to execute instructions out of order to avoid
stalls
But commit result to registers in order
Example
lw
$t0, 20($s2)
addu $t1, $t0, $t2
sub
$s4, $s4, $t3
slti $t5, $s4, 20
Can start sub while addu is waiting for lw
146
Dynamically Scheduled CPU

Preserves
dependencies
Hold pending
operands
Results also sent

to any waiting
reservation stations
Reorders buffer for
register writes
Can supply
operands for
issued instructions
147
Dynamic Scheduling
Exploit instruction-level parallelism
To make programs behave as if they were running on
simple in-order pipeline:
Issue instructions in order, which allows dependences to be tracked

Hardware chooses which instructions to execute next
Execute instructions out of order as long as correct flow of data is ensured
Commit in order
Often extended by speculating on branches and keep the
pipeline full
May need to rollback if prediction incorrect
148
Speculation
Guess what to do with an instruction
Start operation as soon as possible
Check whether guess was right
If so, complete the operation

If not, roll-back and do the right thing
Examples
Speculate on branch outcome
Roll back if path taken is different
Speculate on store preceding load that they refer do different address
Roll back if location is updated
149
Speculation
Common to static and dynamic multiple issue
Compiler can reorder instructions
e.g., move load before branch
Can include fix-up instructions to recover from incorrect guess
Hardware can look ahead for instructions to execute
Buffer results until it determines they are actually needed
Flush buffers on incorrect speculation
Out-of-order execution
150
Static or Dynamic Scheduling?

Why not just let the compiler schedule code?
Not all stalls are predictable, some dependences are unknown at compile
time
Cant always schedule around branches
e.g., cache misses

Branch outcome is dynamically determined
Different implementations of an ISA have different latencies and

hazards
Dynamic scheduling allows code that compiled for one pipeline to run
efficiently on a different pipeline
151
Power Efficiency
Microprocessor
Year
Clock Rate
Pipeline
Stages
Issue
width
Out-of-order/
Speculation
Cores
Power
i486
1989
25MHz
No
5W
Pentium
1993
66MHz
No
10W
Pentium Pro
1997
200MHz
10
Yes
29W
P4 Willamette
2001
2000MHz
22
Yes
75W
P4 Prescott
2004
3600MHz
31
Yes
103W
Core
2006
2930MHz
14
Yes
75W
UltraSparc III
2003
1950MHz
14
No
90W
UltraSparc T1
2005
1200MHz
No
70W
- Complexity of dynamic scheduling and speculation requires power

- Multiple simpler cores per chip may be better than deeper,
aggressively speculated ones
152
Fallacies
Pipelining is easy (!)
The basic idea is easy
The devil is in the details
e.g., Detecting data hazards
Pipelining is independent of technology

So why havent we always done pipelining?
More transistors make more advanced techniques feasible
As transistor budgets continued to double and logic became much
faster than memory, multiple functional units and dynamic
pipelining made more sense
Today, concerns about power are leading to less aggressive designs
153
Concluding Remarks
ISA influences design of datapath and control & vice versa
Pipelining improves throughput using parallelism
More instructions completed per second
Latency for each instruction not reduced
Limited by structural, data, and control hazards
Multiple issue and dynamic scheduling (ILP)
Dependencies limit achievable parallelism
Complexity leads to the power wall
Pipelining in Todays most advanced Processors is not
fundamentally different than techniques we discussed
This class has given you the background you need to learn more!
154
Concluding Remarks
What does every technique help reduce: data hazard stalls,
control stalls, or/and CPI?

Technique
Reduces
Dynamic scheduling
Data hazard stalls
Branch prediction
Control stalls
Multiple Issue
CPI
Speculation
Data and control stalls
Loop unrolling
Control hazard stalls
Compiler pipeline scheduling
Data hazard stalls
155

Processor Pipeline

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Processor Pipeline

Загружено:

Авторское право:

Доступные форматы

Pipelined Processor

Improvement over Single Cycle

Exceptions and Interrupts

Pipelining Analogy: Laundry Example

each have one load of clothes

Washer takes 30 minutes

to put clothes into drawers

Pipelined Laundry: Start work ASAP

Pipelined laundry takes 3.5 hours for 4 loads!

What is the speedup for n tasks?

Each task requires balanced k stages nk

Pipeline time for n tasks

n + Fill time = n + k-1

nk/(n+k-1) k for large n

What about unbalanced pipe stages?

throughput of entire workload

IF: Instruction fetch from memory

Pipelined vs. Single Cycle Performance

Instr fetch Register

Executes multiple instructions in parallel

Subject to hazards that prevent starting the next instruction

in the next cycle

Structure hazard: a required resource is busy

Data hazard: need to wait for previous instruction to complete its

Control hazard: deciding on control action depends on previous

Pipelined Datapath - Basic Idea

Single Clock Cycle Diagram

Single-clock-cycle pipeline diagram

Shows pipeline usage in a single cycle

Well look at single-clock-cycle diagrams for load & store

IF for Load, Store,

ID for Load, Store,

MEM for Load

Corrected Datapath for Load

MEM for Store

Multi-Cycle Pipeline Diagram

Can help answer:

Multi-Cycle Pipeline Diagram

Once the pipeline is

Time to fill the pipeline

Single-Cycle Pipeline Diagram

Some instructions do nothing during some stages.

Reason (Choose BEST answer)

Decreasing R-type to 4 cycles improves

Decreasing R-type to 4 cycles improves

Decreasing R-type to 4 cycles causes hazards

Decreasing R-type to 4 cycles causes hazards

Decreasing R-type to 4 cycles causes hazards

is possible except through the pipeline register

Everything that happened in any previous stage will be overwritten

Control information must travel with the instruction along

pipeline stages just like data

Pipelined Control (Simplified)

Execution/Address Calculation Memory access stage

Show data flow and control through the pipeline

: Does not write until 5th stage (cycle 5)

Can you spot the problem?

Achieve high throughput without reducing instruction latency

ISA design affects complexity of pipeline implementation

MIPS ISA designed for pipelining

Attempt to use item before it is ready

Attempt to make a decision before condition is evaluated

Would cause a pipeline bubble

Pipelined datapaths require separate instruction/data