Академический Документы
Профессиональный Документы
Культура Документы
CHAPTER 4
Introduction
Single cycle data path
Run one instruction a time sequentially in the datapath
Each instruction takes one cycle
Slow. Why?
How can we make it faster?
Run multiple instruction at the same time?
Pipelining
Agenda
What is Pipelining?
Pipelined Datapath
Pipelined Control
Hazards and Dependencies
Structural Hazards
Data Hazards
Control Hazards
Sequential Laundry
Sequential laundry takes 8 hours for 4 loads
6 PM
Time
10
11
12
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T
a
s
k
O
r
d
e
r
A
B
C
D
CSUSM
CS 331
6 PM
Time
T
a
s
k
O
r
d
e
r
10
11
12
2 AM
30 30 30 30 30 30 30
A
B
C
D
6
Speedup
Pipelining Lessons
Pipelining does not help latency of single task, it helps
resources
Potential max speedup = Number pipe stages
Pipeline rate limited by slowest pipeline stage
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline or time to drain it reduces speedup
Stall for Dependences
MIPS Pipeline
Five stages, one step per stage
1.
2.
3.
4.
5.
ALU op
Memory
access
Register
write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
700ps
100 ps
600ps
500ps
10
Pipeline Performance
Single-cycle
(Tc= 800ps)
Pipelined
(Tc= 200ps)
11
Pipeline Performance
Pipelining improves performance by increasing instruction
throughput
13
14
Pipeline Operation
Cycle-by-cycle flow of instructions through the pipelined
datapath
Multi-clock-cycle diagram
15
16
17
EX for Load
18
19
WB for Load
Did someone spot the error??
Wrong
register
number
20
21
EX for Store
22
WB for Store
25
26
27
Question
Yes/No
Yes
Yes
No
No
No
Pipelined Control
No information transfer from one pipeline stage to another
29
30
Pipeline Control
Recall what needs to be controlled in each stage:
Instruction Fetch and PC Increment: Identical for all instructions
Instruction Decode / Register Fetch: Identical for all instructions
Execution: RegDest, ALUOp, ALUSrc
Memory Stage: Branch, MemRead, MemWrite
Write Back: MemToReg, RegWrite
Instruction
R-format
lw
sw
beq
Write-back
stage control
lines
Reg
Mem
write to Reg
1
0
1
1
0
X
0
X
31
Pipelined Control
Control signals derived from instruction
As in single-cycle implementation
32
Pipelined Control
33
Example 1
Instruction sequence
lw
sub
and
or
add
$10,
$11,
$12,
$13,
$14,
20 ($1)
$2, $3
$4, $5
$6, $7
$8, $9
34
Example Pipeline - 1
35
Example Pipeline - 2
36
Example Pipeline - 3
37
Example Pipeline - 4
and $12, $4, $5
38
Example Pipeline - 5
39
Example Pipeline - 6
40
Example Pipeline - 7
41
Example Pipeline - 8
42
Example Pipeline - 9
43
Example 2
Instruction sequence
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
44
Example 2: Dependencies
45
Hazards Continued
What happens when...
add $3, $10, $11
lw $8, 1000($3)
sub $11, $8, $7
46
MIPS Pipelining
Pipelining improves performance
MIPS Pipelining
Hazards prevent next instruction to start in next cycle
Structural hazards
Attempt to use the same resource two different ways at the same time
e.g. the memory
Data hazards
Control hazards
48
Structural Hazards
What happens if we
had unified instruction
and data memory?
Structural Hazard
49
Structural Hazards
Conflict for use of a resource
In MIPS pipeline with a single memory
Load/store requires data access
Instruction fetch would have to stall for that cycle
memories
50
51
52
previous instruction
add
sub
2 stalls
53
54
sub
$2, $1, $3
Nop
Nop
and
or
add
sw
$12,
$13,
$14,
$15,
$2, $5
$6, $2
$2, $2
100($2)
56
How do we detect
when to forward?
57
58
59
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
20
X
10/ 20
X
20
20
X
X
20
X
X
20
X
X
20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
or $13, $6, $2
sw $15, 100($2)
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
60
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
20
X
10/ 20
X
20
20
X
X
20
X
X
20
X
X
20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
or $13, $6, $2
sw $15, 100($2)
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
DM
Reg
Reg
DM
Reg
61
Forwarding Paths
62
EX hazard
Forwarding Conditions
EX hazard
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
MEM hazard
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
64
Data Hazards
Data hazards
Attempt to use item before it is ready
Example: an instruction depends on a previous instruction
Working around Data Hazards
Hardware Stalling
Software Solution (nops)
Hardware Forwarding
Code Scheduling
66
67
14
68
69
70
71
72
73
Stall inserted
here
74
Or, more
accurately
75
76
Selection Changes
A
1, 3, 4
1, 2, 3
2, 3, 4
None of the
above
77
78
79
9 cycles
3 forwards
80
81
82
83
84
85
86
87
88
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
89
$t2, 20($t0)
$t1, 4($t2)
9 cycles
Move sw behind lw
90
Example
Suppose EX is the longest (in time) pipeline stage. To
91
and expensive
92
Pipelining Recap
Pipelining improves performance by increasing instruction
throughput
93
$t2, 20($t0)
$t1, 4($t2)
9 cycles
Move sw behind lw
94
Control Hazards
Current design for branch instruction
is made
instruction
95
Control Hazards
MIPS 5 stage pipeline:
CC2
CC3
Reg
Bubble
CC4
DM
Bubble
Bubble
CC5
CC6
CC7
CC8
Reg
IM
Reg
DM
Reg
96
Control Hazards
Control (or branch) hazards arise because we must fetch
97
98
add ...
sw ...
CC1
CC2
IM
Reg
IM
CC3
CC4
DM
Reg
IM
CC5
CC7
CC8
Reg
DM
Reg
IM
CC6
Reg
DM
Reg
IM
Reg
DM
Reg
99
add ...
CC1
CC2
IM
Reg
IM
CC3
CC4
DM
Reg
IM
CC5
CC6
CC7
CC8
Reg
Flush
Reg
Flush
IM
Flush
IM
Flush these
instructions if
prediction is
wrong
Reg
100
101
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
102
$1, addr
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
103
lw
$1, addr
IF
beq $1,
$4, target
stalled
beq stalled
beq $1, $4, target
ID
EX
IF
ID
MEM
WB
ID
ID
EX
MEM
WB
104
instructions=1
M
u
x
WB
Control
M
u
x
IF/ID
PC
EX/MEM
WB
EX
MEM/WB
WB
Shift
left 2
Registers
Instruction
memory
M
u
x
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
106
Example
Branch is taken but predicted not-taken
36
40
44
48
52
56
sub
beq
and
or
add
slt
. .
$10, $4, $8
$1, $3, 7
$12, $2, $5
$13, $2, $6
$14, $4, $2
$15, $6, $7
.
72
lw $4, 50(7)
#40 + 4 + 74 = 72
107
108
109
more stages
110
Which instruction
can be used to fill
the delay slot?
add $9,$1,$3
111
112
taken
mispredicted twice!
outer:
inner:
beq , , inner
beq , , outer
114
115
Example
Assume the following 3 sequences of branch patterns
Assume initial predict taken for 1-bit predictor and strongly
1-bit
2-bit
TTTTN
TTNTN
NTNTN
116
Branch Prediction
Latest branch predictors are significantly more
117
Pipelining -- Recap
Pipelining focuses on improving instruction throughput,
118
Example
Consider the following times per stage for MIPS 5-stage pipeline
processor:
not taken, and resolves branches in ID. What is the impact of 7stage pipeline vs. 5-stage MIPS pipeline on CPI and CT?
119
instruction execution
Exception
Interrupt
Handling Exceptions
In MIPS, exceptions managed by a System Control
Coprocessor (CP0)
Save the address of offending (or interrupted) instruction
An Alternate Mechanism
Vectored Interrupts
Handler address determined by the cause of exception
Example:
Undefined opcode: C000 0000
Overflow:
C000 0020
:
C000 0040
Instructions either
Deal with the interrupt, or
Jump to real handler
122
Handler Actions
Read cause, and transfer to relevant handler
Determine action required
If restartable
Take corrective action
Use EPC to return to program
Otherwise
Terminate program
Report error using EPC, cause,
123
124
Exception Properties
Restartable exceptions
Pipeline flushes the offending instruction
125
Exception Example
Exception on add in
40
sub $11,
44
and $12,
48
or
$13,
4C
add $1,
50
slt $15,
54
lw
$16,
Handler
80000180 sw
80000184 sw
$2, $4
$2, $5
$2, $6
$2, $1
$6, $7
50($7)
$25, 1000($0)
$26, 1004($0)
126
Exception Example
127
Exception Example
128
Multiple Exceptions
Pipelining overlaps multiple instructions
instruction
In complex pipelines
Out-of-order completion
129
Final Datapath
Basic pipelined architecture
Forwarding
Hazard detection unit
Branch handling
Exception handling
130
IF.Flus h
EX.Flus h
ID.Flush
Ha za rd
de te c tion
unit
WB
C ontrol
0
Addre s s
Re ad
da ta
Ins truction
S hift
left 2
C a us e
EX
Exce pt
PC
Re ad
da ta 1
Rea d
re giste r 1
Rea d
re giste r 2
Reg is te rs
Write
re gis te r
Re a d
da ta 2
Write
S ign
e xtend
WB
M
u
x
MEM/WB
ALUS rc
WB
Da ta
me mory
ALU
32
M
u
x
data
16
EX/MEM
M
u
x
Re gWrite
IF/ID
M
u
x
M
u
x
M
u
x
Addre s s
Write
da ta
ALU
c ontrol
Me mtoRe g
ID/EX
M
u
x
Me mWrite
40000040
Rea d
da ta
M
u
x
Me mRe a d
ALUOp
Re gDs t
M
u
x
Forwarding
unit
131
Advanced Pipelining
132
Deeper pipeline
Multiple issue
Super Pipelining
Five-stage pipeline is a good start
Many designs include pipeline as long as 7, 10, or 20 stages
Pentium 4: 20 stages
Deeper pipeline let the processor clock run faster, but may
134
Multiple Issue
Static multiple issue
Compiler groups instructions
to be issued together
Packages them into issue
slots
Compiler detects and avoids
hazards
135
136
Address
Instruction type
Pipeline Stages
ALU/branch
IF
ID
EX
MEM
WB
n+4
Load/store
IF
ID
EX
MEM
WB
n+8
ALU/branch
IF
ID
EX
MEM
WB
n + 12
Load/store
IF
ID
EX
MEM
WB
n + 16
ALU/branch
IF
ID
EX
MEM
WB
n + 20
Load/store
IF
ID
EX
MEM
WB
138
Load-use hazard
Still one cycle use latency, but now two instructions
More aggressive scheduling required
139
Scheduling Example
Schedule on a static dual-issue pipeline for MIPS
Loop: lw
addu
sw
addi
bne
Loop:
$t0,
$t0,
$t0,
$s1,
$s1,
0($s1)
$t0, $s2
0($s1)
$s1,4
$zero, Loop
#
#
#
#
#
$t0=array element
add scalar in $s2
store result
decrement pointer
branch $s1!=0
ALU/branch
Load/store
cycle
nop
lw
nop
nop
bne
sw
$t0, 0($s1)
$t0, 4($s1)
Loop Unrolling
Basic block: straight-line code sequence with no branches
141
Loop Unrolling
During unrolling, compiler introduces additional registers
addi
lw
lw
lw
lw
addu
addu
addu
addu
sw
sw
sw
sw
bne
$s1,$s1,-16
$t0,0($s1)
$t1,4($s1)
$t2,8($s1)
$t3,12($s1)
$t0,$t0,$s2
$t1,$t1,$s2
$t2,$t2,$s2
$t3,$t3,$s2
$t0,0($s1)
$t1,4($s1)
$t2,8($s1)
$t3,12($s1)
$s1,$0,lp
#
#
#
#
#
#
#
#
#
#
#
#
#
#
decrement pointer
$t0=array element
$t1=array element
$t2=array element
$t3=array element
add scalar in $s2
add scalar in $s2
add scalar in $s2
add scalar in $s2
store result
store result
store result
store result
branch if $s1 != 0
143
Loop:
ALU/branch
Load/store
cycle
lw
$t0, 0($s1)
nop
lw
$t1, 12($s1)
lw
$t2, 8($s1)
lw
$t3, 4($s1)
sw
$t0, 16($s1)
sw
$t1, 12($s1)
nop
sw
$t2, 8($s1)
sw
$t3, 4($s1)
bne
144
145
stalls
Example
lw
$t0, 20($s2)
addu $t1, $t0, $t2
sub
$s4, $s4, $t3
slti $t5, $s4, 20
Can start sub while addu is waiting for lw
146
Hold pending
operands
Can supply
operands for
issued instructions
147
Dynamic Scheduling
Exploit instruction-level parallelism
To make programs behave as if they were running on
Commit in order
pipeline full
148
Speculation
Guess what to do with an instruction
Start operation as soon as possible
Check whether guess was right
Examples
Speculate on branch outcome
149
Speculation
Common to static and dynamic multiple issue
Compiler can reorder instructions
e.g., move load before branch
Can include fix-up instructions to recover from incorrect guess
Hardware can look ahead for instructions to execute
Buffer results until it determines they are actually needed
Flush buffers on incorrect speculation
Out-of-order execution
150
Not all stalls are predictable, some dependences are unknown at compile
time
Dynamic scheduling allows code that compiled for one pipeline to run
efficiently on a different pipeline
151
Power Efficiency
Microprocessor
Year
Clock Rate
Pipeline
Stages
Issue
width
Out-of-order/
Speculation
Cores
Power
i486
1989
25MHz
No
5W
Pentium
1993
66MHz
No
10W
Pentium Pro
1997
200MHz
10
Yes
29W
P4 Willamette
2001
2000MHz
22
Yes
75W
P4 Prescott
2004
3600MHz
31
Yes
103W
Core
2006
2930MHz
14
Yes
75W
UltraSparc III
2003
1950MHz
14
No
90W
UltraSparc T1
2005
1200MHz
No
70W
152
Fallacies
Pipelining is easy (!)
The basic idea is easy
The devil is in the details
Concluding Remarks
ISA influences design of datapath and control & vice versa
Pipelining improves throughput using parallelism
More instructions completed per second
Latency for each instruction not reduced
Limited by structural, data, and control hazards
Multiple issue and dynamic scheduling (ILP)
Dependencies limit achievable parallelism
Complexity leads to the power wall
Pipelining in Todays most advanced Processors is not
This class has given you the background you need to learn more!
154
Concluding Remarks
What does every technique help reduce: data hazard stalls,
Reduces
Dynamic scheduling
Branch prediction
Control stalls
Multiple Issue
CPI
Speculation
Loop unrolling
155