Академический Документы
Профессиональный Документы
Культура Документы
Review
Kai Bu
kaibu@zju.edu.cn
http://list.zju.edu.cn/kaibu/comparch
Appendix C
Lectures 4-6
Pipelining
Outline
Whats Pipelining
How Pipelining Works
Pipeline Hazards
Pipeline with Multicycle FP Operations
Outline
Whats Pipelining
How Pipelining Works
Pipeline Hazards
Pipeline with Multicycle FP Operations
Laundry Example
Ann, Brian, Cathy, Dave
Each has one load of clothes to
wash, dry, fold.
washer
30 mins
dryer
40 mins
folder
20 mins
Sequential Laundry
6 Hours
Time
30 40 20 30 40 20 30 40 20 30 40 20
Task Order
A
B
C
D
What would you do?
Sequential Laundry
6 Hours
Time
30 40 20 30 40 20 30 40 20 30 40 20
Task Order
A
B
C
D
What would you do?
Pipelined Laundry
3.5 Hours
Time
Observations
Task Order
Pipelined Laundry
3.5 Hours
Task Order
Observations
Time
30 40 40 40 40 20 No speed up for
individual task;
A
e.g., A still takes
B
C
D
30+40+20=90
Assembly Line
Cola
Auto
Pipelining
An implementation technique whereby
multiple instructions are overlapped in
execution.
A
e.g., B wash while A dry
B
Essence: Start executing one
instruction before completing the
previous one.
Significance: Make fast CPUs.
Balanced Pipeline
Equal-length pipe stages
e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages wash, dry, fold
40min
T1
T2
T3
T4
A
B
C
D
A
B
C
A
B
Balanced Pipeline
Equal-length pipe stages
e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages wash, dry, fold
40min
T1
T2
T3
T4
A
B
C
D
A
B
C
A
B
Balanced Pipeline
Equal-length pipe stages
e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages wash, dry, fold
40min
T1
T2
T3
T4
A
B
C
D
A
B
C
A
B
Balanced Pipeline
Equal-length pipe stages
One task/instruction
per 40 mins
Performance
40min
T1
T2
T3
T4
A
B
C
D
A
B
C
A
B
Speed up by pipeline =
Number of pipe stages
Pipelining Terminology
Latency: the time for an instruction to
complete.
Throughput of a CPU: the number of
instructions completed per second.
Clock cycle: everything in CPU moves in
lockstep; synchronized by the clock.
Processor Cycle: time required between
moving an instruction one step down the
pipeline;
= time required to complete a pipe stage;
= max(times for completing all stages);
= one or two clock cycles, but rarely more.
CPI: clock cycles per instruction
Outline
Whats Pipelining
How Pipelining Works
Pipeline Hazards
Pipeline with Multicycle FP Operations
IF
Data mem
MEM
ID
read
in one clock cycle, write before read
WB
write
RISC:
RISC:
RISC:
RISC:
RISC:
RISC:
RISC:
RISC:
3 classes of instructions - 1
ALU (Arithmetic Logic Unit) instructions
operate on two regs or a reg + a signextended immediate;
store the result into a third reg;
e.g., add (DADD), subtract (DSUB)
logical operations AND, OR
RISC:
3 classes of instructions - 2
Load (LD) and store (SD) instructions
operands: base register + offset;
the sum (called effective address) is used as
a memory address;
Load: use a second reg operand as the
destination for the data loaded from
memory;
Store: use a second reg operand as the
source of the data stored into memory.
RISC:
3 classes of instructions - 3
Branches and jumps
conditional transfers of control;
Branch:
specify the branch condition with a set of
condition bits or comparisons between two
regs or between a reg and zero;
decide the branch destination by adding a
sign-extended offset to the current PC
(program counter);
MIPS Instruction
at most 5 clock cycles per instruction
IF ID EX MEM WB
MIPS Instruction
IF
IR Mem[PC];
NPC PC + 4;
MIPS Instruction
IF
ID
A Regs[rs];
B Regs[rt];
Imm sign-extended
immediate field
of IR (lower 16 bits)
MIPS Instruction
IF
ALUOutput A + Imm;
ALUOutput A func B;
ALUOutput A op Imm;
ALUOutput NPC + (Imm<<2);
Cond (A == 0);
ID
EX
MIPS Instruction
IF
ID
EX
MEM
LMD Mem[ALUOutput];
Mem[ALUOutput] B;
if (cond) PC ALUOutput;
MIPS Instruction
IF
ID
EX
MEM
WB
Regs[rd]
ALUOutput;
Regs[rt] ALUOutput;
Load
Load
Load
Load
Load
Load
Store
Store
Store
Store
Store
Store
Register-Register ALU
Register-Register ALU
Register-Register ALU
Register-Register ALU
Register-Register ALU
Register-Register ALU
Register-Immediate ALU
Register-Immediate ALU
Register-Immediate ALU
Register-Immediate ALU
Register-Immediate ALU
Register-Immediate ALU
Branch
Branch
Branch
Branch
Branch
Branch
Outline
Whats Pipelining
How Pipelining Works
Pipeline Hazards
Pipeline with Multicycle FP Operations
LD
R1
R1, 0(R2)
R1
Structural Hazard
MEM
Load
Example
1 mem port
mem conflict
Instr i+1
Instr i+2
IF
Instr i+3
data access
vs
instr fetch
Structural Hazard
Data Hazard
DADD
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
No hazard
OR
R8, R1, R9
XOR
R1
Data Hazard
Solution: forwarding
directly feed back EX/MEM&MEM/WB
pipeline regs results to the ALU inputs;
if forwarding hw detects that previous
ALU has written the reg corresponding
to a source for the current ALU,
control logic selects the forwarded
result as the ALU input.
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
EX/MEM
R1, R2, R3
DSUB
R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
MEM/WB
LD
R4, 0(R1)
SD
R4,
12(R1)
R1
R1
R4
R1
R1
R4
Data Hazard
Sometimes stall is necessary
LD
MEM/WB
R1
R1, 0(R2)
R1
Branch Hazard
Redo IF
essentially a stall
Outline
Whats Pipelining
How Pipelining Works
Pipeline Hazards
Pipeline with Multicycle FP Operations
Multicycle FP Operation
FP pipeline
allow for a longer latency for op;
two changes over integer pipeline:
repeat EX;
use multiple FP functional units;
FP Pipeline
loads and stores
integer ALU operations
branches
FP and integer multiplier
FP add
FP subtract
FP conversion
Generalized FP Pipeline
EX is pipelined (except for FP divider)
Additional pipeline registers
e.g., ID/A1
FP divider: 24 CCs
Generalized FP Pipeline
Example
italics: stage where data is needed
bold: stage where a result is available
Hazard
Divider is not fully pipelined
structural hazard
Hazard
Instructions have varying running
times, maybe >1 register write in a
cycle - structural hazard
Hazard
Instructions no longer reach WB in
order Write after write (WAW) hazard
Hazard
Instructions may complete in a
different order than they were issued
exceptions
Hazard
Longer latency of operations more
frequent stalls for RAW hazards
RAW Hazards
Structural Hazards
WAW Hazards
MIPS R4000
MIPS R4000
MIPS R4000
MIPS R4000
RF:
instruction decode and register fetch;
hazard checking;
instruction cache hit detection;
MIPS R4000
EX: execution
effective address calculation;
ALU operation;
branch-target computation and condition
evaluation;
MIPS R4000
MIPS R4000
MIPS R4000
MIPS R4000
MIPS R4000
2-cycle load delay
MIPS R4000
3-cycle branch delay
MIPS R4000
FP unit with eight different stages
MIPS R4000
FP operations: latency and initiation
interval
MIPS R4000
FP operations Example 1
FP multiply + FP add
MIPS R4000
FP operations Example 2
FP add + FP multiply
MIPS R4000
FP operations Example 3: divide + add
MIPS R4000
FP operations Example 4
FP add + FP divide