Академический Документы
Профессиональный Документы
Культура Документы
Overview
Introduction
Pipeline concepts Basics of RISC instruction set Classic 5-stage pipeline
Pipeline Hazards
Stalls, structural hazards, data hazards Branch hazards
Pipeline Implementation
Simple MIPS pipeline
Pipelining
Similar to an assembly line
Widget Definition
Add partA, then B, then C, then D
Widget Definition
Add partA
Add partB
Add partC
Add partD
CPU Pipelining
multiple clock cycles, or one LONG clock cycle
Instruction 2
Fetch, then decode, then execute then access memory if needed then write results if needed
Instruction 1
Instruction 6
Instruction 5
Fetch
Instruction 4
Instruction 3
Instruction 2
Decode
Execute
Memory
Write Results
Instruction 1
1 cycle
1 cycle
1 cycle
1 cycle
1 cycle
Then
Execution time for pipelined version = T N Throughput increase is N
Advantages of Pipelining
Significant speedup without much additional hardware.
Invisible to the programmer
Non-pipelined Implementation
Multi-cycle implementation Simplified to better understand transition to pipelined version Not the most efficient implementation
Datapath
Control
Simplified Datapath
Program Counter Instruction Register Branch Target
Register File
ALU
Multiplexors not shown Control signals not shown Sign extend and shift modules not shown
lw/sw
AddrCal
Immed
branch
ImmExec
Brcomplete
LWmem
SWmem
Rfinish
ImmFinish
LWwrite
Multi-cycle Implementation
At most 5 cycles to implement an instruction
Branch 3 cycles Load 5 cycles Others 4 cycles
Pipelined Version
Each of the 5 clock cycles becomes a pipe stage
IF, ID, EX, MEM, WB
Stages
IF use PC to address current instruction from memory; update PC ID decode instruction and read registers from register file; do equality test on register; sign extend offset field; compute possible branch target EX ALU operates on operands (memory address calculation, register-register operation, registerimmediate operation MEM if a load, read memory, if a store, write memory WB for register-register or load, write register result back to register file.
Pipeline Registers
Register file (just one) Read register after fetch Write register after data memory access
Pipeline Execution
Pipeline Registers
Stages
IF ID EX Mem WB
IF/ID
ID/EX
EX/Mem
Mem/WB
Some Issues
Register file used in two stages,
two register reads (two operands) and one register write during a single clock cycle
PC needed in IF stage and must be updated on every clock cycle Adder needed in ID to compute branch target in cases of branch/jump instructions Branch does not change PC until ID stage, next instruction already fetched at that point
Instruction Timing
Throughput is increased approximately by 5 Execution time of individual instruction INCREASES due to pipelining overhead
Pipeline register delay Clock skew (T = TCL + Tsu + Treg + Tskew )
Important to balance pipeline stages, since clock is matched to slowest stage (TCL)
Example
Unpipelined: 1GHz clock (T = 1ns) ALU 4 cycles 40% Branches 4 cycles 20% Memory 5 cycles 40% If pipelined, increase T by: Tskew + TSU + Treg = .2ns How much speedup from a 5-stage pipeline? Unpipelined execution time: E. Timeu = T * CPI CPI = (.4*4) + (.2*4) + (.4*5) = 4.4
Speedup =
E. Timeu
E. Timep
= 4.4/1.2 = 3.7
Pipeline Hazards
Structural Hazards resource conflicts when more than one instruction needs a resource Data Hazards an instruction depends on a result from a previous instruction that is not yet available Control Hazards conflicts from branches and jumps that change the PC
Pipeline Stall
stall
If Data Memory and Instruction Memory are implemented with a single memory, then this can cause a structural hazard
Hazards
AND
R6, R1, R7
OR
R8, R1, R9
No hazards
XOR R10, R1, R11
Forwarding
Solution for hazards Also called bypassing or short-circuiting
Create potential datapath from where result is calculated to where it is needed by another instruction Detect hazard to route the result
Example:
DADD R1, R2, R3
Forwarding path
DSUB R4, R1, R5
AND
R6, R1, R7
OR
R8, R1, R9
XOR
More Forwarding
Remaining Stalls
Some data hazards cannot be resolved by forwarding:
LD DSUB AND OR R1,0(R2) R4, R1, R5 R6, R1, R7 R8, R1, R9
AND
OR
R6, R1, R7
R8, R1, R9
IF
stall
stall
ID
IF
EX
ID
MEM
EX
WB
MEM WB
Branch Hazards
Branch not taken
BEQZ R1, Name Instr. 1 Instr. 2 Instr. 3 Branch taken Instr. 4
Name:
MEM
WB
1. 2. 3. 4.
Pipeline Freeze
Hold or delete all instructions after a branch until the target address is known.
Simple to implement Results in 1 cycle stall for MIPS Longer stalls for other pipeline architectures
Predicted-not-taken
Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Must be careful not to alter state of registers until actual branch target is known Only slightly more complicated than pipeline freeze to implement Compiler can modify loops to favor branches not taken
Predicted-taken
Treat every branch as taken As soon as branch is decoded and target address is computed, begin fetching at the target
No advantage for MIPS because target address is not known any earlier than branch outcome Only makes sense for machines that compute target address before determining branch outcome
Delayed Branch
Execute instruction after branch no matter what Fetch subsequent instruction depending on branch outcome
branch sequential successor instruction branch target if taken
Compilers job is to put a useful instruction as the sequential successor instruction Otherwise a NOP is used
For the schemes just mentioned, penalty is at most 1 cycle. Penalty is more for deeper pipelines
Pipeline Implementation
Details of pipeline implementation
So that other issues can be explored
Multi-cycle Implementation
1.
2.
3.
Multi-cycle Implementation
4. Memory access/branch completion cycle (MEM)
LMD Mem[ALUOutput] (load) OR Mem[ALUOutput] B (store)
Multicycle Datapath
CYCLE 1
CYCLE 3
Imm = 55
CYCLE 4
CYCLE 2 CYCLE 5
Pipeline Control
Control signals needed for MUXs Register (write) ALU (function) Data memory (read/write)
Overview
Introduction
Pipeline concepts Basics of RISC instruction set Classic 5-stage pipeline
Pipeline Hazards
Stalls, structural hazards, data hazards Branch hazards
Pipeline Implementation
Simple MIPS pipeline
Control Complications
Instruction issue when instruction transfers from ID stage into EX stage Data hazards checked in ID stage
If stall is required, instruction is stalled before it is issued If forwarding is needed, controls are set
LD R1, 45(R2) DADD R5, R6, R7 DSUB R8, R1, R7 (Requires forwarding)
Comparators detect the use of R1 in DSUB and forward result of load to ALU in time for DSUB to begin EX.
Load Interlocks
Recall that the following code requires a stall or load interlock to prevent Read After Write (RAW) hazards
LD DSUB AND OR R1,0(R2) R4, R1, R5 R6, R1, R7 R8, R1, R9
Load Load
Load
Forwarding Logic
Detection is similar to detecting RAW, but more cases All forwarding values originate at ALU or data memory output Terminate at ALU input, data memory input or zero detection unit
Forwarding Logic
Branches in Pipeline
Consider only BEQZ and BNEZ (branch if equal to zero or not equal to zero) For these it is possible to move the test to the ID stage To take advantage of early decision, target address must also be computed early Must add another adder for computing target address in ID Result is 1-cycle stall on branches. Branches on result of register from previous ALU operation will result in a data hazard stall.
Zero test
Resume vs terminate
Terminating programs execution always stops after interrupt Resuming program execution continues after interrupt is handled. Resuming exceptions harder to handle.
After exception is handled, return from exception by reloading PCs and restart instruction stream.
Precise exceptions if pipeline can always be stopped to that the instruction just before the faulting instruction are completed and those after it can be restarted. Floating point instructions tend to take many cycles,
difficult to have precise exceptions
Example
LD IF DADD ID IF EX ID MEM EX WB MEM
WB
Arithmetic exception
1. Deal with the page fault, redo the DADD 2. Deal with the DADD arithmetic exception that will occur again
But: Exceptions can occur out of order Alternate solution: Hardware posts all exceptions in a status vector Control signals that writes data is turned off When instruction enters WB, exception status vector is checked Exceptions of earliest instructions handled first.
IF
ID
EX
MEM
WB
One Approach
4 separate Function Units for EX Stage Integer takes 1 clock cycle FP units take multiple cycles Instruction issue: Allowing an instruction to move from ID to EX phase
Definitions
Latency: the number of cycles between when an instruction produces a result and when the next instruction can use the result. Integer ALU: latency = 0 Loads: latency = 1
FP Mult: latency = 6
Definitions
Initiation Interval: the number of cycles that must elapse between issuing two operations of a given type Integer ALU, Loads, FP Add, FP Mult: Initiation Interval = 1
IF
ID IF
M1 M2 ID
IF
M3 A2
EX
M4 A3
MEM
M5 A4
WB
M6 ME M
M7 WB
MEM WB
A1
ID
S.D
IF
ID
EX
MEM WB
Instruction
MUL.D ... ... ADD.D ... ...
1
IF
2
ID IF
3
M1 ID IF
4
M2 EX ID IF
5
M3 MEM EX ID IF
6
M4 WB MEM A1 ID IF
7
M5
8
M6
9
M7
10
MEM
11
WB
L.D
IF
ID
EX
MEM
WB
Possible Solutions
Add write ports probably not a good idea because it is not a common scenario. Detect structural hazard and implement interlock.
Track scheduled write ports in ID and stall there, OR Stall conflicting instruction in the MEM or WB stage
WAW Hazards
Clock Cycle Number
Instruction
MUL.D ... ... ADD.D F2, F4, F6 ... L.D F2, 0(R2) ...
1
IF
2
ID IF
3
M1 ID IF
4
M2 EX ID IF
5
M3 MEM EX ID IF
6
M4 WB MEM A1 ID IF
7
M5
8
M6
9
M7
10
MEM
11
WB
Overview
Introduction
Pipeline concepts Basics of RISC instruction set Classic 5-stage pipeline
Pipeline Hazards
Stalls, structural hazards, data hazards Branch hazards
Pipeline Implementation
Simple MIPS pipeline
Review of Hazards
Caused by different lengths of execution unit pipelines.
Structural hazards multiple instructions need the same function unit at the same time RAW data hazards Instruction needs to read a value that has not been written yet WAW data hazards Writes occur out of order
Handling Hazards
Structural Hazards
Wait to issue instructions if divider is busy, or if the register write port will not be available.
RAW Hazards
Check source registers against pending destinations, stall issue if necessary.
WAW Hazards
Determine if any instruction in MULT, ADD, or DIV pipeline has same destination of instruction being issued, stall issue if necessary.
Precise Exceptions
Out of order completion makes precise exceptions difficult
Completion Time (starting at 0,1,2)
DIV.D F0, F3, F5 ADD.D F9, F9, F7 SUB.D F10, F10, F14 cycle 28 cycle 9 cycle 10
No data hazards, so no stalls IF SUB causes an exception, ADD is already done, but DIV is NOT complete. Saving PC and starting over at SUB.D will not work.
Solution Options
Buffer results until all previous instructions are complete.
OK as long as the difference in completion times is reasonable. (Lots of storage otherwise)
MIPS FP Pipeline
Stalls to avoid structural and RAW hazards
Stalls per FP operation # stalls depends on latency # stalls also depends on how many cycles before results are used Divide frequency is low, but number of stalls needed is high due to latency Average for add/sub/conv = 1.7 (56% of latency) Average for mult = 2.8 (46%) Average for div = 14.2 (59%)
Stages: IF First half of instruction fetch IS Second half of instruction fetch RF Instruction decode and register fetch, hazard checking, instruction cache hit detection EX execution (address calc., ALU operation, condition evaluation DF Data fetch, first half of data cache access DS Second half of data fetch, completion of cache access TC Tag check, determine whether the data cache access hit WB Write back for loads and register-register operations
8 stages, used 0 or many times, in different orders, by different instructions Large range of completion times (2-112 cycles)
FP Pipeline Stages
Stage A D E M N R S U Functional Unit FP Adder FP divider FP multiplier FP multiplier FP multiplier FP adder FP adder Description Mantissa ADD stage Divide pipeline stage Exception test stage First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers
Branch stalls from longer pipeline substantial FP structural stalls sometimes masked by result stalls
AppendixA Summary
For ideal N-stage pipeline, throughput increase is N over a non-pipelined architecture Ideal pipelined cpu has CPI=1 Pipelining has advantages of
significant speedup with moderate hardware costs invisible to programmer
AppendixA Summary
Solutions include
Stalls Forwarding Buffering state (for exceptions) Branch delay slots Branch prediction Several multi-cycle execution units for FP