Академический Документы
Профессиональный Документы
Культура Документы
Instructor: Prof. Bhagi Narahari Dept. of Computer Science Course URL: www.seas.gwu.edu/~narahari/cs211/
CS211 1
CS211 2
Appendix A of Textbook
Also parts of Chapter 2
CS211 3
CS211 4
Instruction Pipeline
Instruction execution process lends itself naturally to pipelining
overlap the subtasks of instruction fetch, decode and execute
CS211 5
Sk
CS211 6
Synchronous Pipeline
All transfers simultaneous One task or operation enters the pipeline per cycle Processors reservation table : diagonal
CS211 7
S3 S2 S1
T1 T1
T1 T2 T3
T2 T3 T4
T2
Asynchronous Pipeline
Transfers performed when individual processors are ready Handshaking protocol between processors Mainly used in multiprocessor systems with message-passing
CS211 9
= max {
}+d
CS211 11
10
Pipeline throughput (the number of tasks per unit time) : note equivalence to IPC n [ k + (n-1)] nf k + (n-1)
Hk =
CS211 12
Speedup for above case is: 280/100 = 2.8 !! Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns Sequential time = 1000*280ns Throughput= 1000/1003 What is the problem here ? How to improve performance ?
CS211 13
CS211 14
CS211 15
Reservation Tables
Reservation table : displays time-space flow of data through the pipeline analogous to opcode of pipeline
Not diagonal, as in linear pipelines
Feedforward and feedback connections The number of columns in the reservation table : evaluation time of a given function
CS211 16
14
CS211 17
Latency Analysis
Latency : the number of clock cycles between two initiations of the pipeline Collision : an attempt by two initiations to use the same pipeline stage at the same time Some latencies cause collision, some not
CS211 18
Collisions (Example)
16
8 x1x2 x3x4
10 x2x3
x1 x1
x2
x1x2
x3
x1
x2x3
x4
x4
x2x3x4
x1
x1x2
x1x2x4
Latency = 2
CS211 19
Latency Cycle
17
1 x1
9 10 11 12 13 14 15 16 17 18 x2 x3 x2 x2 x2 x2 Cycle x2 x3 x3 Cycle x3 x3 x3
x1 x2 x1 x1 x1 x1 x1 x1 x2
Latency cycle : the sequence of initiations which has repetitive subsequence and without collisions Latency sequence length : the number of time intervals within the cycle Average latency : the sum of all latencies divided by the number of latencies along the cycle
CS211 20
18
Goal : to find the shortest average latency Lengths : for reservation table with n columns, maximum forbidden latency is m <= n 1, and permissible latency p is 1 <= p <= m 1 Ideal case : p = 1 (static pipeline) Collision vector : C = (CmCm . . .C2C1) -1 [ Ci = 1 if latency i causes collision ] [ Ci = 0 for permissible latencies ]
CS211 21
19
Collision Vector
Reservation Table
x x x x x
x1
x1 x1 x1 x1 x1
x2 x2 x2
Value X2
x2 x2 x2
Value X1
C = (? ? . . . ? ?)
CS211 22
CS211 23
Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage
CS211 24
Memory Access
MUX
Write Back
Adder
4
Address
Zero?
MUX MUX
RS2
Memory
Reg File
Inst
ALU
Data Memory
RD
L M D
MUX
Imm
Sign Extend
WB Data
CS211 25
Memory Access
MUX
Write Back
Adder
RS1
4
Address
Zero?
MUX MUX
MEM/WB
Imm
Sign Extend
RD
RD
RD
WB Data
Memory
EX/MEM
RS2
Reg File
ID/EX
IF/ID
ALU
Data Memory
MUX
Visualizing Pipelining
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
I n s t r. O r d e r
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
CS211 28
IFetch Dcd
IFetch Dcd
IFetch Dcd
CS211 29
Clk
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem R-type Ifetch
Pipeline Implementation: Load Ifetch Reg Exec Reg Mem Exec Reg Wr Mem Exec Wr Mem Wr
CS211 30
Store Ifetch
R-type Ifetch
Cycle 4 Cycle 5
Mem
Wr
Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file
CS211 31
CS211 32
Clock
Pipelining the R-type and Load Cycle 1 Cycle 2 Cycle Instruction 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
Reg/Dec Ifetch Load Exec Reg/Dec Ifetch Wr Exec Reg/Dec Wr Exec Reg/Dec Mem Exec Reg/Dec Wr Wr Exec Wr R-type
Cycle 9
R-type Ifetch
R-type Ifetch
R-type Ifetch
instruction Each functional unit must be used at the same stage for all instructions:
Load uses Register Files Write Port during its 5th stage
1 Load Ifetch 2 Reg/Dec 3 Exec 4 Mem 5 Wr
Important Observation Each functional unit can only be used once per
R-type uses Register Files Write Port during its 4th stage
1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Wr
CS211 34
Solution 1: Insert Bubble into the Pipeline2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 1 Cycle
Clock Ifetch Load Reg/Dec Ifetch Exec Reg/Dec Wr Exec Reg/Dec Mem Exec Wr Wr Wr Exec Reg/Dec Wr Exec
R-type Ifetch
R-type Ifetch Reg/Dec Pipeline Exec R-type Ifetch Bubble Reg/Dec Ifetch
Insert a bubble into the pipeline to prevent 2 writes at the same cycle
The control logic can be complex. Lose instruction fetch and issue opportunity.
Solution 2: Delay R-types Write by One Cycle Delay R-types register write by one cycle:
Now R-type instructions also use Reg Files write port at Stage 5 Mem stage is a NOOP stage: nothing is being done.
1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Mem 5 Wr
R-type Ifetch
R-type Ifetch
Why Pipeline?
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
CS211 37
Why Pipeline? Because the resources are there! Time (clock cycles)
ALU
I n s t r. O r d e r
Im
Reg
Dm
Reg
ALU
Im
Reg
Dm
Reg
ALU
Im
Reg
Dm
Reg
ALU
Im
Reg
Dm
Reg
ALU
Im
Reg
Dm
Reg
CS211 38
Can always resolve hazards by stalling More stall cycles = more CPU time = less performance
Increase performance = decrease stall cycles
CS211 39
CS211 40
Cycle Timeunpipelined Ideal CPI Pipeline depth Speedup = Ideal CPI + Pipeline stall CPI Cycle Timepipelined
ALU
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
CS211 42
ALU
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
Bubble
Bubble Bubble
Bubble
ALU
Bubble
Reg
Ifetch
Reg
DMem
CS211 43
Example
Machine A: Dual ported memory (Harvard Architecture) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed
SpeedUpA = Pipe. Depth/(1 + 0) x (clockunpipe /clockpipe ) = Pipeline Depth SpeedUpB = Pipe. Depth/(1 + 0.4 x 1) x (clockunpipe /(clockunpipe / 1.05)
= (Pipe. Depth/1.4) x 1.05 = 0.75 x Pipe. Depth SpeedUpA / SpeedUpB = Pipe. Depth/(0.75 x Pipe. Depth) = 1.33
CS211 45
Data Dependencies
True dependencies and False dependencies
false implies we can remove the dependency true implies we are stuck with it!
Three types of data dependencies defined in terms of how succeeding instruction depends on preceding instruction
RAW: Read after Write or Flow dependency WAR: Write after Read or anti-dependency WAW: Write after Write
CS211 46
CS211 47
RAW Dependency
Example program (a) with two instructions
i1: load r1, a; i2: add r2, r1,r1;
Both cases we cannot read in i2 until i1 has completed writing the result
In (a) this is due to load-use dependency In (b) this is due to define-use dependency
CS211 48
All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
CS211 49
CS211 50
CS211 51
CS211 52
CS211 53
For each instruction i construct set of Read registers and Write registers
Rregs(i) is set of registers that instruction i reads from Wregs(i) is set of registers that instruction i writes to Use these to define the 3 types of data hazards
CS211 54
A WAW hazard exists on register if Wregs( i ) Wregs( j ) A WAR hazard exists on register if Wregs( i ) Rregs( j )
CS211 55
CS211 56
Data Hazard on R1
Time (clock cycles)
IF ID/RF EX MEM
DMem
WB
Reg
I n s t r. O r d e r
Ifetch
Reg
ALU
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
r8,r1,r9
Ifetch
Reg
DMem
Reg
ALU
xor r10,r1,r11
Ifetch
Reg
DMem
Reg
CS211 57
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
xor r10,r1,r11
Ifetch
Reg
DMem
Reg
CS211 58
38
Memory M
Memory M
Access Unit
Access Unit
R1
STO M,R1
R2
LD R2,M
R1
STO M,R1
R2
MOVE R2,R1
CS211 60
39
Memory M
Memory M
Access Unit
Access Unit
R1
LD R1,M
R2
LD R2,M
R1
LD R1,M
R2
MOVE R2,R1
CS211 61
40
Memory M
Memory M
Access Unit
Access Unit
R1
STO M, R1
R2
STO M,R2
R1
R2
STO M,R2
CS211 62
NextPC
mux
Immediate
Registers
MEM/WR
EX/MEM
ALU
ID/EX
Data Memory
mux
CS211 63
mux
What about memory If instructions are initiated in order and operations?in the same stage, operations always occur
there can be no hazards between memory operations! What does delaying WB on arithmetic operations cost? cycles ? hardware ? What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R3 <- R2 + R1 Delayed Loads Can recognize this in decode stage and introduce bubble while stalling fetch stage Tricky situation: R1 <- Mem[ R2 + I ] Mem[R3+34] <- R1 Handle with bypass in memory stage!
op Rd Ra Rb
op Rd Ra Rb
Rd
D Mem T
Rd
to reg file
CS211 64
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
CS211 65
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
Bubble
ALU
DMem
Reg
ALU
Ifetch
Bubble
Reg
DMem
Reg
Bubble
Ifetch
Reg
ALU
DMem
CS211 66
CS211 67
CS211 68
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
CS211 70
CS211 72
CS211 73
1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this
CS211 74
CS211 75
Delayed Branch
Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allow more slots to be filled
Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
CS211 76
Scheduling Branch scheme penalty Stall pipeline 3 Predict taken 1 Predict not taken 1 Delayed branch 0.5
Example: let initial value = T, actual outcome of branches is- NT, T,T,NT,T,T
Predictions are: T, NT,T,T,NT,T 3 wrong (in red), 3 correct = 50% accuracy
In general, can have k-bit predictors: more when we cover superscalar processors.
CS211 78
Cycle Timeunpipelined Pipeline depth Speedup = 1 + Pipeline stall CPI Cycle Timepipelined
Hazards limit performance on computers:
Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction
CS211 79
Pipelines pass control information down the pipe just as data moves down pipe Forwarding/Stalls handled by local control Exceptions stop the pipeline
CS211 80
Introduction to ILP
What is ILP?
Processor and Compiler design techniques that speed up execution by causing individual machine operations to execute in parallel
Hardware
Superscalar
Dataflow Indep. Arch.
Determine Dependences
Determine Dependences
Determine Independences
Determine Independences
VLIW
TTA
Execute
B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.
CS211 82