Академический Документы
Профессиональный Документы
Культура Документы
PIPELINING
1
Outline
What is pipelining?
The basic pipeline for a RISC instruction set
The major hurdle of pipelining – pipeline hazards
Data hazards
Control hazards
Performance
2
What is pipelining?
Pipelining is an implementation technique whereby
multiple instructions are overlapped in execution.
Not visible to the programmer
Each step in the pipeline completes a part of an
instruction.
Each step is completing different parts of different
instructions in parallel
Each of these steps is called a pipe stage or a pipe
segment.
3
What is pipelining?
Like Assembly line in the factory
The time required between moving an instruction
one step down the pipeline is a machine cycle.
Allthe stages must be ready to proceed at the same
time
Slowest pipe stage dominates
Machine cycle is usually one clock cycle (sometimes
two, rarely more)
The pipeline designer’s goal is to balance the length of
each pipeline stage.
4
時間
I1 I2 I3
F1 E1 F2 E2 F3 E3
(a) 循序執行
中間階段緩衝器
B1
指令提取 執行單元
單元
(b) 硬體組織
時間
時脈週期 1 2 3 4
指令
I1 F1 E1
I2 F2 E2
I3 F3 E3
(c) 利用管線處理的執行
圖 8.1 管線處理指令的基本觀念
5
Pipeline Stages
Every stage is completed within one cpu cl
ock
F (Fetch)
D (Decode)
E (Execute)
W (Write)
6
時間
時脈週期 1 2 3 4 5 6 7
指令
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
(a) 指令的執行被分成四個步驟
中間階段緩衝器
D : 解碼指令
F : 提取指令 與 E: 執行運作 W : 寫入結果
提取運算元
B1 B2 B3
(b) 硬體組織
圖 8.2 管線的 4 階段
7
How about the clock of IF
Because all the stages must be completed at
the same time, we can not expect IF to
access main memory
Thus IF must access L1 cache
When cache is built in CPU, the access time
of cache is almost the same as the time to
execute other operations within CPU
Slowest pipe stage dominates
8
Performance of Pipeline
If the stages are perfectly balanced, then the ti
me per instruction on the pipelined machine –
assuming ideal conditions--is equal to
Time _ Per _ Instruction _ on _ Unpipelined _ Machine
Number _ of _ Pipe _ Stages
9
Pipeline Hazards
Structural hazards
Caused by resource conflict
Possible to avoid by adding resources – but may be too costly
Data hazards
Instructiondepends on the results of a previous instruction in
a way that is exposed by the overlapping of instructions in the
pipeline
Can be mitigated somewhat by a smart compiler
Control hazards
When the PC does not get just incremented
Branches and jumps - not too bad
10
管線可能遇到的問題 (data hazard)
時間
時脈週期 1 2 3 4 5 6 7 8 9
指令
I1 F1 D1 E1 W1
Stall ( 拖
I2 F2 D2 E2 W2 延)
除法
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 E5
圖 8.3 執行作業耗費一個以上的時序週期所造成的影響
I3 waits I2’s result to do ALU
11
管線可能遇到的問題 (control hazard or ins
trction hazard)
時間
時脈週期 1 2 3 4 5 6 7 8 9
指令
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
(a) 在連續時序週期中的指令執行步驟
時間
時脈週期 c 1 2 3 4 5 6 7 8 9
階段
F: 提取 F1 F2 F2 F2 F2 F3 Bubble( 氣泡 ), stall( 間隔 )
D: 解碼 D1 idle idle idle D2 D3
E: 執行 E1 idle idle idle E2 E3
W: 寫入 W1 idle idle idle W2 W3
(b) 在連續時序週期中,每一個處理器階段所執行的功能
12
管線可能遇到的問題 (structural hazard)
時間
時脈週期 1 2 3 4 5 6 7
指令
I1 F1 D1 E1 W1
I2( 載入 ) F2 D2 E2 M2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4
I5 F5 D5
13
Pipeline can not accelerate the execution
time of instructions, actually it will prolong
the instruction execution time due to the
latch added between stages
But pipeline will promote productivity
Shorten the completion time of one instruction
14
Data Hazard
Add R2,R3,R4
Sub R5,R4,R6
One instruction’s source registers is the
destination register of previous one
15
Data Hazard
時間
時脈週期 1 2 3 4 5 6 7 8 9
指令
I1 (Add) F1 D1 E1 W1
I2 (Sub) F2 D2 D2A E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
圖 8.6 由於 D2 與 W1 之間的資料相關性所造成的管線間隔
16
Forwarding also called bypassing, shorting,
short-circuiting 時間
時脈週期 1 2 3 4 5 6 7 8 9
指令
I1 (Add) F1 D1 E1 W1
I2 (Sub) F2 D2 E W
2 2
I3 F3 D E W
3 3 3
I4 F4 D4 E4 W4
17
Forwarding( 轉交 )
Key is to keep the ALU result around
ADD produces R4 value at ALU output
Add R2,R3,R4 SUB needs it again at the ALU input
Sub R5,R4,R6
How do we handle this in general?
Forwarded value can be at ALU output or Me
m stage output
18
來源 1
來源 2
SRC1 SRC2
暫存器檔
ALU
RSL T
目的
(a) 資料路徑
SRC1,SRC2 RSL T
E: Execute
E: 執行 W: 寫入
(ALU) ( 暫存器檔 )
(ALU)
轉交路徑
(b) P osition of the source and result registers in the processor pipeline
(b) 在處理器管線中之來源與結果暫存器的位置
圖 8.7 利用管線處理的處理器中的運算元轉交處理
19
Instruction side effect
An implicit data hazard
Auto inc/dec addressing mode (push, pop)
Carry flag
Add R1,R3
AddwithCarry R2,R4
R4 [R2] + [R4] + carry
20
Control hazards (instruction hazard)
時間
時脈週期 1 2 3 4 5 6
指令
I1 F1 E1
I2( 分支 ) F2 E2 執行單元閒置
I3 F3 X Branch
penalty
Ik Fk Ek
圖 8.8 分支指令所造成的閒置週期
21
時間
時脈週期 1 2 3 4 5 6 7 8
I1 F1 D1 E1 W1
I 2 (分支 ) F2 D2 E2
I3 F3 D3 X
I4 F4 X
Ik Fk Dk Ek Wk
I k+ 1 F k+ 1 D k+ 1 E k+ 1
I1 F1 D1 E1 W1
I 2 (分支 ) F2 D2
I3 F3 X
Ik Fk Dk Ek Wk
I k+ 1 F k+ 1 D k+ 1 E k+ 1
圖 8.9 分支時序
22
The instruction queue is used mainly for detecting and dealing with branches
指令提取單元
指令佇列
F : 提取指令
D : 分派 / E : 執行 W : 寫入
解碼單元 單元 結果
23
Branch Instructions
20% of Dynamic execution is branch
instructions, that is, every 5 instructions
been executed has one branch
Branch has huge impact on pipeline, thus
modern CPU has various strategies to
handle this control hazard such as delay
branch, branch prediction, branch folding
24
Delayed branch
(make the stall cycle useful)
25
LOOP Shift_left R1
Decrement R2
Branch=0 LOOP
NEXT Add R1,R3
(a) 原來的程式迴圈
LOOP Decrement R2
Branch=0 LOOP
Shift_left R1 Branch delay slot
NEXT Add R1,R3
(b) 重新排序過的指令
圖 8.12 為延遲分支重新排序的指令
26
時間
時脈週期 1 2 3 4 5 6 7 8
指令
遞減 F E
分支 F E
Shift ( 延遲槽 ) F E
遞減 ( 發生分支 ) F E
分支 F E
Shift ( 延遲槽 ) F E
27
Branch prediction
(Speculative execution)
Static branch prediction
Fixed prediction for every branch
Branch taken
Branch not taken
Dynamic branch prediction
branch-prediction buffer
Branch target buffer (BTB)
28
Branch-prediction buffer
29
發生分支 (BT)
BNT LNT LT BT
沒有發生分支 (BNT)
BT
BNT
BNT BT
ST: strongly taken BT
圖 8.15 分支預測演算法的狀態機表示法
30
Branch target buffer (BTB)
31
The Target Buffer
Send PC of current
instruction to target
buffer in IF stage
34
Branch folding
The instruction queue is used mainly for detecting and dealing with
branches. The 601's branch unit scans the bottom four entries of the
queue, identifying branch instructions and determining what type
they are (conditional, unconditional, etc.). In cases where the branch
unit has enough information to resolve the branch right then and
there (e.g. in the case of an unconditional branch, or a conditional
branch whose condition is dependent on information that's already
in the condition register) then the branch instruction is simply
deleted from the instruction queue and replaced with the instruction
located at the branch target.
This branch-elimination technique, called branch folding, speeds
performance in two ways. First, it eliminates an instruction (the
branch) from the code stream, which frees up dispatch bandwidth
for other instructions. Second, it eliminates the single-cycle pipeline
bubble that usually occurs immediately after a branch.
35
Pipeline and addressing mode
Complex addressing mode doesn’t save
any clock cycle than simple addressing
mode
Load (X(R1)),R2 complex addressing mode
Add #X,R1,R2
Load (R2),R2 simple addressing mode
Load (R2),R2
36
時間
時脈週期 1 2 3 4 5 6 7
F轉交
下一個指令 F D E W
(a) 複雜定址模式
Add F D X + [R1] W
Load F D [X + [R1]] W
下一個指令 F D E W
(b) 簡單定址模式
圖 8.16 使用複雜與簡單定址模式的等價作業
37
現代處理器的定址模式特色
(RISC 優先強調的部分 )
存取運算元時不需要存取記憶體一次以上
只有載入 (load) 與儲存 (store) 指令存取記
憶體運算元
使用的定址模式沒有副作用
暫存器定址模式 ( 不須計算 EA)
暫存器間接定址模式 ( 不須計算 EA)
索引定址模式 ( 只須一個 clock 計算 EA)
38
Add R1,R2
Compare R3,R4
Branch=0 ...
(a) 程式片段
Compare R3,R4
Add R1,R2
Branch=0 ...
(b) 重新排序過的指令
圖 8.17 指令的重新排序
•條件碼旗標所導入的相關性會降低 compiler 替指令重新排序的彈性 39
Pipeline Datapath
暫存器檔
匯流排 A
匯流排 B
A
ALU R
匯流排 C
B
PC
控制訊號管線
遞增器
指令解碼器 IMAR
記憶體位址
( 指令提取 )
指令佇列
資料快取
41
F : 指令提取
單元
指令佇列
浮點數單元
分派單元 W : 寫入
結果
整數單元
圖 8.19 包含兩個執行單元的處理器
42
時間
時脈週期 1 2 3 4 5 6 7
I2 (Add) F2 D2 E2 W2
I3 (Fsub) F3 D3 E3 E3 E3 W3
I4 (Sub) F4 D4 E4 W4
43
Out of order execution
( 脫序的執行 )
In order issue
To avoid dead lock
Out of order execution
Reorder buffer
In order retire
Register renaming
Temporary registers for commitment
Imprecise exception
Allow instructions completion after exception instruction
Precise exception
instructions completion after exception instruction are aborted
44
系統交互連接匯流排
外部快取
單元
指令 資料
載入佇列 儲存佇列
預先提取
與分派單元 記憶體管理
單元
D-Cache
I-Cache
iTLB dTLB
指令緩衝區
浮點 整數 整數執行
浮點單元
暫存器 暫存器 單元
45
兩條整數管線處理
解碼 快取 延遲 寫入
提取 群組 執行 延遲 檢查
E C N1 N2 N3 W
E C N1 N2 N3 W
執行 執行 寫入
F D G
暫存器 執行 檢查
R X1 X2 X3 N3 W
R X1 X2 X3 N3 W
兩條浮點管線處理
指令緩衝區
46
效能考量
T=(N x S)/R
Instruction throughput
Ps = R/S = N/T 一秒所能執行的指令數
Only one L1 cache
Dmiss = ((1-hi) + d(1-hd)) x Mp
Dmiss = (0.05 + 0.03) x 17 = 1.36 cycles
L1,L2 cache
Dmiss = ((1-hi) + d(1-hd)) x (hs x Ms + (1-hs) x Mp) = 0.46 c
ycles
47
管線階段的數目
較遠的指令在管線中形成相關性
管線變長 , 分支的損失更嚴重
成本增加
UltraSPARC II 9 stages
Intel Pentium pro 12 stages
Pentium 4 12 stages
48
結論
管線設計的考量
處理器的指令集
管線硬體的設計
編譯程式的設計
三者之間互動強烈
49