Академический Документы
Профессиональный Документы
Культура Документы
Scoreboarding Merit
Enable out-of-order execution and reduces stall
cycles caused by RAW dependences
Tomasulo Algorithm
History:
1966: scoreboarding in CDC6600, implementing
limited dynamic scheduling
Three years later: Tomasulo in IBM 360/91,
introducing register renaming and reservation station
Now appearing in todays Dec Alpha, SGI MIPS, SUN
UltraSparc, Intel Pentium, IBM PowerPC and others in
different forms
Tomasulo Organization
10
Renaming Implementation
0
F0
F2
F4
LD1
F0
F2
F4
F0
F2
F4
Pd
Rs
Rt
Renaming
2 Mult1 LD1
Rd
Ps
Pt
12
13
Code Example
LD
LD
MULTI
SUBD
DIVD
ADD
F6,34(R2)
F2,45(R3)
F0,F2,F4
F8,F6,F2
F10,F0,F6
F6,F8,F2
LD1
LD2
SUBD
MULTI
ADD
DIVD
What to Observe
1. Whether some instructions can be issued
Execution
complete
S1
Vj
S2
Vk
RS for j
Qj
RS for k
Qk
Clock
F2
F4
F6
F8
F0
Write
Result
Load1
Load2
Load3
Busy
No
No
No
FU
Address
16
F30
Execution
complete
S1
Vj
S2
Vk
RS for j
Qj
RS for k
Qk
Clock
F2
F4
F6
F8
F0
Write
Result
Load1
Load2
Load3
FU
Busy
No
Yes
No
No
Address
34+R2
Load1
17
F30
Execution
complete
S1
Vj
S2
Vk
RS for j
Qj
RS for k
Qk
Clock
F2
F4
F6
F8
F0
FU
Write
Result
Load1
Load2
Load3
Load2
Busy
Yes
Yes
No
Address
34+R2
45+R3
Load1
18
F30
Write
Result
Clock
3
F0
FU
F2
Mult1 Load2
Load1
Load2
Load3
S2
Vk
RS for j
Qj
R(F4)
Load2
F4
F6
Busy
Yes
Yes
No
Address
34+R2
45+R3
RS for k
Qk
F8
F30
Load1
19
Write
Result
4
Clock
4
F0
FU
F2
Mult1 Load2
Load1
Load2
Load3
S2
Vk
RS for j
Qj
R(F4)
Load2
F4
F6
F8
M(34+R2)
Add1
Busy
No
Yes
No
45+R3
RS for k
Qk
Load2
Address
20
F30
Write
Result
4
5
R(F4)
M(34+R2)
Mult1
Clock
F4
F6
F8
M(34+R2)
Add1
Mult2
F0
FU
F2
Mult1 M(45+R3)
S2
Vk
M(45+R3)
Load1
Load2
Load3
RS for j
Qj
Busy
No
No
No
Address
RS for k
Qk
21
F30
Write
Result
4
5
R(F4)
M(34+R2)
Mult1
Clock
F4
F6
F8
Add2
Add1
Mult2
F0
FU
F2
Mult1 M(45+R3)
S2
Vk
M(45+R3)
M(45+R3)
Load1
Load2
Load3
RS for j
Qj
Busy
No
No
No
Address
RS for k
Qk
Add1
22
F30
Write
Result
4
5
R(F4)
M(34+R2)
Mult1
Clock
F4
F6
F8
Add2
Add1
Mult2
F0
FU
F2
Mult1 M(45+R3)
S2
Vk
M(45+R3)
M(45+R3)
Load1
Load2
Load3
RS for j
Qj
Busy
No
No
No
RS for k
Qk
Add1
Address
23
F30
Clock
8
FU
Issue
1
2
3
4
5
6
Op
Execution
complete
3
4
Write
Result
4
5
7
S1
Vj
Load1
Load2
Load3
Busy
No
No
No
Address
F10
F12 ...
8
S2
Vk
RS for j
Qj
RS for k
Qk
ADDD M()-M()
M(45+R3)
MULTD M(45+R3)
DIVD
R(F4)
M(34+R2)
Mult1
F0
F2
F4
F6
F8
Mult1
M(45+R3)
Add2
M()-M() Mult2
24
F30
Write
Result
4
5
R(F4)
M(34+R2)
Mult1
Clock
F4
F6
F8
Add2
M()M() Mult2
F0
FU
F2
Mult1 M(45+R3)
Load1
Load2
Load3
Busy
No
No
No
Address
S2
Vk
RS for j
Qj
RS for k
Qk
M(45+R3)
25
F30
Write
Result
4
5
R(F4)
M(34+R2)
Mult1
Clock
F4
F6
F8
Add2
M()M() Mult2
10
F0
FU
F2
Mult1 M(45+R3)
Load1
Load2
Load3
Busy
No
No
No
S2
Vk
RS for j
Qj
RS for k
Qk
M(45+R3)
Address
26
F30
Clock
11
FU
Issue
1
2
3
4
5
6
Execution
complete
3
4
Write
Result
4
5
10
11
Address
F10
F12 ...
S2
Vk
RS for j
Qj
MULTD M(45+R3)
DIVD
R(F4)
M(34+R2)
Mult1
F0
F2
F4
F6
F8
Mult1
M(45+R3)
(M-M)+M()
M()M() Mult2
Op
S1
Vj
Load1
Load2
Load3
Busy
No
No
No
RS for k
Qk
27
F30
Write
Result
4
5
Clock
F4
12
F0
F2
FU Mult1 M(45+R3)
Busy Address
Load1 No
Load2 No
Load3 No
8
11
S2
Vk
RS for j
Qj
RS for k
Qk
R(F4)
M(34+R2) Mult1
F6
F8
(M-M)+M()M()M()Mult2
28
F30
Write
Result
4
5
Clock
13
F0
FU
F2
Mult1 M(45+R3)
Load1
Load2
Load3
Busy
No
No
No
Address
8
11
S2
Vk
RS for j
Qj
R(F4)
M(34+R2)
Mult1
F4
F6
RS for k
Qk
F8
29
F30
Write
Result
4
5
Clock
14
F0
FU
F2
Mult1 M(45+R3)
Load1
Load2
Load3
Busy
No
No
No
Address
8
11
S2
Vk
RS for j
Qj
R(F4)
M(34+R2)
Mult1
F4
F6
RS for k
Qk
F8
30
F30
Write
Result
4
5
Clock
15
F0
FU
F2
Mult1 M(45+R3)
Load1
Load2
Load3
Busy
No
No
No
8
11
S2
Vk
RS for j
Qj
R(F4)
M(34+R2)
Mult1
F4
F6
RS for k
Qk
F8
Address
31
F30
Execution
complete
3
4
15
7
Clock
16
FU
Write
Result
4
5
16
8
10
Load1
Load2
Load3
Busy
No
No
No
11
S1
Vj
S2
Vk
M*F4
M(34+R2)
F0
F2
F4
M*F4
M(45+R3)
RS for j
Qj
RS for k
Qk
F6
F8
Address
32
F30
Execution
complete
3
4
15
7
Clock
55
FU
Write
Result
4
5
16
8
10
Address
11
S1
Vj
S2
Vk
M*F4
M(34+R2)
F0
F2
F4
M*F4
M(45+R3)
Load1
Load2
Load3
Busy
No
No
No
RS for j
Qj
RS for k
Qk
F6
F8
33
F30
Execution
complete
3
4
15
7
56
10
S1
Vj
Write
Result
4
5
16
8
M*F4
M(34+R2)
Clock
F0
F2
F4
M*F4
M(45+R3)
56
FU
Load1
Load2
Load3
Busy
No
No
No
11
S2
Vk
RS for j
Qj
RS for k
Qk
F6
F8
Address
34
F30
Execution
complete
3
4
15
7
56
10
S1
Vj
Write
Result
4
5
16
8
57
11
S2
Vk
RS for j
Qj
RS for k
Qk
Clock
F0
F2
F4
F6
F8
M*F4
M(45+R3)
57
FU
Load1
Load2
Load3
Address
Busy
No
No
No
35
F30
Machine Correctness
E(D,P) = E(S,P) if
Tomasulo Summary
Reservations stations:
Increases effective register number
Distributes scheduling logic
Register renaming: Avoids WAR and WAW
dependence
Tag + Data broadcasting for waking up child
instructions
Pros: can be effectively combined with
speculative execution
Cons: CDB broadcasting adds one-cycle delay
(addressed in modern instruction scheduling)
38