Академический Документы
Профессиональный Документы
Культура Документы
Off-Chip Metadata
Hao Wu (UT Austin), Krishnendra Nathella (Arm),
Joseph Pusdesris (Arm), Dam Sunwoo (Arm),
Akanksha Jain (UT Austin), Calvin Lin (UT Austin)
The Problem: Irregular Accesses
A X B D C Y E
2
Design Space For Irregular Prefetching
State-of-the-art
regular prefetchers
have low performance
Best Offset
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup
Higher is better
3
Design Space For Irregular Prefetching
Best Offset
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup
Higher is better
4
Design Space For Irregular Prefetching
600%
500% Temporal
STMS
Lower is better
Higher is better
5
Design Space For Irregular Prefetching
600% Our previous work
STMS [MICRO’13][ISCA’19]
Lower is better
500%
400%
increases performance
Traffic
100%
50%
Best Offset
0%
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup
7
Design Space For Irregular Prefetching
200%
Our solution provides
150% a new design point. MISB
Traffic
100%
50% Triage
Best Offset
0%
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup
8
Temporal Prefetchers
A X B D C Y E
• Memorize correlations
A X B C D E Y …
10~20 MB
• Replay memorized accesses of metadata
A X B C D E Y … 9
Temporal Prefetchers
• High metadata overhead
Cache (10~20 MB)
Metadata • Too large to fit on-chip
Demand Traffic
Accesses • Metadata stored off-chip
~480% traffic overhead
DRAM Metadata
10
Managed ISB (MISB) [ISCA’19]
~150%
traffic
• Caching metadata on-chip
Meta overhead – Prefetch metadata
Cache data • ~150% traffic overhead
Demand
Metadata – Reduces performance in
Prefetcher bandwidth-constrained
Accesses
environments
– Complicates hardware design
DRAM Metadata
11
Our Solution: Triage
~150%
traffic
• Completely removes off-
Meta overhead chip metadata
Cache data • How can we fit metadata
Metadata on-chip?
Demand Prefetcher
Accesses
DRAM Metadata
12
Insight 1: Not All Metadata Is Useful
Only 15%
matters!
13
Insight 2: Metadata more useful than data
• For irregular 25%
20%
benchmarks, benefit of
Speedup(%)
prefetching > benefit of 15%
store metadata 0%
Benefit of 1MB additional data cache
Benefit of 1MB prefetcher metadata
14
Metadata stored in last level cache
Original LLC
Data
15
How to find out useful metadata?
• This is a cache replacement problem!
• We use state-of-art cache replacement –
Hawkeye [ISCA’16] to figure it out
16
How large should the metadata store be?
Choose
one No metadata
Data
Addr 512KB metadata
Sampler Data Metadata
1MB metadata
Data Metadata
17
Evaluation Methodology
• Industrial Simulator • Benchmark
– ARMv8 AArch64 – SPEC2006
– OoO Core Irregular subset
– Bandwidth: 32GB/s – CloudSuite
• Prefetchers
• Multicore – ChampSim – On chip metadata
• BO, SMS
– Similar trends as the
industrial simulator – Off chip metadata
• STMS, Domino, MISB
18
Performance – Single Core
50%
40%
Speedup(%)
30%
20%
10%
0%
gcc mcf soplex omnetpp astar sphinx3 xalancbmk Geomean
-10%
BO SMS
19
Performance – Single Core
50%
40%
Speedup(%)
30%
20%
10%
0%
gcc mcf soplex omnetpp astar sphinx3 xalancbmk Geomean
-10%
BO SMS Triage
20
Performance – Single Core
200%
150% MISB
Traffic
100%
50% Triage
Best Offset
0%
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup
21
Performance – Multi Core
20
15
Speedup(%)
10
0
2core 4core 8core 16core
MISB
22
Performance – Multi Core
20
15
Speedup(%)
10
0
2core 4core 8core 16core
MISB Triage
23
Speedup (%)
5
0
15
20
25
30
35
10
MIX16
MIX26
MIX29
MIX25
MIX28
MIX1
MIX7
MIX13
MIX6
MIX14
MIX24
MIX15
bo
MIX19
MIX12
triage
MIX17
MIX9
MIX18
MIX11
bo_triage
MIX2
MIX5
MIX23
MIX8
MIX30
MIX10
MIX20
MIX21
MIX4
MIX27
MIX3
Performance – 4-Core SPEC Irregular
MIX22
24
Performance – 4-Core All SPEC
bo triage bo_triage
60
50
Speeedup(%)
40
30
20
10
0
-10
25
Performance - Cloudsuite
80
60
Speedup(%)
40
20
0
casandra classification cloud9 nutch stream Average
-20
-40
BO Triage-Dyn BO+Triage-Dyn
26
Conclusion
• Triage provides a new design point for temporal
prefetchers
• Triage has good performance
– 23.5% speedup vs. 5.8% for BO
– 59.3% traffic vs. 156.4% for MISB
– Better performance than MISB in 16-core systems
• For more information please read our MICRO paper
27
Thank you!
28
Regular Prefetching
A B C D E F G
a1 a2 a3
• PC-localization: Segregate the
F global stream by the load
B
instruction’s PC
G
• PC-localized streams are more
A C predictable!
30
D E
Metadata Representation
F B A D C E ….
a1 a2 a3 ….
Address Neighbor
F B
B A
A D
… …
a1 a2
a2 a3 31
Metadata Replacement
• Hawkeye [Jain et al, ISCA’18]
32
Metadata Replacement
PC
Hawkeye Metadata
Metadata
OPTgen Last OPT
Access decision Predictor Insertion Storage
Policy
Computes Uses past
past OPT OPT solution
solution to predict
future
33
Metadata Storage
• We use last level cache (LLC) to store
metadata
• For irregular workloads, the benefit from
prefetching is higher than performance
reduction from reducing LLC size
34
Dynamic Allocation
OPTgen
(512K) ×
No metadata
Addr Data
OPTgen
×
(1M)
35
Dynamic Allocation
OPTgen
(512K) ×
1MB metadata
36
Dynamic Allocation
OPTgen
(512K) √
512KB metadata
37
Background: ISB
A X B D C Y E
Metadata
• Assign a structural address
Physical Structural
for each access in a stream
A 71
• Convert irregular access X 72
streams to sequential streams B 73
… …
38
Background: ISB
A X B D C Y E
Metadata
• Prefetch the next
Physical Structural
address in structural A 71
address space X 72
B 73
… …
39
Background: ISB
Physical Structural
Meta On-Chip
Cache data A 71 Metadata
X 72
Demand
Accesses Physical Structural
A 71
DRAM Metadata X 72 Off-Chip
B 73 Metadata
… … 40
VA_page PA_page
48
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses
49
On-Chip Metadata
Physical Structural
MISB Operation A ?
A=?
Meta
Cache data
Off-Chip Metadata
Demand Metadata Physical Structural
Accesses Prefetcher
A 71
X 72
DRAM Metadata
B 73
50
On-Chip Metadata
Physical Structural
MISB Operation A ?
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A=71 A 71
X 72
DRAM Metadata
B 73
51
On-Chip Metadata
Physical Structural
MISB Operation A 71
A=71
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
52
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses
53
On-Chip Metadata
Physical Structural
MISB Operation A 71
A=71
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
54
On-Chip Metadata
Physical Structural
MISB Operation A 71
Meta
Cache data
72=?, 73=?
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
55
On-Chip Metadata
Physical Structural
MISB Operation A 71
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
72=X, 73=B A 71
X 72
DRAM Metadata
B 73
56
On-Chip Metadata
Physical Structural
MISB Operation A 71
X 72
72=X, 73=B B 73
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
57
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses
58
On-Chip Metadata
Physical Structural
MISB Operation M ?
M=?
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Useless Physical Structural
Accesses Traffic!
A 71
X 72
DRAM Metadata
B 73
59
On-Chip Metadata
Physical Structural
MISB Operation M ?
M=?
Meta Bloom
Cache data Filter
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
60
On-Chip Metadata
Physical Structural
MISB Operation M ?
M= ×
Meta Bloom
Cache data Filter
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
61
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses
62
Evaluation Methodology
• Industrial Simulator • SPEC2006
– ARMv8 AArch64 – Irregular Subset
– OoO Core • CloudSuite
– 2-level TLB
– Bandwidth: 32GB/s
• Multicore – ChampSim
– Similar trends as the
industrial simulator
63
Evaluated Prefetchers
STMS & Domino ISB MISB
64
Evaluated Prefetchers
Idealized ISB MISB
STMS & Domino
• Global correlation • PC localization • PC localization
• Metadata not cacheable • Metadata cacheable • Metadata cacheable
• Syncs metadata with • Prefetches metadata
TLB
65
Performance
50.0%
40.0%
Speedup(%)
30.0%
20.0%
10.0%
0.0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN
66
Performance
50.0%
40.0%
Speedup(%)
30.0%
20.0%
10.0%
0.0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN
67
Performance
50.0%
40.0%
Speedup(%)
30.0%
20.0%
10.0%
0.0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN
68
Traffic Overhead
700%
Traffic Overhead(%)
600%
500%
400%
300%
200%
100%
0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN
69
Traffic Overhead
1316%
700%
Traffic Overhead(%)
600%
500%
400%
300%
200%
100%
0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN
70
Traffic Overhead
1316%
700%
Traffic Overhead(%)
600%
500%
400%
300%
200% 70%
100%
0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN
71
Performance - SPEC2006 on 8 Cores
STMS Domino MISB
30
25
Speedup (%)
20
15
10
5
0
73
Performance – CloudSuite on 4 cores
STMS Domino MISB
20
15
Speedup (%)
10
0
cassandra classfication cloud9 nutch streaming AVERAGE
74
Traffic – CloudSuite on 4 Cores
STMS Domino MISB
3000
Traffic Overhead (%)
2500
2000
1500
1000
500
0
cassandra classfication cloud9 nutch streaming AVERAGE
75