Вы находитесь на странице: 1из 74

Temporal Prefetching Without the

Off-Chip Metadata
Hao Wu (UT Austin), Krishnendra Nathella (Arm),
Joseph Pusdesris (Arm), Dam Sunwoo (Arm),
Akanksha Jain (UT Austin), Calvin Lin (UT Austin)
The Problem: Irregular Accesses
A X B D C Y E

• Common in many programs


• Hard to prefetch

2
Design Space For Irregular Prefetching
State-of-the-art
regular prefetchers
have low performance

Best Offset
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup
Higher is better
3
Design Space For Irregular Prefetching

STMS Temporal prefetchers


like STMS [HPCA’09]
improve performance

Best Offset
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup
Higher is better
4
Design Space For Irregular Prefetching
600%
500% Temporal
STMS
Lower is better

400% prefetchers have


Traffic

300% high traffic


200% overhead.
100%
0%
Best Offset
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup

Higher is better
5
Design Space For Irregular Prefetching
600% Our previous work
STMS [MICRO’13][ISCA’19]
Lower is better

500%
400%
increases performance
Traffic

and reduces traffic


300%
200%
100%
Best Offset MISB
0%
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup
Higher is better
6
Design Space For Irregular Prefetching
200%
If we zoom in, the
150% traffic overhead is still MISB
150%
Traffic

100%

50%
Best Offset
0%
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup

7
Design Space For Irregular Prefetching
200%
Our solution provides
150% a new design point. MISB
Traffic

100%

50% Triage
Best Offset
0%
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup

8
Temporal Prefetchers
A X B D C Y E

• Memorize correlations
A X B C D E Y …
10~20 MB
• Replay memorized accesses of metadata
A X B C D E Y … 9
Temporal Prefetchers
• High metadata overhead
Cache (10~20 MB)
Metadata • Too large to fit on-chip
Demand Traffic
Accesses • Metadata stored off-chip
~480% traffic overhead
DRAM Metadata

10
Managed ISB (MISB) [ISCA’19]
~150%
traffic
• Caching metadata on-chip
Meta overhead – Prefetch metadata
Cache data • ~150% traffic overhead
Demand
Metadata – Reduces performance in
Prefetcher bandwidth-constrained
Accesses
environments
– Complicates hardware design
DRAM Metadata

11
Our Solution: Triage
~150%
traffic
• Completely removes off-
Meta overhead chip metadata
Cache data • How can we fit metadata
Metadata on-chip?
Demand Prefetcher
Accesses

DRAM Metadata

12
Insight 1: Not All Metadata Is Useful
Only 15%
matters!

13
Insight 2: Metadata more useful than data
• For irregular 25%

20%
benchmarks, benefit of

Speedup(%)
prefetching > benefit of 15%

last level cache (LLC) 10%

• We use part of LLC to 5%

store metadata 0%
Benefit of 1MB additional data cache
Benefit of 1MB prefetcher metadata
14
Metadata stored in last level cache
Original LLC
Data

Part of LLC is used to store metadata


Data Metadata

15
How to find out useful metadata?
• This is a cache replacement problem!
• We use state-of-art cache replacement –
Hawkeye [ISCA’16] to figure it out

16
How large should the metadata store be?
Choose
one No metadata
Data
Addr 512KB metadata
Sampler Data Metadata
1MB metadata
Data Metadata
17
Evaluation Methodology
• Industrial Simulator • Benchmark
– ARMv8 AArch64 – SPEC2006
– OoO Core Irregular subset
– Bandwidth: 32GB/s – CloudSuite
• Prefetchers
• Multicore – ChampSim – On chip metadata
• BO, SMS
– Similar trends as the
industrial simulator – Off chip metadata
• STMS, Domino, MISB

18
Performance – Single Core
50%
40%
Speedup(%)

30%
20%
10%
0%
gcc mcf soplex omnetpp astar sphinx3 xalancbmk Geomean
-10%
BO SMS

19
Performance – Single Core
50%
40%
Speedup(%)

30%
20%
10%
0%
gcc mcf soplex omnetpp astar sphinx3 xalancbmk Geomean
-10%
BO SMS Triage

20
Performance – Single Core
200%

150% MISB
Traffic

100%

50% Triage
Best Offset
0%
0% 5% 10% 15% 20% 25% 30% 35% 40%
Speedup

21
Performance – Multi Core
20

15
Speedup(%)

10

0
2core 4core 8core 16core
MISB

22
Performance – Multi Core
20

15
Speedup(%)

10

0
2core 4core 8core 16core
MISB Triage

23
Speedup (%)

5
0
15
20
25
30
35

10
MIX16
MIX26
MIX29
MIX25
MIX28
MIX1
MIX7
MIX13
MIX6
MIX14
MIX24
MIX15
bo

MIX19
MIX12
triage

MIX17
MIX9
MIX18
MIX11
bo_triage

MIX2
MIX5
MIX23
MIX8
MIX30
MIX10
MIX20
MIX21
MIX4
MIX27
MIX3
Performance – 4-Core SPEC Irregular

MIX22
24
Performance – 4-Core All SPEC
bo triage bo_triage
60
50
Speeedup(%)

40
30
20
10
0
-10

25
Performance - Cloudsuite
80
60
Speedup(%)

40
20
0
casandra classification cloud9 nutch stream Average
-20
-40
BO Triage-Dyn BO+Triage-Dyn

26
Conclusion
• Triage provides a new design point for temporal
prefetchers
• Triage has good performance
– 23.5% speedup vs. 5.8% for BO
– 59.3% traffic vs. 156.4% for MISB
– Better performance than MISB in 16-core systems
• For more information please read our MICRO paper
27
Thank you!

28
Regular Prefetching
A B C D E F G

• Some programs access memory sequentially


– e.g. MPEG player
• Regular prefetchers are effective and widely used
– e.g. Best offset prefetcher
29
Global vs. PC-Localization
while ( ! end ) {
read tree->next; F B a1 A a2 D C E a3 …. Global
if (condition)
F B A D C E ….
read linked_list->next; PC localization
} a1 a2 a3 ….

a1 a2 a3
• PC-localization: Segregate the
F global stream by the load
B
instruction’s PC
G
• PC-localized streams are more
A C predictable!
30
D E
Metadata Representation
F B A D C E ….

a1 a2 a3 ….

Address Neighbor
F B
B A
A D
… …
a1 a2
a2 a3 31
Metadata Replacement
• Hawkeye [Jain et al, ISCA’18]

– OPTgen learns from Belady’s OPT policy


– PC-based predictor predicts whether an access is
hit or miss using OPTgen

32
Metadata Replacement
PC
Hawkeye Metadata
Metadata
OPTgen Last OPT
Access decision Predictor Insertion Storage
Policy
Computes Uses past
past OPT OPT solution
solution to predict
future

33
Metadata Storage
• We use last level cache (LLC) to store
metadata
• For irregular workloads, the benefit from
prefetching is higher than performance
reduction from reducing LLC size
34
Dynamic Allocation
OPTgen
(512K) ×
No metadata

Addr Data
OPTgen
×
(1M)

35
Dynamic Allocation
OPTgen
(512K) ×
1MB metadata

Addr Data Metadata


OPTgen

(1M)

36
Dynamic Allocation
OPTgen
(512K) √
512KB metadata

Addr Data Metadata


OPTgen

(1M)

37
Background: ISB
A X B D C Y E

Metadata
• Assign a structural address
Physical Structural
for each access in a stream
A 71
• Convert irregular access X 72
streams to sequential streams B 73
… …
38
Background: ISB
A X B D C Y E

Metadata
• Prefetch the next
Physical Structural
address in structural A 71
address space X 72
B 73
… …
39
Background: ISB
Physical Structural
Meta On-Chip
Cache data A 71 Metadata
X 72
Demand
Accesses Physical Structural
A 71
DRAM Metadata X 72 Off-Chip
B 73 Metadata
… … 40
VA_page PA_page

Background: ISB TLB V1


V2
A
X
ISB’s metadata cache is synchronized with the TLB
Physical Structural
Meta On-Chip
Cache data A 71 Metadata
X 72
Demand
Accesses Physical Structural
A 71
DRAM Metadata X 72 Off-Chip
B 73 Metadata
… … 41
VA_page PA_page

Background: ISB TLB V1


V2
A
X
ISB’s metadata cache is synchronized with the TLB
Physical Structural
Meta On-Chip
Cache data A 71 Metadata
X 72
Demand
Accesses Physical Structural
A 71
DRAM Metadata X 72 Off-Chip
B 73 Metadata
… … 42
VA_page PA_page

Background: ISB TLB V4


V2
B
X
ISB’s metadata cache is synchronized with the TLB
Physical Structural
Meta On-Chip
Cache data A 71 Metadata
X 72
Demand
Accesses Physical Structural
A 71
DRAM Metadata X 72 Off-Chip
B 73 Metadata
… … 43
VA_page PA_page

Background: ISB TLB V4


V2
B
X
ISB’s metadata cache is synchronized with the TLB
Physical Structural
Meta On-Chip
Cache data A 71 Metadata
X 72
Demand
Accesses Physical Structural
A 71
DRAM Metadata X 72 Off-Chip
B 73 Metadata
… … 44
VA_page PA_page

Background: ISB TLB V4


V2
B
X
ISB’s metadata cache is synchronized with the TLB
Physical Structural
Meta On-Chip
Cache data B 73 Metadata
X 72
Demand
Accesses Physical Structural
A 71
DRAM Metadata X 72 Off-Chip
B 73 Metadata
… … 45
Deficiencies of ISB
On-Chip Metadata TLB
Physical Structural VA_page PA_page
Demand
B 73 V4 B
Accesses
X 72 V2 X
B+1 INVALID
… … On-Chip Metadata Size Required
B+63 INVALID = TLB Size * Cache Lines Per Page
46
Deficiencies of ISB On-Chip Metadata Size Required
= TLB Size * Cache Lines Per Page

• Metadata is managed at coarse granularity


– ~90% traffic is useless due to lack of spatial locality
• Metadata size is proportional to page size
– ISB does not scale to large pages
• Metadata size is proportional to TLB size
– ISB does not work for two-level TLBs
47
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses

48
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses

49
On-Chip Metadata
Physical Structural
MISB Operation A ?

A=?
Meta
Cache data
Off-Chip Metadata
Demand Metadata Physical Structural
Accesses Prefetcher
A 71
X 72
DRAM Metadata
B 73
50
On-Chip Metadata
Physical Structural
MISB Operation A ?

Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A=71 A 71
X 72
DRAM Metadata
B 73
51
On-Chip Metadata
Physical Structural
MISB Operation A 71

A=71
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
52
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses

53
On-Chip Metadata
Physical Structural
MISB Operation A 71

A=71
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
54
On-Chip Metadata
Physical Structural
MISB Operation A 71

Meta
Cache data
72=?, 73=?
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
55
On-Chip Metadata
Physical Structural
MISB Operation A 71

Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
72=X, 73=B A 71
X 72
DRAM Metadata
B 73
56
On-Chip Metadata
Physical Structural
MISB Operation A 71
X 72
72=X, 73=B B 73
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
57
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses

58
On-Chip Metadata
Physical Structural
MISB Operation M ?

M=?
Meta
Cache data
Off-Chip Metadata
Metadata
Demand Prefetcher Useless Physical Structural
Accesses Traffic!
A 71
X 72
DRAM Metadata
B 73
59
On-Chip Metadata
Physical Structural
MISB Operation M ?

M=?
Meta Bloom
Cache data Filter
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
60
On-Chip Metadata
Physical Structural
MISB Operation M ?

M= ×
Meta Bloom
Cache data Filter
Off-Chip Metadata
Metadata
Demand Prefetcher Physical Structural
Accesses
A 71
X 72
DRAM Metadata
B 73
61
Components of MISB
• Manage metadata at a fine granularity
• Prefetch metadata
• Filter out unnecessary accesses

62
Evaluation Methodology
• Industrial Simulator • SPEC2006
– ARMv8 AArch64 – Irregular Subset
– OoO Core • CloudSuite
– 2-level TLB
– Bandwidth: 32GB/s
• Multicore – ChampSim
– Similar trends as the
industrial simulator

63
Evaluated Prefetchers
STMS & Domino ISB MISB

• Global correlation • PC localization • PC localization

64
Evaluated Prefetchers
Idealized ISB MISB
STMS & Domino
• Global correlation • PC localization • PC localization
• Metadata not cacheable • Metadata cacheable • Metadata cacheable
• Syncs metadata with • Prefetches metadata
TLB

65
Performance
50.0%

40.0%
Speedup(%)

30.0%

20.0%

10.0%

0.0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN

STMS Domino ISB MISB

66
Performance
50.0%

40.0%
Speedup(%)

30.0%

20.0%

10.0%

0.0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN

STMS Domino ISB MISB

67
Performance
50.0%

40.0%
Speedup(%)

30.0%

20.0%

10.0%

0.0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN

STMS Domino ISB MISB

68
Traffic Overhead
700%
Traffic Overhead(%)

600%
500%
400%
300%
200%
100%
0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN

STMS Domino ISB MISB

69
Traffic Overhead
1316%
700%
Traffic Overhead(%)

600%
500%
400%
300%
200%
100%
0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN

STMS Domino ISB MISB

70
Traffic Overhead
1316%
700%
Traffic Overhead(%)

600%
500%
400%
300%
200% 70%
100%
0%
403.gcc 429.mcf 450.soplex 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk GEOMEAN

STMS Domino ISB MISB

71
Performance - SPEC2006 on 8 Cores
STMS Domino MISB
30
25
Speedup (%)

20
15
10
5
0
73
Performance – CloudSuite on 4 cores
STMS Domino MISB
20

15
Speedup (%)

10

0
cassandra classfication cloud9 nutch streaming AVERAGE
74
Traffic – CloudSuite on 4 Cores
STMS Domino MISB
3000
Traffic Overhead (%)

2500
2000
1500
1000
500
0
cassandra classfication cloud9 nutch streaming AVERAGE
75

Вам также может понравиться