Академический Документы
Профессиональный Документы
Культура Документы
on Shared-Resource Contention
Tanima Dey
Wei Wang
Jack W. Davidson
I. I NTRODUCTION
Large scale multicore processors, known as chip multiprocessors (CMPs), have several advantages over complex
uniprocessor systems, including support for both thread-level
and instruction-level parallelism. As such, CMPs offer higher
overall computational power which is very useful for many
types of applications, including parallel and server applications. Consequently, these machines have become the new
mainstream architecture and are now being widely used in
servers, laptops and desktops. This shift to CMPs has created
many challenges for fully utilizing the power of multiple
execution cores. One of these challenges is the management
of contention for shared resources. As there are multiple
cores and separate hardware execution modules for each core,
the contention among applications for a computing resource
has been reduced. However, there are several resources in
CMPs which are shared by several and/or all processing
cores, including on-chip shared and last-level caches (LLC),
C0
C1
L1
cache
L1
HWrPF
L1
cache
L1
HWrPF
C2
C3
L1
cache
L1
HWrPF
L1
cache
L1
HWrPF
L2cache
L2cache
L2
HWrPF
FSB
interface
L2
HWrPF
FSB
interface
C0
C2
L1
cache
L1
HWrPF
L1
cache
L1
HWrPF
C4
C6
L1
cache
L1
HWrPF
C1
L1
cache
L1
HWrPF
L1
cache
L1
HWrPF
C3
C5
C7
L1
cache
L1
HWrPF
L1
cache
L1
HWrPF
L1
cache
L1
HWrPF
L2cache
L2cache
L2cache
L2cache
L2
HWrPF
FSB
interface
L2
HWrPF
FSB
interface
L2
HWrPF
FSB
interface
L2
HWrPF
FSB
interface
FSB
FSB
FSB
MemoryControllerHub
(Northbridge)
MemoryControllerHub(Northbridge)
MB
MB
Memory
Memory
(a) Yorkfield
(b) Harpertown
Fig. 1. Experimental Platforms (CX stands for the processor core, L1 HW-PF and L2 HW-PF stand for hardware prefetcher for L1- and L2-caches, respectively
and FSB and MB stand for Front Side Bus and Memory Bus, respectively.)
A. Experimental platforms
C0
C1
C2
C3
L1
L1
L1
L1
L2
Application
Thread
L2
Memory
C0
C1
C2
L1
L1
L1
L2
C3
L1
L2
Memory
Figure 2(b)) or C1. As we measure contention for the L1cache, we keep the effect of L2-cache contention the same
by mapping threads to the cores that share the same L2cache. Furthermore, we make sure that contention for FSB
remains unchanged and choose Yorkfield which has one FSB,
for this experiment. The only difference between these two
configurations is the way L1-cache is shared between the
threads and we are able to measure how L1-cache contention
affects the benchmarks performance.
C0
C1
C2
C3
L1
L1
L1
L1
L2
Application
Thread
C0
C1
C2
C3
L1
L1
L1
L1
L2
L2
Memory
Memory
L2
C0
C2
C4
C6
L1
L1
L1
L1
L2
ApplicationThread
C1
C3
C5
L1
L1
L1
L2
L2
C7
L1
L2
Memory
C0
C2
C4
C6
L1
L1
L1
L1
L2
ApplicationThread
C1
C3
C5
C7
L1
L1
L1
L1
L2
L2
L2
Memory
(1)
C0
C1
C2
C3
L1
L1
L1
L1
L2
Application1
Thread
Application2
Thread
L2
Memory
C0
C1
C2
L1
L1
L1
L2
C3
L1
L2
Memory
L1-cache: To measure the effect of inter-application contention for L1-cache on performance, we run pairs of PARSEC
benchmarks each with two threads in two configurations. In
the baseline configuration, two threads of each benchmark
get exclusive L1-cache access. There is no inter-application
contention for L1-cache between them because L1-cache is
not shared with the co-runners threads. Two threads of one
benchmark are mapped onto one core, e.g., C0 and two
threads of the co-running benchmark are mapped onto the
other core, e.g., C1 (shown in Figure 5(a)). In the contention
configuration, two threads from both benchmarks share the L1caches and there is potential contention for L1-cache among
them. Here, two threads from both benchmarks are mapped
onto the two cores which share the same L2-cache, e.g., C0
and C1 (shown in Figure 5(b)). As we measure contention
only for L1-caches, we keep the effect of L2-cache contention
the same by using the same L2-cache and choose Yorkfield,
having one FSB, to make sure that the contention for the FSB
remains unchanged.
Application1
Thread
C0
C1
L1
L1
C2
C3
L1
L2
L1
C0
Application2
Thread
L2
Memory
L1
C1
C2
L1
L1
L2
C3
L1
L2
Memory
C0
C2
C4
C6
L1
L1
L1
L1
L2
Application1Thread
Application2Thread
C1
C3
C5
L1
L1
L1
L2
L2
C7
L1
L2
Memory
Application1Thread
C0
C2
C4
C6
L1
L1
L1
L1
C1
C3
C5
L1
L1
L1
C7
Application2Thread
L2
L2
L2
L1
L2
Memory
15
13
11
9
7
5
3
1
0
2
4
6
8
10
Fig. 8.
tention
BS
BT
CN
DD
FA
FE
FL
Benchmarks
RT
SC
SW
VP
X2
19
17
15
13
11
9
7
5
3
1
0
1
3
5
Fig. 9.
tention
BS
BT
CN
DD
FA
FE
FL
Benchmarks
RT
SC
SW
VP
X2
3
1
1
3
5
7
9
11
13
15
BS
BT
CN
DD
FA
FE
FL
Benchmarks
RT
SC
SW
VP
X2
100%
Improvement
Degradation
39.0
8.90
SC
SW
40.52 105.5
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
BS
BT
CN
DD
FA
FE
FL
RT
VP
X2
Benchmarks
2.70
12.9
0.81
FL
RT
108.1
3.71
326.8
9.63
SW
VP
X2
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
BS
BT
CN
DD
FA
FE
SC
Benchmarks
ments of 2% and 0.15% respectively and do not show intraapplication contention for the FSB.
1.62
62.49
2.79
53.95
SW
VP
11.39
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
BS
BT
CN
DD
FA
FE
FL
RT
SC
X2
Benchmarks
Benchmarks
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
raytrace
streamcluster
swaptions
vips
x264
L1-cache
none
inter
intra
inter
inter
intra
inter
none
inter
none
intra
inter
L2-cache
none
inter
inter
intra, inter
inter
intra, inter
inter
none
inter
none
inter
intra, inter
FSB
none
intra
intra
intra, inter
intra
intra
intra
intra
intra
none
inter
intra
TABLE I
P ERFORMANCE SENSITIVITY OF THE PARSEC BENCHMARKS DUE TO
CONTENTION IN THE MEMORY HIERARCHY RESOURCES
[15] L. Jin and S. Cho. SOS: A software-oriented distributed shared cache management approach for chip
multiprocessors. In Intl. Conf. Parallel Architecture and
Compilation Techniques (PACT), 2009.
[16] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing
and partitioning in a chip multiprocessor architecture. In
Intl. Conf. on Parallel Architectures and Compilation
Techniques (PACT), 2004.
[17] R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn.
Using OS observations to improve performance in multicore systems. IEEE Micro, 28(3):5466, 2008.
[18] N. Lakshminarayana and H. Kim. Understanding performance, power and energy behavior in asymmetric
multiprocessors. In Intl Conf. on Computer Design
(ICCD), 2008.
[19] H. Lee, S. Cho, and B. R. Childers. StimulusCache:
Boosting performance of chip multiprocessors with excess cache. In Intl. Symp. on High-Performance Computer Architecture (HPCA), 2010.
[20] J. Mars and M. L. Soffa. Synthesizing contention. In
Workshop on Binary Instrumentation and Applications
(WBIA), 2009.
[21] J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa.
Contention aware execution: online contention detection
and response. In Intl Symp. on Code Generation and
Optimization (CGO), 2010.
[22] J. Mars, L. Tang, and M. L. Soffa. Directly characterizing
cross core interference through contention synthesis. In
Intl Conf. on High Performance Embedded Architectures
and Compilers (HiPEAC), 2011.
[23] M. K. Qureshi and Y. N. Patt. Utility-based cache
partitioning: A low-overhead, high-performance, runtime
mechanism to partition shared caches. In Intl. Symp. on
Microarchitecture (MICRO), 2006.
[24] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling
for a simultaneous multithreading processor. In Intl.
Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2000.
[25] S. Srikantaiah, M. Kandemir, and M. J. Irwin. Adaptive
set pinning: Managing shared caches in chip multiprocessors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
2008.
[26] Y. Xie and G. Loh. Dynamic classification of program
memory behaviors in CMPs. In In Proc. of CMP-MSI,
held in conjunction with ISCA-35, 2008.
[27] C. Xu, X. Chen, R. Dick, and Z. Mao. Cache contention
and application performance prediction for multi-core
systems. In Intl Symp. on Performance Analysis of
Systems Software (ISPASS), 2010.
[28] E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing
on modern CMP matter to the performance of contemporary multithreaded programs? In ACM SIGPLAN Symp.
on Principles and Practice of Parallel Programming
(PPoPP), 2010.
[29] L. Zhao, R. Iyer, R. Illikkal, J. Moses, S. Makineni, and
D. Newell. CacheScouts: Fine-grain monitoring of shared
caches in CMP platforms. In Intl Conf. on Parallel
Architecture and Compilation Techniques (PACT), 2007.
[30] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors
via scheduling. In Intl. Conf. on Architectural Support
for Programming Languages and Operating Systems
(ASPLOS), 2010.