Вы находитесь на странице: 1из 6

2010 Second International Conference on Computer Engineering and Applications

Pthreads Performance Characteristics on Shared Cache CMP, Private Cache CMP


and SMP
Ian K. T. Tan

Ian Chai

Poo Kuan Hoong

Faculty of Information Technology


Multimedia University
Cyberjaya 63100 Malaysia
e-mail: ian@mmu.edu.my

Faculty of Engineering
Multimedia University
Cyberjaya 63100 Malaysia
e-mail: ianchai@mmu.edu.my

Faculty of Information Technology


Multimedia University
Cyberjaya 63100 Malaysia
e-mail: khpoo@mmu.edu.my

benefits of these new CMP microprocessors is software


applications. Software applications are generally designed
and programmed using imperative programming languages
that are sequential in nature and hence are not able to
capitalize on the availability of CMP processors effectively
[3][7][9][10][11]. The current pool of programmers is
generally skilled in imperative languages and to preserve
this, incremental support for concurrency is suggested [9]
instead of introducing a complete paradigm shift in the
programmers' mindset. McGrath [7] and Knight [10] state
that software is the key to the future of CMP and Sutter and
Lazarus [8] clearly state that there is a lack of programming
tools as software development is currently greatly dependent
on programmers to create software that can benefit from
CMP architecture.
Intel, who is one of the largest chip manufacturers that
ships CMP understands this need to provide tools to assist
this shift in paradigm and has introduced several tools such
as the Intel Threading Building Blocks which is a C++
template based library [11]. This template basically provides
ready multi-threaded algorithms for developers use.
In addition to parallel algorithms, CMP poses the issue of
fine grain parallelism. With multiple processors on the same
die, the latency in terms of cost for parallelization is reduced
and hence a much finer grain of parallelism can be achieved.
Programmers will not have any way to know at what level
of granularity to code other than to empirically try different
levels of granularity in order to achieve parallel execution
that is profitable.
In this paper, we present an analysis of the difference on
cost of threading for shared and private cache CMP using
the portable POSIX thread library (pthreads) with a
comparison to symmetrical multi-processors (SMP). These
include processing cores with a shared cache, processing
cores that have private independent caches and symmetrical
multi processing. With the technique employed in this
paper, software developers will have a tool to determine the
level of granularity suitable for their application.
The rest of this paper is organized as follows. We present
the background on our method for evaluation including
related work, empirical results using the tools developed and
finally we conclude.

AbstractWith the wide availability of chip multi-processing


(CMP), software developers are now facing the task of
effectively parallelizing their software code. Once they have
identified the areas of parallelization, they will need to know
the level of code granularity needed to ensure profitable
execution. Furthermore, this problem multiplies itself with
different hardware available. In this paper, we present a novel
approach for fair comparison of the hardware configuration
by simulation through configuring a pair of quad-core
processors. The simulated configuration represents shared
cache CMP, private cache CMP and symmetrical multiprocessor (SMP) environment. We then present a modified
lmbench micro-benchmark suite to measure the cost of
threading on these different hardware configurations. In our
empirical studies, we observe that shared cache CMP exhibits
better performance when the operating systems load balancer
is highly active. However, the measurements also indicate that
thread size is an important consideration where potential cache
trashing can occur when sharing a cache between processing
cores. Private cache CMP and SMP do not exhibit significant
difference in our measurements. The techniques presented can
be incorporated into integrated development environment,
compilers and potentially even other run-time environments.
Keywords-component; pthreads, cache, CMP, lmbench

I.

INTRODUCTION

The emergence of multi-core microprocessors, also


known as Chip Multi-Processing (CMP) processors, was
foreseen as far back as 1989 [1][2], where Hennessy J. very
accurately pin-pointed that a single chip microprocessor
[2] will be introduced in year 2000. In the year 2000, IBM
did introduce a microprocessor with two processing cores
on a single die through the introduction of the POWER4
microprocessor [3]. New introductions by chip
manufacturers such as Intel and Tilera have this CMP
feature [4][5][6]. In the coming years, CMP will be the
prevalent approach and even if an alternative architecture
arises, it will take several years before it can be
implemented [7]. In support of this claim, Borkar S. of Intel
also spoke about the challenges of a 1,000-core chip [8]
which is a strong indication that CMP is the direction that
will be taken for the next few years.
However, one of the major hindrances to reaping the

978-0-7695-3982-9/10 $26.00 2010 IEEE


DOI 10.1109/ICCEA.2010.44

186

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.

II.

Front Side Bus

AN OVERVIEW OF OUR APPROACH

For the simulation platform, we used a DELL PowerEdge


1900 with dual processor Intel Quad Core E5335 system
where we are able to achieve the necessary configuration for
a fair analysis of the shared and private cache CMP as well
as the SMP environment.
We ran SuSE Linux Enterprise Server 10.2, which uses
the Linux kernel version 2.6.16.21-0.8-smp and augmented
the lmbench micro-benchmarking suite to conduct the
measurements.

L2 Cache
Core
0

Core

Figure 1. Intel Quad Core E5335.

Using the Linux Operating System, we are able to disable


individual processing cores through setting the necessary
values in the file /sys/devices/system/cpu/cpuz/online,
where z is the core identity.
Front Side Bus

L2 Cache
Core
0

Core
2

L2 Cache
Core
4

Core
6

L2 Cache
Core
1

Core
3

L2 Cache
Core
5

Core
7

Figure 2. Shared Cache CMP.

Figure 2 illustrates the share cache configuration through


disabling core 1,3,4,5,6 and 7 (setting the value 0 in the
respective online files), Figure 3 illustrates the private
cache configuration through disabling core 1,2,3,5,6 and 7
whilst Figure 4 illustrates the SMP configuration through
disabling core 2,3,4,5,6, and 7.
Front Side Bus

L2 Cache
Core
0

Core
2

L2 Cache
Core
4

Core
6

L2 Cache
Core
1

Core
3

L2 Cache
Core
5

Core
6

Core
1

Core
3

Core
5

Core
7

B.
lmbench Micro-benchmark Suite
POSIX thread is a standard that started to gain wide
acceptance from as far back as the early 90s. Performance
of POSIX threads have been of interest and in 1999, de
Supinski and May [12] presented performance of POSIX
threading for several SMP systems.
With the evolution of faster CMP processors,
performance measurement tools face a new challenge in
measuring the performance of up to microseconds and even
nanoseconds timing accuracy.
McVoy and Staelin [15][16][17] presented a
comprehensive operating system benchmarking suite, which
they called the lmbench micro-benchmarking suite. This
has been widely used by the hardware vendors such as
Hewlett-Packard and Sun Microsystems.
Their method to empirically obtain accurate
measurements are well accepted and as recent as 2007, Li et
al. [14] used a similar method as lmbench to benchmark two
processes communicating via two pipes to measure operating
systems context switching cost.
We will augment the lmbench micro-benchmarking suite
to provide an accurate measurement of the cost of threading
using the POSIX thread (pthread) libraries for shared and
private cache CMP and compare the results to a symmetrical
multiprocessor configuration.
The engine for our measurement comes from the
lmbench's timing harness. The micro-benchmarking suite
has been improved over the years, especially after being
critique by [18] and is now in version 3. The main
contribution of version 3 over its predecessors is in the
capability to measure multi-processor scalability.
The timing harness of lmbench ensures that the error due
to the timing window is minimized to less than 1%. It does
this by ensuring that the timing duration is long enough
which by default is to create sufficient loops to meet 5000
milliseconds. Since the least known timing accuracy from
the gettimeofday() clock is 10 milliseconds, this will
ensure a less than 1% timing error. We note that in most
modern
implementations,
either
through

Core

Core

Core
4

L2 Cache

This novel approach of disabling processing core


achieves a fair comparison for an analysis of shared cache
CMP, private cache CMP and SMP configurations with the
exception that in the shared cache CMP configuration, the
total amount of cache available is 4 MB instead of 8 MB in
the other 2 configurations.

L2 Cache

Core

Core
2

L2 Cache

Figure 4. Symmetrical Multi-Processing.

A.
Hardware Configuration
Each Intel Quad Core E5335 processor is architected with
2 cores sharing a 4MB L2 Cache as depicted in Figure 1.

L2 Cache

L2 Cache

Core
7

Figure 3. Private Cache CMP.

187

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.

gettimeofday() or clock_gettime(), it is possible


to obtain timing in nanoseconds.
The timing harness also allows for benchmark
initialization and cleanup. The former allows for timing to
be made with various initialization variables or routines to
ensure the section of interest can be timed independently of
its initialization routines. The latter allows for the release of
any resources that has to be done manually.
Prior to invoking the benchmark routine, lmbench will
execute a similar benchmark routine, which is optional,
whose sole purpose is to measure the overheads. A typical
overhead is the looping overheads necessary to ensure the
timing interval is sufficiently large.
For both the overheads routine and the benchmarking
routine, lmbench will take the median of 11 runs by default
but it also allows for the number of runs to be specified by
the user. A reference to [19] is mentioned on the use of the
median as oppose to the mean to remove any skewed data
that may arise.
We chose to augment lmbench as it is used by system
vendors and also the Linux kernel developers as a means to
measure the performance of their systems or code. In
addition, lmbench provides an interface for us to add on our
own measurement routines. The following sections describe
our performance measurement code which utilizes the
lmbench programming interface to the timing harness.
C.
Thread Creation
The lmbench suite provides for measuring process
creation using fork() and can measure the simple
performance of a process creation, a process creation and
loading a new program (fork() and execve()) and a
process creation which creates a shell and executes a
command.
We developed a new measurement module to measure the
cost of creating a POSIX thread. The lmbench timing
harness will determine the number of iterations that will be
sufficient to ensure that the timing accuracy is within 1%.
In our thread creation module, we have a loop which runs
for the number of iterations determined by the timing
harness. In the loop, we have the following code,
pthread_create(&thr, NULL, \
process_dummy, (void*)cookie);
pthread_join(thr, NULL);
The method employed by de Supinski and May [12]
measures the time to recursively create a large number of
threads. It starts with setting the iterations to zero in the
main thread and each newly created thread will then
increment it until the number is equal to the large number
predetermined and this is divided by the number of
iterations to obtain the measurement. The method includes

the time it takes to increment and compare the value (loop)


whilst our method measures the loop iterations as overheads
and deduct it from our measurements resulting in a better
representation of the actual timing.
The method used by [13] is similar to that of [12] in that
it also measures the looping constructs but it does not
measure the value comparison as the number of iterations is
provided. Their method [13] will also measure the looping
overheads.
D.
Synchronization
Measurement of synchronization has been a challenge
and work done by [12] and at Sun Microsystems [20] have
utilized the ping-pong method to effectively measure the
cost of synchronization using conditional variables. Their
methods require the waiting and signalling (for the
condition variables) processes to be bound to different
CPUs; otherwise, it will merely measure the time-slice of
the operating system.
Our main area of interest is in the cost of threading for
performance improvement in computation-intensive
applications. In such a situation, the synchronization would
be centered on shared variables, either for reading or
writing. Our measurement method utilizes the ping-pong
method but with an improvement over [12] and [13] in that
we are able to set the communicating threads onto different
processing cores. In addition to disabling specific cores as
described in II (A), we used the sched_setaffinity()
function to place the communicating threads onto the
respective processing cores.
The basic sequence of our measurement for
synchronization is: Calculate the overheads through the initial
acquisition of locks and the time required for the
threads to signal each other to start. This will include
the other overheads such as loops through the use of
empty loops. As many compilers will optimize
empty loops through removing them, we included a
volatile inline assembly code that does nothing,
__asm__ __volatile__ ();.
Execute the ping-pong method as per figure 5.
The ping-pong method creates 2 threads where one
thread will lock mutexes and signal the other thread
to unlock the mutexes. This process is interleaved in
the reverse direction and the total time taken is
divided by the number of mutexes. This timing
measured is the cost of locking and unlocking a
single shared variable (the synchronization variable)
between the 2 processing cores. Figure 5 illustrates
the process for 4 mutexes.

188

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.

application (ta).

(1)
If every thread has the same priority and is assigned the
same quantum for their time slices, then we can also derive
that the total number of time slices for all the threads will be
at most the total number of time slices for the single
threaded version plus n; assuming that the last time slice
quantum is not fully consumed.

(2)
The number of time slices will directly determine the
number of context switches. As the number of context
switching is going to be a few magnitudes larger than the
number of threads n, and then we will have;

(3)

Figure 5. Ping Pong method for 4 mutexes.

E.
Context Switching
One of the most important tasks for a multi-tasking kernel
scheduler is to context switch between tasks. Tasks can
either be a process which is a program in execution or a
thread, which has a CPU state stored within a process. A
CPU context is either a process or a thread. A context
switch occurs when either a process or a thread is switched
in a CPU at the end of its allocated timeslice or if it is
interrupted. During this process, the state of the previous
context is saved and the state of the next context is loaded.
This is an area of interest to measure because the system
effectively does no useful work during this event.
Previous work on measuring Pthread performance
measurement takes context switching into consideration.
This is because each new thread created will incur
overheads in context switching. Our interest is in the cost of
threading in order to improve the performance of computer
intensive applications and our counter argument for this is
that by creating more threads, the effective number of
context switching remains constant.
Assuming that the threads are 100% computational and
will not be pre-empted by other processes, the total
execution time of all the threads (tn) will be equal to the
execution time of the single threaded version of the

However, context switching on CMP processors will


affect performance if tasks are switched to other processing
cores that require caches to be invalidated.
F. Task Migration
Due to the difference in the cache layout of various
multiprocessing systems, we introduce a new measurement
technique to determine the difference in migrating tasks of
various sizes from one processing core to another. Using
the setup as illustrated in figures 2, 3 and 4; we measured
the difference in timing between Private Cache CMP,
Shared Cache CMP and SMP configurations.
The pseudo code in figure 6 illustrates our measurement.
We also measure a control set where we set the core affinity
to the same processing core and do not write a new set of
values. We call this measurement the overheads and use it
as a reference point for our measurements.
Create Thread
malloc() required memory space
Fill up memory space using memset()
Repeat x number of times
Switch core: sched_setaffinity()
Simulate memory read: memcpy()
Use memset() to write new values
free() up memory space
Leave Thread
Time = Timing Divide x
Figure 6. Pseudo Code for Measurement of Task Migration.

189

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.

described in Section II (B) of this paper) where the total


number of mutexes in the ping-pong arrangement is
1,000,000.

We use memcpy() to simulate a read where the source


is the space that was allocated using malloc() and the
destination is NULL. By reading the whole memory space,
it will load up the cache of the processing core and by
writing to it, it will then invalid the cache of the previously
running core.
III.

RESULTS

The results measured were in consideration of creating 2


threads to each run on the different processing core. This is
achieved by using the function sched_setaffinity().
A.
Thread Creation
Using the augmented benchmark as described in Section
II (C) above, we compared the results of measuring thread
creation against the lmbench measurement for fork().
Figure 7 illustrates that the cost of threading is significantly
less than the cost of creating a new process.

Figure 8. Cost of Synchronizing for a single mutex variable

The timing was divided by 1,000,000 to reflect the cost of


locking and unlocking a single mutex. The results yielded
similar values for all three types of cache arrangement as
illustrated by figure 8. As the shared CMP shows no
advantage over the SMP and private CMP even after
locking and unlocking 1,000,000 mutexes, it confirms that
the locking and unlocking mechanism of POSIX threads are
implemented without using intra-cache write buffers.
C.
Task Migration Cost
To minimize transient processes from affecting the
measurements, we conducted it under single user mode with
no network connectivity (/sbin/init 1). We apply the
same technique to disable processing cores and ran the
measurements for various task sizes.
Our measured results indicate that the cost of task
migration is cheaper for the shared cache configuration
when the task size is small. As the task size increases,
cache contention becomes an issue.

Figure 7. Process and thread creation times.

The cache arrangement of the system does not affect the


threading measurement. This is because unlike creating a
whole new process, thread creation does not need to make a
whole new copy for a new memory space. Threads share
the same memory space. Furthermore, on the Intel x86
based servers, a new process or thread will always be
initially assigned to the first processing core, which is
processor 0.
This measurement shows that the timing harness was
exploited correctly where the cost of thread creation is a
magnitude less than the cost of process creation; 9.41
microseconds for the thread creation as compared to 125.25
microseconds for the process creation.
B.
Synchronization
The synchronization mutex is a shared variable and any
acquisition of the mutex will cause a write to main memory.
The timing is the median timing obtained from 11 runs (as

Figure 9. Comparison of cost of task migration for various task sizes

190

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.

The measurements were made through 10,000 iterations,


where in each iteration a tasks specified memory space is
read and re-written and then migrated to another processing
core. The overheads for the measurement is computed from
executing the same process as the measurement process but
without the reading and writing steps and with
sched_setaffitnity() to the same processing core,
that is, without task migration. The measured results is
therefore the measured timing for the process, as illustrated
in Figure 6, subtract the overheads measurement.
As depicted in Figure 9, the shared CMP exhibits quicker
task migration, which is due to the data being in the same
cache, that is, it is cache-hot. However, when the task size
increases, the shared CMP is disadvantaged by cache
trashing.
IV.

REFERENCES
[1]
[2]

[3]

[4]

[5]

[6]

CONCLUSION

We have presented a novel method to fairly compare a


shared cache, private cache and symmetrical multiprocessing platform.
Through using the industry accepted lmbench microbenchmarking suite's timing harness, we have developed our
own measurement tools which we used to analyze the
characteristics of threading cost, hence parallelization cost,
for desktop microprocessors. It is shown that cache sharing
for two cores will potentially have a lower total cost in a
situation where significant amount of load balancing occurs,
whilst private cache arrangements does not exhibit any
difference from an SMP arrangement. The authors noted
that the latter statement is limited to systems where the
processors communicate using the bus architecture.
The method used in this study can be extended and
repeated for a higher number of cores sharing the last level
cache (LLC) such as the AMD Phenom microprocessors.
This study is limited to synchronization using shared
memory variables and for compute intensive applications
only. Further work will be conducted to empirically
illustrate the accuracy of the measurement by using the tool
to predict the level of parallel granularity and implement it
for several applications.

[7]
[8]

[9]
[10]
[11]
[12]

[13]

[14]

[15]

[16]

[17]
[18]

[19]

[20]

P. Gelsinger, P. Gargini, G. Parker, & A. Yu, Microprocessors circa


2000, IEEE Spectrum, 26(10), 1989, 43-47.
I. Young, ISSCC 93 Evening Discussion Session. Moderated by,
IEEE International Solid State Circuits Conference, San Francisco,
California, United States, 1993.
J. S. Gardner, Multicore Everywhere: The x86 and Beyond,
ExtremeTech, January 2006 [online]. Available:
http://www.extremetech.com/article2/0,1697,1909158,00.asp, August
2007.
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey,
P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R.,
Grochowski, E., Juan, T., Hanrahan, P. Larrabee: A ManyCore x86
Architecture for Visual Computing. ACM Trans. Graph. 27, 3, Article
18 (August 2008).
T P Morgan, Tilera pushes to 100 core with mesh processor, The
Register , October 2009 [online]. Avaialble:
http://www.theregister.co.uk/2009/10/26/tilera_third_gen_mesh_chip
s/, October 2009.
J. Markoff, Intel Prototype May Herald A New Age of Processing,
New York Times, 12 February 2007.
D. McGrath, Software Not Keeping Up With Multicore Advances:
Panelist, EETimes, March 2006.
S. Borkar, Thousand-Core Chips: A Technology Perspective, Proc.
44th Design Automation Conference, San Diego, California, USA,
2007, 746-749.
H. Sutter, J. Larus, Software and the Concurrency Revolution, ACM
Queue 3(7), 2005, 54-62.
W. Knight, Two Heads Are Better Than One, IEE Review 51(9),
2005, pp. 32-35
Intel Software Insight, September 2006
B. R. de Supinski, J. May, Benchmarking Pthreads Performance,
Proc. International Conference on Parallel and Distributed
Processing Techniques, Las Vegas, Nevada, United States, 1999.
B. Kothari, M. Claypool, PThreads Performance, Technical Report
WPI-CS-TR-99-11, Computer Science Department, Worcester
Polytechnic Institute, January 1999.
C. Li, C. Ding, K. Shen, Quantifying The Cost of Context Switch,
Proceedings of the 2007 workshop on Experimental computer
science, San Diego, California, United States, 2007. Article No. 2.
L. McVoy, C. Staelin, lmbench: Portable tools for performance
analysis, Proc. of the 1996 USENIX Technical Conference, San
Diego, California, United States, 1996, 279-295.
C. Staelin, lmbench: an extensible micro-benchmark suite, Wiley
Interscience Journal Software: Practice and Experience, 35(11),
2005, 1079-1105.
C. Staelin, lmbench3: measuring scalability, HP Laboratories Israel
Technical Report, HPL-2002-213, 2002.
A. B. Brown, M. I. Seltzer, Operating Systems Benchmarking in the
Wake of Lmbench: A Case Study on the Performance of NetBSD on
the Intel x86 Architecture, Proceedings of the 1997 ACM
SIGMETRICS international conference on Measurement and
modeling of computer systems, Seattle, Washington, United States,
1995, 214-224.
R. Jain. The Art of Computer Systems Performance Analysis:
Techniques for Experimental Design, Measurement, Simulation and
Modeling, John Wiley & Sons, 2001.
Multithreading in the Solaris Operating Environment - A Technical
Whitepaper, Sun Microsystems Inc., 2001, pp 25 33

191

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.

Вам также может понравиться