Академический Документы
Профессиональный Документы
Культура Документы
Ian Chai
Faculty of Engineering
Multimedia University
Cyberjaya 63100 Malaysia
e-mail: ianchai@mmu.edu.my
I.
INTRODUCTION
186
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.
II.
L2 Cache
Core
0
Core
L2 Cache
Core
0
Core
2
L2 Cache
Core
4
Core
6
L2 Cache
Core
1
Core
3
L2 Cache
Core
5
Core
7
L2 Cache
Core
0
Core
2
L2 Cache
Core
4
Core
6
L2 Cache
Core
1
Core
3
L2 Cache
Core
5
Core
6
Core
1
Core
3
Core
5
Core
7
B.
lmbench Micro-benchmark Suite
POSIX thread is a standard that started to gain wide
acceptance from as far back as the early 90s. Performance
of POSIX threads have been of interest and in 1999, de
Supinski and May [12] presented performance of POSIX
threading for several SMP systems.
With the evolution of faster CMP processors,
performance measurement tools face a new challenge in
measuring the performance of up to microseconds and even
nanoseconds timing accuracy.
McVoy and Staelin [15][16][17] presented a
comprehensive operating system benchmarking suite, which
they called the lmbench micro-benchmarking suite. This
has been widely used by the hardware vendors such as
Hewlett-Packard and Sun Microsystems.
Their method to empirically obtain accurate
measurements are well accepted and as recent as 2007, Li et
al. [14] used a similar method as lmbench to benchmark two
processes communicating via two pipes to measure operating
systems context switching cost.
We will augment the lmbench micro-benchmarking suite
to provide an accurate measurement of the cost of threading
using the POSIX thread (pthread) libraries for shared and
private cache CMP and compare the results to a symmetrical
multiprocessor configuration.
The engine for our measurement comes from the
lmbench's timing harness. The micro-benchmarking suite
has been improved over the years, especially after being
critique by [18] and is now in version 3. The main
contribution of version 3 over its predecessors is in the
capability to measure multi-processor scalability.
The timing harness of lmbench ensures that the error due
to the timing window is minimized to less than 1%. It does
this by ensuring that the timing duration is long enough
which by default is to create sufficient loops to meet 5000
milliseconds. Since the least known timing accuracy from
the gettimeofday() clock is 10 milliseconds, this will
ensure a less than 1% timing error. We note that in most
modern
implementations,
either
through
Core
Core
Core
4
L2 Cache
L2 Cache
Core
Core
2
L2 Cache
A.
Hardware Configuration
Each Intel Quad Core E5335 processor is architected with
2 cores sharing a 4MB L2 Cache as depicted in Figure 1.
L2 Cache
L2 Cache
Core
7
187
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.
188
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.
application (ta).
(1)
If every thread has the same priority and is assigned the
same quantum for their time slices, then we can also derive
that the total number of time slices for all the threads will be
at most the total number of time slices for the single
threaded version plus n; assuming that the last time slice
quantum is not fully consumed.
(2)
The number of time slices will directly determine the
number of context switches. As the number of context
switching is going to be a few magnitudes larger than the
number of threads n, and then we will have;
(3)
E.
Context Switching
One of the most important tasks for a multi-tasking kernel
scheduler is to context switch between tasks. Tasks can
either be a process which is a program in execution or a
thread, which has a CPU state stored within a process. A
CPU context is either a process or a thread. A context
switch occurs when either a process or a thread is switched
in a CPU at the end of its allocated timeslice or if it is
interrupted. During this process, the state of the previous
context is saved and the state of the next context is loaded.
This is an area of interest to measure because the system
effectively does no useful work during this event.
Previous work on measuring Pthread performance
measurement takes context switching into consideration.
This is because each new thread created will incur
overheads in context switching. Our interest is in the cost of
threading in order to improve the performance of computer
intensive applications and our counter argument for this is
that by creating more threads, the effective number of
context switching remains constant.
Assuming that the threads are 100% computational and
will not be pre-empted by other processes, the total
execution time of all the threads (tn) will be equal to the
execution time of the single threaded version of the
189
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.
RESULTS
190
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
CONCLUSION
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
191
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE. Downloaded on June 07,2010 at 11:12:29 UTC from IEEE Xplore. Restrictions apply.