Вы находитесь на странице: 1из 8

Data and Thread Afnity in OpenMP Programs

Christian Terboven
Center for Computing and Communication RWTH Aachen University 52074 Aachen, Germany

Dieter an Mey
Center for Computing and Communication RWTH Aachen University 52074 Aachen, Germany

Dirk Schmidl
Center for Computing and Communication RWTH Aachen University 52074 Aachen, Germany

terboven@rz.rwthaachen.de

anmey@rz.rwthschmidl@rz.rwthaachen.de aachen.de Henry Jin Thomas Reichstein


Center for Computing and Communication RWTH Aachen University 52074 Aachen, Germany

NAS Division NASA Ames Research Center Moffet Field, CA 94035-1000

hjin@nas.nasa.gov

reichstein@rz.rwthaachen.de ABSTRACT
The slogan of last years International Workshop on OpenMP was A Practical Programming Model for the Multi-Core Era, although OpenMP still is fully hardware architecture agnostic. As a consequence the programmer is left alone with bad performance if threads and data happen to live apart. In this work we examine the programmers possibilities to improve data and thread anity in OpenMP programs for several toy applications and present how to apply the lessons learned on larger application codes. We lled a gap by implementing explicit data migration on Linux providing a next touch mechanism. large shared memory machines. With the growing number of cores on all kinds of processor chips and with additional OpenMP implementations (e.g. GNU and Visual Studio), OpenMP is available for use by a rapidly growing community. Upcoming multicore architectures make the playground for OpenMP programs even more diverse. The memory hierarchy will grow, with more caches on the processor chips. Whereas applying OpenMP on machines with a at memory is straight forward in many cases, there are quite some pitfalls when using OpenMP on ccNUMA and multicore architectures. The increasing diversity of multicore processor architectures further introduces more aspects to be considered in order to obtain good scalability. OpenMP as a programming model has no notion of the hardware a program is running on. Approaches to improve the support for controlling the anity between data and threads within upcoming OpenMP specications are still under discussion, yet they have not been included in the OpenMP 3.0 draft [1]. Operating system calls and compiler dependent environment variables to pin threads to processor cores and to control page allocation have to be employed to improve the scalability of OpenMP applications on ccNUMA and multicore architectures. This paper is organized as follows: In chapter 2 we take a look at anity on three major platforms, namely Linux, Windows and Solaris, and describe our implementation of data migration on Linux. Performance experiments with some toy applications described in chapter 3 reveal the implication of ccNUMA eects when anity is not carefully taken into account. In chapter 4 we take a look at some performance experiences with application packages, which stresses the importance of anity for production scenarios. Finally we conclude with a short summary of our ndings and an outlook on the remaining open issues.

Categories and Subject Descriptors


D.1.3 [Concurrent Programming]: Parallel programming

General Terms
Performance

Keywords
OpenMP, ccNUMA, Anity, Binding, Migration

1.

INTRODUCTION

OpenMP is an Application Programming Interface (API) for a portable, scalable programming model for developing shared-memory parallel applications in Fortran, C and C++. So far, OpenMP was predominantly employed on

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. MAW08, May 5, 2008, Ischia, Italy. Copyright 2008 ACM 978-1-60558-091-3/08/05 ...$5.00.

2.

OPERATING SYSTEM SUPPORT

Today, most operating systems already have smart algorithms to allocate threads and data on ccNUMA architectures. They model the structure of the underlying hardware

377

in order to describe the distance between processors and dierent blocks of memory. Nevertheless, there is no standardized terminology to name groups of processors with the same distance to all memory blocks or which are in the same equivalence calls to that respect. A term which is frequently used in the context of Linux and Windows is numa node, describing a region of memory in which every byte has the same distance from each CPU [2]. In the context of Solaris, the term locality group or lgroup [3] is dened as a subset of a machine in which all components can access one another within a bounded latency interval. Throughout this paper we will use the rst term, which is more popular. In order to explicitly optimize an application for a given multicore and/or ccNUMA architecture, two aspects are relevant: Memory placement: The nearness of threads to their data is crucial for performance. This is determined by the operating systems placement strategies. Neither OpenMP nor the programming languages oer any explicit support for inuencing memory placement. Thread binding: The operating system might decide to move a thread away from its initial location. This might cause a performance penalty, as the data in the cache(s) as well as memory locality is lost.

2.1

Thread binding

To avoid the operating system moving threads around, it is possible to bind a thread to a given core or a given subset of a systems cores. Solaris oers the system call processor_bind() to bind one thread or a set of threads to a specied core. The Linux system call sched_setaffinity() and the SetThreadAffinityMask() function call on Windows both expect a bit mask representing the cores allowed for execution to bind a specied thread. The Sun and Intel compilers allow to inuence thread binding without explicit system calls via an environment variable. The variable SUNW_MP_PROCBIND is respected by the Sun compilers on Solaris and Linux, the variable KMP_AFFINITY is used by the Intel compilers on Linux and Windows, although only on Intel architectures. The third approach to enforce thread binding is via operating system specic commands, e.g. by using the taskset command or the numactl tools on Linux or invoking a program with start /affinity on Windows. By using those tools, a process can be restricted to a given subset of processor cores and any thread created by this process will obey these restrictions. In summary, there are several well-suited solutions to control thread binding. Nevertheless, it is desirable to have a consistent and portable mechanism for all operating systems, which is not yet provided by OpenMP.

Solaris: Solaris oers madvise() via the MPO (memory placement optimization) facility, taking an address range and an access pattern advice as arguments. The advice MADV_ACCESS_LWP instructs the kernel that the next thread to touch the specied address range will access it most heavily and to move the pages to the accessing threads core or numa node, respectively. This strategy is called next touch strategy and is important for applications with dynamic data access patterns. Beside rst touch and next touch, Solaris also allows for random placement. This can improve the performance of applications with unstructured memory access patterns. Linux: With kernel version 2.6.18, the additional system call sys_move_pages() has been made available to oer page migration facilities. Using this function, one can manually move pages from one processor to another. But the application is still required to know which set of pages has to be moved to which processor. Below we describe our implementation of the next touch strategy on Linux. In addition to that system call, the numactl tool can be used to allow memory allocation only from a specied set of numa nodes, or to set a memory interleave policy. Memory will then be allocated using a round robin strategy which can improve the performance of application with random memory access patterns. Windows: Although the documentation states The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used, we found that this is not always the case when the system gets loaded. The current ccNUMA support on Windows is limited to thread binding, as explained above, and explicit memory allocation from a given numa node.

2.2.1

Implementing next touch on Linux

2.2

Memory placement

Solaris, Linux and Windows employ the so-called rst touch strategy by default. This means a page is placed in the memory next to the processor from which the rst access to the page occurs. This strategy can be exploited in an application by initializing the data in parallel in the same pattern as it is accessed during the computation. Of course this strategy is only successful, if the threads stay on their initial cores and if the access pattern remains constant.

Using the sys_move_pages() function to migrate the data is useful in many cases, still it lacks support for the next touch strategy described above. We implemented that functionality on top of this function using the following approach to achieve the same functionality as on Solaris. Our function named next_touch() has to be called with an address range given as argument. Inside our function, mprotect() is called to protect the pages, so that subsequent reads and writes cause a segmentation fault. We provide a new signal handler such that, when the data is accessed by the program, our signal handler catches the SEGFAULT and moves the specic page with the sys_move_pages() system call to the processor which required it. Then the page protection is removed. Using this technique, we provide the same functionality on Linux as is available on Solaris. We are aware that this approach does not provide the optimal performance, but as long as the Linux kernel does not provide equivalent functionality itself in a more ecient way, we state that this approach is the easiest for the programmer to to exploit scalability on Linux ccNUMA architectures with dynamic memory access patterns. The overhead of our implementation is compared to the mechanism provided by Solaris in 3.4. Our intent was to provide a library approach that is portable between dierent Linux distributions, thus we were prohibited to change the kernel, and supports general multi-threading and not only OpenMP. This led to some overhead which could have been avoided by dropping selected requirements. For example we

378

have to parse the /proc lesystem every time to query on which core the accessing thread is running, which is significantly more expensive than assuming an invariable distribution of the threads. If the same functionality would have been implemented in kernel mode, the required information would have been provided at negligible cost by querying the scheduler.

3.3

Sparse Matrix-Vector Multiplication

3. 3.1

KERNELS Stream

McCalpins Stream kernel [4] is very popular for measuring a machines memory bandwidth. Here we use the OpenMP version to study the ccNUMA eects on a quadsocket single-core Opteron 848 (2.2 GHz) based Sun Fire V40z machine running OpenSolaris. Table 1 shows the achievable memory bandwidth of the OpenMP Stream benchmark (daxpy) for dierent initialization strategies. If the arrays are initialized by the initial thread, the data resides only on the masters numa node, which prohibits any speedup. If each thread initializes the chunks of the arrays which it later uses during the computation (rst touch), the speedup is almost perfect. The same performance can be achieved by employing the next touch strategy with Solaris madvise() system call, that is re-distributing the pages that were initialized by the master. The one-time overhead for migration is negligible.

A sparse matrix-vector-multiplication (SMXV) typically is the most time consuming part in iterative solvers. We examined ccNUMA eects of the SMXV benchmark kernel of DROPS [5], a 3D CFD package for simulating two-phase ows with a matrix of some 300 MB and about 19,600,000 nonzero values. Two parallelization strategies are compared: In the rows-strategy the parallel loop runs over the number of rows and a dynamic loop schedule is used for load balancing, while in the nonzeros-strategy the number of nonzeros is statically partitioned into blocks of approximately equal size, one block for each thread. The performance is shown on two dierent architectures: A quad-socket single-core Opteron 848 (2.2 GHz) based Sun Fire V40z machine (ccNUMA) and a dual-socket dual-core Xeon 5160 (3.0 GHz) based Dell PowerEdge 1950 machine (UMA). rows 1 thread 4 threads 561.9 960 326.3 793.9 nonzeros 1 thread 4 threads 561.5 978.1 324.5 1147.6

UMA ccNUMA

Table 2: Performance [MFLOP/s] of SMXV.

3.2

Stream with std::valarry STL container

All elements of a variable of type std::valarray are guaranteed to be initialized with zero. This leads to bad data placement on ccNUMA architectures, as the initialization with zero touches the data and thereby leads to page placement by the operating system. A typical solution to this problem is a parallel initialization with the same memory access pattern as in the computation. With std::valarray, the initialization is done inside the data type implementation and a later parallel access does not lead to page redistribution. The same problem arises for std::vector. We considered several approaches to utilize ccNUMA-architectures with these data types. 1. A modication of std::valarray or std::vector is possible so that the initialization is done in parallel, respecting the rst touch policy. The problem of this approach is that it is limited to a given compiler, as typically every compiler provides its own STL implementation. Therefore this approach is not portable. 2. If std::vector is used instead of std::valarray, a custom allocator can be specied. We implemented an allocator that uses malloc() and free() for memory allocation and initializes the memory with zero in a loop parallelized with OpenMP, whose schedule and chunksize parameters are specied as template parameters. From a C++ programmers perspective we found this approach to be the most elegant one. 3. The third approach is to use page migration functionality provided by the operating system, e.g. madvise() on Solaris or sys_move_pages() on Linux. Again this approach is not portable, as this functionality is not available on all current operating systems.

Table 2 depicts that the nonzeros-strategy clearly outperforms the rows-strategy on the ccNUMA architecture. While the dynamic loop scheduling in the rows-strategy successfully provides good load balance, the memory locality is not optimal. The nonzeros-strategy shows a negligible load imbalance for the given dataset, but its advantage is that each thread works on local data. Both strategies perform about the same on the UMA architecture, for the given test case.

3.4

Jacobi Solver

We studied the eect and overhead of migration using a modied version of the simple Jacobi solver taken from the OpenMP ARB web site on an quad-socket dual-core Opteron 875 (2.2 GHz) based Sun Fire V40z machine running OpenSolaris. The matrix size is 5000x5000 which leads to a memory footprint of the whole application of about 600 MB. When starting the program with 8 threads without binding and with the default rst touch memory placement strategy, all data is allocated on to the master threads numa node, as the initialization is done in a serial program part. The execution time per iteration was between 0.422 and 0.424 seconds, corresponding to about 768 Mop/s. Then we explicitly bound all threads to consecutive processor ids. Still the execution time was between 0.423 and 0.424 seconds per iteration with no noticeable overhead of the binding itself. After some more iterations we migrated all data with the Solaris MPO madvise() system call using the next touch mechanism. The timing of the following iteration included the overhead of the migration activities and took 0.629 seconds, which includes some additional inquiry functions to observe the memory placement activities. After this iteration all three data arrays are equally distributed among the four numa nodes. Now that the placement is nearly optimal, the following iterations only took about 0.094 seconds corresponding to 3.457 Gop/s, so the performance increased by a factor of 4.5 and migration payed o already after one iteration.

379

#threads 1 2 4

data initialized by master 3266 MB/s 3330 MB/s 3135 MB/s

rst touch strategy 3270 MB/s 6472 MB/s 12397 MB/s

next touch strategy 3246 MB/s 6377 MB/s 12277 MB/s

Table 1: STREAM (daxpy) memory bandwidth with dierent memory placement strategies.

We also timed the overhead of migrating the necessary 40000 pages of 4 KB each. It took about 0.17 seconds corresponding to a rate of 3.4 GB/s, which is about half of the peak bandwidth of the Opteron processors HyperTransport links. Our next touch implementation on Linux is clearly slower. On a quad-socket single-core Opteron 848 based Sun Fire V40z running Scientic Linux 5.0, the execution time drops by a factor of 3.55 from 0.48 seconds to 0.135 seconds by migration. The overhead of the migration is about 9 seconds, it pays o after 26 iterations. Calling the sys_move_pages() function directly, we measured the overhead of just the page migration to be 1.05 seconds, the rest is caused by our mechanism that calls the SEGFAULT handler and other overhead as querying the /proc lesystem, for example.

do i = 1, 4 parallel num_threads(i) parallel num_threads(2) continue !$omp end parallel !$omp end parallel end do !$omp !$omp Table 3 shows how eciently this technique can be applied. For comparison we measured the memory bandwidth which can be obtained with an outer team of only one thread and an inner team with 8 thread (test cases 1 and 2). If the initial thread initializes all data, the bandwidth is about 4.8 times less than if all threads of the inner team initialize their data. For the other test cases (3 to 10) we employ 4 threads in the outer team and 2 threads in all inner teams. As long as all inner threads initialize their data themselves (test cases 8, 9, 10) the optimal accumulated memory bandwidth can be achieved: between 9.7 and 11.6 GB/s. If the initial thread initializes all data (test case 3, 4), the performance is as poor as in the rst case. If the threads of the outer team, which then become the masters of the inner teams, do the initialization (test cases 5, 6, 7), the optimal bandwidth can only be obtained if the anity support is enabled and the threads are properly bound to the two processor cores of the numa node in pairs by the set-up loop listed above.

3.5

Modied Stream with Nested OpenMP

In order to investigate the ccNUMA eects in the context of twofold nested OpenMP parallelization, we modied the daxpy kernel of the Stream benchmark. Basically, a new driver program contains a parallel region and each thread of this outer team executes a Stream benchmark synchronously on private data arrays. We experimented on the quad-socket dual-core Opteron 875 based Sun Fire V40z running OpenSolaris and tailored our experiment to the target machine by starting the outer parallel region with 4 threads and all inner parallel regions with 2 threads. We varied the way the data arrays are initialized, the sequence in which the threads are forked for the rst time, and tested the new thread anity ag (SUNW MP THR AFFINITY) of the latest Sun Studio compiler suite (version 12). Setting the environment variable SUNW MP PROCBIND to TRUE causes all threads to be bound to all processor cores in ascending order. Solaris enumerates the cores such that cores 0 and 1, 2 and 3, 4 and 5, and 6 and 7 belong to the locality groups (numa nodes) 0, 1, 2, and 3 respectively. The implementation of nested OpenMP parallelization in the Sun Studio compilers uses a pool of threads (like other compilers do as well). Because these threads are dynamically assigned whenever an inner team is forked, they loose their data anity easily. With the new anity ag, the mapping between threads of the pool and the members of the inner teams is maintained. Solaris typically places all threads of the outer team close to each other with respect to the machines memory architecture, such that they can access shared data eciently. When later on the inner teams slave threads are created, they will most likely be located apart from their masters, thus causing unfavorable memory access for all data which is shared with respect to the inner parallel regions. This can be resolved by a little trick: The threads of the inner teams can be bound to the two processor cores of a processor chip in pairs with a combination of this new anity ag and a simple set-up loop to fork all threads for the rst time:

4. 4.1

APPLICATIONS ThermoFlow60

The OpenMP parallelization of the Finite Element Code ThermoFlow60 solving the Navier-Stokes Equations to simulate the heat distribution in a rocket combustion chamber [6] lead to a speed-up of over 40 with 68 threads on the 72-way Sun Fire 15K server. Equipped with 18 boards carrying 4 UltraSPARC processor chips and physically local memory each, this machine reveals a ccNUMA architecture, which was not respected by the Solaris 8 operating system such that data was randomly distributed across the whole machine. After upgrading to Solaris 9 we were caught by surprise: With up to 4 threads scheduled within one board, the program ran about 30 percent faster than previously, because the threads were able to access all data locally. But with a higher thread count, the performance dropped down considerably (up to a factor of 2.6 with 68 threads). As the initialization of the main data areas took place during the initial sequential phase of the program, all data was allocated close to the master thread by the rst touch policy introduced with Solaris 9. During the compute phase all threads were accessing memory located on the master threads board, which led to a severe bottleneck. After carefully initializing all data areas in additional parallel regions, the original performance could be recaptured.

380

Test case 1 2 3 4 5 6 7 8 9 10

#threads 18 18 42 42 42 42 42 42 42 42

First touch Initial thread All inner threads Initial thread Initial thread Inner master Inner master Inner master All inner threads All inner threads All inner threads

Anity n.a. n.a. no yes no yes yes + sort no yes yes + sort

min MB/s 2408 11666 4 556 4 570 4 988 4 1196 4 2696 4 2400 4 2660 4 2843

max MB/s 2445 11866 4 607 4 610 4 1325 4 1312 4 2857 4 2838 4 2896 4 2887

Table 3: Stream (daxpy) with nested OpenMP, dierent initialization strategies.

4.2

FLOWer

The obvious approach to use MPI on a higher level and OpenMP underneath, only requiring the lowest level of multithreading support of the MPI implementation, oers interesting opportunities on SMP clusters with fat nodes. If a hybrid application suers from load imbalances on the MPI level, the number of threads can be increased in order to speed-up busy MPI processes or decreased to slow down idle MPI processes, provided these processes reside on the same SMP node. It may not always be easy to nd out the optimal assignment of threads to the MPI processes, and the optimal distribution may change in the course of the runtime of an application. Therefore, we developed a dynamic thread balancing (DTB) library which performs the thread adjustment automatically. Sponsored by the German Research Council (DFG), scientists of the Laboratory of Mechanics of RWTH Aachen University simulated a small scale prototype of a space launch vehicle designed to take o horizontally and glide back to earth after placing its cargo in orbit. The corresponding Navier-Stokes Equations were solved with FLOWer, a ow solver developed at the German Aerospace Center (DLR). The performance of FLOWer employing hybrid parallelization for the simulation of a small scale prototype of a space launch vehicle was improved with DTB on a Sun Fire 15K [7]. Because of a severe load imbalance and because of a limited number of blocks, the number of MPI tasks was limited, whereas loop-level parallelization added to the scalability of the code. Varying the number of threads with the DTB library resulted in substantial improvements. The DTB mechanism is sensitive to ccNUMA eects: If the number of threads previously running within the same processor board (ccNUMA node) is increased for a busy MPI process, an additional thread might be started on a dierent board thus slowing down the whole MPI process, because this additional thread has to access all its data remotely. As FLOWer uses a static mesh of grid points, our strategy was to turn o the rst touch mechanism and use random placement instead for the rst phase of the computation. Once the distribution of threads stabilized, we employed Solaris MPOs next touch features to migrate all data to the very board where they were accessed the next time. Figure 1 demonstrates the success of this strategy. In [7] 23 MPI processes were started with 2 threads each and the number of L2 cache misses satised by local and remote memory accesses have been measured with hardware performance counters, which nicely reected the success of the

migration: The number of local misses went up from less then 5 Million to well over 100 Million misses per second and the number of remote misses dropped down from over 100 Million to some 20 million misses per second. The accumulated number of misses increased, because more data was required for a faster computation after the migration took place. Once the memory was mainly accessed locally the speed increased by about 20 percent, which nicely corresponds to the observations obtained in [8] on a similar architecture.

4.3

TFS

The multi-block Navier-Stokes Solver TFS written in Fortran90 is developed by the Institute of Aerodynamics of the RWTH Aachen University and used to simulate the human nasal ow. OpenMP is employed on two levels, on the block and on the loop level. This application puts a high burden on the memory system and thus is quite sensitive to ccNUMA eects [9]. In order to improve memory locality, threads were bound to processors and also pages were migrated to where they are used (next touch mechanism) with the Solaris madvise() system call after a warm-up phase of the program. Surprisingly, this was only protable when applied to a single level of parallelism. Applying these techniques to the nested parallel version was not protable at all: As described in subsection 3.4, the implementation of nested OpenMP parallelization in the Sun Studio compilers employs a pool of threads. Because these threads are dynamically assigned whenever an inner team is forked, they loose their data anity frequently. Employing the anity features of the latest Sun Studio compiler improved the thread anity by maintaining the mapping between threads of the pool and the members of the inner teams and thus improved scalability by about 25 percent. The combination of thread anity, processor binding and explicit data migration nally lead to a speedup of 25 for 64 threads on the SFE25K, a satisfying result for this code taking into account Amdahls law.

4.4

NPB BT-MZ

The NPB-MZ multi-zone application benchmark was derived from the well-known NAS Parallel Benchmarks (NPB) suite that involves ow solvers on collections of loosely coupled discretization meshes. The BT-MZ code was designed to have uneven-sized zones, which allows to evaluate various load balancing strategies. The nested OpenMP implementation is based on a two-level approach: On the outer

381

Data initialized by the master thread Data initialized by the outer team

Runtime w/ setup loop 37.80 sec 17.76 sec

Runtime wo/ setup loop 47.42 sec 30.51 sec

Table 4: NPB BT-MZ with modied data initialization, 168 threads.

level exploits coarse-grained parallelism between zones and on the inner level loop parallelism within each zone is em ployed with Snum threadsT clauses to specify the number of threads for each inner parallel region. Load balancing in BTMZ is based on a bin-packing algorithm with an additional adjustment from OpenMP threads as described in [10]. Multiple zones are clustered into zone groups among which the computational workload is evenly distributed. Each zone group is statically assigned to each thread of the outer parallel region. Comparing the performance of the NPB BT-MZ on the SGI Altix 4700 and the Sun Fire E25K in [11] reveals the importance of a suitable support of anity by the underlying operating environment. Figure 2 plots the runtime and the speedup relative to the single CPU timing on each system for a wide range of combinations of thread counts on the outer and on the inner level. Because of slow clock speed of the SunFire system, its single CPU performance is only about 1/4 of the Altix. However, at large CPU counts, the SunFire shows its strength for nested OpenMP and outperforms the Altix in many cases. In fact, the best nested OpenMP speedup on the SunFire is 68 (17.49 seconds) at 16 threads on the outer and 8 threads on the inner level, which interestingly matches closely with the hybrid result on the Altix. The best nested OpenMP result on the Altix is only 15.6 (19.47 seconds) from 32 times 2 threads. The improved performance of nested OpenMP on the SunFire is due to the support of thread anity with thread reuse by the Sun Studio compiler, which is currently missing in the Intel compiler on the Altix. The initial implementation of the nested OpenMP version of NPB BT-MZ respects the rst touch mechanism of the Linux and Solaris operating system, which together with Suns anity support leads to the nice scalability reported above. In additional experiments on the Sun Fire we slightly modied the source code to either let the initial thread initialize all of the data or to let the threads of the outer teams initialize the data for their inner teams. We also optionally started the threads of both level in an additional setup loop such that the threads of the inner teams are scheduled on the same boards as described in chapter 3.4. Table 4 shows, that if data cannot be initialized by all inner threads themselves, the setup loop improves the performance in both cases. If all the threads are running on the same board, it is sucient to let the threads of the outer team initialize the data to obtain optimal speed-up (17.76 seconds)

Current operating systems typically apply the rst touch policy and allocate data where it is initialized. In extreme cases like the Stream benchmark with all data allocated by the master thread, no speedup may be obtainable at all. A xed allocation of data and threads cannot solve all problems, ecient adaptive algorithms rely on migration. As OpenMP does not provide any means to control anity and as automatic migration is still not available in any current production environment (to our knowledge), programmers have to use the utilities to control placement provided by their operating environment. While this may help for OpenMP parallelization on a single level, the situation is worse for nested parallelization. Currently only the Sun compiler provides the means to maintain the anity between data and threads of inner parallel regions. There have been many discussions about the necessity for data placement control in OpenMP in the past. From our application experience we found that providing a next touch automatism is easiest for the program, this is why we implemented that functionality on Linux. Automatic migration would be the ultimate goal, as long as the overhead is acceptable. In [12] it was demonstrated with a research compiler and runtime environment that if migration works, there is no need for explicit data placement. Today, to the best of our knowledge, there is no commercial solution available. We think that hints to the runtime system for binding, memory placement, migration and probably prefetching would be a welcome enhancement of a future OpenMP standard, to supersede the necessity for employing proprietary low level system calls by the programmer.

6.

REFERENCES

5.

CONCLUSIONS

For OpenMP programs which are not suciently cachefriendly, the anity of threads and their most frequently used data is a necessity for good performance. We described the current situation on the most popular platforms and looked at some simple codes and at some large production codes which are aected by ccNUMA architectures.

[1] OpenMP Application Program Interface, Draft 3.0 for Public Comment, 2007. [2] Linux NUMA FAQ list on the Linux Scalability Eort Homepage: http://lse.sourceforge.net/numa/faq/index.html [3] Sun Technical White Paper: Solaris Memory Placement Optimization and Sun Fire Servers. [4] J.D. McCalpin: STREAM: Sustainable Memory Bandwidth in High Performance Computers. [5] C. Terboven, A. Spiegel, D. an Mey, S. Gross, V. Reichelt: Parallelization of the C++ Navier-Stokes Solver DROPS with OpenMP. International Conference on Parallel Computing (ParCo 2005), Malaga, Spain, 2005. [6] D. an Mey, T. Haarmann: Pushing Loop-Level Parallelization to the Limit. Fourth European Workshop on OpenMP (EWOMP 2002), Rome, Italy, 2002. [7] A. Spiegel, D. an Mey: Hybrid Parallelization with Dynamic Thread Balancing on a ccNUMA System. Sixth European Workshop on OpenMP (EWOMP 2004), Stockholm, Sweden, 2004.

382

[8] M. Norden, H. Lf, J. Rantakokko, S. Holmgren: o Dynamic Data Migration for Structured AMR Solvers. International Journal of Parallel Programming, Vol 35, No. 5, 2007. [9] D. an Mey, S. Sarholz, C. Terboven: Nested Parallelization with OpenMP. International Journal of Parallel Programming, Vol 35, No. 5, 2007. [10] H. Jin, R.F. Van der Wijngaart: Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks. Journal of Parallel and Distributed Computing, Vol 66, No. 5, 2006.

[11] H. Jin, B. Chapman, L. Huang, D. an Mey, T. Reichstein: Performance Evaluation of a Multi-Zone Application in Dierent OpenMP Approaches. Submitted to International Journal of Parallel Programming. [12] D.S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, E. Ayguade: Is Data Distribution Necessary in OpenMP? Conference on High Performance Networking and Computing, Proceedings of the 2000 ACM/IEEE conference on Supercomputing.

383

Figure 1: FLOWer on one Sun Fire 15K: random memory placement at the beginning and then explicit migration after about 700 seconds (local and remote cache misses and Gop/s).

Figure 2: Timing and speedup of the BT-MZ benchmark on SGI Altix 4700 and Sun Fire E25K.

384

Вам также может понравиться