Programming Research Group: Jonathan - Hill@comlab - Ox.ac - Uk Skill@qucis - Queensu.ca

Programming Research Group
PRACTICAL BARRIER SYNCHRONISATION Jonathan M.D. Hill1 Oxford University Computing Laboratory Oxford U.K. D.B. Skillicorn2 Department of Computing and Information Science Queen's University, Kingston, Canada PRG-TR-16-96
Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford OX1 3QD
1 2
Jonathan.Hill@comlab.ox.ac.uk skill@qucis.queensu.ca
Abstract
We investigate the performance of barrier synchronisation on both shared-memory and distributed-memory architectures, using a wide range of techniques. The performance results obtained show that distributed-memory architectures behave predictably, although their performance for barrier synchronisation is relatively poor. For shared-memory architectures, a much larger range of implementation techniques are available. We show that asymptotic analysis is useless, and a detailed understanding of the underlying hardware is required to design an effective barrier implementation. We show that a technique using cache coherence is more e ective than semaphore- or lock-based techniques, and is competitive with specialised barrier synchronisation hardware.
1 Introduction
Barrier synchronisation is an important collective-communication operation in several of today's parallel programming models. It is a simple concept to understand: all processes in some group must reach a certain point in the execution of their code before any is allowed to proceed past that point. Getting past the barrier allows a process to deduce, locally, that the whole group has reached a common, global, state. As such, barriers are an important way to reduce state-space explosion and hence the complexity of programming. Barrier synchronisation is at the heart of SPMD programming, allowing it to be more exible than a purely-SIMD style. Thus barriers are common in data-parallel languages, particularly new Fortran languages such as HPF and MPP Fortran 12, 15], where every distributed loop body involving shared arrays ends with a barrier. Directives such as MASTER also contain hidden barriers. Barrier synchronisation is also integral 1
to the BSP programming model 11, 13, 16] in which computations proceed in supersteps which are separated by barriers. Barriers also appear in less-structured parallel computation models such as PVM 2, 3, 8] (pvm barrier) and MPI 7, 9, 14] (MPI barrier). Manufacturers have responded by including mechanisms for barrier synchronisation in today's parallel architectures, usually in software, but sometimes in hardware. One of the contributions of this paper is to show that such implementations should be treated with caution|their performance is often poor. The second contribution is to show that naive cost analysis may lead to poor implementations of barrier synchronisations. For example, a barrier synchronisation using a reduction has asymptotic logarithmic complexity while a total exchange has asymptotic linear complexity. However, the constants dominate in practice. Techniques that look unattractive from a complexity perspective are often most e ective in real implementations. The third contribution of this paper is to present the performance of an algorithm for low-latency barrier synchronisation on shared-memory architectures that uses cache coherency to provide the mutual exclusion required. Our results show that this technique is as e cient as hardware solutions to barrier synchronisation. Moreover, on some machines, the technique is two orders of magnitude faster than alternative software solutions. In Section 2, we describe several techniques for implementing barriers on sharedmemory architectures, and compare their performance. In Section 3, we do the same for message-passing. Section 4 compares the performance of dedicated hardware to the earlier software solutions.
2 Shared-Memory Barriers
In this section we show the performance of various implementations of barrier synchronisation on a Silicon Graphics Power Challenge (75MHz R8000), a Sun Sparcstation-20 multiprocessor (100MHz hyperSPARC), and a Digital 8400 (300Mhz Alpha EV5) using sharedmemory implementation techniques. As a baseline, we rst measure the performance of each architecture's basic tools for pairwise synchronisation. We do this using three benchmarks: diately unlocking a single lock (The SUN does not support locks). SEMLOCK All processes loop entering a single critical section protected by a semaphore. This is semantically equivalent to the test above as a 0-1 semaphore is used to represent a lock. PWSEM Two processes loop, each one locking one semaphore and unlocking another, i.e., s1=0;s2=0; fp(s1);v(s2) || v(s1);p(s2)g+. The performance of these simple code segments shows the cost of the basic building blocks from which barrier synchronisations are built. We then measure the performance of the following techniques for implementing barriers (using 1 process per processor):
SEMTREE The same as the previous im-
plementation except that semaphores are used for pairwise synchronisation instead of locks. in which each process entering the barrier increments a counter protected by a lock. The process then spins on the counter until it reaches p, the number of processes participating in the barrier. above except that, rather than spinning on a counter, the rst p ? 1 processes to enter the barrier wait on a semaphore, and are awakened by the last process that enters the barrier. process writes a value to a unique location, and all of these locations are placed on the same cache line. After writing its value, each process spins reading the entire set of locations. This technique uses the architecture's cache coherence mechanism to handle mutual exclusion, since only one processor can hold the line in write mode at any given time.
PWAYSPIN A centralised counter barrier
PWAYSLEEP A similar barrier to the one
LOCK All processes loop locking and imme-
PWAYCACHE A barrier in which each
MAN Manufacturer-supplied barrier system
call (for the SGI). LOCKTREE A tree-based dissemination 1] barrier using locks (on the SGI). This algorithm uses a logarithmic number of phases in which each process synchronises with its immediate neighbour in a notional linear ordering, then with a process two steps away, then with one four steps away, and so on. 2
The performance of these implementations, expressed in terms of the minimum barrier latency i.e., the time between the rst process entering the barrier and the last one exiting, are shown in Figure 1. The times are measured by repeatedly barrier synchronising the processors, and taking a mid-stream sample. The manufacturer's barrier on the SGI performs rather poorly | three other implementations outperform it. This appears to be characteristic of today's machines, and suggests that implementations carefully crafted by users are to be preferred. It is clear from this data that locks, if available, are much faster than semaphores, which is not entirely unexpected. What is surprising is how poor the semaphore implementation on
SGI LOCK 2.5 SEMLOCK 637.2 PWSEM 320.8 MAN 33.5 LOCKTREE 12.9 SEMTREE 335.6 PWAYSPIN 18.1 PWAYSLEEP 355.4 PWAYCACHE 6.6
SUN DEC | | 315.6 37.8 129.3 97.9 | | | | 167.6 89.7 320.4 188.6 288.7 179.6 2.2 3.8
process continues. Otherwise (i.e., another processor actually set the lock) a repeated attempt is made to set the lock. Typically an exponential back-o is incorporated to minimise contention on the lock. 3. If the lock is set (i.e., by another processor), the processor \spin-waits" until it becomes unlocked. As the lock will be held in local cache, spinning does not cause any tra c on the shared-memory bus. When another process unsets the lock, a write invalidate will be issued to all spinning processors by the cache coherence mechanism, and the processors will update their local cached copy of the lock. Each process then stops spinning and attempts to set the lock. A limitation of centralised barrier schemes such as PWAYSPIN is that there can be excessive amounts of contention on the sharedmemory bus as each process tries to set the lock using an atomic swap (step 2 in the lock implementation above). In fact the contention is exaggerated in a centralised barrier scheme because each process may enter the barrier at about the same time, which results in a race condition for the processors setting the centralised lock. As only one process can enter the critical region and set the lock, the other processors will start spinning and then race with each other to set the lock when the rst process exits the critical region. Therefore although there is only 1 write request and 1 read request issued onto the shared bus by each processor, when incrementing the shared counter (we ignore the cost of spinning locally on the counter as it will be in cache), the cost of setting/unsetting the lock associated with the critical region of code that increments the shared counter will be p2 read requests and p2 =2 writes per processor (counting an atomic swap as a read and a write). This is solely due to the contention as each process attempts the atomic swap in step (2) of the lock implemen3
Figure 1: Comparing Shared-Memory Implementations of Barriers. Times in s based on 106 repeated barriers, using 4 processors. the SGI is, almost twice as slow as semaphores on the SUN, even correcting for clock rate difference. This makes barrier implementation techniques that rely on semaphores perform very poorly on the SGI. The improvement in performance from the centralised barrier PWAYSPIN, through the tree-based schemes LOCKTREE, to the PWAYCACHE barrier can only be e ectively analysed by considering the number of sharedmemory read and write bus transactions issued by each process. To do such analysis on lock-based barriers we rst consider the implementation of low-latency locks. The common technique 10] of implementing a lock is to use an atomic swap operation in conjunction with the cache coherence of the multiprocessor to minimise the amount of tra c on the shared-memory bus. A lock can be represented by a word in shared memory, and unsetting of a lock can be implemented by writing a default value into the lock. Setting the lock involves: 1. Reading the value of the lock from the shared variable into each processor's local cache. 2. If the lock is not set, the processor attempts to set the lock using an atomic swap operation. If the set succeeds the
Processors ! LOCK SEMLOCK PWSEM MAN LOCKTREE SEMTREE PWAYSPIN PWAYSLEEP PWAYCACHE
1 0.2 1.6 | 1.1 0.3 0.3 1.3 0.8 0.4
SGI 2 3 0.8 1.3 99.2 393.3 303.7 313.6 11.0 23.9 6.1 12.7 163.4 252.0 7.4 12.7 160.8 249.4 3.2 4.5
4 2.5 637.2 320.8 33.5 12.9 335.6 18.1 355.4 6.6
1 | 14.9 | | | | 17.3 15.7 0.43
SUN 2 3 | | 63.2 210.3 125.3 124.5 | | | | 44.7 120.0 88.5 191.1 93.5 209.1 1.0 1.5
4 | 315.5 129.3 | | 167.6 320.4 288.7 2.2
1 | 11.7 | | | | 12.1 12.1 0.1
Digital 2 3 | | 21.2 33.8 59.6 70.6 | | | | 21.6 67.8 41.6 101.9 56.2 135.1 1.5 2.5
4 | 37.8 97.9 | | 89.7 188.6 179.6 3.8
Figure 2: Shared-Memory Barriers. Times in s based on 106 repeated barriers. tation above. The standard solution to overcoming this problem is to use multiple locks, so that small groups of processors synchronise on each lock. For example, each process in the tree-based barrier LOCKTREE performs log p set and unset operations. As each of these lock operations is performed on a distinct lock, the number of bus transactions is 2 log p read requests and log p writes. Distributing locks is not the only way of improving the performance of centralised barrier schemes. The weakness of lock-based barriers is the need for an atomic swap on a single, shared lock. Our solution is to eliminate the swaps by using a barrier lock that consists of p shared words, each initialised to a default value. As each process i enters the barrier, it sets the ith word of barrier lock. However, unlike the implementation of a single lock, there is no need for an atomic swap operation, which may fail and introduce contention into the algorithm, as each process sets a separate entry in the barrier lock. After setting the lock, each process spins on the other p ? 1 values until they become set. As with the lock implementation, the spinning does not induce tra c on the shared-memory bus, until write invalidates are issued by each process reaching the barrier. In this algorithm, each processor will issue 1 write request to shared memory, 4 and (p ? 1) reads. Moreover, we expect the number of read requests to be much smaller than this upper bound if the barrier lock is arranged to occupy consecutive words in cache. In such a situation, the number of reads will be reduced to (p ? 1)=k; where k is the number of words in a cache line. For example, as the cache line on the SGI holds 128 bytes (i.e., k = 32), then the number of reads may be reduced to 1. However, experiments show that placing the p words of the barrier lock in di erent cache lines has a marginal e ect on performance. This is probably because, for these architectures, cache misses are satis ed from secondary cache, so the miss penalty is small. In multiprocessors with no secondary cache, the improvement due to placing the entire barrier lock on a single cache line might be more marked. We conclude that it is cache coherence rather than cache speed that is providing the high performance, so we can expect the technique to scale independent of the cache size. On the SGI the PWAYCACHE technique is faster than using a lock-based tree by a factor of two; on the SUN it is the best choice by two orders of magnitude. Table 2 allows us to consider the scalability of these techniques as well. It is clear that both lock and semaphore overhead grow faster than linearly in the num-
ber of processors sharing a single one. So As a baseline, we again measure the perfordo the manufacturer-provided barrier on the mance of basic pairwise messaging. SGI and the PWAYSPIN and PWAYSLEEP implementations. Even the tree-based PINGPONG A pair of processes exchange empty messages as quickly as possisemaphore implementations scale worse than ble, i.e. f(nonblocksend;recv) || linearly. The only scalable implementa(nonblocksend;recv)g+. tions are LOCKTREE whose performance decreases at powers of two, and PWAYCACHE whose performance is linear in p (but with a We then measure the performance of the following techniques for implementing barriers: very small constant). These results are evidence that standard techniques for managing heavyweight pro- LIBCALL A call to an appropriate barrier synchronisation routine (either cesses, inherited from multiprogrammed opmpi barrier or IBM's MPL barrier erating systems are too cumbersome for efroutine). cient parallel programming. Implementations involving context switching perform MESSTREE An implementation of the alpoorly because of the poor performance of gorithm used in LOCKTREE, but using the underlying management of processes waitmessages instead of locks. ing on semaphores. The results also show that non-intuitive implementations can out- PWAY An implementation in which each process sends a message to a single desperform apparently-sensible ones. This is ignated process, and waits for a receipt because asymptotic performance analysis is message. The designated process accepts overshadowed by the size of the constants inall p ? 1 messages from the other provolved, and these, as we have seen, depend on cesses, before sending p ? 1 receipt mesdetails of the operating system and hardware. sages. TOTXCH A total exchange, that is each sends 3 Distributed-Memory Bar- processor 1. p ? 1 messages and expects p ?
riers
Message-passing systems with non-blocking sends but blocking receives make each message self-synchronising. Thus constructing a barrier means building a sequence of messages in which delivery at every processor depends on receipt from every other processor. Two obvious strategies are to use logarithmicdepth parallel-pre x structures or to use total exchanges where every process exchanges a single message with every other process. Message passing may be used as an implementation technique even for shared-memory architectures, so we also compare the performance of message-based barriers with sharedmemory techniques. 5
The performance of these implementations are shown in Figures 3 and 4. The clock rate for the SP2 is 66 Mhz. Once again, the manufacturer-supplied barrier LIBCALL is almost never the best choice. As expected MESSTREE is the method of choice. Its performance decreases past each power of two, but is overall better than the alternatives. The comparison of PWAY and TOTXCH depends on the relationship between total communication capacity and the critical path. Using PWAY the designated process is executing the same code as all of the processes using TOTXCH, but it then must broadcast the receipt messages to all of the other processes. Thus it can only win if it
receives its p ? 1 incoming messages earlier because there are p messages in the communication system rather than p2 . This is the case for the SGI MPI implementation and the SP2 using ethernet interconnect. PWAY is worse than TOTXCH for small numbers of processors on the SP2 using the switch interconnect, but the two implementations perform similarly for larger numbers of processors. Note that the performance of TOTXCH is much better than results obtained by Bokhari 4] using multiphase total exchange, which requires approximately 3 ms on the SP2. This shows clearly that clever variants of total exchange are important only as a means to control congestion in communication networks. The performance of barriers using message passing is much more predictable than those of barriers using shared-memory. This is partly because the supplied nonblocking send and blocking receive operations are typically much higher level than the primitives available in a shared-memory setting. There is thus less opportunity to use the architecture e ectively. It is also clear that implementations of MPI have confused the message-passing interface presented to the programmer with the underlying implementation techniques. Notice that even the SGI MPI implementation of a barrier performs worse than a call to the builtin barrier routine provided by SGI. The message overhead means that a tree-based barrier using locks is much faster than one using messages. Also note that the Argonne implementation of MPI on SGIs is worse than the SGI implementation by a factor between two and three. On all of the considered systems, the performance of barriers is very close to theoretical results (Table 4): dlog pe cost of a single message for tree-based barriers, and p ? 1 cost of a single message for total-exchange barriers. 6
4 Hardware Barriers
The CRAY T3D provides a hardware barrier synchronisation mechanism. Its performance is shown in Figure 7. It provides fast barrier synchronisation for submachines that are a power of two, but is signi cantly slower for submachines of other sizes. A project at Purdue University 5, 6], the PAPERS project, involves constructing fast synchronisation hardware, using o -the-shelf components and connecting a wide range of hardware con gurations. It is based on an AND-tree. They report barrier synchronisation times of 2.5 s for four processors, and expect this value to grow only extremely slowly for larger machine sizes. Note that the PWAYCACHE implementation of barrier synchronisation on the SUN performs as well as hardware mechanisms on some other multiprocessors (on four processors).
5 Conclusions
From the performance measurements we have given we can draw the following conclusions. First, the performance of barriers on distributed-memory machines is predictable, although not good. This is partly because such architectures provide message-passing abstractions that are far from the hardware. Second, the performance of barriers on shared-memory machines is much better, but is not nearly so predictable. The increased performance is possible because of access to the machine at a level very close to the hardware. However, this increased access requires a much more detailed understanding of the hardware and operating system to design good implementation techniques for barriers. We have illustrated the following pitfalls: Semaphores require process queueing and dequeueing that is expensive on today's
SGI SGI SP2 SGI MPI Argonne MPI IBM mpl PINGPONG 36.1 108.9 53.0 LIBCALL 73.5 217.7 197.9 MESSTREE 74.1 165.9 95.6 PWAY 92.0 234.0 156.2 TOTXCH 105.7 224.9 124.3 Figure 3: Comparing Message-Passing Implementations of Barriers. Times in s based on 106 repeated barriers, using 4 processors. Processors ! PINGPONG LIBCALL MESSTREE PWAY TOTXCH SGI MPI 2 3 4 36.06 35.8 72.5 73.5 37.2 73.8 74.1 47.5 53.9 91.8 37.4 72.6 105.7 Argonne MPI 2 3 108.86 99.6 197.9 97.2 151.0 147.9 182.1 94.2 157.7
1 0.5 0.9 0.9 0.9
1 0.5 0.7 0.7 0.7
4 217.7 165.9 234.8 224.9
Figure 4: Barriers Using Message Passing over Shared Memory. Times in s based on 106 repeated barriers. Processors ! PINGPONG LIBCALL MESSTREE PWAY TOTXCH IBM SP2 | switch 4 5 6 52.97 126.8 197.9 208.7 216.6 91.9 93.6 125.4 126.0 134.4 155.7 177.6 199.3 92.0 124.3 162.0 199.8 3
7 230.2 129.5 235.9 235.7
8 289.2 139.3 279.3 278.4
5.8 101.6 3.2 63.4 0.4 114.9 3.2 63.3
Figure 5: Barriers Using Message Passing. Times in s based on 106 repeated barriers. IBM SP2 | Ethernet Processors ! 2 4 6 8 PINGPONG 740.4 LIBCALL 1038.3 2141.3 2377.5 3216.2 MESSTREE 738.7 1244.0 2588.9 3745.4 PWAY 1064.0 1601.1 2803.7 3475.9 TOTXCH 739.5 1756.1 4580.8 9557.3 Figure 6: Barriers Using Message Passing and Ethernet. Times in s based on 106 repeated barriers. 7
CRAY T3D Processors ! 1 2 4 8 9 16 25 32 64 128 256 HWBARR 0.9 1.9 1.9 1.9 17.3 1.9 17.8 1.9 2.0 2.0 2.0 Figure 7: Hardware Barriers on the T3D. Times in s based on 106 repeated barriers. architectures. A lock-based mechanism using a single lock requires each process to issue a linear number of reads and writes, and hence a quadratic number of attempts are made on the lock. A lock-based mechanism using a tree structure still requires a logarithmic number of attempts on each lock. Manufacturers' built-in barrier synchronisation operations are unlikely to be particularly good. Third, the use of what is essentially a unary counter that exploits cache coherency provides much higher performance than any other technique, and is competitive with special-purpose barrier hardware provided in supercomputers. This technique scales well, and is the implementation of choice for shared-memory architectures. 4] S.H. Bokhari. Multiphase complete exchange on Paragon, SP2, and CS-2. IEEE Parallel and Distributed Technology, 4, No.3:45{59, Fall 1996. 5] H.G. Dietz, T.M. Chung, and T.I. Mattox. A parallel processing support library based on synchronized aggregate communication. School of Electrical Engineering, Purdue University, April 1995. 6] H.G. Dietz, T. Muhammad, J.B. Sponaugle, and T. Mattox. PAPERS: Purdue's adapter for parallel execution and rapid synchronization. Technical Report TR-EE-94-11, Purdue School of Electrical Engineering, March 1994. 7] J. Dongarra, S. W. Otto, M. Snir, and D. Walker. An introduction to the MPI standard. Technical Report CS-95-274, University of Tennessee,
http://www.netlib.org/tennessee /ut-cs-95-274.ps, January 1995.
References
1] G.R. Andrews. Concurrent Programming: Principles and Practice. Benjamin Cummings, 1991. 2] A Beguelin, J Dongarra, A Geist, R Manchek, and V Sunderam. Recent enhancements to PVM. International Journal of Supercomputing Applications and High Performance Computing, 95. 3] A. Beguelin, J.J. Dongarra, G.A. Geist, R. Manchek, and V.S. Sunderam. PVM software system and documentation. Email to netlib@ornl.gov. 8
8] G. A. Geist. PVM3: Beyond network computing. In J. Volkert, editor, Parallel Computation, Lecture Notes in Computer Science 734, pages 194{203. Springer, 1993. 9] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming. MIT Press, Cambridge MA, 1994. 10] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative Approach. Morgan-Kau man, second edition, 1996. 11] Jonathan M.D. Hill. The Oxford BSP toolset users' guide and reference manual. Oxford Parallel, December 1995.
12] C.H. Koelbel, D.B. Loveman, R.S. Schreiber, G.L. Steele Jr., and M.E. Zosel. The High Performance Fortran Handbook. MIT Press, Cambridge MA, 1994. 13] W. F. McColl. Scalable computing. In J. van Leeuwen, editor, Computer Science Today: Recent Trends and Developments, volume 1000 of Lecture Notes in Computer Science, pages 46{61. Springer-Verlag, 1995. 14] Message Passing Interface Forum. MPI: A message passing interface. In Proceedings of Supercomputing '93, pages 878{ 883. IEEE Computer Society, 1993. 15] Douglas M. Pase, Tom MacDonald, and Andrew Meltzer. MPP fortran programming model. Technical report, Cray Research, Inc., October 1993. 16] D.B. Skillicorn, J.M.D. Hill, and W.F. McColl. Questions and answers about BSP. Technical Report TR-15-96, Oxford University Computing Laboratory, August 1996.

Programming Research Group: Jonathan - Hill@comlab - Ox.ac - Uk Skill@qucis - Queensu.ca

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Programming Research Group: Jonathan - Hill@comlab - Ox.ac - Uk Skill@qucis - Queensu.ca

Загружено:

Авторское право:

Доступные форматы

Programming Research Group

SEMTREE The same as the previous im-

PWAYSPIN A centralised counter barrier

PWAYSLEEP A similar barrier to the one

LOCK All processes loop locking and imme-

PWAYCACHE A barrier in which each

MAN Manufacturer-supplied barrier system

1 0.2 1.6 | 1.1 0.3 0.3 1.3 0.8 0.4

4 2.5 637.2 320.8 33.5 12.9 335.6 18.1 355.4 6.6

1 | 14.9 | | | | 17.3 15.7 0.43

4 | 315.5 129.3 | | 167.6 320.4 288.7 2.2

1 | 11.7 | | | | 12.1 12.1 0.1

4 | 37.8 97.9 | | 89.7 188.6 179.6 3.8

1 0.5 0.9 0.9 0.9

1 0.5 0.7 0.7 0.7

4 217.7 165.9 234.8 224.9

7 230.2 129.5 235.9 235.7

8 289.2 139.3 279.3 278.4

5.8 101.6 3.2 63.4 0.4 114.9 3.2 63.3

Вам также может понравиться