Академический Документы
Профессиональный Документы
Культура Документы
Marco Oey
Koen Langendoen
Henri E. Bal
Abstract
Most distributed systems contain protocols for reliable
communication, which are implemented either in the microkernel or in user space. In the latter case, the microkernel provides only low-level, unreliable primitives and the
higher-level protocols are implemented as a library in user
space. This approach is more flexible but potentially less
efficient.
We study the impact on performance of this choice
for RPC and group communication protocols on Amoeba.
An important goal in this paper is to look at overall
system performance. For this purpose, we use several
(communication-intensive) parallel applications written in
Orca.
We look at two implementations of Orca on Amoeba, one
using Amoebas kernel-space protocols and one using userspace protocols built on top of Amoebas low-level FLIP
protocol. The results show that comparable performance
can be obtained with user-space protocols.
1 Introduction
Most modern operating systems are based on a microkernel, which provides only the basic functionality of the
system. All other services are implemented in servers that
run in user space. Mechanisms provided by a microkernel
usually include low-level memory management, process
creation, I/O, and communication.
An important design issue for microkernels is which
communication primitives to provide. Some systems provide primitives of a high abstraction level, such as reliable
message passing, Remote Procedure Call [5], or reliable
group communication [10]. An alternative approach, in the
spirit of microkernels, is to provide only low-level, unreliable send and receive primitives, and to build higherlevel protocols as a library in user space. The trade-offs are
similar to those of other design choices for microkernels:
implementing functionality in user space is more flexible
but potentially less efficient.
Several researchers have implemented protocols like
TCP/IP and UDP/IP in user space [12, 16]. They describe
many advantages of this approach. For example, it eases
maintenance and debugging, allows the co-existence of
This research was supported in part by a PIONIER grant from the
Netherlands Organization for Scientific Research (N.W.O.).
Interface layer
threads
System layer
threads
RPC
Group Comm.
Panda
communication
Operating System
necessarily reliable). The interface layer uses these primitives to provide a higher level interface to the RTS.
An important issue is in which layer the reliability and
ordering semantics should be implemented. Panda has been
designed to be flexible and allows both layers to implement
any of these semantics. For example, on most multicomputers the hardware communication primitives are reliable,
so it is most efficient to have the system layer provide reliable communication. In this case the interface layer will
be simple. If the communication primitives provided by
the system layer are unreliable, the interface layer uses
protocols to make point-to-point communication reliable.
Likewise, Pandas interface layer has a protocol for making
group communication reliable and totally ordered.
The Panda RPC protocol is a 2-way stop-and-wait protocol. The client sends a request message to the server.
The server executes the request and sends back a reply
message, which also acts as an implicit acknowledgement
for the request. Finally, the client needs to send back an
acknowledgement for the reply message. If possible, the
Panda RPC protocol piggy backs this acknowledgement on
another request message, and only sends an explicit message after a certain timeout. This optimization is the major
difference with Amoebas 3-way RPC protocol, which always sends back an explicit acknowledge message.
The protocol for totally ordered group communication
is similar to the Amoeba protocol of Kaashoek [10]. It uses
a sequencer to order all messages. To broadcast a message,
the sender passes the message to the sequencer, which tags
it with the next sequence number and then does the actual
broadcast. The receivers use the sequence numbers to determine if they have missed a message, in which case they
ask the sequencer to send the message again. For this purpose, the sequencer keeps a history of messages that may
not have been delivered at all machines yet. The protocol
has several mechanisms to prevent overflow of the history
buffer. Also, for large messages, both Amoeba and Panda
use a more efficient protocol, in which the senders broadcast
messages themselves and the sequencer broadcasts (small)
acknowledgement messages [10].
The Orca/Panda system was originally developed on top
of SunOS running on a collection of Sun-4 workstations [4].
The system has been ported to several operating systems
(including Amoeba, Solaris, Horus, and Parix) and machines (including the CM-5 and Parsytec transputer grid).
In this paper, we will only look at the Amoeba implementations.
Kernel-space
implementation
Interface
layer
Wrapper routines
System
layer
Amoeba
Microkernel
User-space
implementation
Panda
RPC
Panda
Group
Comm.
unreliable
communication
RPC
Group
FLIP
FLIP
message
size
0 Kb
1 Kb
2 Kb
3 Kb
4 Kb
unicast
user
0.53 ms
1.50 ms
2.50 ms
3.72 ms
4.18 ms
multicast
user
0.62 ms
1.58 ms
2.55 ms
3.74 ms
4.23 ms
RPC
user
kernel
1.56 ms 1.27 ms
2.53 ms 2.23 ms
3.60 ms 3.40 ms
4.77 ms 4.48 ms
5.27 ms 5.06 ms
group
user
kernel
1.67 ms 1.44 ms
3.59 ms 3.38 ms
3.67 ms 3.44 ms
4.84 ms 4.56 ms
5.35 ms 5.25 ms
RPC
group
user-space
825 Kb/s
941 Kb/s
kernel-space
897 Kb/s
941 Kb/s
Hence, two context switches are needed to process the reply message. With the kernel-space protocol, on the other
hand, Amoeba immediately delivers the reply message to
the blocked client thread; no context switches are needed
since no other thread was scheduled between sending the
request and receiving the reply. We measured inside the
Amoeba kernel that the total overhead of the two context
switches is about 140 s, which already explains half the
difference in latency between the user-space and kernelspace implementation.
Thread handling in Amoeba is expensive since only kernel level threads are provided, so each operation that involves threads (e.g., signaling) potentially has to go through
the kernel. Consequently, the user-space implementation
needs four additional crossings between kernel and user
address spaces for each RPC: two for waking up the client
thread, and two for switching from the daemon thread to
the client thread. The costs of an address space crossing
are not fixed but depend on the depth of the call stack, because the kernel has to save and restore register windows.
Our SPARC processors use six register windows of fixed
size. A new window is allocated during each procedure
call. When the user invokes a system call, the Amoeba
kernel first saves all register windows in use, performs the
system call, and then restores the single topmost register
window, before returning to user space. When the user
program continues, the register windows deeper in the call
stack are faulted in through underflow traps. These traps
are handled in software by the operating system, hence they
are rather expensive: about 6 s per trap.
The user-space implementation suffers from the
Amoeba policy to only restore the topmost register window. When the daemon thread enters the kernel to signal
the client thread about the arrival of its reply message, the
daemon threads stack is using all register windows. Consequently, the daemon suffers six additional underflow traps
when returning down the call stack after the system call has
finished. The combined overhead of crossing the address
space boundary and underflow traps is about 50 s. At the
server machine both user-space and kernel-space imple-
Orca Application
Travelling Salesman Problem
All-Pairs Shortest Paths
Alpha-Beta Search
Region Labeling
Successive Overrelaxation
Linear Equation Solver
Implementation
Kernel-space
User-space
Kernel-space
User-space
Kernel-space
User-space
Kernel-space
User-space
Kernel-space
User-space
Kernel-space
User-space
User-space-dedicated
1
790
783
213
216
565
567
759
767
118
118
521
527
527
8
87
92
30
31
106
106
132
133
20
19
102
113
116
16
44
46
17
18
78
78
115
119
14
13
91
112
94
32
23
24
11
11
60
59
114
108
13
11
127
164
128
max. speedup
34.2
32.2
18.4
19.0
9.3
9.5
6.6
7.1
9.0
10.3
5.7
4.7
5.6
Table 3: Performance of Orca applications; execution times [sec] and max. speedup
The Linear Equation Solver (LEQ) is the only application that shows a clear advantage for the kernel-space
protocol. The poor performance on the user-space implementation is due to the sequencers machine. This processor handles broadcast requests from all other machines,
but it must also process all incoming update messages and
run an Orca process (as all machines do). With 32 processors, this machine becomes overloaded and slows down the
iterative application. The solution is to dedicate one processor to just sequencing; this not only reduces the latency
of a group message by 50 s (see Section 3.2), but also
avoids the overhead of preempting the Orca process at the
sequencer for each incoming message. The kernel-space
does not suffer from the overhead of context switching between two threads in user-space, because the sequencer is
run as part of the interrupt handler inside the kernel.
The row labeled User-space-dedicated lists the timings for the user-space implementation that sacrifices one
processor to run the sequencer part of the group protocol
only. On eight processors the loss of one worker process
does not outweight the benefits, but at 16 and 32 processors
the dedicated sequencer clearly pays off. For example, on
16 processors the 15 workers now solve the linear equation in 94 seconds instead of 112 seconds. Note that the
execution times increase when going from 16 processors
to 32 for all three implementations. This is a consequence
of the parallel algorithm that now sends twice the number
of group messages of half the size. The decrease in computation time does not outweigh the increase in message
handling overhead.
For the coarse-grained applications (TSP, ASP, and AB),
the performances listed in Table 3 show no significant advantage for either of the two protocols. Apparently, the
poorer results for the user-space implementation on the latency tests reported in Section 4 do not influence these Orca
applications. For the finer grained applications (RL, SOR,
LEQ), the performance results show two effects. First,
the kernel-space implementation does better when a lot of
6 Discussion
The paper has shown that user-space communication
protocols can be implemented on top of Amoeba with an
acceptable performance loss. The communication performance is somewhat lower than that of kernel-space protocols.
The great advantage of user-space protocols is increased
flexibility. For example, we have incorporated continuations [7] in our Orca RTS to reduce the context switching
overhead inside the Panda RPC protocol and to save on
stack space for blocking Orca operations. The Amoeba
RPC protocol does not support the asynchronous transmission of the reply message, which leads to an additional
context switch when handling blocking Orca operations.
As another example, we intend to add nonblocking broadcast messages to our system, where the sending thread
does not have to wait for its message to come back from
the sequencer. For some write-operations, nonblocking
broadcasts can be used without violating Orcas sequential
consistency semantics. With the Amoeba broadcast protocol this optimization would require modifications to the
kernel.
Our work is related to several other systems. Implementations of user-space protocols are also studied in [12, 16].
These studies use a reliable byte-stream protocol (TCP/IP)
and an unreliable datagram protocol (UDP/IP). Our work is
based on the same motivations. A significant contribution
of our work is that we have written several parallel applications on top of the kernel-space and user-space protocols.
We have used them as example applications to study the
impact on overall performance.
The Mach system also has a user-space protocol for communication between different machines, but this protocol
is implemented by separate servers and consequently has a
high overhead [13]. Protocol decomposition is studied in
the x-kernel [13], which allows protocols to be connected in
a graph, depending on the needs of the application. Userspace group communication protocols are studied in the
Horus project [18]. Horus uses a multicast transport service (MUTS) that can run in kernel space or in user space.
Application-specific protocols in user-space are supported
by Tempest, which eases the implementation of multiple
coherence protocols for distributed shared memory [9].
The performance of our user-space implementation
could be improved significantly if user-level access to the
network would be allowed, since such access would eliminate many system calls. An alternative optimization is
to have the kernel execute user code. Since Orca is a
type-secure language, it is possible to have the kernel execute operations on shared objects in a secure way. Several
groups are working on this idea [17, 8].
Acknowledgements
Raoul Bhoedjang implemented parts of the Orca runtime
system and Panda RPC. Tim Ruhl also implemented parts
of Panda RPC and the Panda system layer. Rutger Hofman implemented Pandas broadcast protocol, and helped
in analyzing its performance characteristics. Ceriel Jacobs
wrote the Orca compiler. Frans Kaashoek was one of the
main designers of the Panda system. Kees Verstoep provided valuable assistance when tracking down time spent
inside the Amoeba kernel. We would like to thank Raoul
Bhoedjang, Leendert van Doorn, Saniya Ben Hassen, Rutger Hofman, Ceriel Jacobs, Frans Kaashoek, Tim Ruhl, and
the anonymous referees for their useful comments on the
paper.
References
[1] H.E. Bal. Programming Distributed Systems. Prentice
Hall Intl, Hemel Hempstead, UK, 1991.
[2] H.E. Bal and M.F. Kaashoek. Object distribution in
Orca using compile-time and run-time techniques. In
Conference on Object-Oriented Programming Systems, Languages and Applications, pages 162177,
Washington D.C., September 1993.
[3] H.E. Bal, M.F. Kaashoek, and A.S. Tanenbaum. Orca:
A language for parallel programming of distributed
systems. IEEE Transactions on Software Engineering, 18(3):190205, March 1992.
[4] R. Bhoedjang, T. Ruhl, R. Hofman, K. Langendoen, H.E. Bal, and M.F. Kaashoek. Panda: A
portable platform to support parallel programming
languages. Symposium on Experiences with Distributed and Multiprocessor Systems, pages 213226,
September 1993.
[5] A. D. Birrell and B. J. Nelson. Implementing remote
procedure calls. ACM Transactions on Computer Systems, 2(1):3959, February 1984.
[6] F. Douglis, J.K. Ousterhout, M.F. Kaashoek, and
A.S. Tanenbaum. A comparison of two distributed