Вы находитесь на странице: 1из 8

Comparing Kernel-Space and User-Space Communication Protocols on Amoeba

Marco Oey

Koen Langendoen

Henri E. Bal

Dept. of Mathematics and Computer Science


Vrije Universiteit
Amsterdam, The Netherlands

Abstract
Most distributed systems contain protocols for reliable
communication, which are implemented either in the microkernel or in user space. In the latter case, the microkernel provides only low-level, unreliable primitives and the
higher-level protocols are implemented as a library in user
space. This approach is more flexible but potentially less
efficient.
We study the impact on performance of this choice
for RPC and group communication protocols on Amoeba.
An important goal in this paper is to look at overall
system performance. For this purpose, we use several
(communication-intensive) parallel applications written in
Orca.
We look at two implementations of Orca on Amoeba, one
using Amoebas kernel-space protocols and one using userspace protocols built on top of Amoebas low-level FLIP
protocol. The results show that comparable performance
can be obtained with user-space protocols.

1 Introduction
Most modern operating systems are based on a microkernel, which provides only the basic functionality of the
system. All other services are implemented in servers that
run in user space. Mechanisms provided by a microkernel
usually include low-level memory management, process
creation, I/O, and communication.
An important design issue for microkernels is which
communication primitives to provide. Some systems provide primitives of a high abstraction level, such as reliable
message passing, Remote Procedure Call [5], or reliable
group communication [10]. An alternative approach, in the
spirit of microkernels, is to provide only low-level, unreliable send and receive primitives, and to build higherlevel protocols as a library in user space. The trade-offs are
similar to those of other design choices for microkernels:
implementing functionality in user space is more flexible
but potentially less efficient.
Several researchers have implemented protocols like
TCP/IP and UDP/IP in user space [12, 16]. They describe
many advantages of this approach. For example, it eases
maintenance and debugging, allows the co-existence of
 This research was supported in part by a PIONIER grant from the
Netherlands Organization for Scientific Research (N.W.O.).

multiple protocols, and makes it possible to use applicationspecific protocols.


In this paper we compare kernel-space and user-space
protocols for the Amoeba distributed operating system [15].
We consider the RPC and group communication protocols
Amoeba provides. One of the main goals of this paper is
to study the impact of user-space protocols on overall system performance. In general, implementing functionality
outside the kernel entails some overhead, but in practice
other factors often dominate and the overhead may well
become negligible [6]. Therefore we study not only the
performance of the protocols, but also that of applications.
Our study is based on parallel applications written in
the Orca programming language [3]. We have built two
implementations of Orca on Amoeba. The first one uses
Amoebas high-level communication protocols. The second implementation makes calls to Amoebas low-level
datagram protocol, FLIP [11], and uses its own communication protocols in user space. We have done extensive
performance measurements at different levels of both systems, including the low-level primitives, the high-level protocols, and many Orca applications. These measurements
were done on a large-scale Ethernet-based distributed system. Since most parallel Orca programs do a significant
amount of communication, Orca is a good platform for a
meaningful comparison.
The outline of the rest of the paper is as follows. In Section 2 we present the general structure of our Orca system.
In Section 3, we describe the two Orca implementations
on Amoeba. In Section 4, we discuss the performance of
the communication primitives. In Section 5, we look at
the performance of Orca applications. Finally, Section 6
contains a discussion and conclusions.

2 The Orca/Panda system


Orca is a language for writing parallel applications that
run on distributed-memory systems (collections of workstations and multicomputers) [3]. Its programming model
is based on shared data-objects, which are instances of
abstract data types that can be shared among different processes. The shared data encapsulated in an object can only
be accessed through the operations defined by the abstract
data type. Each operation is applied to a single object and
is guaranteed to be executed indivisibly. The model can
be regarded as an object-based distributed shared memory
with sequential consistency semantics [14].

Orca runtime system

Interface layer

threads

System layer

threads

RPC

Group Comm.

Panda
communication

Operating System

Figure 1: Structure of the Orca/Panda system


.
The Orca runtime system (RTS) takes care of object
management. To reduce communication overhead, it may
decide to store an object on one processor (preferably the
one that accesses the object most frequently) or it may replicate the object and store copies of it on multiple processors.
The decision which strategy to use is made by the RTS,
based on heuristic information provided by the compiler
[2]. Different strategies may be used for different objects.
In general, the RTS will replicate objects that are expected
to be read frequently (based on the compiler-generated information). Objects with a low (expected) read/write ratio
are stored on a single processor.
The Orca RTS is implemented on top of an abstract machine called Panda [4]. Panda provides threads, Remote
Procedure Call, and group communication. Group communication is totally ordered, which means that all messages
will be received by all processors in the same total order.
The Orca RTS is fully independent of the operating system and architecture. It uses Panda RPC to implement
remote invocations on nonreplicated objects. Read-only
operations on replicated objects are executed locally, without doing any communication. Write operations on replicated objects are implemented by broadcasting the operation and its parameters to each processor holding a copy
of the object, using Panda group communication. Each
processor applies the operation to its local copy. Since the
group communication is totally ordered, all copies remain
consistent.
Recently, the Orca RTS has been improved to handle
condition synchronization on nonreplicated shared objects
more efficiently. Orca operations may specify that a certain
condition, called a guard, should hold before the operation
is started. If an operation on a remote object blocks, the
Orca RTS no longer blocks the RPC server thread, but
queues a continuation [7] at the object. Later, when another operation modifies the state of the object, the guard
of the blocked operation is checked. If it evaluates to true,
the blocked operation is retrieved from the continuation and
resumed. This optimization reduces the number of blocked
threads and saves a context switch since the thread that
modifies the state sends back the reply of the blocked operation itself instead of waking up a blocked server thread. In
the next section it is shown that only the flexible user-space
protocols can fully exploit this improvement in the Orca
RTS.
Internally, Panda consists of two layers (see Figure 1).
The system layer is partly operating system dependent and
provides point-to-point communication and multicast (not

necessarily reliable). The interface layer uses these primitives to provide a higher level interface to the RTS.
An important issue is in which layer the reliability and
ordering semantics should be implemented. Panda has been
designed to be flexible and allows both layers to implement
any of these semantics. For example, on most multicomputers the hardware communication primitives are reliable,
so it is most efficient to have the system layer provide reliable communication. In this case the interface layer will
be simple. If the communication primitives provided by
the system layer are unreliable, the interface layer uses
protocols to make point-to-point communication reliable.
Likewise, Pandas interface layer has a protocol for making
group communication reliable and totally ordered.
The Panda RPC protocol is a 2-way stop-and-wait protocol. The client sends a request message to the server.
The server executes the request and sends back a reply
message, which also acts as an implicit acknowledgement
for the request. Finally, the client needs to send back an
acknowledgement for the reply message. If possible, the
Panda RPC protocol piggy backs this acknowledgement on
another request message, and only sends an explicit message after a certain timeout. This optimization is the major
difference with Amoebas 3-way RPC protocol, which always sends back an explicit acknowledge message.
The protocol for totally ordered group communication
is similar to the Amoeba protocol of Kaashoek [10]. It uses
a sequencer to order all messages. To broadcast a message,
the sender passes the message to the sequencer, which tags
it with the next sequence number and then does the actual
broadcast. The receivers use the sequence numbers to determine if they have missed a message, in which case they
ask the sequencer to send the message again. For this purpose, the sequencer keeps a history of messages that may
not have been delivered at all machines yet. The protocol
has several mechanisms to prevent overflow of the history
buffer. Also, for large messages, both Amoeba and Panda
use a more efficient protocol, in which the senders broadcast
messages themselves and the sequencer broadcasts (small)
acknowledgement messages [10].
The Orca/Panda system was originally developed on top
of SunOS running on a collection of Sun-4 workstations [4].
The system has been ported to several operating systems
(including Amoeba, Solaris, Horus, and Parix) and machines (including the CM-5 and Parsytec transputer grid).
In this paper, we will only look at the Amoeba implementations.

3 Kernel-space and user-space communication protocols


Panda has been implemented in two different ways on
Amoeba. The first implementation uses Amoebas kernelspace RPC and group communication protocols. The second implementation uses Amoebas low-level, unreliable
communication primitives and runs the Panda protocols on
top of them in user-space. Both implementations use use
Amoeba threads, which are created and scheduled (preemptively) by the kernel. Mutexes are provided for synchronization between the threads of one process. Panda
uses them to implement mutexes and condition variables,
which are provided to the Panda user (i.e., the Orca RTS).
This section describes both implementations of Panda.

Kernel-space
implementation
Interface
layer

Wrapper routines

System
layer

Amoeba
Microkernel

User-space
implementation
Panda
RPC

Panda
Group
Comm.

unreliable
communication
RPC

Group
FLIP

FLIP

Figure 2: Two implementations of Panda on Amoeba.

3.1 Kernel-space protocols


Our first implementation of Panda on Amoeba uses the
Amoeba RPC and totally ordered group communication
protocols to implement the corresponding Panda primitives.
The goal of this implementation is to wrap the Amoeba
primitives to make them directly usable by the Orca RTS.
The communication primitives provided by Amoeba differ slightly from the Panda primitives. Amoeba expects a
server thread to block waiting for a request. Server threads
in Amoeba therefore repeatedly call get request to wait for
the next request. Panda, on the other hand, is based on
implicit message receipt, in which a new (pop-up) thread
is created for each incoming RPC request. The difference
between the two models, however, is easy to overcome.
The Panda model is implemented on top of the Amoeba
model by using daemon threads. An RPC daemon waits
for an incoming RPC request (using get request) and then
does an upcall to a procedure that handles the request and
computes the reply. A restriction of Amoebas RPC primitives is that the reply message has to be sent back by the
same thread that issued the get request primitive. As explained in Section 2, the Orca RTS may send back a reply
message from a different thread when handling guarded operations. The Panda RPC interface function pan rpc reply
is implemented on top of the Amoeba kernel by signaling
the original thread so it can send back the message using
the put reply system call. Note that this solution, which
works around the inflexible kernel RPC, undoes the Orca
RTS optimizations and re-introduces an additional context
switch and increased memory usage because of the blocked
server thread.
The structure of this implementation is shown in the left
part of Figure 2. The Panda system layer does not contain
any code related to communication. The interface layer
mainly consists of wrapper routines that make the Amoeba
primitives look like the Panda primitives. The overhead
introduced by this is small. We refer to this system as the
kernel-space implementation.

3.2 User-space protocols


A radically different implementation of Panda is to bypass the Amoeba protocols and directly access the low-level
communication primitives (see the right part of Figure 2).
These primitives are provided by FLIP (Fast Local Internet Protocol), which is a network layer protocol. FLIP
differs from the more familiar Internet Protocol (IP) in
many important ways. In particular, FLIP supports loca-

tion transparency, group communication, large messages,


and security [11]. FLIP offers unreliable communication
primitives for sending a message to one process (unicast)
or a group of processes (multicast).
For this implementation, we have used Pandas RPC
and group-communication protocols described in Section 3.
These protocols were initially developed on UNIX (see
Section 2). They remained unchanged during the port to
Amoeba. We only had to implement Pandas system level
primitives on top of FLIP. All modifications to the system
thus were localized in the system-dependent part of Panda
(see Figure 2). In summary, this implementation treats
Amoeba as a system providing unreliable and unordered
communication. We refer to this system as the user-space
implementation.
The system layer uses one daemon thread to receive
FLIP packets. FLIP fragments large messages, so the daemon reassembles packets into a complete message, checks
if the message is to be delivered to the group or RPC module, and makes an upcall to the corresponding handler in
the interface layer. These handlers run their protocols (e.g.,
to order group messages) and make upcalls to the messagehandler procedures registered by the Panda user. Panda requires these upcall procedures to run to completion quickly
such that messages can be processed completely by the system level daemon without context switching to intermediate
threads. The Orca RTS uses continuations to handle longterm blocking of operation invocations (see Section 2), so
its message handlers fulfill this requirement (since Orca
operations normally take little time).
Unlike the kernel-space RPC protocol, no work arounds
are needed to support the asynchronous pan rpc reply call.
The flexible user-space protocol can take full advantage
of the Orca RTS optimizations. In comparison to a previous Panda version that included daemon threads at the
interface layer to support blocking Orca message handlers,
the latency of RPC and group messages has dropped with
300 s. The Amoeba high level RPC and group protocols
could not take advantage of the change in the Orca RTS and
still incur a context switch when handling blocked object
invocations.

4 Performance of the communication protocols


We measured the performance of the protocols on
Amoeba version 5.2, running on SPARC processors. Each
processor runs at 50 MHz and is on a single board (a
Tsunami board, equivalent to the Sun Classic), which contains 32 MB of local memory. The boards are connected by
a 10 Mbit/s Ethernet. Below, we describe the performance
of the Panda system-layer primitives and of the RPC and
group communication protocols. All reported measurements are average values of 10 runs with little variation:
less than 1%.

4.1 Performance of the system-layer primitives


Table 1 shows the latency for the unicast and multicast
primitives that are provided by the Panda system layer in
the user space implementation (see Figure 2). These library
routines do system calls to invoke the corresponding FLIP
primitives. All measurements are from user process to user
process.

message
size
0 Kb
1 Kb
2 Kb
3 Kb
4 Kb

unicast
user
0.53 ms
1.50 ms
2.50 ms
3.72 ms
4.18 ms

multicast
user
0.62 ms
1.58 ms
2.55 ms
3.74 ms
4.23 ms

RPC
user
kernel
1.56 ms 1.27 ms
2.53 ms 2.23 ms
3.60 ms 3.40 ms
4.77 ms 4.48 ms
5.27 ms 5.06 ms

group
user
kernel
1.67 ms 1.44 ms
3.59 ms 3.38 ms
3.67 ms 3.44 ms
4.84 ms 4.56 ms
5.35 ms 5.25 ms

Table 1: Communication Latencies.


The two primitives are almost equally expensive, because Ethernet provides multicast in hardware. Unicast
latency was measured with a simple pingpong program,
in which two machines repeatedly send messages of various
sizes to each other. Multicast latency was measured with a
similar program in which two machines repeatedly multicast a message to the group on receipt of a message from
the other machine. Since the replies are sent directly from
within the upcall issued by the Panda layer, the latency figures do not include any context switching overhead. This
holds for both the unicast and multicast latency.
The nonlinear relation between latency and message
length is due to the fragmentation performed by the lowlevel FLIP primitives in the Amoeba kernel. Messages
are broken down into maximum length Ethernet packets
of 1500 bytes, which are reassembled in user space at the
receiving side. Receiving a packet incurs some overhead
costs as can be seen from the difference in latency between
2 Kb, 3 Kb, and 4 Kb messages. A 2 Kb message can be
transmitted in two packets, while both 3 Kb and 4 Kb message take three packets, hence the relatively small latency
difference between sending 3 Kb and 4 Kb messages.

4.2 Performance of the Remote Procedure Call


protocols
RPC latency (see Table 1) was measured using RPC
requests of various sizes and empty reply messages. The
RPC throughput (see Table 2) was measured by sending
requests of 8000 bytes each and sending back empty replies.
As can be seen, the kernel-space implementation has a
lower RPC latency and, consequently, a higher throughput
than the user-space implementation. For null messages,
for example, the kernel-space protocols are 0.3 ms faster
(1.27 ms versus 1.57 ms). We think it is very important
to determine how much of this difference is due to fundamental limits of a user-space protocol and how much is
due to implementation differences between Amoeba and
Panda. Therefore, we give an analysis of this difference.
The Amoeba and Panda protocols are both tuned fairly well,
although more effort was spent on optimizing the Amoeba
protocol.
The most important difference between the two implementations is that the user-space protocol incurs additional
context switches, even though Panda makes upcalls directly
from within the system level receive daemon. The userspace protocol needs extra context switches at the client
side when handling the reply message. First the system
level daemon has to be scheduled when the reply arrives,
and then the reply has to be passed on to the client thread.

RPC
group

user-space
825 Kb/s
941 Kb/s

kernel-space
897 Kb/s
941 Kb/s

Table 2: Communication Throughputs.

Hence, two context switches are needed to process the reply message. With the kernel-space protocol, on the other
hand, Amoeba immediately delivers the reply message to
the blocked client thread; no context switches are needed
since no other thread was scheduled between sending the
request and receiving the reply. We measured inside the
Amoeba kernel that the total overhead of the two context
switches is about 140 s, which already explains half the
difference in latency between the user-space and kernelspace implementation.
Thread handling in Amoeba is expensive since only kernel level threads are provided, so each operation that involves threads (e.g., signaling) potentially has to go through
the kernel. Consequently, the user-space implementation
needs four additional crossings between kernel and user
address spaces for each RPC: two for waking up the client
thread, and two for switching from the daemon thread to
the client thread. The costs of an address space crossing
are not fixed but depend on the depth of the call stack, because the kernel has to save and restore register windows.
Our SPARC processors use six register windows of fixed
size. A new window is allocated during each procedure
call. When the user invokes a system call, the Amoeba
kernel first saves all register windows in use, performs the
system call, and then restores the single topmost register
window, before returning to user space. When the user
program continues, the register windows deeper in the call
stack are faulted in through underflow traps. These traps
are handled in software by the operating system, hence they
are rather expensive: about 6 s per trap.
The user-space implementation suffers from the
Amoeba policy to only restore the topmost register window. When the daemon thread enters the kernel to signal
the client thread about the arrival of its reply message, the
daemon threads stack is using all register windows. Consequently, the daemon suffers six additional underflow traps
when returning down the call stack after the system call has
finished. The combined overhead of crossing the address
space boundary and underflow traps is about 50 s. At the
server machine both user-space and kernel-space imple-

mentation cause one context switch and two address space


crossings.
A related difference that has little to do with running protocols in user space or kernel space is due to the fact that (for
software engineering reasons) procedure calls in Panda are
more deeply nested than in Amoeba. This not only results in
more function call overhead for the user-space implementation, but also causes extra register window overflow and
underflow traps. An important case where the user-space
implementation suffers from additional overflow traps is
at the server side of the RPC latency test. When sending
back the reply, the user-space implementation goes down
the protocol layers by stacking additional procedure invocations, hence, generating overflow traps. In contrast, the
kernel-space implementation just records a pointer to the
reply message, returns from the protocol stack because it
has finished processing the request, and finally at the bottom layer sends back the reply by invoking the mandatory
put reply system call.
Another difference is that the user-space implementation
suffers from more locking overhead. Profiling data shows
that it does seven times more lock() calls than the kernelspace implementation. The kernel-space protocol rarely
needs to lock shared data, because internal kernel threads
are scheduled nonpreemptively in Amoeba. Fortunately,
acquiring and releasing locks in user space can be done
cheaply if no other thread is holding the lock or waiting for
it. Therefore the overhead is negligible in comparison to
context switching and trapping costs.
Two more differences have a noticeable impact on the
latency. First, the user-space implementation uses slightly
larger headers (64 bytes vs. 56 bytes), which amounts to
28 = 16 s latency on the Ethernet. Second, the userspace implementation includes portable fragmentation code
to handle large messages that are broken up into small fragments. The kernel-space implementation relies on the FLIP
layer in the Amoeba kernel to do fragmentation. Note that
the user-space implementation therefore includes two layers that provide fragmentation, which results in an overhead
of about 20 s per message, i.e. 40 s per RPC. At the moment it is impossible to leave out the RPC fragmentation
code without completely changing the software structure
of Panda.
In summary, the difference in performance has as the
only essential component two context switches, counting
for 140 s. The fact that Amoeba currently provides only
kernel threads causes an additional overhead of 50 s for
register window traps and address space crossings. Increased functionality (fragmentation) takes 40 s. The
larger header size of the user-space RPC protocol is an implementation detail; it accounts for 16 s. This leaves a
gap of 54 s in performance difference. This can largely
be attributed to the fact that the Amoeba extensions that
make the low-level FLIP interface available to the userspace implementation have not yet been optimized: for
instance, user-to-kernel address translation can be sped up
considerably.

4.3 Performance of the group communication


protocols
Group communication latency is measured by creating
a group of two members and having one of them send
group messages. The sending member waits until it gets

its own message back from the sequencer (which is on


the other processor). We used messages of different sizes
for measuring the latency, see Table 1. The difference
between user-space and kernel-space latency is about 0.23
ms. Throughput is measured by creating a group of multiple
members sending messages of 8000 bytes in parallel. This
effectively saturates the Ethernet, so user-space and kernelspace implementation achieve equal throughput.
In contrast to the RPC protocol, the group protocol in
user space does not incur an additional context switch at
the client machine. Both the kernel-space and user-space
implementation receive the ordered message in a daemon
thread, which signals the client thread that it can send the
next message. Therefore both implementations have a context switch in the critical path at the machine of the sending
member. There is, however, a subtle difference that causes
40 s overhead in the case of the user-space implementation. When the ordered message from the sequencer arrives,
the client thread is sleeping on a condition variable and has
to be notified by the daemon thread. This requires a system
call and causes a number of underflow traps when returning from the kernel. In the kernel-space implementation the
grp send call is blocking in the sense that the calling thread
is suspended until the message has returned from the sequencer. Hence, unblocking the sleeping client thread does
not require an expensive address space crossing.
Like the RPC protocol, the user-space group communication protocol includes code to fragment and reassemble
large messages even though the FLIP interface of Amoeba
is already capable of doing so. Note that double fragmentation occurs only at the sending member, because the sequencer is written to order groups messages at the fragment
level. Therefore the user-space group protocol only incurs
a 20 s overhead whereas the RPC implementation lost
40 s to fragment both the request and reply message. In
total, the user-space group protocol incurs 60 s overhead
at the sending member machine; the remaining overhead is
incurred at the sequencer machine.
In case of the kernel-space implementation, the sequencer runs entirely inside the Amoeba kernel so no time
is wasted in crossing the user-kernel address space boundary. In the case of the user-space implementation, however,
each message has to cross this boundary twice. The sequencer issues two system calls: one to fetch a message
from the network and another to multicast the message
including the sequence number describing the global ordering. These additional system calls increase the latency
by about 40 s, because of address space crossings and
additional underflow traps. Another source of overhead
is that both the unordered incoming message and ordered
outgoing message have to be transferred from one address
space to the another. This requires additional user-to-kernel
address translations and data copying in comparison to the
kernel-space implementation. The extra overhead accounts
for 30 s.
Another major difference between the user-space and
kernel-space group protocols is that the Amoeba group code
is invoked from within the (software) interrupt handler,
whereas the sequencer in the user-space implementation is
a separate thread. Consequently, the user-space implementation does an additional thread switch, which takes about
110 s. This switch is so expensive because the interrupt
handler first runs to completion, then the scheduler is in-

voked, and finally the context of the current thread can be


saved, so the sequencer thread can be resumed. At the
sequencer node, an extra thread runs to deliver the group
message to the user. Since this thread has run last to deliver
the previous message, a full context switch is needed. The
performance of the user-space protocol can be improved
by running the sequencer at a dedicated machine such that
no member thread needs to run to process group messages.
This effectively reduces the context switch time to 60 s,
since the sequencer context is still loaded when a message
arrives.
The user-space implementation performs better when
considering the Ethernet network latency because it works
with small headers of 40 bytes, whereas the kernel-space
implementation prepends each data message with a 52 byte
header. At this point the user-space implementation saves
about 212 = 24 s per sequenced message.
In summary, the difference in performance of the group
protocols has as the essential component one context
switch, counting for 110 s, and one address space crossing, counting for 40 s. Thus the total essential overhead
for the user-space implementation is 150 s. The fact that
Amoeba only provides kernel threads causes an additional
overhead of 50 s for register window traps and address
space crossings at the sending member machine. Increased
functionality (fragmentation) takes another 20 s, but the
smaller header size yields an improvement of 24 s. In
total, this leaves a gap of about 30 s in performance difference. This can again be attributed to the untuned part of
the Amoeba kernel making the FLIP interface available to
user programs.

5 Performance of parallel applications


In this section we give performance measurements of six
parallel Orca applications on Amoeba. The measurements
were done on a processor pool of SPARC processors as
described in the previous section. The pool consists of
several Ethernet segments connected by an Ethernet switch.
Each segment connects eight processors by a 10 Mbit/sec
Ethernet.
The six Orca applications selected for the measurements
together solve a wide range of problems with different characteristics. Some of the applications are described in [1].
The applications were run on an Orca implementation using kernel-space communication protocols, and on an Orca
implementation using user-space protocols. The two implementations use exactly the same runtime system and
compiler.
The absolute execution times of the applications and
maximum speedup (relative to the single-processor case)
are given in Table 3. (Most applications achieve higher
speedups for larger input problems, but this is not relevant
for the current comparison.) For each application and for
both protocols we give the elapsed time on different numbers of processors. The reported measurements are the
mean values obtained in 10 runs. The variation was less
than 0.2%.
To make a fair comparison between both implementations, we have gone through considerable effort to eliminate
caching effects. Caching effects played a significant role
in earlier measurements since the SPARC processors are
equipped with small direct-mapped caches: 4 Kb instruction cache, and 2 Kb data cache. We have observed a factor

of two difference in performance due to conflicts in the


instruction cache. By linking the code fragments common
to both Orca implementations (e.g., the runtime system)
at fixed positions, we succeeded in achieving almost equal
execution times for both implementations when running on
a single processor. The measured differences are typically
quite small: less than 2%.
For the Travelling Salesman Problem (TSP) the kernelspace implementation marginally outperforms the userspace implementation. Both implementations achieve superlinear speedups because of a different search order and
a reduction in cache misses in the data cache. The coarsegrain nature of this application limits the advantage of the
kernel-space implementations communication primitives.
The frequently accessed data object holding the shortest
path is replicated by the Orca RTS, so it can be read locally.
The only communication that takes place is needed for operations to fetch jobs from a central queue object, but the
number of jobs is small: 2184.
The figures for the All-Pairs Shortest Paths program
(ASP) also show only a marginal difference between the
user-space and kernel-space protocols. Again this is a consequence of the infrequent usage of communication; the
program sends 768 group messages to coordinate an iterative process. The moderate speedup is caused by the
relatively high latency that each group message of 3200
bytes incurs: about 5 ms per message, which sums to a
delay of almost 4 seconds. The summed latency difference
between the user-space and kernel-space implementations
is 200 ms, which is reflected in the slightly better speedup
attained by the kernel-space implementation.
The Alpha-Beta Search program (AB) has also been
written in a coarse-grained style and does not communicate
a lot. The poor speedups are caused by the search overhead
the parallel algorithm incurs; efficient pruning in parallel
- search is a known hard problem.
The Region Labeling (RL) and Successive Overrelaxation (SOR) programs are both finite-element methods that
iteratively update all elements of a matrix. At the end
of each iteration, processors exchange boundary elements
with their neighbors by means of shared buffer objects.
This exchange causes the kernel-space implementation to
perform worse than the user-space applications if the grain
size of the application is small. For example, Region Labeling on the kernel-space implementation takes six seconds longer to run to completion on 32 processors than
its user-space counterpart. The kernel-space implementation suffers from an additional context switch per remote
guarded BufGet operation that blocks until the buffer is
filled by its owning processor. Likewise the BufPut operation blocks if the buffer is full. The kernel-space implementation needs an additional context switch because
the Amoeba RPC implementation demands that a matching get request and put reply are issued by the same thread
(see Section 3.1), while the operation that sets the guard to
true is necessarily executed by another thread.
The execution times for the Region Labeling and Successive Overrelaxation programs show that the performance
of both implementations flattens out at 16 processors. This
behavior is caused by the limited bandwidth of the Ethernet; apparently, on both implementations the programs
cause network saturation for a large number of processors
and therefore achieve poor speedups.

Orca Application
Travelling Salesman Problem
All-Pairs Shortest Paths
Alpha-Beta Search
Region Labeling
Successive Overrelaxation
Linear Equation Solver

Implementation
Kernel-space
User-space
Kernel-space
User-space
Kernel-space
User-space
Kernel-space
User-space
Kernel-space
User-space
Kernel-space
User-space
User-space-dedicated

1
790
783
213
216
565
567
759
767
118
118
521
527
527

8
87
92
30
31
106
106
132
133
20
19
102
113
116

16
44
46
17
18
78
78
115
119
14
13
91
112
94

32
23
24
11
11
60
59
114
108
13
11
127
164
128

max. speedup
34.2
32.2
18.4
19.0
9.3
9.5
6.6
7.1
9.0
10.3
5.7
4.7
5.6

Table 3: Performance of Orca applications; execution times [sec] and max. speedup
The Linear Equation Solver (LEQ) is the only application that shows a clear advantage for the kernel-space
protocol. The poor performance on the user-space implementation is due to the sequencers machine. This processor handles broadcast requests from all other machines,
but it must also process all incoming update messages and
run an Orca process (as all machines do). With 32 processors, this machine becomes overloaded and slows down the
iterative application. The solution is to dedicate one processor to just sequencing; this not only reduces the latency
of a group message by 50 s (see Section 3.2), but also
avoids the overhead of preempting the Orca process at the
sequencer for each incoming message. The kernel-space
does not suffer from the overhead of context switching between two threads in user-space, because the sequencer is
run as part of the interrupt handler inside the kernel.
The row labeled User-space-dedicated lists the timings for the user-space implementation that sacrifices one
processor to run the sequencer part of the group protocol
only. On eight processors the loss of one worker process
does not outweight the benefits, but at 16 and 32 processors
the dedicated sequencer clearly pays off. For example, on
16 processors the 15 workers now solve the linear equation in 94 seconds instead of 112 seconds. Note that the
execution times increase when going from 16 processors
to 32 for all three implementations. This is a consequence
of the parallel algorithm that now sends twice the number
of group messages of half the size. The decrease in computation time does not outweigh the increase in message
handling overhead.
For the coarse-grained applications (TSP, ASP, and AB),
the performances listed in Table 3 show no significant advantage for either of the two protocols. Apparently, the
poorer results for the user-space implementation on the latency tests reported in Section 4 do not influence these Orca
applications. For the finer grained applications (RL, SOR,
LEQ), the performance results show two effects. First,
the kernel-space implementation does better when a lot of

group communication occurs, because the sequencer in the


user-space implementation is overloaded due to additional
context switches. Second, the user-space implementation
does better when a lot of remote object invocations block
in guarded operations, because the kernel-space implementation needs an additional context switch in this case to
conform to Amoebas RPC requirements.

6 Discussion
The paper has shown that user-space communication
protocols can be implemented on top of Amoeba with an
acceptable performance loss. The communication performance is somewhat lower than that of kernel-space protocols.
The great advantage of user-space protocols is increased
flexibility. For example, we have incorporated continuations [7] in our Orca RTS to reduce the context switching
overhead inside the Panda RPC protocol and to save on
stack space for blocking Orca operations. The Amoeba
RPC protocol does not support the asynchronous transmission of the reply message, which leads to an additional
context switch when handling blocking Orca operations.
As another example, we intend to add nonblocking broadcast messages to our system, where the sending thread
does not have to wait for its message to come back from
the sequencer. For some write-operations, nonblocking
broadcasts can be used without violating Orcas sequential
consistency semantics. With the Amoeba broadcast protocol this optimization would require modifications to the
kernel.
Our work is related to several other systems. Implementations of user-space protocols are also studied in [12, 16].
These studies use a reliable byte-stream protocol (TCP/IP)
and an unreliable datagram protocol (UDP/IP). Our work is
based on the same motivations. A significant contribution
of our work is that we have written several parallel applications on top of the kernel-space and user-space protocols.
We have used them as example applications to study the
impact on overall performance.

The Mach system also has a user-space protocol for communication between different machines, but this protocol
is implemented by separate servers and consequently has a
high overhead [13]. Protocol decomposition is studied in
the x-kernel [13], which allows protocols to be connected in
a graph, depending on the needs of the application. Userspace group communication protocols are studied in the
Horus project [18]. Horus uses a multicast transport service (MUTS) that can run in kernel space or in user space.
Application-specific protocols in user-space are supported
by Tempest, which eases the implementation of multiple
coherence protocols for distributed shared memory [9].
The performance of our user-space implementation
could be improved significantly if user-level access to the
network would be allowed, since such access would eliminate many system calls. An alternative optimization is
to have the kernel execute user code. Since Orca is a
type-secure language, it is possible to have the kernel execute operations on shared objects in a secure way. Several
groups are working on this idea [17, 8].

Acknowledgements
Raoul Bhoedjang implemented parts of the Orca runtime
system and Panda RPC. Tim Ruhl also implemented parts
of Panda RPC and the Panda system layer. Rutger Hofman implemented Pandas broadcast protocol, and helped
in analyzing its performance characteristics. Ceriel Jacobs
wrote the Orca compiler. Frans Kaashoek was one of the
main designers of the Panda system. Kees Verstoep provided valuable assistance when tracking down time spent
inside the Amoeba kernel. We would like to thank Raoul
Bhoedjang, Leendert van Doorn, Saniya Ben Hassen, Rutger Hofman, Ceriel Jacobs, Frans Kaashoek, Tim Ruhl, and
the anonymous referees for their useful comments on the
paper.

References
[1] H.E. Bal. Programming Distributed Systems. Prentice
Hall Intl, Hemel Hempstead, UK, 1991.
[2] H.E. Bal and M.F. Kaashoek. Object distribution in
Orca using compile-time and run-time techniques. In
Conference on Object-Oriented Programming Systems, Languages and Applications, pages 162177,
Washington D.C., September 1993.
[3] H.E. Bal, M.F. Kaashoek, and A.S. Tanenbaum. Orca:
A language for parallel programming of distributed
systems. IEEE Transactions on Software Engineering, 18(3):190205, March 1992.
[4] R. Bhoedjang, T. Ruhl, R. Hofman, K. Langendoen, H.E. Bal, and M.F. Kaashoek. Panda: A
portable platform to support parallel programming
languages. Symposium on Experiences with Distributed and Multiprocessor Systems, pages 213226,
September 1993.
[5] A. D. Birrell and B. J. Nelson. Implementing remote
procedure calls. ACM Transactions on Computer Systems, 2(1):3959, February 1984.
[6] F. Douglis, J.K. Ousterhout, M.F. Kaashoek, and
A.S. Tanenbaum. A comparison of two distributed

systems: Amoeba and Sprite. Computing Systems,


4(4):353384, 1991.
[7] R.P. Draves, B.N. Bershad, R.F. Rashid, and R.W.
Dean. Using continuations to implement thread management and communication in operating systems. In
Proceedings of 13th ACM Symposium on Operating
Systems Principles, pages 12236. ACM SIGOPS,
October 1991.
[8] D. Engler, M.F. Kaashoek, and J. OToole. The Operating System Kernel as a Secure Programmable Machine. In Proceedings of the 6th SIGOPS European
Workshop, Wadern, Germany, September 1994. ACM
SIGOPS.
[9] B. Falsafi, A.R. LeBeck, S.K. Reinhardt, I. Schoinas,
M.D. Hill, J.R. Larus, A. Rogers, and D.A. Wood.
Application-specific protocols for user-level shared
memory. In Supercomputing 94, pages 380389,
November 1994.
[10] M.F. Kaashoek. Group Communication in Distributed
Computer Systems. PhD thesis, Vrije Universiteit,
Amsterdam, December 1992.
[11] M.F. Kaashoek, R. van Renesse, H. van Staveren,
and A.S. Tanenbaum. FLIP: an internet protocol for
supporting distributed systems. ACM Transactions on
Computer Systems, 11(1):73106, January 1993.
[12] C. Maeda and B.N. Bershad. Protocol service decomposition for high-performance networking. In
Proceedings of 14th ACM Symposium on Operating
Systems Principles, pages 244255. ACM SIGOPS,
December 1993.
[13] L.L. Peterson, N. Hutchinson, S. OMalley, and
H. Rao. The x-kernel: A platform for accessing internet resources. IEEE Computer, 23(5):2333, May
1990.
[14] A.S. Tanenbaum, M.F. Kaashoek, and H.E. Bal. Parallel programming using shared objects and broadcasting. IEEE Computer, 25(8):1019, August 1992.
[15] A.S. Tanenbaum, R. van Renesse, H. van Staveren,
G.J. Sharp, S.J. Mullender, A.J. Jansen, and G. van
Rossum. Experiences with the Amoeba distributed
operating system. Communications of the ACM,
33(2):4663, December 1990.
[16] C.A. Thekkath, T.D. Nguyen, E. Moy, and E.D. Lazowska. Implementing network protocols at user
level. In Proc. of the SIGCOMM 93 Symposium,
September 1993.
[17] L. van Doorn and A.S. Tanenbaum. Using active messages to support shared objects. In Proceedings of the
6th SIGOPS European Workshop, Wadern, Germany,
September 1994. ACM SIGOPS.
[18] R. van Renesse, K. Birman, R. Cooper, B. Glade,
and P. Stephenson. The Horus system. In K.P. Birman and R. van Renesse, editors, Reliable Distributed
Computing with the Isis Toolkit, pages 133147. IEEE
Computer Society Press, September 1993.

Вам также может понравиться