Sctpweb

Using the Stream Control Transmission Protocol and Multi-Core Processors to Improve the Performance of Web Servers
Vlad Olaru, Mugurel Andreica, Nicolae Tapus Computer Science Department Politehnica University of Bucharest , Romania {vlad, mugurel.andreica, nicolae.tapus}@cs.pub.ro
Abstract
This paper presents the design of a Web server using multi-core processors and the Stream Control Transmission Protocol (SCTP) as a transport-level protocol for HTTP. The multi-threaded server design takes advantage of the underlying multi-core architecture by dening stream scheduling policies that attempt to improve the performance of the servicing threads. The server has been implemented by modifying an existing, simple Web server called NullHttpd [19]. The paper presents the performance evaluation of the server and underlines the advantages of using SCTP as a transport protocol and those of dening SCTP stream scheduling policies. The reported results show that SCTP outperforms TCP as a HTTP transport protocol for a Web server running on multi-core processors.
1 Introduction
The use of multi-core processors on a large scale is about to change the landscape of computer science as advocated by many researchers [2]. Writing parallel programs was always a challenging task, and now every computer shipped out of the factory is a small parallel computer that needs parallel (or rather multi-core aware) code. Writing such programs is further complicated by the recent shifts in multicore processor design that propose radical changes of the current, widely-used SMP (UMA) model. For instance, ccNUMA processors are already on the market, but few programs take advantage of this new CPU memory architecture beyond the default support offered by the underlying operating system. Another important impact on software development will have the performance-asymmetric multi-core architectures [12]. Such processors accomodate on the same chip several cores with the same instruction set but different hardware characteristics (different clock speed, different instruction issue width, in- or out-of-order execution, etc). Therefore, there is a clear need for redesigning large classes of software to account for these changes in CPU design. In this context, we present in this paper some changes
that could be done to improve the performance of multicore based servers. Servers have been traditionally multithreaded, but they have not necessarily been run on parallel computers. With multi-core servers, one has to think how to extend the potential for parallelism of the modern CPUs to the entire server system. In this paper, we address this issue by presenting the design of a multi-core Web server that attempts to extend the parallel processing capabilities to the communication subsystem as well. However, such a decision implies the need to reconsider the use of the traditional comunication protocol for servers in general, namely TCP, as a transport protocol for HTTP. The single stream nature and its inner algorithms make TCP hardly suited for a parallel communication subsystem. A better choice is the Stream Control Transport Protocol (SCTP) [5], as we advocate in this paper. SCTP is a connection- and message-oriented protocol multiplexing several streams of bound-preserving messages onto the same connection. The use of multiple streams helps better communicate between hosts based on multi-core processors. So far, SCTP has been mostly used in other areas than Web servers (or servers in general). However, we believe that it can replace successfully TCP as a transport protocol for Web servers. This paper presents the design of a multicore Web server that uses SCTP for the communication with its clients and evaluation gures that support our claim.
2 Stream Control Transmission Protocol

SCTP is a connection-oriented, transport-level protocol that operates on top of a datagram transfer protocol like IP. SCTP offers reliable data transfer but optionally also partial-reliable data transfer. It has been developed with the intention to provide a general-purpose transport protocol for message-oriented applications, in the context of telephony over IP for the Signalling System 7. The original standard has been published in October 2000 by the IETF SIGTRAN working group in RFC 2960 [5]. The SCTP design solves two key problems of TCP usage: head-of-line blocking and multihoming. Head-of-
line blocking delays the delivery of a TCP packet within a receivers transport layer buffers if earlier packets have been lost and must be retransmitted. This is a serious issue for applications that use independent messages. SCTP solves the problem by using multiple, independent logical streams of boundary-preserved messages within a connection, which SCTP calls an association. The transmission mechanism keeps the order of messages within the same stream. This order is maintained by assigning independent sequence numbers to user datagrams. SCTP allows bypassing the sequence number assignment in order to deliver immediately the messages to the application, if need be. When an association is initialized, the communicating parties agree on the number of streams they will use. Preserving message boundaries is an important feature of SCTP as well. As opposed to TCP that transmits data as a stream of bytes, SCTP behaves much like UDP, in the sense that it sends messages with a well dened size. Unlike UDP, SCTP guarantees a reliable delivery of the sent messages. Otherwise said, if an SCTP host sends 1KB of data in one operation, the other endpoint will read 1KB of data in exactly one operation. This property represents an advantage over TCP as one can avoid the typical methods used to identify the received messages with TCP. SCTP offers multihoming support at both ends of an association. Computers that have several IP addresses associated with several physical network interfaces can use SCTP to direct the data trafc over different paths. Thus, SCTP provides fault-tolerance and automatically takes over the communication of a failing path. This automatic faulttolerant mechanism uses an SCTP instance that monitors all the data paths to the other end by means of HEARTBEAT messages. These messages are sent over all of the unused paths and can be acknowledged by a HEARTBEAT ACK according to the state of the path: active or inactive. A path is active if it was recently used to transmit an SCTP datagram conrmed by the other end. If transmitting messages fails repeatedly along a given path, that path is considered inactive. An SCTP association is terminated forcefully (and the other end considered unaccessible) either when the number of events of unacknowledged HEARTBEAT messages within a given time frame or the number of retransmission events exceed a given threshold. TCP can emulate SCTP behavior by setting up a connection for each SCTP stream, but there are two other drawbacks beyond those already pointed out: additional resource consumption and overhead, plus loss of programming productivity (in terms of additional development effort).
3 Server design and implementation

3.1 Motivation for a new server design
Currently, there is no extensive experience with SCTP and Web servers (see Section 5). Therefore, an investiga-
tion of the performance of SCTP as a transport protocol for multi-core based Web servers needs to start almost from the scratch, namely by modifying an existing Web server to support communication over SCTP. Afterwards, one has to combine the SCTP support with the multi-core capabilities of the modern processors for improved Web server performance. Under these circumstances, our choice was to modify a simple Web server and we chose NullHttpd [19]. The importance of a new server design comes from the fact that the multiple streams of an SCTP association match naturally a multi-threaded service of the client HTTP requests. In the case of multi-core processors, this multithreaded service can be really executed in parallel (true parallelism as opposed to pseudo-parallelism, typical for uniprocessor, time-shared systems). However, there are several aspects of the problem that need special attention. First of all, a parallel handling of SCTP requests considers an additional level of parallelism, that we call intraassociation parallelism. In a TCP server, client requests sent on separate TCP connections can be handled in parallel using separate threads on separate server cores. This can be true for SCTP requests as well, however, an SCTP client has also the option to send several independent requests on different streams of the same association. At the server side, it is therefore possible to handle in parallel independent requests sent by the same client on the same SCTP association. This type of intra-association parallelism improves the per client response time, while the other type of parallelism strives to improve the overall performance for all the clients. Second, the SCTP API imposes a particular handling of the client requests. The SCTP receive routine, sctp recvmsg, cannot be used to wait for incoming SCTP messages on a given stream of the association. Only when a message is received, the programmer knows the number of the stream the data came on. As a consequence, one needs a central point of request distribution per SCTP association that routes messages for particular SCTP streams to the corresponding servicing threads. Finally, there is also a question of how to handle requests sent over the same stream. Handling requests for TCP connections is simple: the connection is assigned to a given thread and all the requests sent on that connection will be handled by that thread. With SCTP, there are several choices. For instance, one can allocate a servicing thread for any request, regardless of the stream it came on. Another choice would be to use stream afnity, a mechanism that binds an SCTP stream of a given association to a given servicing thread. In general, with SCTP one can talk about stream scheduling within the same association.
3.2 Server design

Our modied NullHttpd server, which we call NullHttpd.SCTP, has been designed to maximize the degree of parallelism of the client request handling when using
SCTP as a transport protocol for HTTP. This choice implies not only multi-threading the server, but also multiplexing its communication paths so that we can achieve a pseudo-parallel communication system. A truly parallel system would require multiple network interfaces and a solution like Concurrent Multipath Transfer [9]. As of now, our server doesnt support such a solution. As a client opens up an SCTP association with the server, the server allocates a so called association thread that will dispatch client requests for service. This thread can be drawn from a special thread pool reserved for this type of service. Each SCTP client association will have its own dispatch thread. Its main job is to wait for incoming client messages by using a blocking sctp recvmsg call. When a message arrives, the call unblocks and, based on the stream number, is being dispatched to a worker thread drawn from a separate thread pool than the association thread pool. When an incoming client message needs to be dispatched by the association thread, this one has to choose a worker thread that will actually handle the request. At this point, the association thread implements what we call a stream scheduling policy, which might very well be called a message dispatch policy. In our current design, we have two such policies. The rst one simply chooses any available thread from the worker thread pool and passes it the message. This policy assumes that the client sends its request in one SCTP message, otherwise it would be wrong. However, this is not a restricting assumption since HTTP requests are not complicated, have small size and are usually sent within one message. Remember from Section 2 that SCTP messages sent in one operation are read at the other end in exactly one operation (and with the exact message size). Therefore, we dont run into the risk of reading the received request in several fragments, as might very well happen in the case of a TCP byte stream. A second policy denes a stream afnity for the worker threads. Each worker thread gets assigned to the service of one SCTP stream and the association thread will always dispatch messages for that stream to that worker thread. Compared to the rst policy, this stream afnity policy tends to favor locality of the service. We refer the reader to Section 4 for a comparison between the stream afnity policy and the previously described stream scheduling policy. It is worth noting that, currently, the stream scheduling policies are non-informed. That is, an incoming SCTP message is not inspected in order to gather application-level protocol (HTTP, in our case) information that could be used to drive the choice of a worker thread. However, it is possible to think of informed policies that speculate such information to improve the request service. For instance, on a multiple cc-NUMA processor server like that used in our experiments (see Section 4), one could think of a localityaware scheduling policy that assigns requests for the same
Web document to the same worker thread, which can be pinned down to a given multi-core chip by means of a CPU afnity mask. Thus, the requested document would stay loaded in the local memory of that CPU chip, thus improving the speed of its delivery. To summarize, the design of the SCTP association service uses an association thread and worker threads to form a particular variant of the well-known producer-consumer problem: single producer-multiple consumer. The association thread is the producer that takes client data from an SCTP stream and feeds it to one of the consumers, a worker thread. Naturally, many SCTP associations can be serviced in parallel since the server uses a separate thread pool for the association threads.
3.3 Aspects of the server implementation

The modications we have made to NullHttpd [19] have been quite extensive. First of all, NullHttpd was not designed for persistent HTTP connections (i.e., HTTP/1.1). This implied adding persistency support, because we were interested to compare the TCP version with the SCTP version (i.e., NulHttpd vs NullHttpd.SCTP). Since testing the latter implied sending more requests over several streams of an association, the TCP version must support persistent HTTP as well in order to allow for a fair comparison. Second, NullHttpd is multi-threaded but doesnt have thread pools (it creates a new thread as a new TCP connections is set up). This aspect of the server is not very optimal for multi-core architectures but it gave us the opportunity to implement our own multi-threaded service according to the guidelines of our design that were presented in the previous section. As already mentioned, NullHttpd.SCTP has two thread pools, one for association threads and one for worker threads. The logic of each of these thread pools is similar. The threads start by looking for service and they block on the lock of a particular queue for their respective jobs. As soon as a new client SCTP association is set up (this is done by the main server thread, the one that executes the accept system call), an association thread from the association pool grabs the lock of the queue for associations and takes over the handling of that association. Similarly, as soon as an association thread places jobs (service requests) in an association queue, one of the worker threads wakes up, grabs the queue lock and starts handling the request. The single producer-multiple consumer scheme of cooperation between the association thread and the worker threads is based on communication pipes. We call them like that by analogy with Unix pipes, but they are actually userspace circular buffers. Each worker thread is linked to its association thread by means of such a pipe (otherwise said, there is one pipe per association stream). The association thread reads data from an SCTP association by means of an sctp recvmsg call, identies the stream the data belongs to,
and writes the data to the pipe associated with that stream. The worker thread handling that stream will read the data from the corresponding pipe. The double-buffering problem in the above cooperation scheme could have been avoided by passing the receive buffer directly to a worker thread. However, our design considers also HTTP pipelining, a technique that improves page loading times by sending a burst of requests on a stream before waiting for an answer. Although the client we used for the performance evaluation (see Section 4) doesnt behave like that, we designed the server as if clients could do that. HTTP pipelining poses a problem to the stream afnity policy. That policy uses a single servicing thread per stream and therefore maintains the order of the request service but needs a place to accumulate the burst of requests and uses pipes for that purpose.
stream has not been reached. When all the requests sent over all of the streams of the association have completed, the client nishes. The received les are not stored. The client discards the received bytes to prevent slowing down the rate at which requests can be issued. The time between sending the rst request and the last completion of a request is reported as the latency of the test. The tests have used requests for ve les: small.png (roughly 7KB large), image.jpg (roughly 65KB large), england.jpg (roughly 121KB large), dragonball.jpg (roughly 705KB large) and ball.jpg (roughly 1810KB large). The rst three les have typical sizes for small, medium and large Web documents. The latter two les are included in evaluation to assess the performance of SCTP in general, not only for the Web. 4.2.1 TCP vs. SCTP A rst fair question that arises is how well performs SCTP with a single stream per association when compared to TCP. The intuition tells that TCP should fare better, as SCTP adds some overhead related to the management of multiple streams of the same association. Figure 1 shows the latency measured in seconds of requesting 8000 les over TCP, respectively SCTP. For SCTP two gures are reported, one for each of the stream scheduling policies we dened (Policy 1 denotes the default policy, while Policy 2 refers to the stream afnity policy). The gure reports values for all of the les in the test le set.
500 450 400 350 300 250 200 150 100 50 0
4 Performance evaluation
4.1 Experimental setup
Our experiments used several HS22 [8] server blades of an IBM cluster. Each of the server blades is equipped with two quad-core Intel(R) Xeon(R) E5520 processors running at 2.26 GHz. These are cc-NUMA processors from the Intel Nehalem family. During the experiments, the HyperThreading (Simultaneous Multi-Threading) capability of the processors has been disabled. Thus, only 8 cores (8 hardware threads) have been available for the tests. Each blade has 48 GB of RAM and 1 TB of disk space. The set of les used for the experiments resides on an DS3000 [4] external storage unit connected to the cluster by means of a Fibre Channel interconnect and are exported to the node by means of NFS. The cluster nodes are interconneted with Gigabit Ethernet and run RedHat Enterprise Linux Server.
Latency (sec)
4.2 Single client performance

This section presents the performance evaluation of a single SCTP client that opens up an association with the NullHttpd.SCTP Web server. The client sends the same number of requests along each of the streams of the association. All the requests are for the same Web document (in fact, we use requests for image les). Requesting the same le repeatedly means that the client tests in fact only the processing and communication capabilities of the Web server and not the performance of its disk I/O subsystem, which is exactly what we are looking for. Indeed, after the rst load from the disk, the contents of the requested document will stay in the main memory of the server for the rest of the test and the disk I/O costs are negligible. The client sends requests over all the streams of the association and waits for replies. As soon as a request is completely fullled (i.e., the whole le has been received), the client sends imediately another request on that stream, provided that the maximum number of requests to be sent per
TCP SCTP 1 stream Policy 1 SCTP 1 stream Policy 2
Requested files and sizes
Figure 1. TCP vs. one SCTP stream per association Counterintuitively, TCP performs worse than SCTP for typical Web-size les and outperforms SCTP only for large les. We dont have an explanation for this behavior. However, it should be said that for TCP we are using the same modied Web server (NullHttpd.SCTP) in which specic SCTP send and receive calls are replaced with normal TCP send and receive calls. It is therefore possible to conceive that a simpler and more efcient implementation of the TCP version of the server would outperform the SCTP server servicing single stream associations.
Anyway, the differences between TCP and SCTP with one stream are not signicant. Therefore, for the rest of this section we will skip the comparison with TCP. Instead, we assume that the gures for SCTP associations with one stream approximate pretty well the TCP performance. 4.2.2 Stream scheduling policies comparison We compared also our default stream scheduling policy (that routes an incoming request to any of the available servicing threads) with the stream afnity policy (that associates a handling thread with each stream of an association that will service all the requests sent on that stream). In the experiments we call the default policy Policy 1, while Policy 2 denotes the stream afnity policy.
SCTP stream scheduling comparison

45 40 35
Latency (sec)
30 25 20 15 10 5 0 1 2 4 8 16 32 64
small.png Policy 1 small.png Policy 2 image.jpg Policy 1 image.jpg Policy 2 england.jpg Policy 1 england.jpg Policy 2
icantly greater than the values shown (one order of magnitude difference) and would have made the graph difcult to read. The rest of the tests for which we report results will use the stream afnity policy, unless otherwise specied. As the number of streams grows, the request latencies of both policies initially decrease up to some point and then start to increase slowly. That optimum point is less obvious for small and medium les in the case of the stream afnity policy, for which the performance seems unaffected even for large number of streams per association (32 and 64). For the rest of the cases, the optimum point seems to be around 8 streams per association, which is probably explained by the fact that the server used 8 cores for the test. Overall, Figure 2 shows that the performance of SCTPbased Web serves can be increased by increasing the number of streams (given a multi-core architecture; the effect of the number of cores is presented in the next section). If one looks also at Figure 1, it becomes clear that sending requests over several streams can improve the client perceived performance compared to the case when just one stream (either TCP or SCTP) is being used. 4.2.3 The impact of the number of cores on the server performance This section shows how is the performance of the SCTP server affected by the variation of the number of cores used for service. In order to do that, we used the Linux taskset [13] command and we launched the SCTP server on 2, 4 and 8 cores respectively. The clients requested 32000 les over one SCTP association.
Number of streams per SCTP association
Figure 2. Stream scheduling comparison The tests used an 8-core server and the client has sent 8000 requests. Figure 2 shows the latency of servicing all the requests when varying the number of streams per association. Only gures for the typical Web size documents have been reported. Note that the stream afnity improves the server performance. The improvement is especially important for small- and medium-sized les (small.png and image.jpg). For instance, for 64 streams, the latency of using the stream afnity policy (Policy 2) for small.png is three times smaller than that of the default policy (Policy 1), i.e. 3.1 vs. 9.19 seconds. For image.jpg, the corresponding difference is 1.61 times smaller, 6.08 vs. 9.79 seconds. These results are in particular important because these types of les account for most of the Web trafc (many studies [3, 7, 23] have shown that the distribution of Web document requests follows a Zipf-like law [24] in which most of the requests are for small- and medium-sized les). Note that the latencies for the default policy (Policy 1) in the case of small.png requests beyond 4 streams per association are greater than the values for a larger le (image.jpg) fetched using the second policy. This is another argument in favor of our stream afnity policy. The stream afnity policy outperforms the default policy for large-size les (ball.jpg and dragonball.jpg) as well, but the actual gures are not reported because they are signif-
Request latency for large files

2500 2000
Latency (s)
ball-8-cores 1500 1000 500 0 1 2 4 8 16 32 64 ball-4-cores ball-2-cores dragonball-8-cores dragonball-4-cores dragonball-2-cores
Figure 3. The variation of the request latency for large les with the number of cores Figure 3 shows the performance for large les (ball.jpg and dragonball.jpg). While it is not surprinsing that the server running on two cores underperforms, it is interesting that the differences between 4 and 8 cores are not that large, especially when the number of streams per SCTP association grows. The best server performance, regardless of the number of cores, favors a choice of 4 to 8 streams per association. Remember that the stream afnity policy binds a stream to a thread, therefore as the number of streams
per associations grows, so does the contention for CPU resources and this contention offsets the advantage of increasing any further the number of streams per association.
Request latency for england.jpg

200 150
Latency (s)
100 50 0 1 2 4 8 16 32 64
client issues. For this test, the client sent requests for the ball.jpg image (the largest from our le set) 8000, 16000 and 32000 times. As a result, the server had to deliver aproximately 14.4, 28.8 and 57.6 GB of data respectively. The tests were run on 2, 4 and 8 cores by appropriately setting the CPU afnity mask of the application server by using the Linux taskset [13] command. The request latencies for these loads when using various numbers of streams pe SCTP association are reported in Figures 6, 7 and 8.
england-8-cores england-4-cores england-2-cores
Load influence on the request latency

2000 1500
Latency (s)
1000 500
ball-8-cores-32k ball-8-cores-16k ball-8-cores-8k
Figure 4. The variation of the request latency for england.jpg with the number of cores
Request latency for image.jpg and small.png

120 100
Latency (s)
16
32
64
80 60 40 20 0 1 2 4 8 16 32 64
image-8-cores image-4-cores image-2-cores small-8-cores small-4-cores small-2-cores
Figure 6. The inuence of the load on a server with 8 cores when modifying the number of requests made by a client on an SCTP association
Load influence on request latency

2500 2000
Latency (s)
1500 1000 500 0 1 2 4 8 16 32 64 ball-4-cores-32k ball-4-cores-16k ball-4-cores-8k
Figure 5. The variation of the request latency for image.jpg andsmall.png with the number of cores In the case of the les that have typical sizes for the Web, one can see from Figures 4 and 5 that the performance of retrieving static data from the SCTP Web server seems to favor a conguration of the server with 8 cores and 4 to 8 streams per SCTP association. For small les (small.png), the difference between the performance of the server running on 2 cores only and the other two cases is not as visible as in the case of the larger les. The reason for that is the short service the server threads have to fulll, short enough not to saturate the two CPU cores. For medium and large Web-like les (image.jpg and england.jpg), the difference between using 2 cores and 4/8 cores is signicant. However, the difference between using 4 and 8 cores becomes even less important than in the case of the large les (ball.jpg and dragonball.jpg). 4.2.4 The load impact on the server performance We also assesed the impact that the client load has on the server performance by varying the number of request our
Figure 7. The inuence of the load on a server with 4 cores when modifying the number of requests made by a client on an SCTP association Increasing the load determines the server to saturate pretty fast. For instance, in Figure 8 one can see that even for 8000 client requests (roughly 14.4 GB of data) the server running on two cores cannot cope with the load and saturates already for two streams per association. In fact, the number of cores seems to correlate pretty well with the value of the number of streams for which the server starts to saturate. Indeed, the server running on 4 cores seems to start saturating for 4 streams per association while the server running on 8 cores starts to degrade its peformance beyond 8 streams per association. There are two exceptions, for 32000 requests, the server running on two cores has the
lowest latency for 4 streams, while the server running on 4 cores records its best performance for 8 streams.
Load influence on the request latency

2500 2000
Latency (s)
ous for 4 cores, but become visible for 2 cores. For 4 cores, choosing cores on the same chip improves the request latency with 8-11%, while in the case of 2 core the improvement varies between 5-16%.
CPU affinity influence

250
1500
Latency (s)
1000 500 0 1 2 4 8 16 32 64
ball-2-cores-32k ball-2-cores-16k ball-2-cores-8k
200 150 100 50 2-cores-affinity-0x88 2-cores-affinity-0xc0 4-cores-affinity-0xcc 4-cores-affinity-0xf0 2 4 8 16 32 64
Figure 8. The inuence of the load on a server with 2 cores when modifying the number of requests made by a client on an SCTP association Overall, it can be concluded that the request latency grows with the number of requests and the saturation point of the server in terms of number of streams is pretty low. Therefore, one should not use associations with many streams (signicantly more than the available number of cores) when servicing large loads. 4.2.5 The inuence of CPU afnity on the server performance In the previous experiments when we tested the server performance on 2 and 4 cores, respectively, we chose cores that were residing on the same chip. In this section we present the performance differences that may arise when changing the CPU afnity of the server so that it spans both of the system chips. This was the default case when using 8 cores, because we had no choice, but for 2 or 4 cores one can think of different afnity sets for the server application. The matter is important because our processors are cc-NUMA (from the Nehalem family of the Intel processors). As a result of that, the CPU chips have local memory of their own, that is accesible faster than the local memory of another CPU. Therefore, it makes a difference if the CPU afnity mask of an application spans several processor chips or not. Our experiments used the Linux taskset [13] command to place the SCTP Web server on 2 or 4 cores either belonging to the same chip or equally divided between the two chips of the system. For two cores we used CPU afnity masks of 0xc0 (cores 3 and 4 of the second chip) and 0x88 (core 4 of the second chip and core 4 of the rst chip). Similarly, for 4 cores, the CPU afnity masks were 0xf0 and 0xcc. The latencies of retrieving 16000 times the dragonball.jpg le from the server using the above CPU afnity masks are shown in Figure 9. Note that beyond 4 streams per association, the CPU afnity favors a placement that uses cores of the same chip. The differences are less obvi-
Figure 9. Latency for requesting 16000 times dragonball.jpg from a server running on cores of the same chip vs. cores from different chips
CPU affinity influence

45 40 35 30 25 20 15 10 5 0 2 4 8 16 32 64
Latency (s)
2-cores-affinity-0x88 2-cores-affinity-0xc0 4-cores-affinity-0xcc 4-cores-affinity-0xf0
Figure 10. Latency for requesting 16000 times image.jpg from a server running on cores of the same chip vs. cores from different chips In Figure 10 we have the results for retrieving image.jpg 16000 times from the server. Here the results are much more clear because the size of the image is signicatly smaller (65KB vs. 705KB) and the load on the processors is reduced accordingly. There are a few things to notice on this graph. First, both for 2 and 4 cores, when the server ran on two different processor chips it came closer to the performance of the server running on the cores of the same processor chip at about 8 streams per association. This is some sort of optimal point for the server spanning both chips. When the server runs on the same chip, its performance reaches very fast (at about 4 streams per association) the optimal point and stays pretty much constant with the increase of the number of streams, whereas in the other case, after reaching the optimal point, the performance of the server degrades. Overall, the difference between the server running on two cores of the same chip and that running on two cores
of different chips ranges from 15-32%. For 4 cores, the improvement ranges from 14-26%.
Sensitivity to the send/receive buffer size

200
Latency (s)
150 100 50 0 1 2 4 8 16 32 64 8k-buffer 4k-buffer 2k-buffer 1k-buffer
Figure 12 shows that increasing the packet loss rate determines naturally a degradation of the server performance, perceived by the client as an increase of the request latency. The 0.1% packet loss rate can be taken as a good reference for the magnitude of this degradation, since in that case only one packet in 1000 gets lost. So we can consider that, practically, no packets are lost. The gures are indeed practically indistinguishable from those previously reported for england.jpg. Note also that the degradation varies smooth with the number of streams per association, for each packet loss rate taken separately. That means that even in the presence of network losses, associations with more streams perform constantly better than those with few streams (1 or 2).
Figure 11. The impact of the send/receive buffer size on the server performance
Latency (s)
Request latency in the presence of network packet losses

1200 1000 800 600 400 200 0 1 2 4 8 16 32 64 8k-buffer-0.1% 8k-buffer-0.3% 8k-buffer-0.5% 8k-buffer-0.7%
4.2.6 The impact of the send/receive buffer size on the server performance As already mentioned in Section 2, an SCTP message sent in one operation will be received at the other end in exactly one operation, with exactly the message size. Under these circumstances, we explored the sensitivity of SCTP to the size of the send/receive buffers. We modied the client and the server code to match their send/receive buffer sizes for different values. The modications refer to the server send buffer and the client receive buffer. The client requested 32000 times the england.jpg le from a server running on 8 cores. The server buffer size has been varied from 1 KB to 8 KB. The impact on the server performance is reported in Figure 11. Note that the larger the server send buffer is, the better the performance. The largest differences in request latency are recorded beyond 4 streams per association. Beyond 4 streams per association, the request latency decreases with 13-15% between 1KB and 2KB, with 30-35% between 1KB and 4KB, and with 23-41% between 1KB and 8KB. 4.2.7 The impact of the network packet loss on the server performance So far, our experiments have not considered real network behavior, because both the server and the client machines are server blades in the same cluster and are interconnected by means of a high speed network (Gigabit Ethernet). This section aims to investigate the impact of the packet loss rate on the performance of our SCTP Web server. Figure 12 reports the request latencies for clients that request 32000 copies of the england.jpg le over 8 SCTP streams. The server runs on 8 cores and uses an 8 KB send buffer. We decided to choose loss rates taken from various reports about typical loss rates in the Internet. In order to induce the loss rate, we used the tc [14] Linux tool that allowed us to instrument the simulation of packet loss.
Figure 12. The impact of the network packet loss rate on the server performance
4.3 Multiple clients performance

In this subsection we describe the behavior of the SCTP Web server when requests are coming from multiple clients, either located on the same machine or on different machines. To test the performance of the server we chose its most efcient conguration as it appears from the previous experiments. Namely, we used all of the 8 cores of the node hosting the server and the server was using a send buffer of size 8 KB. The clients sent 32000 request for england.jpg over 8 SCTP streams for any given association (i.e., 4000 requests per stream). We varied the number of clients from 1 to 32 and the results are reported in Figures 13 and 14. All these clients have run on the same (client) machine. Figure 13 shows the average latency of a request perceived by a client. As throughout the experimental section, we consider a request to consist of all of the 32000 individual requests sent by one client over its streams. The average value reported on the graph is computed over all the latencies of the client requests. The degradation of the latency is proportional with the number of clients issuing requests (the graph may be a bit misleading, on the X axis the scale is actually exponential). The variation of the number of cores also affects the performance of the server, as it should naturally be. Note however, that the degradation of the latency from 4 to 8 cores is signicantly smaller than that from 2
to 4 cores. One explanation can be found when looking at Figure 14. Indeed, the server saturates pretty fast as the number of cores goes from 4 to 8 and increasing the number of clients doesnt improve the performance any further.
Average latency per client for england.jpg

3000 2500
Latency (s)
8 4+4 16 8+8 32 16+16
Latency (s) 299.14 298.41 597.16 597.03 1199.71 1181.35
Throughput (MB/s) 100.77 100.19 101.12 100.78 100.62 102.15
Maximum throughput (MB/s) 102.24 102.07 102.71 102.55 102.65 104.07
Table 1. One vs. two client nodes performance

2 cores 4 cores 8 cores 1 8 16 32
2000 1500 1000 500 0
5 Related work
Although SCTP has been proposed more than a decade ago, it has failed to replace TCP beyond the domain it was initially designed for (IP telephony). Therefore, this section discusses several interesting projects that use SCTP in other areas as well and highlight the benets over using TCP. TCP is still the dominant transport protocol of the Internet partly because it took a while before the popular operating systems in use have decided to support the protocol in their main distributions. As of now, however, many operating systems support SCTP (FreeBSD, Linux, Solaris, AIX, HPUX, etc). Other solutions include user-space implementations of the protocol [20]. Also, the programming support tends to extend beyond the C libraries that have been used so far [15]. Since the beginning of 2010, Java 1.7 includes an experimental SCTP package [1], while Windows has started to use SCTP for its .NET since 2011 [21]. SCTP has been used for FTP and HTTP servers before. Using multiple-stream SCTP associations has been found to facilitate system backups or mirror site downloads that rely on FTP to do their job [11]. Natarajan et al. [18, 17] modied an Apache server and a Firefox browser to support SCTP and showed improved performance when using SCTP as a transport protocol for HTTP. The authors also submitted and IETF draft for server and browser SCTP support [16]. Another interesting work, Concurrent Multipath Transfer [9], improves the parallel transfer of data over multihomed SCTP associations. We believe that our work complements their work by exploring the use of SCTP in the context of servers running on multi-core processors. As this paper tried to, the changes in processor paradigm imply software changes. Thus, we not only adapted an existing server so that it can use SCTP, but we provided a new server design attempting to match the potential of parallel computation of the multi-core processors with the potential for communication over parallel, independent streams offered by SCTP. Another interesting project that shares the above goal with our project but operates in a different direction is the use of SCTP as a transport protocol for MPI (the Message Passing Interface) for clusters of computers [10]. Although the context is a bit different, the goal is similar in
Number of clients
Figure 13. Average latency for england.jpg when using several clients runing in parallel on the same node When looking at Figure 14, a natural concern is to nd out whether the server really saturates or the limitation is imposed by the capacity of the client machine to absorb the data. We ran some additional experiments to settle the issue. Namely, we started the same number of clients 8, 16 and 32, respectively, evenly spread across two client machines (two blades in the cluster). The results show that the individual performance gures for the clients do not change. As a consequence, we can conclude that the throughput limitation is due to the server. A comparison between tests with 1 client node and 2 client nodes is shown in Table 1. The gures for the test using 2 client nodes are marked 4+4, 8+8 and 16+16 respectively. The test used as well 8 cores for the server and the clients used 8 streams over which england.jpg has been requested 32000 times (4000 times per each stream).
Average and maximum throughput for england.jpg

120
Throughput (MB/s)
100 2 cores 80 60 40 20 0 1 8 16 32 2 cores max 4 cores 4 cores max 8 cores 8 cores max
Number of clients
Figure 14. Average and maximum throughput values for england.jpg when using several clients run in parallel on the same client node
the sense that one tries to parallelize the communication paths among parallel computing resources. SCTP is not the only communication protocol attempting to multiplex several data streams onto the same network connection. The Structured Stream Transport (SST) project [6] has dened and developed a protocol similar to SCTP. To our knowledge, there are no known attempts to use that protocol as a transport protocol for HTTP.
6 Conclusions
In this paper we have presented the design and implementation of a multi-core Web server that uses the Stream Control Transmission Protocol as a transport protocol for HTTP. The server design attempts to match the multi-core computing capabilities of modern processors with the potential for parallel communication offered by multi-stream SCTP associations. The server is capable of taking advantage of multiple independent communication paths of the same SCTP association by using stream scheduling policies in order to improve the performance of the servicing threads. We have shown that by dening a stream afnity policy it is possible to improve the performance of a multithreaded Web server that has been implemented by modifying an existing server, NullHttpd, to support HTTP persistency, SCTP and extended multi-threading capabilities. The evaluation results show that using SCTP associations with multiple stream outperforms single stream connections like TCP in the case of multi-core Web servers. The TCP performance is similar to that of an SCTP association with one stream. The performance evaluation considered both highly reliable networks (such as cluster interconnects) and real world scenarios that include network packet loss. It has been also shown that the SCTP-based Web server is sensitive to CPU afnity settings.
7 Acknowledgment
The work presented in this paper has been partially supported by CNCS-UEFISCDI under the research grant ID 1679/2008 (contract no. 736/2009), PN II - IDEI program.
References
[1] JDK 1.7. https://jdk7.dev.java.net/. [2] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS2006-183, EECS Department, University of California, Berkeley, December 2006. [3] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of IEEE Infocom 99, March 1999.
[4] IBM System Storage DS3000. http://www03.ibm.com/systems/storage/disk/ds3000/. [5] R. Stewart et al. RFC 2960: Stream Control Transmission Protocol, 2000. [6] B. Ford. Structured Streams: a New Transport Abstraction. In Proc. of the ACM SIGCOMM 2007, August 2007. [7] S. Glassman. A Caching Relay for the World Wide Web. In First International World Wide Web Conference, 1994. [8] HS22 Blade Servers IBM Blade Center. http://www03.ibm.com/systems/bladecenter/hardware/servers/hs22/index.htm [9] J. Iyengar, P.D. Amer, and R. Stewart. Concurrent multipath transfer using SCTP multihoming over independent end-to-end paths. In IEEE/ACM Journal Transactions on Networking, 14(5), October 2006. [10] H. Kamal, B. Penoff, and A. Wagner. SCTP vs. TCP for MPI. In Proc. of Supercomputing 2005 (SC2005), November 2005. [11] S. Ladha and P.D. Amer. Improving multiple le transfers using SCTP multistreaming. In Proc. of the IEEE International Conference on Performance, Computing and Communications, 2004. [12] Tong Li, D. Baumberger, D. A. Koufaty, and S. Hahn. Efcient Operating System Scheduling for Performance-Asymmetric Multi-Core Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC07), November 2007. [13] Linux. taskset. man page. [14] Linux. tc. man page. [15] LKSCTP. http://lksctp.sourceforge.net. [16] P. Natarajan. Using SCTP as a Transport Layer Protocol for HTTP. http://tools.ietf.org/html/draftnatarajan-http-over-sctp-02, 2009. [17] P. Natarajan, P.D. Amer, and R. Stewart. Multistreamed Web Transport for Developing Regions. In Proc. of the ACM SIGCOMM Workshop on Networked Systems for Developing Regions, August 2008. [18] P. Natarajan, J. Iyengar, P.D. Amer, and R. Stewart. SCTP: An Innovative Transport Layer Protocol for the Web. In Proc. of the 15th International Conference on World Wide Web, May 2006. [19] NullHttpd. http://sourceforge.net/projects/nullhttpd/. [20] SCTPLIB. http://www.sctp.de/sctp.html. [21] SCTP.NET. http://www.spcommunication.co.uk/. [22] The Standard Performance Evaluation Corporation (SPEC). http://www.spec.org/web2009/. [23] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, A. Karlin, and H. Levy. On the scale and performance of cooperative Web proxy caching. In Proceedings of the 17th ACM Symposium on Operating System Principles, December 1999. [24] G. Zipf. Human Behavior and the Principle of Least Effort, 1949. Addison Wesley.

Sctpweb

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Sctpweb

Загружено:

Авторское право:

Доступные форматы

Using the Stream Control Transmission Protocol and Multi-Core Processors to Improve the Performance of Web Servers

2 Stream Control Transmission Protocol

3 Server design and implementation

3.2 Server design

3.3 Aspects of the server implementation

4.2 Single client performance

TCP SCTP 1 stream Policy 1 SCTP 1 stream Policy 2

Requested files and sizes

SCTP stream scheduling comparison

Number of streams per SCTP association

Request latency for large files

ball-8-cores 1500 1000 500 0 1 2 4 8 16 32 64 ball-4-cores ball-2-cores dragonball-8-cores dragonball-4-cores dragonball-2-cores

Number of streams per SCTP association

Request latency for england.jpg

england-8-cores england-4-cores england-2-cores

Load influence on the request latency

Number of streams per SCTP association

ball-8-cores-32k ball-8-cores-16k ball-8-cores-8k

Request latency for image.jpg and small.png

Number of streams per SCTP association

image-8-cores image-4-cores image-2-cores small-8-cores small-4-cores small-2-cores

Load influence on request latency

1500 1000 500 0 1 2 4 8 16 32 64 ball-4-cores-32k ball-4-cores-16k ball-4-cores-8k

Number of streams per SCTP association

Number of streams per SCTP association

Load influence on the request latency

CPU affinity influence

ball-2-cores-32k ball-2-cores-16k ball-2-cores-8k

200 150 100 50 2-cores-affinity-0x88 2-cores-affinity-0xc0 4-cores-affinity-0xcc 4-cores-affinity-0xf0 2 4 8 16 32 64

Number of streams per SCTP association

Number of streams per SCTP association

CPU affinity influence

2-cores-affinity-0x88 2-cores-affinity-0xc0 4-cores-affinity-0xcc 4-cores-affinity-0xf0

Number of streams per SCTP association

Sensitivity to the send/receive buffer size

150 100 50 0 1 2 4 8 16 32 64 8k-buffer 4k-buffer 2k-buffer 1k-buffer

Number of streams per SCTP association

Request latency in the presence of network packet losses

Number of streams per SCTP association

4.3 Multiple clients performance

Average latency per client for england.jpg

8 4+4 16 8+8 32 16+16

Latency (s) 299.14 298.41 597.16 597.03 1199.71 1181.35

Throughput (MB/s) 100.77 100.19 101.12 100.78 100.62 102.15

Maximum throughput (MB/s) 102.24 102.07 102.71 102.55 102.65 104.07

Table 1. One vs. two client nodes performance

2000 1500 1000 500 0

Average and maximum throughput for england.jpg

Вам также может понравиться