Web Servers: Implementation and Performance: Erich Nahum

Web Servers:
Implementation and Performance
Erich Nahum
IBM T.J. Watson Research Center

www.research.ibm.com/people/n/nahum
nahum@us.ibm.com
Web Servers: Erich Nahum 1

Contents of This Tutorial
Introduction to HTTP
HTTP Servers:
Outline of an HTTP Server Transaction
Server Models: Processes, Threads, Events
Event Notification: Asynchronous I/O
HTTP Server Workloads:
Workload Characteristics
Workload Generation
Server TCP Issues:
Introduction to TCP
Server TCP Dynamics
Server TCP Implementation Issues

Things Not Covered in Tutorial
Clusters
Client-side issues: DNS, HTML rendering
Proxies: some similarities, many differences
Dynamic Content: CGI, PHP, JSP, etc.
QoS for Web Servers
SSL/TLS and HTTPS
Content Distribution Networks (CDNs)
Security and Denial of Service

Assumptions and Expectations
Some familiarity with WWW as a user

(Has anyone here not used a browser?)
Some familiarity with networking concepts
(e.g., unreliability, reordering, race conditions)
Familiarity with systems programming
(e.g., know what sockets, hashing, caching are)
Examples will be based on C & Unix
taken from BSD, Linux, AIX, and real servers
(sorry, Java and Windows fans)

Objectives and Takeaways
After this tutorial, hopefully we will all know:
Basics of server implementation & performance

Pros and cons of various server architectures
Difficulties in workload generation
Interactions between HTTP and TCP
Design loop of implement, measure, profile, debug,
and fix
Many lessons should be applicable to any networked

server, e.g., files, mail, news, DNS, LDAP, etc.

Timeline
Section Min
Introduction to HTTP 20
Outline of an HTTP Server Transaction 25
Server Models: Processes, Threads, Events 20
Event Notification: Asynchronous I/O 20
Workload Characteristics 35
Break 15
Workload Generation 40
Introduction to TCP 35
Server TCP Dynamics 20
Server TCP Implementation 25

Acknowledgements
Many people contributed comments and
suggestions to this tutorial, including:
Abhishek Chandra Balachander Krishnamurthy

Mark Crovella Vivek Pai
Suresh Chari Jennifer Rexford
Peter Druschel Anees Shaikh
Jim Kurose
Errors are all mine, of course.

Chapter 1: Introduction to HTTP

Introduction to HTTP
http request http request
Laptop w/ http response http response

Netscape Desktop w/
Server w/ Apache Explorer
HTTP: Hypertext Transfer Protocol

Communication protocol between clients and servers
Application layer protocol for WWW
Client/Server model:
Client: browser that requests, receives, displays object
Server: receives requests and responds to them
Protocol consists of various operations
Few for HTTP 1.0 (RFC 1945, 1996)
Many more in HTTP 1.1 (RFC 2616, 1999)

How are Requests Generated?
User clicks on something
Uniform Resource Locator (URL):
http://www.nytimes.com
https://www.paymybills.com
ftp://ftp.kernel.org
news://news.deja.com
telnet://gaia.cs.umass.edu
mailto:nahum@us.ibm.com
Different URL schemes map to different services
Hostname is converted from a name to a 32-bit IP
address (DNS resolve)
Connection is established to server
Most browser requests are HTTP requests.

What Happens Then?
Client downloads HTML document <html>
Sometimes called container page <head>

<meta
Typically in text format (ASCII)

name=Author
content=Erich Nahum>
Contains instructions for rendering

<title> Linux Web
Server Performance
(e.g., background color, frames) </title>
</head>
Links to other pages <body text=#00000>
<img width=31
height=11
src=ibmlogo.gif>
<img
Many have embedded objects: src=images/new.gif>
<h1>Hi There!</h1>
Images: GIF, JPG (logos, banner ads) Heres lots of cool
linux stuff!
Usually automatically retrieved <a href=more.html>
Click here</a>
I.e., without user involvement for more!
</body>
can control sometimes </html>
(e.g. browser options, junkbusters) sample html file

So Whats a Web Server Do?
Respond to client requests, typically a browser
Can be a proxy, which aggregates client requests (e.g., AOL)
Could be search engine spider or custom (e.g., Keynote)
May have work to do on clients behalf:
Is the clients cached copy still good?
Is client authorized to get this document?
Is client a proxy on someone elses behalf?
Run an arbitrary program (e.g., stock trade)
Hundreds or thousands of simultaneous clients
Hard to predict how many will show up on some day
Many requests are in progress concurrently
Server capacity planning is non-trivial.

What do HTTP Requests Look Like?
GET /images/penguin.gif HTTP/1.0
User-Agent: Mozilla/0.9.4 (Linux 2.2.19)
Host: www.kernel.org
Accept: text/html, image/gif, image/jpeg
Accept-Encoding: gzip
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
Cookie: B=xh203jfsf; Y=3sdkfjej
<cr><lf>
Messages are in ASCII (human-readable)

Carriage-return and line-feed indicate end of headers
Headers may communicate private information
(e.g., browser, OS, cookie information, etc.)

What Kind of Requests are there?
Called Methods:
GET: retrieve a file (95% of requests)
HEAD: just get meta-data (e.g., mod time)
POST: submitting a form to a server
PUT: store enclosed document as URI
DELETE: removed named resource
LINK/UNLINK: in 1.0, gone in 1.1
TRACE: http echo for debugging (added in 1.1)
CONNECT: used by proxies for tunneling (1.1)
OPTIONS: request for server/proxy options (1.1)

What Do Responses Look Like?
HTTP/1.0 200 OK
Server: Tux 2.0
Content-Type: image/gif
Content-Length: 43
Last-Modified: Fri, 15 Apr 1994 02:36:21 GMT
Expires: Wed, 20 Feb 2002 18:54:46 GMT
Date: Mon, 12 Nov 2001 14:29:48 GMT
Cache-Control: no-cache
Pragma: no-cache
Connection: close
Set-Cookie: PA=wefj2we0-jfjf
<cr><lf>
<data follows>
Similar format to requests (i.e., ASCII)

What Responses are There?
1XX: Informational (defd in 1.0, used in 1.1)
100 Continue, 101 Switching Protocols
2XX: Success
200 OK, 206 Partial Content
3XX: Redirection
301 Moved Permanently, 304 Not Modified
4XX: Client error
400 Bad Request, 403 Forbidden, 404 Not Found
5XX: Server error
500 Internal Server Error, 503 Service Unavailable,
505 HTTP Version Not Supported

What are all these Headers?
Specify capabilities and properties:
General:
Connection, Date
Request:
Accept-Encoding, User-Agent
Response:
Location, Server type
Entity:
Content-Encoding, Last-Modified
Hop-by-hop:
Proxy-Authenticate, Transfer-Encoding
Server must pay attention to respond properly.

Summary: Introduction to HTTP
The major application on the Internet

Majority of traffic is HTTP (or HTTP-related)
Client/server model:
Clients make requests, servers respond to them
Done mostly in ASCII text (helps debugging!)
Various headers and commands
Too many to go into detail here
Well focus on common server ones
Many web books/tutorials exist (e.g., Krishnamurthy &
Rexford 2001)

Chapter 2: Outline of a Typical
HTTP Transaction

Outline of an HTTP Transaction
In this section we go over the basics
of servicing an HTTP GET request
from user space
For this example, we'll assume a initialize;
forever do {
single process running in user space, get request;
similar to Apache 1.3 process;
At each stage see what the send response;
log request;
costs/problems can be }
Also try to think of where costs can
be optimized server in
Well describe relevant socket a nutshell
operations as we go

Readying a Server
s = socket(); /* allocate listen socket */
bind(s, 80); /* bind to TCP port 80 */
listen(s); /* indicate willingness to accept */
while (1) {
newconn = accept(s); /* accept new connection */b
First thing a server does is notify the OS it is interested in

WWW server requests; these are typically on TCP port 80.
Other services use different ports (e.g., SSL is on 443)
Allocate a socket and bind()'s it to the address (port 80)
Server calls listen() on the socket to indicate willingness to
receive requests
Calls accept() to wait for a request to come in (and blocks)
When the accept() returns, we have a new socket which
represents a new connection to a client

Processing a Request
remoteIP = getsockname(newconn);
remoteHost = gethostbyname(remoteIP);
gettimeofday(currentTime);
read(newconn, reqBuffer, sizeof(reqBuffer));
reqInfo = serverParse(reqBuffer);
getsockname() called to get the remote host name

for logging purposes (optional, but done by most)
gethostbyname() called to get name of other end
again for logging purposes
gettimeofday() is called to get time of request
both for Date header and for logging
read() is called on new socket to retrieve request
request is determined by parsing the data
GET /images/jul4/flag.gif

Processing a Request (cont)
fileName = parseOutFileName(requestBuffer);
fileAttr = stat(fileName);
serverCheckFileStuff(fileName, fileAttr);
open(fileName);
stat() called to test file path

to see if file exists/is accessible
may not be there, may only be available to certain people
"/microsoft/top-secret/plans-for-world-domination.html"
stat() also used for file meta-data
e.g., size of file, last modified time
"Have plans changed since last time I checked?
might have to stat() multiple files just to get to end
e.g., 4 stats in bill g example above
assuming all is OK, open() called to open the file

Responding to a Request
read(fileName, fileBuffer);
headerBuffer = serverFigureHeaders(fileName, reqInfo);
write(newSock, headerBuffer);
write(newSock, fileBuffer);
close(newSock);
close(fileName);
write(logFile, requestInfo);
read() called to read the file into user space

write() is called to send HTTP headers on socket
(early servers called write() for each header!)
write() is called to write the file on the socket
close() is called to close the socket
close() is called to close the open file descriptor
write() is called on the log file

Optimizing the Basic Structure
As we will see, a great deal of locality exists in

web requests and web traffic.
Much of the work described above doesn't really

need to be performed each time.
Optimizations fall under 2 categories: caching and

custom OS primitives.

Optimizations: Caching
Idea is to exploit locality in client requests. Many
files are requested over and over (e.g., index.html).
Why open and close fileDescriptor =

files over and over lookInFDCache(fileName);
again? Instead, metaInfo =
lookInMetaInfoCache(fileName);
cache open file FDs, headerBuffer =
manage them LRU. lookInHTTPHeaderCache(fileName);
Why stat them again Again, cache HTTP header

and again? Cache info on a per-url basis,
path name access rather than re-generating
characteristics. info over and over.

Optimizations: Caching (cont)
Instead of reading and writing the data, cache data,

as well as meta-data, in user space
Even better, mmap() fileData =

the file so that two lookInFileDataCache(fileName);
copies dont exist in fileData =
lookInMMapCache(fileName);
both user and kernel remoteHostName =
space lookRemoteHostCache(fileName);
Since we see the same clients over and over, cache

the reverse name lookups (or better yet, don't do
resolves at all, log only IP addresses)

Optimizations: OS Primitives
Rather than call accept(), getsockname() & read(), add a
new primitive, acceptExtended(), which combines the 3
primitives
Instead of calling acceptExtended(listenSock,

&newSock, readBuffer,
gettimeofday(), use a &remoteInfo);
memory-mapped
counter that is cheap currentTime = *mappedTimePointer;
to access (a few buffer[0] = firstHTTPHeader;

instructions rather buffer[1] = secondHTTPHeader;
than a system call) buffer[2] = fileDataBuffer;
writev(newSock, buffer, 3);
Instead of calling write() many times use writev()

OS Primitives (cont)
Rather than calling read() & write(), or write() with an
mmap()'ed file, use a new primitive called sendfile()
(or transmitfile()). Bytes stay in the kernel.
While we're at it, httpInfo = cacheLookup(reqBuffer);

add a header option sendfile(newConn,
to sendfile() so httpInfo->headers,
httpInfo->fileDescriptor,
that we don't have OPT_CLOSE_WHEN_DONE);
to call write() at all.
Also add an option to close the connection so that we

don't have to call close() explicitly.
All this assumes proper OS support. Most have it these days.

An Accelerated Server Example
acceptex(socket, newConn, reqBuffer, remoteHostInfo);
httpInfo = cacheLookup(reqBuffer);
sendfile(newConn, httpInfo->headers,
httpInfo->fileDescriptor, OPT_CLOSE_WHEN_DONE);
write(logFile, requestInfo);
acceptex() is called
gets new socket, request, remote host IP address
string match in hash table is done to parse request
hash table entry contains relevant meta-data, including modification times, file
descriptors, permissions, etc.
sendfile() is called
pre-computed header, file descriptor, and close option
log written back asynchronously (buffered write()).
Thats it!

Complications
Custom APIs have problems:

Additional test coverage is required
API may not be sufficiently general to be worth it
Take, for example, acceptex():
Work has shown it doesnt make a big difference
Some server applications write before reading (e.g., FTP)
Result is no OSs outside of MS have it
Counter-example is sendfile():
Useful for many types of servers (Web, FTP, SMB, NFS)
Result is available on virtually every OS

Complications (cont)
Much of this assumes sharing is easy:

but, this is dependent on the server architectural model
if multiple processes are being used, as in Apache, it is
difficult to share data structures.
Take, for example, mmap():
mmap() maps a file into the address space of a process.
a file mmap'ed in one address space cant be re-used for
a request for the same file served by another process.
Apache 1.3 does use mmap() instead of read().
in this case, mmap() eliminates one data copy versus a
separate read() & write() combination, but process will
still need to open() and close() the file.

Complications (cont)
Similarly, meta-data info needs to be shared:
e.g., file size, access permissions, last modified time, etc.
While locality is high, cache misses can and do happen
sometimes:
if previously unseen file requested, process can block waiting for
disk.
OS can impose other restrictions:
e.g., limits on number of open file descriptors.
e.g., sockets typically allow buffering about 64 KB of data. If a
process tries to write() a 1 MB file, it will block until other end
receives the data.
Need to be able to cope with the misses without slowing
down the hits

Summary: Outline of a Typical
HTTP Transaction
A server can perform many steps in the process of
servicing a request
Different actions depending on many factors:
e.g., 304 not modified if client's cached copy is good
e.g., 404 not found, 401 unauthorized
Most requests are for small subset of data:
well see more about this in the Workload section
we can leverage that fact for performance
Architectural model affects possible optimizations
well go into this in more detail in the next section

Chapter 3:
Server Architectural Models

Server Architectural Models
Several approaches to server structure:

Process based: Apache, NCSA
Thread-based: JAWS, IIS
Event-based: Flash, Zeus
Kernel-based: Tux, AFPA, ExoKernel
We will describe the advantages and disadvantages

of each.
Fundamental tradeoffs exist between performance,
protection, sharing, robustness, extensibility, etc.

Process Model (ex: Apache)
Process created to handle each new request:

Process can block on appropriate actions,
(e.g., socket read, file read, socket write)
Concurrency handled via multiple processes
Quickly becomes unwieldy:
Process creation is expensive.
Instead, pre-forked pool is created.
Upper limit on # of processes is enforced
First by the server, eventually by the operating system.
Concurrency is limited by upper bound

Process Model: Pros and Cons
Advantages:
Most importantly, consistent with programmer's way of
thinking. Most programmers think in terms of linear series of
steps to accomplish task.
Processes are protected from one another; can't nuke data in
some other address space. Similarly, if one crashes, others
unaffected.
Disadvantages:
Slow. Forking is expensive, allocating stack, VM data structures
for each process adds up and puts pressure on the memory
system.
Difficulty in sharing info across processes.
Have to use locking.
No control over scheduling decisions.

Thread Model (Ex: JAWS)
Use threads instead of processes. Threads

consume fewer resources than processes (e.g.,
stack, VM allocation).
Forking and deleting threads is cheaper than
processes.
Similarly, pre-forked thread pool is created. May
be limits to numbers but hopefully less of an issue
than with processes since fewer resources
required.

Thread Model: Pros and Cons
Advantages:
Faster than processes. Creating/destroying cheaper.
Maintains programmer's way of thinking.
Sharing is enabled by default.
Disadvantages:
Less robust. Threads not protected from each other.
Requires proper OS support, otherwise, if one thread blocks
on a file read, will block all the address space.
Can still run out of threads if servicing many clients
concurrently.
Can exhaust certain per-process limits not encountered with
processes (e.g., number of open file descriptors).
Limited or no control over scheduling decisions.

Event Model (Ex: Flash)
while (1) {
accept new connections until none remaining;
call select() on all active file descriptors;
for each FD:
if (fd ready for reading) call read();
if (fd ready for writing) call write();
}
Use a single process and deal with requests in a

event-driven manner, like a giant switchboard.
Use non-blocking option (O_NDELAY) on sockets, do
everything asynchronously, never block on anything,
and have OS notify us when something is ready.

Event-Driven: Pros and Cons
Advantages:
Very fast.
Sharing is inherent, since theres only one process.
Don't even need locks as in thread models.
Can maximize concurrency in request stream easily.
No context-switch costs or extra memory consumption.
Complete control over scheduling decisions.
Disadvantages:
Less robust. Failure can halt whole server.
Pushes per-process resource limits (like file descriptors).
Not every OS has full asynchronous I/O, so can still block on a file
read. Flash uses helper processes to deal with this (AMPED
architecture).

In-Kernel Model (Ex: Tux)
HTTP user/ user/

kernel kernel
SOCK boundary HTTP boundary
TCP TCP
IP IP
ETH ETH
user-space server kernel-space server
Dedicated kernel thread for HTTP requests:

One option: put whole server in kernel.
More likely, just deal with static GET requests in kernel to
capture majority of requests.
Punt dynamic requests to full-scale server in user space,
such as Apache.

In-Kernel Model: Pros and Cons
In-Kernel Event Model:

Avoids transitions to user space, copies across u-k boundary, etc.
Leverages already existing asynchronous primitives in the kernel
(kernel doesn't block on a file read, etc.).
Advantages:
Extremely fast. Tight integration with kernel.
Small component without full server optimizes common case.
Disadvantages:
Less robust. Bugs can crash whole machine, not just server.
Harder to debug and extend, since kernel programming required,
which is not as well-known as sockets.
Similarly, harder to deploy. APIs are OS-specific (Linux, BSD, NT),
whereas sockets & threads are (mostly) standardized.
HTTP evolving over time, have to modify kernel code in response.

So Whats the Performance?
Graph shows server throughput for Tux, Flash, and Apache.

Experiments done on 400 MHz P/II, gigabit Ethernet, Linux 2.4.16, 8 client machines, WaspClient
workload generator
Tux is fastest, but Flash close behind

Summary: Server Architectures
Many ways to code up a server
Tradeoffs in speed, safety, robustness, ease of programming and
extensibility, etc.
Multiple servers exist for each kind of model
Not clear that a consensus exists.
Better case for in-kernel servers as devices
e.g. reverse proxy accelerator, Akamai CDN node
User-space servers have a role:
OS should provides proper primitives for efficiency
Leave HTTP-protocol related actions in user-space
In this case, event-driven model is attractive
Key pieces to a fast event-driven server:
Minimize copying
Efficient event notification mechanism

Chapter 4: Event Notification

Event Notification Mechanisms
Recall how Flash works:
One process, many FD's, calling select() on all active socket
descriptors.
All sockets are set using O_NDELAY flag (non-blocking)
Single address space aids sharing for performance
File reads and writes don't have non-blocking support, thus helper
processes (AMPED architecture)
Point is to exploit concurrency/parallelism:
Can read one socket while waiting to write on another
Event notification:
Mechanism for kernel and application to notify each other of
interesting/important events
E.g., connection arrivals, socket closes, data available to read, space
available for writing

State-Based: Select & Poll
select() and poll():
State-based: Is socket ready for reading/writing?
select() interface has FD_SET bitmasks turned on/off based on
interest
poll() is simple array, larger structure but simpler implementation
Performance costs:
Kernel scans O(N) descriptors to set bits
User application scans O(N) descriptors
select() bit manipulation can be expensive
Problems:
Traffic is bursty, connections not active all at once
# (active connections) << # (open connections).
Costs are O(total connections), not O(active connections)
Application keeps specifying interest set repeatedly

Event-Based Notification
Banga, Mogul & Druschel (USENIX 99)
Propose an event based approach, rather than state-
based:
Something just happened on socket X, rather than socket X
is ready for reading or writing
Server takes event as indication socket might be ready
Multiple events can happen on a single socket (e.g., packets
draining (implying writeable) or accumulating (readable))
API has following:
Application notifies kernel by calling declare_interest() once
per file descriptor (e.g., after accept()), rather than multiple
times like in select()/poll()
Kernel queues events internally
Application calls get_next_event() to see changes

Event-Based Notification (cont)
Problems:
Kernel has to allocate storage for event queue. Little's law says it
needs to be proportional to the event rate
Bursty applications could overflow queue
Can address multiple events by coalescing based on FD
Results in storage O(total connections).
Application has to change the way it thinks:
Respond to events, instead of checking state.
If events are missed, connections might get stuck.
Evaluation shows it scales nicely:
cost is O(active) not O(total)
Windows NT has something similar:
called IO completion ports

Notification in the Real World
POSIX Real-Time Signals:
Different concept: Unix signals are invoked when something is
ready on a file descriptor.
Signals are expensive and difficult to control (e.g., no ordering),
so applications can suppress signals and then retrieve them via
sigwaitinfo()
If signal queue fills up, events will be dropped. A separate signal
is raised to notify application about signal queue overflow.
Problems:
If signal queue overflows, then app must fall back on state-
based approach. Chandra and Mosberger propose signal-per-fd
(coalescing events per file descriptor).
Only one event is retrieved at a time: Provos and Lever propose
sigtimedwait4() to retrieve multiple signals at once

Notification in the Real World
Sun's /dev/poll:
App notifies kernel by writing to special file /dev/poll to
express interest
App does IOCTL on /dev/poll for list of ready FD's
App and kernel are still both state based
Kernel still pays O(total connections) to create FD list
Libenzis /dev/epoll (patch for Linux 2.4):
Uses /dev/epoll as interface, rather than /dev/poll
Application writes interest to /dev/epoll and IOCTL's to get
events
Events are coalesced on a per-FD basis
Semantically identical to RT signals with sig-per-fd &
sigtimedwait4().

Real File Asynchronous I/O
Like setting O_NDELAY (non-blocking) on file

descriptors:
Application can queue reads and writes on FDs and pick them up
later (like dry cleaning)
Requires support in the file system (e.g., callbacks)
Currently doesn't exist on many OS's:
POSIX specification exists
Solaris has non-standard version
Linux has it slated for 2.5 kernel
Two current candidates on Linux:
SGI's /dev/kaio and Ben LeHaises's /dev/aio
Proper implementation would allow Flash to eliminate
helpers

Summary: Event Notification
Goal is to exploit concurrency
Concurrency in user workloads means host CPU can overlap
multiple events to maximize parallelism
Keep network, disk busy; never block
Event notification changes applications:
state-based to event-based
requires a change in thinking
Goal is to minimize costs:
user/kernel crossings and testing idle socket descriptors
Event-based notification not yet fully deployed:
Most mechanisms only support network I/O, not file I/O
Full deployment of Asynchronous I/O spec should fix this

Chapter 5:
Workload Characterization

Workload Characterization
Why Characterize Workloads?

Gives an idea about traffic behavior
("Which documents are users interested in?")
Aids in capacity planning
("Is the number of clients increasing over time?")
Aids in implementation
("Does caching help?")
How do we capture them ?

Through server logs (typically enabled)
Through packet traces (harder to obtain and to process)

Factors to Consider
client? proxy? server?
Where do I get logs from?

Client logs give us an idea, but not necessarily the same
Same for proxy logs
What we care about is the workload at the server
Is trace representative?
Corporate POP vs. News vs. Shopping site
What kind of time resolution?
e.g., second, millisecond, microsecond
Does trace/log capture all the traffic?
e.g., incoming link only, or one node out of a cluster
Probability Refresher
Lots of variability in workloads
Use probability distributions to express
Want to consider many factors
Some terminology/jargon:
Mean: average of samples
Median : half are bigger, half are smaller
Percentiles: dump samples into N bins
(median is 50th percentile number)
Heavy-tailed:
Pr[ X x] cx a
As x->infinity

Important Distributions
Some Frequently-Seen Distributions:
Normal: ( x ) 2 /( 2 2 )
e
(avg. sigma, variance mu) f ( x)
2
Lognormal: (ln( x ) ) 2 /( 2 2 )
(x >= 0; sigma > 0) e
f ( x)
x 2
Exponential:
(x >= 0)
f ( x) e x
Pareto:
(x >= k, shape a, scale k)
f ( x) ak a / x ( a 1)

More Probability
Graph shows 3 distributions with average = 2.

Note average median in all cases !
Different distributions have different weight in tail.

What Info is Useful?
Request methods
GET, POST, HEAD, etc.
Response codes
success, failure, not-modified, etc.
Size of requested files
Size of transferred objects
Popularity of requested files
Numbers of embedded objects
Inter-arrival time between requests
Protocol support (1.0 vs. 1.1)

Sample Logs for Illustration
Name: Chess Olympics IBM IBM
1997 1998 1998 2001
Description: Kasparov- Nagano 1998 Corporate Corporate
Deep Blue Olympics Presence Presence
Event Site Event Site
Period: 2 weeks in 2 days in 1 day in 1 day in

May 1997 Feb 1998 June 1998 Feb 2001
Hits: 1,586,667 11,485,600 5,800,000 12,445,739
Bytes: 14,171,711 54,697,108 10,515,507 28,804,852

Clients: 256,382 86,021 80,921 319,698
URLS: 2,293 15,788 30,465 42,874
Well use statistics generated from these logs as examples.

Request Methods
Chess Olympics IBM IBM

1997 1998 1998 2001
GET 96% 99.6% 99.3% 97%
HEAD 04% 00.3 % 00.08% 02%
POST 00.007% 00.04 % 00.02% 00.2%
Others: noise noise noise noise
KR01: "overwhelming majority" are GETs, few POSTs

IBM2001 trace starts seeing a few 1.1 methods (CONNECT, OPTIONS, LINK), but still very small (1/10^5 %)

Response Codes
Code Meaning Chess Olympics IBM IBM
1997 1998 1998 2001
200 OK 85.32 76.02 75.28 67.72
204 NO_CONTENT --.-- --.-- 00.00001 --.--
206 PARTIAL_CONTENT 00.25 --.-- --.-- --.--
301 MOVED_PERMANENTLY 00.05 --.-- --.-- --.--
302 MOVED_TEMPORARILY 00.05 00.05 01.18 15.11
304 NOT_MODIFIED 13.73 23.24 22.84 16.26
400 BAD_REQUEST 00.001 00.0001 00.003 00.001
401 UNAUTHORIZED --.- 00.001 00.0001 00.001
403 FORBIDDEN 00.01 00.02 00.01 00.009
404 NOT_FOUND 00.55 00.64 00.65 00.79
407 PROXY_AUTH --.-- --.-- --.-- 00.002
500 SERVER_ERROR --.-- 00.003 00.006 00.07
501 NOT_IMPLEMENTED --.-- 00.0001 00.0005 00.006
503 SERVICE_UNAVAIL --.-- --.-- 00.0001 00.0003
??? UNKNOWN 00.0003 00.00004 00.005 00.0004
Table shows percentage of responses.

Majority are OK and NOT_MODIFIED.
Consistent with numbers from AW96, KR01.

Resource (File) Sizes
Shows file/memory usage (not weighted by frequency!)

Lognormal body, consistent with results from AW96, CB96, KR01.
AW96, CB96: sizes have Pareto tail; Downey01: Sizes are lognormal.

Tails from the File Size
Shows the complementary CDF (CCDF) of file sizes.

Havent done the curve fitting but looks Pareto-ish.

Response (Transfer) Sizes
Shows network usage (weighted by frequency of requests)

Lognormal body, pareto tail, consistent with CBC95,
AW96, CB96, KR01

Tails of Transfer Size
Shows the complementary CDF (CCDF) of file sizes.

Looks somewhat Pareto-like; certainly some big transfers.

Resource Popularity
Follows a Zipf model: p(r) = r^{-alpha}

(alpha = 1 true Zipf; others Zipf-like")
Consistent with CBC95, AW96, CB96, PQ00, KR01
Shows that caching popular documents is very effective

Number of Embedded Objects
Mah97: avg 3, 90% are 5 or less

BC98: pareto distr, median 0.8, mean 1.7
Arlitt98 World Cup study: median 15 objects, 90%
are 20 or less
MW00: median 7-17, mean 11-18, 90% 40 or less
STA00: median 5,30 (2 traces), 90% 50 or less
Mah97, BC98, SCJO01: embedded objects tend to
be smaller than container objects
KR01: median is 8-20, pareto distribution
Trend seems to be that number is increasing over time.

Session Inter-Arrivals
Inter-arrival time between successive requests

Think time"
difference between user requests vs. ALL requests
partly depends on definition of boundary
CB96: variability across multiple timescales, "self-
similarity", average load very different from peak
or heavy load
SCJO01: log-normal, 90% less than 1 minute.
AW96: independent and exponentially distributed
KR01: pareto with a=1.5, session arrivals follow
poisson distribution, but requests follow pareto

Protocol Support
IBM.com 2001 logs:

Show roughly 53% of client requests are 1.1
KA01 study:
92% of servers claim to support 1.1 (as of Sep 00)
Only 31% actually do; most fail to comply with spec
SCJO01 show:
Avg 6.5 requests per persistent connection
65% have 2 connections per page, rest more.
40-50% of objects downloaded by persistent connections
Appears that we are in the middle of a slow transition to 1.1

Summary: Workload
Characterization
Traffic is variable:
Responses vary across multiple orders of magnitude
Traffic is bursty:
Peak loads much larger than average loads
Certain files more popular than others
Zipf-like distribution captures this well
Two-sided aspect of transfers:
Most responses are small (zero pretty common)
Most of the bytes are from large transfers
Controversy over Pareto/log-normal distribution
Non-trivial for workload generators to replicate

Chapter 6: Workload Generators

Why Workload Generators?
Allows stress-testing and bug-
finding
Gives us some idea of server
capacity Measure Reproduce
Allows us a scientific process to
compare approaches
e.g., server models, gigabit
Fix
adaptors, OS implementations Find
and/or
Assumption is that difference in improve Problem
testbed translates to some
difference in real-world The Performance
Allows the performance Debugging Cycle
debugging cycle

Problems with Workload
Generators
Only as good as our understanding of the traffic

Traffic may change over time
generators must too
May not be representative
e.g., are file size distributions from IBM.com similar to mine?
May be ignoring important factors
e.g., browser behavior, WAN conditions, modem connectivity
Still, useful for diagnosing and treating problems

How does W. Generation Work?
Many clients, one server
match asymmetry of Internet
Server is populated with some kind
of synthetic content
Simulated clients produce requests
for server
Master process to control clients,
aggregate results
Goal is to measure server
not the client or network
Requests Responses
Must be robust to conditions
e.g., if server keeps sending 404 not
found, will clients notice?

Evolution: WebStone
The original workload generator from SGI in 1995
Process based workload generator, implemented in C
Clients talk to master via sockets
Configurable: # client machines, # client processes, run time
Measured several metrics: avg + max connect time, response
time, throughput rate (bits/sec), # pages, # files
1.0 only does GETS, CGI support added in 2.0
Static requests, 5 different file sizes:
Percentage Size
35.00 500 B
50.00 5 KB
14.00 50 KB
0.90 500 KB www.mindcraft.com/webstone
0.10 5 MB

Evolution: SPECWeb96
Developed by SPEC
Systems Performance Evaluation Consortium
Non-profit group with many benchmarks (CPU, FS)
Attempt to get more representative
Based on logs from NCSA, HP, Hal Computers
4 classes of files:
Percentage Size
35.00 0-1 KB
50.00 1-10 KB
14.00 10-100 KB
Poisson distribution
1.00between
100 KB 1each
MB class

SPECWeb96 (cont)
Notion of scaling versus load:

number of directories in data set size doubles as
expected throughput quadruples (sqrt(throughput/5)*10)
requests spread evenly across all application directories
Process based WG
Clients talk to master via RPC's (less robust)
Still only does GETS, no keep-alive
www.spec.org/osg/web96

Evolution: SURGE
Scalable URL Reference GEnerator
Barford & Crovella at Boston University CS Dept.
Much more worried about representativeness,
captures:
server file size distributions,
request size distribution,
relative file popularity
embedded file references
temporal locality of reference
idle periods ("think times") of users
Process/thread based WG

SURGE (cont)
Notion of user-equivalent:
statistical model of a user
active off time (between URLS),
inactive off time (between pages)
Captures various levels of burstiness
Not validated, shows that load generated is
different than SpecWeb96 and has more
burstiness in terms of CPU and # active
connections
www.cs.wisc.edu/~pb

Evolution: S-client
Almost all workload generators are closed-loop:
client submits a request, waits for server, maybe thinks for some
time, repeat as necessary
Problem with the closed-loop approach:
client can't generate requests faster than the server can respond
limits the generated load to the capacity of the server
in the real world, arrivals dont depend on server state
i.e., real users have no idea about load on the server when they click
on a site, although successive clicks may have this property
in particular, can't overload the server
s-client tries to be open-loop:
by generating connections at a particular rate
independent of server load/capacity

S-Client (cont)
How is s-client open-loop?
connecting asynchronously at a particular rate
using non-blocking connect() socket call
Connect complete within a particular time?
if yes, continue normally.
if not, socket is closed and new connect initiated.
Other details:
uses single-address space event-driven model like Flash
calls select() on large numbers of file descriptors
can generate large loads
Problems:
client capacity is still limited by active FD's
arrival is a TCP connect, not an HTTP request
www.cs.rice.edu/CS/Systems/Web-measurement

Evolution: SPECWeb99
In response to people "gaming" benchmark, now includes rules:
IP maximum segment lifetime (MSL) must be at least 60 seconds
(more on this later!)
Link-layer maximum transmission unit (MTU) must not be larger than
1460 bytes (Ethernet frame size)
Dynamic content may not be cached
not clear that this is followed
Servers must log requests.
W3C common log format is sufficient but not mandatory.
Resulting workload must be within 10% of target.
Error rate must be below 1%.
Metric has changed:
now "number of simultaneous conforming connections: rate of a
connection must be greater than 320 Kbps

SPECWeb99 (cont)
Directory size has changed:
(25 + (400000/122000)* simultaneous conns) / 5.0)
Improved HTTP 1.0/1.1 support:
Keep-alive requests (client closes after N requests)
Cookies
Back-end notion of user demographics
Used for ad rotation
Request includes user_id and last_ad
Request breakdown:
70.00 % static GET
12.45 % dynamic GET
12.60 % dynamic GET with custom ad rotation
04.80 % dynamic POST
00.15 % dynamic GET calling CGI code

SPECWeb99 (cont)
Other breakdowns:
30 % HTTP 1.0 with no keep-alive or persistence
70 % HTTP 1.0 with keep-alive to "model" persistence
still has 4 classes of file size with Poisson distribution
supports Zipf popularity
Client implementation details:
Master-client communication now uses sockets
Code includes sample Perl code for CGI
Client configurable to use threads or processes
Much more info on setup, debugging, tuning
All results posted to web page,
including configuration & back end code
www.spec.org/osg/web99

So how realistic is SPECWeb99?
Well compare a few characteristics:

File size distribution (body)
File size distribution (tail)
Transfer size distribution (body)
Transfer size distribution (tail)
Document popularity
Visual comparison only
No curve-fitting, r-squared plots, etc.
Point is to give a feel for accuracy

SpecWeb99 vs. File Sizes
SpecWeb99: In the ballpark, but not very smooth

SpecWeb99 vs. File Size Tail
SpecWeb99 tail isnt as long as real logs (900 KB max)

SpecWeb99 vs.Transfer Sizes
Doesnt capture 304 (not modified) responses

Coarser distribution than real logs (i.e., not smooth)

Spec99 vs.Transfer Size Tails
SpecWeb99 does OK, although tail drops off rapidly (and in

fact, no file is greater than 1 MB in SpecWeb99!).

Spec99 vs. Resource Popularity
SpecWeb99 seems to do a good job, although tail

isnt long enough

Evolution: TPC-W
Transaction Processing Council (TPC-W)
More known for database workloads like TPC-D
Metrics include dollars/transaction (unlike SPEC)
Provides specification, not source
Meant to capture a large e-commerce site
Models online bookstore
web serving, searching, browsing, shopping carts
online transaction processing (OLTP)
decision support (DSS)
secure purchasing (SSL), best sellers, new products
customer registration, administrative updates
Has notion of scaling per user
5 MB of DB tables per user
1 KB per shopping item, 25 KB per item in static images

TPC-W (cont)
Remote browser emulator (RBE)
emulates a single user
send HTTP request, parse, wait for thinking, repeat
Metrics:
WIPS: shopping
WIPSb: browsing
WIPSo: ordering
Setups tend to be very large:
multiple image servers, application servers, load balancer
DB back end (typically SMP)
Example: IBM 12-way SMP w/DB2, 9 PCs w/IIS: 1M $
www.tpc.org/tpcw

Summary: Workload Generators
Only the beginning. Many other workload generators:
httperf from HP
WAGON from IBM
WaspClient from IBM
Others?
Both workloads and generators change over time:
Both started simple, got more complex
As workload changes, so must generators
No one single "good" generator
SpecWeb99 seems the favorite (2002 rumored in the works)
Implementation issues similar to servers:
They are networked-based request producers
(i.e., produce GET's instead of 200 OK's).
Implementation affects capacity planning of clients!
(want to make sure clients are not bottleneck)

Chapter 7: Introduction to TCP

Introduction to TCP
Layering is a common principle in
network protocol design
TCP is the major transport protocol application
in the Internet
transport
Since HTTP runs on top of TCP,
much interaction between the two network
Asymmetry in client-server model
puts strain on server-side TCP link
implementations
Thus, major issue in web servers is physical
TCP implementation and behavior

The TCP Protocol
Connection-oriented, point-to-point protocol:

Connection establishment and teardown phases
Phone-like circuit abstraction
One sender, one receiver
Originally optimized for certain kinds of transfer:
Telnet (interactive remote login)
FTP (long, slow transfers)
Web is like neither of these
Lots of work on TCP, beyond scope of this tutorial
e.g., know of 3 separate TCP tutorials!

TCP Protocol (cont)
application application
writes data reads data
socket socket
layer layer
TCP TCP
data segment
send buffer receive buffer
ACK segment
Provides a reliable, in-order, byte stream abstraction:

Recover lost packets and detect/drop duplicates
Detect and drop bad packets
Preserve order in byte stream, no message boundaries
Full-duplex: bi-directional data flow in same connection
Flow and congestion controlled:
Flow control: sender will not overwhelm receiver
Congestion control: sender will not overwhelm network!
Send and receive buffers
Congestion and flow control windows

The TCP Header
32 bits
Fields enable the following:
Uniquely identifying a source port # dest port #
connection sequence number
(4-tuple of client/server IP acknowledgement number
address and port head not
len used
UA P R S F rcvr window size
numbers)
checksum ptr urgent data
Identifying a byte range
within that connection Options (variable length)
Checksum value to detect
corruption
Identifying protocol application
transitions (SYN, FIN) data
Informing other side of (variable length)
your state (ACK)

Establishing a TCP Connection
Client sends SYN with client server
initial sequence number connect()
(ISN) listen()
SYN (
X) port 80
Server responds with
its own SYN w/seq
number and ACK of
client (ISN+1) (next X +1)
+ ACK (
Y )
expected byte) SYN
(
Client ACKs server's

ISN+1 A CK (
Y+1)
The 3-way handshake time
All modulo 32-bit accept()
arithmetic read()

Sending Data
application application
writes data reads data
socket socket
layer layer
TCP data segment TCP
send buffer receive buffer
ACK segment
Sender puts data on the wire:

Holds copy in case of loss
Sender must observed receiver flow control window
Sender can discard data when ACK is received
Receiver sends acknowledgments (ACKs)
ACKs can be piggybacked on data going the other way
Protocol says receiver should ACK every other packet in
attempt to reduce ACK traffic (delayed ACKs)
Delay should not be more than 500 ms. (typically 200)
Well see how this causes problems later

Preventing Congestion
Sender may not only overrun receiver, but may also overrun
intermediate routers:
No way to explicitly know router buffer occupancy,
so we need to infer it from packet losses
Assumption is that losses stem from congestion, namely, that
intermediate routers have no available buffers
Sender maintains a congestion window:
Never have more than CW of un-acknowledged data outstanding (or
RWIN data; min of the two)
Successive ACKs from receiver cause CW to grow.
How CW grows based on which of 2 phases:
Slow-start: initial state.
Congestion avoidance: steady-state.
Switch between the two when CW > slow-start threshold

Congestion Control Principles
Lack of congestion control would lead to congestion

collapse (Jacobson 88).
Idea is to be a good network citizen.
Would like to transmit as fast as possible without loss.
Probe network to find available bandwidth.
In steady-state: linear increase in CW per RTT.
After loss event: CW is halved.
This is called additive increase /multiplicative decrease
(AIMD).
Various papers on why AIMD leads to network stability.

Slow Start
Initial CW = 1. sender receiver
After each ACK, CW += 1;
Continue until: one segm
ent
RTT
Loss occurs OR
CW > slow start threshold two segm
ents
Then switch to congestion
avoidance
If we detect loss, cut CW four segm
ents
in half
Exponential increase in
window size per RTT
time

Congestion Avoidance
Until (loss) {
after CW packets ACKed:
CW += 1;
}
ssthresh = CW/2;
Depending on loss type:
SACK/Fast Retransmit:
CW/= 2; continue;
Course grained timeout:
CW = 1; go to slow start.
(This is for TCP Reno/SACK: TCP

Tahoe always sets CW=1 after a loss)

How are losses recovered?
Say packet is lost (data or ACK!) sender receiver
Coarse-grained Timeout:
Sender does not receive ACK after Seq=9
2, 8 b
some period of time ytes d
ata
Event is called a retransmission time-
out (RTO)
timeout
=100
ACK
RTO value is based on estimated
round-trip time (RTT) X
loss
RTT is adjusted over time using
exponential weighted moving average: Seq=9
2, 8 byte
s data
RTT = (1-x)*RTT + (x)*sample
(x is typically 0.1)
=100
First done in TCP Tahoe AC K
time
lost ACK scenario

Fast Retransmit
Receiver expects N, gets N+1: sender receiver
Immediately sends ACK(N)
ACK 3000
This is called a duplicate ACK
Does NOT delay ACKs here! SEQ=300
0 , size=100
0
Continue sending dup ACKs for SEQ=400 X
each subsequent packet (not N) 0
SEQ=500
0
Sender gets 3 duplicate ACKs: SEQ=600
0
Infers N is lost and resends
3 chosen so out-of-order packets ACK 3000
dont trigger Fast Retransmit ACK 3000
accidentally ACK 3000
Called fast since we dont need to SEQ=300
0 , size=100
wait for a full RTT 0
time
Introduced in TCP Reno

Other loss recovery methods
Selective Acknowledgements (SACK):

Returned ACKs contain option w/SACK block
Block says, "got up N-1 AND got N+1 through N+3"
A single ACK can generate a retransmission
New Reno partial ACKs:
New ACK during fast retransmit may not ACK all
outstanding data. Ex:
Have ACK of 1, waiting for 2-6, get 3 dup acks of 1
Retransmit 2, get ACK of 3, can now infer 4 lost as well
Other schemes exist (e.g., Vegas)
Reno has been prevalent; SACK now catching on

How about Connection Teardown?
Either side may terminate a
connection. ( In fact, client server
connection can stay half-
closed.) Let's say the
server closes (typical in close()
X)
FIN(
WWW)
Server sends FIN with seq
Number (SN+1) (i.e., FIN is close()
ACK(X
+1)
a byte in sequence)
FIN(Y
Client ACK's the FIN with )
SN+2 ("next expected") time C K (Y+1

)
timed wait
A
Client sends it's own FIN
when ready
Server ACK's client FIN as
well with SN+1. closed

The TCP State Machine
TCP uses a Finite State Machine, kept by each side of a connection, to keep track of what
state a connection is in.
State transitions reflect inherent races that can happen in the network, e.g., two FIN's
passing each other in the network.
Certain things can go wrong along the way, i.e., packets can be dropped or corrupted. In
fact, machine is not perfect; certain problems can arise not anticipated in the original
RFC.
This is where timers will come in, which we will discuss more later.

TCP State Machine:
Connection Establishment
CLOSED
CLOSED: more implied than
actual, i.e., no connection server application
client application
LISTEN: willing to receive
calls listen()
calls connect()
send SYN
connections (accept call)
LISTEN
SYN-SENT: sent a SYN,
waiting for SYN-ACK SYN_SENT
SYN-RECEIVED: received a receive SYN

send SYN + ACK receive SYN
SYN, waiting for an ACK of send ACK
our SYN receive SYN & ACK

SYN_RCVD
ESTABLISHED: connection send ACK
ready for data transfer receive ACK
ESTABLISHED

TCP State Machine:
Connection Teardown
ESTABLISHED
FIN-WAIT-1: we closed first,
waiting for ACK of our FIN close() called
(active close) send FIN

receive FIN
FIN-WAIT-2: we closed first, FIN_WAIT_1
send ACK
other side has ACKED our FIN,

but not yet FIN'ed receive ACK receive FIN CLOSE_WAIT
send ACK
CLOSING: other side closed of FIN
before it received our FIN close() called

FIN_WAIT_2 CLOSING
TIME-WAIT: we closed, other send FIN
side closed, got ACK of our FIN receive FIN receive ACK
CLOSE-WAIT: other side sent send ACK of FIN LAST_ACK
FIN first, not us (passive close)

TIME_WAIT
LAST-ACK: other side sent
receive ACK
FIN, then we did, now waiting
for ACK wait 2*MSL
(240 seconds)
CLOSED

Summary: TCP Protocol
Protocol provides reliability in face of complex

network behavior
Tries to trade off efficiency with being "good
network citizen"
Vast majority of bytes transferred on Internet
today are TCP-based:
Web
Mail
News
Peer-to-peer (Napster, Gnutella, FreeNet, KaZaa)

Chapter 8: TCP Dynamics

TCP Dynamics
In this section we'll describe some of the problems you
can run into as a WWW server interacting with TCP.
Most of these affect the response as seen by the
client, not the throughput generated by the server.
Ideally, a server developer shouldn't have to worry
about this stuff, but in practice, we'll see that's not
the case.
Examples we'll look at include:
The initial window size
The delayed ACK problem
Nagle and its interaction with delayed ack
Small receive windows interfering with loss recovery

TCPs Initial Window Problem
Recall congestion control:
senders initial congestion window is sender receiver
set to 1
Recall delayed ACKs: GET /inde
x.html
ack every other packet
RTT
1 st segme
nt
set 200 ms. delayed ack timer
Short-term deadlock:
sender is waiting for ACK since it sent
200 ms.
1 segment time
receiver is waiting for 2nd segment
before ACKing segment
ACK of 1
st
Problem worse than it seems:
RTT
multiple objects per web page 2nd segm
ent
IE does not do pipelining! 3rd segme
nt
nd + 3rd seg
ments
ACK of 2

Solving the IW Problem
sender receiver
Solution: set IW = 2-4
x.html
RFC 2414 GET /inde
Didn't affect many BSD
RTT
1 st segme
nt
systems since they 2nd segm
ent
(incorrectly) counted the
connection setup in st and 2
nd
f 1
RTT
ACK o
congestion window 3rd segme
nt
calculation
time
200 ms.
Delayed ACK still happens,
but now out of critical path
of response time for
d
download ACK of 3r

Receive Window Size Problem
sender receiver
Recall Fast Retransmit:
ACK 3000
Amount of data in flight:
MIN(cong win,recv win) SEQ=300
0, size=10
00
can't ever have more than that SEQ=400 X
0
outstanding SEQ=500
0
In order for FR to work: SEQ=600
0
enough data has to be in flight
after lost packet, 3 more ACK 3000
segments must arrive ACK 3000
ACK 3000
4.5 KB of receive-side buffer
space must be available. SEQ=300
0, size=10
00
note many web documents are less time
than 4.5 KB!

Receive Window Size (cont)
Previous discussion assumes sender receiver
large enough receive windows!
Early versions of MS Windows 00 , RWIN = 2
0 00
30
had 16 KB default recv. window AC K
Balakrishnan et al. 1998: SEQ=300

0, size=10
00
Study server TCP traces from X
RTO timeout
1996 Olympic Web Server SEQ=400
0
show over 50% of clients have 0 00
receive window < 10K K 30 00 , RWIN = 1
AC
Many suffer coarse-grained
retransmission timeouts (RTOs) (illegal for sender
to send more)
Even SACK would not have
helped! SEQ=300
0 , size=100
0
time

Fixing Receive Window Problem
Balakrishnan et. al 98
"Right-edge recovery sender receiver
Also proposed by Lin & Kung 98
00
Now an RFC (3042) 30 00 , R WIN = 20
AC K
How does it work? SEQ=300
0, size=10
00
Arrival of dup ack means, segment has left X
the network SEQ=400
0
When dup ACK is received, send next
IN = 1000
segment (not retransmission) ACK 30 00 , RW
Continue with 2nd and 3rd dup acks SEQ=500
0
Idea is "keep ACK clock flowing" by forcing
more duplicate acks to be generated , RWIN = 1000
ACK 3000
Claim is that it would have avoided 25% of
SEQ=600
course-grained timeouts in 96 Olympics 0
trace 3rd dup 0 00
K 30 00 , RWIN = 1
AC
ack!
SEQ=300
0 , size=100
time 0

The Nagle Algorithm
Different types of TCP traffic exist:

Some apps (e.g., telnet) send one byte of data, then wait
for ACK
Others (e.g., FTP) use full-size segments
Recall server can write() to a socket at any time
Once written, should host stack send? Or should we wait
and hope to get more data?
May send many small packets, which is bad for 2
reasons:
Uses more network bandwidth (raises ratio of headers to
content)
Uses more CPU (many costs are per-packet, not per-byte)

The Nagle Algorithm
Solution is the Nagle algorithm:
If full-size segment of data is available, just send

If small segment available, and there is no unacknowledged data
outstanding, send
Otherwise, wait until either:
More data arrives from above (can coalesce packet), or
ACK arrives acknowledging outstanding data
Idea is have at most one small packet outstanding

Interaction of Nagle & Delayed ACK
Nagle and delayed ACK's cause sender receiver
(temporary) deadlock:
Sender wants to send 1.5 segments, x.html
write() GET /inde
sends first full one
RTT
1 st segme
Nagle prevents second from being nt (full siz
e)
sent (since not full size, and now we write()
(Nagle forbids
have unacked data outstanding)
sender from
Sender waits for delayed ACK from
200 ms.
sending more)
receiver
Receiver is waiting for 2nd segment
st segment
before sending ACK A C K of 1
Similar to IW=1 problem earlier
RTT
2 nd segme
n
Result: Many disable Nagle. t (half size
)
via setsockopt() call

Interaction of Nagle & Delayed ACK
For example, in WWW servers:

original NCSA server issued a write() for every header
Apache does its own buffering to do a single write() call
other servers use writev() (e.g., Flash)
if not careful you can flood the network with packets
More of an issue when using persistent connections:
closing the connection forces data out with the FIN bit
but persistent connections or 1.0 keep-alives affected
Mogul and Minshal 2001 evaluate a number of modifications to
Nagle to deal with this
Linux has similar "TCP_CORK" option
suppresses any non-full segment
application has to remember to disable TCP_CORK when finished.

Summary: TCP Dynamics
Many ways in which an HTTP transfer can interact

with TCP
Interaction of factors can cause delays in
response time as seen by clients
Hard to shield server developers from having to
understand these issues
Mistakes can cause problems such as flood of
small packets

Chapter 9: TCP Implementation

Server TCP Implementation
In this section we look at ways in which the host

TCP implementation is stressed under large web
server workloads. Most of these techniques deal
with large numbers of connections:
Looking up arriving TCP segments with large numbers of
connections
Dealing with the TIME-WAIT state caused by closing
large number of connections
Managing large numbers of timers to support connections
Dealing with memory consumption of connection state
Removing data-touching operations
byte copying and checksums

In the beginningBSD 4.3
packet arrival: ?
Recall how demultiplexing works: IP: 10.1.1.2, port: 5194
given a packet, want to find connection

state (PCB in BSD) One-behind cache
4-tuple of source, destination port & IP Head of PCB list

addresses
Original BSD:
IP: 192.123.168.40, port: 23
used one-behind cache with linear
search to match 4-tuple
assumption was "next segment very IP: 1.2.3.4, port: 45981
likely is from the same connection
assumed solitary, long-lived, FTP-like
transfer IP: 9.2.16.1, port: 873
average miss time is O(N/2) (N=length

of PCB list) IP: 118.23.48.3, port: 65383
IP: 10.1.1.2, port: 5194

PCB Hash Tables
McKenney & Dove SigComm 92:
linear search with one-behind cache
doesn't work well for transaction workloads
packet arrival: ?
hashing does much better IP: 10.1.1.2, port: 5194
hash based on 4-tuple
cost: O(1) (constant time) PCB Hash Table
BSD adds hash table in 90's O ------- (N-1)
other BSD Unixes (such as AIX) quickly

followed.
IP: 10.1.1.2, port: 5194
Algorithmic work on hash tables:
e..g., CLR book, perfect hash tables
none specific to web workloads
hash table sizing problematic

Problem of Old Duplicates
Recall in the Internet: client server
packets may be arbitrarily
SYN (X)
duplicated, delayed, and reordered.
while rare, case must be accounted
for. SYN (X)
Consider the following:
two hosts connect, transfer data, CK(X+1)
SYN (Y) A
close GET /inde
client starts new connection using x.html
same 4-tuple FIN
duplicate packet arrives from first Content +
connection ACK + FIN
connection has been closed, state is
gone
time
how can you distinguish? AC K
SYN (X)
?

Role of the TIME-WAIT State
Solution: dont do that! client server
prevent same 4-tuple from being
SYN (X)
used
one side must remember 4-tuple
for period of time to reject old SYN (X)
packets.
spec says, whoever closes the CK(X+1)
connection must do this (in the SYN (Y) A
TW state). GET /inde
x.html
Period is 2 times maximum
segment lifetime (MSL), after Content +
FIN
which it is assumed no packet
from previous conversation will ACK + FIN
still be alive
(2 * MSL)
MSL defined as 2 minutes in RFC time
1122 AC K
SYN (Z)
X
reject!

TIME-WAIT Problem in Servers
Recall in a WWW server, server closes connection!
asymmetry of client/server model means many clients
PCB sticks around for 2*MSL units of time
Mogul 1995 CA Election server study:
shows large numbers (90%) of PCB's in TIME-WAIT.
would have been 97% if followed proper MSL!
Example: doing 1000 connections/sec.
Assume MSL is 120 seconds, request takes 1 second.
Have 1000 connections in ESTABLISHED state.
240,000 connections in TIME-WAIT state!
FTY99 propose & evaluate 3 schemes:
require client to close (requires changing HTTP).
have client use new TCP option (client close) (TCP).
do client reset (browser, MS did this for a while)
claim 50% improvement in throughput, 85% in memory use

Dealing with TIME-WAIT
Sorting hash table entries
PCB Hash Table
(Aron & Druschel 99)
O ------- (N-1)
Demultiplexing requires that
all PCB's be examined (for
some hash bucket) before you
can give up on that PCB and 192.123.168.40: ESTABLISHED
say it was not found.

Since most lookups are for 128.119.82.37: TIME_WAIT
existing connections, most

connections will be in
ESTABLISHED state rather
9.2.16.145: TIME_WAIT
than TIME-WAIT.
Can sort PCB chain such that 178.23.48.3: TIME_WAIT
TW entries are at the end.

Thus, ESTABLISHED entries 10.1.1.2: TIME_WAIT
are at front of chain.

Server Timer Management
Each TCP connection can have up to 5
timers associated with it: HEAD OF PCB LIST
delayed ack, retransmission, persistence,
keep-alive, time-wait
Original BSD: 192.123.168.40: RTO in 2 secs
linear linked list of PCB's

fast timer (200 ms): walk all PCB's for 1.2.3.4: TIME-WAIT in 30 secs
delayed ACK timer
slow timer (500 ms): walk all PCB's for
all other timers 9.2.16.1: delayed ACK in 100 ms
time kept in relative form, so have to

subtract time from PCB (500 ms) for 4 118.23.48.3: keep-alive in 1 sec
larger timers
costs: O(#PCBs), not O(#active timers)
10.1.1.2: persist in 10 secs

Server Timer Management
Can again exploit semantics of PCB Hash Table
the TIME-WAIT state:
O ------- (N-1)
If PCB's are sorted by state,
delayed ACK timer can stop after
it encounters PCB in TIME-
WAIT, since ACKs are not 192.123.168.40: ESTABLISHED
delayed for connections in TIME-

WAIT state
128.119.82.37: TIME_WAIT
Aron and Druschel show 25
percent HTTP throughput
improvement using this technique 9.2.16.145: TIME_WAIT
Attribute most of win to reduced

timer processing, but probably
helps PCB lookup as well. 178.23.48.3: TIME_WAIT
10.1.1.2: TIME_WAIT

Customized PCB Tables
Regular PCB Hash Table
Maintain 2 sets of PCBs: normal O ------- (N-1)
and TIME-WAIT
first done in BSDI in 96 192.123.168.40: ESTABLISHED
still must search both PCBs
Aron & Druschel 99:
can compress TW PCBs, since only TIME-WAIT PCB Hash Table
port and sequence numbers needed O ------- (N-1)
normal still has full PCB state
show you can save a lot of kernel
pinned RAM (from 31 MB to 5 MB, a 9.2.16.145
82% reduction)
results in more RAM available for
disk cache, which leads to better 128.119.72.4
performance
10.1..1.2

Scalable Timers: Timing Wheels
wheel pointer
Varghese SOSP 1987:
use a hash-table-like structure called
timing wheel
events are ordered by relative time in
the future Timing Wheel
given event in future time T, put in
O ------- (N-1)
slot (T mod N)
list sorted by time (scheme 5)
Each clock tick: Expire: 12
wheel turns one slot (mod N)

look at first item in chain:
Expire: 22
if ready, fire, check next
if empty or not ready to fire, all done
continue until non-ready item is
Expire: 42
encountered (or end of list)
Ex: current time = 12, N = 10;

Timing Wheels (cont)
Variant (scheme 6 in paper):
just insert into wheel slot, dont bother to sort
check all timers in slot on each tick
Original SOSP 1987 paper
premise was more for large-scale simulations
have lots of events happening "in the future"
Algorithmic Costs (assuming good hash function):
O(1) average time for basic dictionary operations
insertion, cancellation, per-tick bookkeeping
O(N) (N = number timers) worst-case for scheme 6
O(log(N)) worst-case for scheme 5
Deployment:
Used in FreeBSD as of release 3.4 (scheme 6)
Variant in Linux 2.4 (hierarchy of timers with cascade)
Aron claims "about the same perf" as his approach

Data-Touching Operations
Lots of research in high-speed network community about how
touching data is bad
especially as CPU speeds increase relative to memory
Several ways to avoid data copying:
Use mmap() as described earlier to cut to one copy
Use I/O lite primitives (new API) to move buffers around
Use sendfile() API combined with integrated zero-copy I/O system in
kernel
Also a cost to reading the data via checksums:
Jacobson showed how it can be folded into the copy for free, with some
complexity on the receive side
I/O Lite /exokernel use checksum caches
Advanced network cards do checksum for you
Originally on SGI FDDI card (1995)
Now on all gigabit adaptors, some 100baseT adaptors

Summary: Implementation Issues
Scaling problems happen in large WWW Servers:

Asymmetry of client/server model
Large numbers of connections
Large amounts of data transferred
Approaches fall into one or more categories:
Hashing
Caching
Exploit common-case behavior
Exploiting semantic information
Don't touch the data
Most OS's now support these functions over the last
3 years

Web Servers: Implementation and Performance: Erich Nahum

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Web Servers: Implementation and Performance: Erich Nahum

Загружено:

Авторское право:

Доступные форматы

Web Servers:

Implementation and Performance

IBM T.J. Watson Research Center

Web Servers: Erich Nahum 1

Web Servers: Erich Nahum 2

Web Servers: Erich Nahum 3

Some familiarity with WWW as a user

Web Servers: Erich Nahum 4

After this tutorial, hopefully we will all know:

Basics of server implementation & performance

Many lessons should be applicable to any networked

Web Servers: Erich Nahum 5

Web Servers: Erich Nahum 6

Abhishek Chandra Balachander Krishnamurthy

Errors are all mine, of course.

Web Servers: Erich Nahum 7

Web Servers: Erich Nahum 8

http request http request

Laptop w/ http response http response

HTTP: Hypertext Transfer Protocol

Web Servers: Erich Nahum 9

Most browser requests are HTTP requests.

Web Servers: Erich Nahum 10

Sometimes called container page <head>

Typically in text format (ASCII)

Contains instructions for rendering

(e.g. browser options, junkbusters) sample html file

Web Servers: Erich Nahum 11

Server capacity planning is non-trivial.

Web Servers: Erich Nahum 12

Messages are in ASCII (human-readable)

Web Servers: Erich Nahum 13

Web Servers: Erich Nahum 14

Similar format to requests (i.e., ASCII)

Web Servers: Erich Nahum 15

Web Servers: Erich Nahum 16

Server must pay attention to respond properly.

Web Servers: Erich Nahum 17

The major application on the Internet

Web Servers: Erich Nahum 18

Web Servers: Erich Nahum 19

Web Servers: Erich Nahum 20

First thing a server does is notify the OS it is interested in

Web Servers: Erich Nahum 21

getsockname() called to get the remote host name

Web Servers: Erich Nahum 22

stat() called to test file path

Web Servers: Erich Nahum 23

read() called to read the file into user space

Web Servers: Erich Nahum 24

As we will see, a great deal of locality exists in

Much of the work described above doesn't really

Optimizations fall under 2 categories: caching and

Web Servers: Erich Nahum 25

Why open and close fileDescriptor =

Why stat them again Again, cache HTTP header

Web Servers: Erich Nahum 26

Instead of reading and writing the data, cache data,

Even better, mmap() fileData =

Since we see the same clients over and over, cache

Web Servers: Erich Nahum 27

Instead of calling acceptExtended(listenSock,

to access (a few buffer[0] = firstHTTPHeader;