Вы находитесь на странице: 1из 143

Web Servers:

Implementation and Performance

Erich Nahum

IBM T.J. Watson Research Center


www.research.ibm.com/people/n/nahum
nahum@us.ibm.com

Web Servers: Erich Nahum 1


Contents of This Tutorial
Introduction to HTTP
HTTP Servers:
Outline of an HTTP Server Transaction
Server Models: Processes, Threads, Events
Event Notification: Asynchronous I/O
HTTP Server Workloads:
Workload Characteristics
Workload Generation
Server TCP Issues:
Introduction to TCP
Server TCP Dynamics
Server TCP Implementation Issues

Web Servers: Erich Nahum 2


Things Not Covered in Tutorial

Clusters
Client-side issues: DNS, HTML rendering
Proxies: some similarities, many differences
Dynamic Content: CGI, PHP, JSP, etc.
QoS for Web Servers
SSL/TLS and HTTPS
Content Distribution Networks (CDNs)
Security and Denial of Service

Web Servers: Erich Nahum 3


Assumptions and Expectations

Some familiarity with WWW as a user


(Has anyone here not used a browser?)
Some familiarity with networking concepts
(e.g., unreliability, reordering, race conditions)
Familiarity with systems programming
(e.g., know what sockets, hashing, caching are)
Examples will be based on C & Unix
taken from BSD, Linux, AIX, and real servers
(sorry, Java and Windows fans)

Web Servers: Erich Nahum 4


Objectives and Takeaways

After this tutorial, hopefully we will all know:

Basics of server implementation & performance


Pros and cons of various server architectures
Difficulties in workload generation
Interactions between HTTP and TCP
Design loop of implement, measure, profile, debug,
and fix

Many lessons should be applicable to any networked


server, e.g., files, mail, news, DNS, LDAP, etc.

Web Servers: Erich Nahum 5


Timeline
Section Min
Introduction to HTTP 20
Outline of an HTTP Server Transaction 25
Server Models: Processes, Threads, Events 20
Event Notification: Asynchronous I/O 20
Workload Characteristics 35
Break 15
Workload Generation 40
Introduction to TCP 35
Server TCP Dynamics 20
Server TCP Implementation 25

Web Servers: Erich Nahum 6


Acknowledgements
Many people contributed comments and
suggestions to this tutorial, including:

Abhishek Chandra Balachander Krishnamurthy


Mark Crovella Vivek Pai
Suresh Chari Jennifer Rexford
Peter Druschel Anees Shaikh
Jim Kurose

Errors are all mine, of course.

Web Servers: Erich Nahum 7


Chapter 1: Introduction to HTTP

Web Servers: Erich Nahum 8


Introduction to HTTP

http request http request

Laptop w/ http response http response


Netscape Desktop w/
Server w/ Apache Explorer

HTTP: Hypertext Transfer Protocol


Communication protocol between clients and servers
Application layer protocol for WWW
Client/Server model:
Client: browser that requests, receives, displays object
Server: receives requests and responds to them
Protocol consists of various operations
Few for HTTP 1.0 (RFC 1945, 1996)
Many more in HTTP 1.1 (RFC 2616, 1999)

Web Servers: Erich Nahum 9


How are Requests Generated?
User clicks on something
Uniform Resource Locator (URL):
http://www.nytimes.com
https://www.paymybills.com
ftp://ftp.kernel.org
news://news.deja.com
telnet://gaia.cs.umass.edu
mailto:nahum@us.ibm.com
Different URL schemes map to different services
Hostname is converted from a name to a 32-bit IP
address (DNS resolve)
Connection is established to server

Most browser requests are HTTP requests.

Web Servers: Erich Nahum 10


What Happens Then?
Client downloads HTML document <html>

Sometimes called container page <head>


<meta

Typically in text format (ASCII)


name=Author
content=Erich Nahum>

Contains instructions for rendering


<title> Linux Web
Server Performance
(e.g., background color, frames) </title>
</head>
Links to other pages <body text=#00000>
<img width=31
height=11
src=ibmlogo.gif>
<img
Many have embedded objects: src=images/new.gif>
<h1>Hi There!</h1>
Images: GIF, JPG (logos, banner ads) Heres lots of cool
linux stuff!
Usually automatically retrieved <a href=more.html>
Click here</a>
I.e., without user involvement for more!
</body>
can control sometimes </html>

(e.g. browser options, junkbusters) sample html file

Web Servers: Erich Nahum 11


So Whats a Web Server Do?
Respond to client requests, typically a browser
Can be a proxy, which aggregates client requests (e.g., AOL)
Could be search engine spider or custom (e.g., Keynote)
May have work to do on clients behalf:
Is the clients cached copy still good?
Is client authorized to get this document?
Is client a proxy on someone elses behalf?
Run an arbitrary program (e.g., stock trade)
Hundreds or thousands of simultaneous clients
Hard to predict how many will show up on some day
Many requests are in progress concurrently

Server capacity planning is non-trivial.

Web Servers: Erich Nahum 12


What do HTTP Requests Look Like?
GET /images/penguin.gif HTTP/1.0
User-Agent: Mozilla/0.9.4 (Linux 2.2.19)
Host: www.kernel.org
Accept: text/html, image/gif, image/jpeg
Accept-Encoding: gzip
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
Cookie: B=xh203jfsf; Y=3sdkfjej
<cr><lf>

Messages are in ASCII (human-readable)


Carriage-return and line-feed indicate end of headers
Headers may communicate private information
(e.g., browser, OS, cookie information, etc.)

Web Servers: Erich Nahum 13


What Kind of Requests are there?

Called Methods:
GET: retrieve a file (95% of requests)
HEAD: just get meta-data (e.g., mod time)
POST: submitting a form to a server
PUT: store enclosed document as URI
DELETE: removed named resource
LINK/UNLINK: in 1.0, gone in 1.1
TRACE: http echo for debugging (added in 1.1)
CONNECT: used by proxies for tunneling (1.1)
OPTIONS: request for server/proxy options (1.1)

Web Servers: Erich Nahum 14


What Do Responses Look Like?
HTTP/1.0 200 OK
Server: Tux 2.0
Content-Type: image/gif
Content-Length: 43
Last-Modified: Fri, 15 Apr 1994 02:36:21 GMT
Expires: Wed, 20 Feb 2002 18:54:46 GMT
Date: Mon, 12 Nov 2001 14:29:48 GMT
Cache-Control: no-cache
Pragma: no-cache
Connection: close
Set-Cookie: PA=wefj2we0-jfjf
<cr><lf>
<data follows>

Similar format to requests (i.e., ASCII)

Web Servers: Erich Nahum 15


What Responses are There?
1XX: Informational (defd in 1.0, used in 1.1)
100 Continue, 101 Switching Protocols
2XX: Success
200 OK, 206 Partial Content
3XX: Redirection
301 Moved Permanently, 304 Not Modified
4XX: Client error
400 Bad Request, 403 Forbidden, 404 Not Found
5XX: Server error
500 Internal Server Error, 503 Service Unavailable,
505 HTTP Version Not Supported

Web Servers: Erich Nahum 16


What are all these Headers?
Specify capabilities and properties:
General:
Connection, Date
Request:
Accept-Encoding, User-Agent
Response:
Location, Server type
Entity:
Content-Encoding, Last-Modified
Hop-by-hop:
Proxy-Authenticate, Transfer-Encoding

Server must pay attention to respond properly.

Web Servers: Erich Nahum 17


Summary: Introduction to HTTP

The major application on the Internet


Majority of traffic is HTTP (or HTTP-related)
Client/server model:
Clients make requests, servers respond to them
Done mostly in ASCII text (helps debugging!)
Various headers and commands
Too many to go into detail here
Well focus on common server ones
Many web books/tutorials exist (e.g., Krishnamurthy &
Rexford 2001)

Web Servers: Erich Nahum 18


Chapter 2: Outline of a Typical
HTTP Transaction

Web Servers: Erich Nahum 19


Outline of an HTTP Transaction
In this section we go over the basics
of servicing an HTTP GET request
from user space
For this example, we'll assume a initialize;
forever do {
single process running in user space, get request;
similar to Apache 1.3 process;
At each stage see what the send response;
log request;
costs/problems can be }
Also try to think of where costs can
be optimized server in
Well describe relevant socket a nutshell
operations as we go

Web Servers: Erich Nahum 20


Readying a Server
s = socket(); /* allocate listen socket */
bind(s, 80); /* bind to TCP port 80 */
listen(s); /* indicate willingness to accept */
while (1) {
newconn = accept(s); /* accept new connection */b

First thing a server does is notify the OS it is interested in


WWW server requests; these are typically on TCP port 80.
Other services use different ports (e.g., SSL is on 443)
Allocate a socket and bind()'s it to the address (port 80)
Server calls listen() on the socket to indicate willingness to
receive requests
Calls accept() to wait for a request to come in (and blocks)
When the accept() returns, we have a new socket which
represents a new connection to a client

Web Servers: Erich Nahum 21


Processing a Request
remoteIP = getsockname(newconn);
remoteHost = gethostbyname(remoteIP);
gettimeofday(currentTime);
read(newconn, reqBuffer, sizeof(reqBuffer));
reqInfo = serverParse(reqBuffer);

getsockname() called to get the remote host name


for logging purposes (optional, but done by most)
gethostbyname() called to get name of other end
again for logging purposes
gettimeofday() is called to get time of request
both for Date header and for logging
read() is called on new socket to retrieve request
request is determined by parsing the data
GET /images/jul4/flag.gif

Web Servers: Erich Nahum 22


Processing a Request (cont)
fileName = parseOutFileName(requestBuffer);
fileAttr = stat(fileName);
serverCheckFileStuff(fileName, fileAttr);
open(fileName);

stat() called to test file path


to see if file exists/is accessible
may not be there, may only be available to certain people
"/microsoft/top-secret/plans-for-world-domination.html"
stat() also used for file meta-data
e.g., size of file, last modified time
"Have plans changed since last time I checked?
might have to stat() multiple files just to get to end
e.g., 4 stats in bill g example above
assuming all is OK, open() called to open the file

Web Servers: Erich Nahum 23


Responding to a Request
read(fileName, fileBuffer);
headerBuffer = serverFigureHeaders(fileName, reqInfo);
write(newSock, headerBuffer);
write(newSock, fileBuffer);
close(newSock);
close(fileName);
write(logFile, requestInfo);

read() called to read the file into user space


write() is called to send HTTP headers on socket
(early servers called write() for each header!)
write() is called to write the file on the socket
close() is called to close the socket
close() is called to close the open file descriptor
write() is called on the log file

Web Servers: Erich Nahum 24


Optimizing the Basic Structure

As we will see, a great deal of locality exists in


web requests and web traffic.

Much of the work described above doesn't really


need to be performed each time.

Optimizations fall under 2 categories: caching and


custom OS primitives.

Web Servers: Erich Nahum 25


Optimizations: Caching
Idea is to exploit locality in client requests. Many
files are requested over and over (e.g., index.html).

Why open and close fileDescriptor =


files over and over lookInFDCache(fileName);
again? Instead, metaInfo =
lookInMetaInfoCache(fileName);
cache open file FDs, headerBuffer =
manage them LRU. lookInHTTPHeaderCache(fileName);

Why stat them again Again, cache HTTP header


and again? Cache info on a per-url basis,
path name access rather than re-generating
characteristics. info over and over.

Web Servers: Erich Nahum 26


Optimizations: Caching (cont)

Instead of reading and writing the data, cache data,


as well as meta-data, in user space

Even better, mmap() fileData =


the file so that two lookInFileDataCache(fileName);
copies dont exist in fileData =
lookInMMapCache(fileName);
both user and kernel remoteHostName =
space lookRemoteHostCache(fileName);

Since we see the same clients over and over, cache


the reverse name lookups (or better yet, don't do
resolves at all, log only IP addresses)

Web Servers: Erich Nahum 27


Optimizations: OS Primitives
Rather than call accept(), getsockname() & read(), add a
new primitive, acceptExtended(), which combines the 3
primitives

Instead of calling acceptExtended(listenSock,


&newSock, readBuffer,
gettimeofday(), use a &remoteInfo);
memory-mapped
counter that is cheap currentTime = *mappedTimePointer;

to access (a few buffer[0] = firstHTTPHeader;


instructions rather buffer[1] = secondHTTPHeader;
than a system call) buffer[2] = fileDataBuffer;
writev(newSock, buffer, 3);

Instead of calling write() many times use writev()

Web Servers: Erich Nahum 28


OS Primitives (cont)
Rather than calling read() & write(), or write() with an
mmap()'ed file, use a new primitive called sendfile()
(or transmitfile()). Bytes stay in the kernel.

While we're at it, httpInfo = cacheLookup(reqBuffer);


add a header option sendfile(newConn,
to sendfile() so httpInfo->headers,
httpInfo->fileDescriptor,
that we don't have OPT_CLOSE_WHEN_DONE);
to call write() at all.

Also add an option to close the connection so that we


don't have to call close() explicitly.

All this assumes proper OS support. Most have it these days.

Web Servers: Erich Nahum 29


An Accelerated Server Example
acceptex(socket, newConn, reqBuffer, remoteHostInfo);
httpInfo = cacheLookup(reqBuffer);
sendfile(newConn, httpInfo->headers,
httpInfo->fileDescriptor, OPT_CLOSE_WHEN_DONE);
write(logFile, requestInfo);

acceptex() is called
gets new socket, request, remote host IP address
string match in hash table is done to parse request
hash table entry contains relevant meta-data, including modification times, file
descriptors, permissions, etc.
sendfile() is called
pre-computed header, file descriptor, and close option
log written back asynchronously (buffered write()).

Thats it!

Web Servers: Erich Nahum 30


Complications

Custom APIs have problems:


Additional test coverage is required
API may not be sufficiently general to be worth it
Take, for example, acceptex():
Work has shown it doesnt make a big difference
Some server applications write before reading (e.g., FTP)
Result is no OSs outside of MS have it
Counter-example is sendfile():
Useful for many types of servers (Web, FTP, SMB, NFS)
Result is available on virtually every OS

Web Servers: Erich Nahum 31


Complications (cont)

Much of this assumes sharing is easy:


but, this is dependent on the server architectural model
if multiple processes are being used, as in Apache, it is
difficult to share data structures.
Take, for example, mmap():
mmap() maps a file into the address space of a process.
a file mmap'ed in one address space cant be re-used for
a request for the same file served by another process.
Apache 1.3 does use mmap() instead of read().
in this case, mmap() eliminates one data copy versus a
separate read() & write() combination, but process will
still need to open() and close() the file.

Web Servers: Erich Nahum 32


Complications (cont)
Similarly, meta-data info needs to be shared:
e.g., file size, access permissions, last modified time, etc.
While locality is high, cache misses can and do happen
sometimes:
if previously unseen file requested, process can block waiting for
disk.
OS can impose other restrictions:
e.g., limits on number of open file descriptors.
e.g., sockets typically allow buffering about 64 KB of data. If a
process tries to write() a 1 MB file, it will block until other end
receives the data.
Need to be able to cope with the misses without slowing
down the hits

Web Servers: Erich Nahum 33


Summary: Outline of a Typical
HTTP Transaction
A server can perform many steps in the process of
servicing a request
Different actions depending on many factors:
e.g., 304 not modified if client's cached copy is good
e.g., 404 not found, 401 unauthorized
Most requests are for small subset of data:
well see more about this in the Workload section
we can leverage that fact for performance
Architectural model affects possible optimizations
well go into this in more detail in the next section

Web Servers: Erich Nahum 34


Chapter 3:
Server Architectural Models

Web Servers: Erich Nahum 35


Server Architectural Models

Several approaches to server structure:


Process based: Apache, NCSA
Thread-based: JAWS, IIS
Event-based: Flash, Zeus
Kernel-based: Tux, AFPA, ExoKernel

We will describe the advantages and disadvantages


of each.
Fundamental tradeoffs exist between performance,
protection, sharing, robustness, extensibility, etc.

Web Servers: Erich Nahum 36


Process Model (ex: Apache)

Process created to handle each new request:


Process can block on appropriate actions,
(e.g., socket read, file read, socket write)
Concurrency handled via multiple processes
Quickly becomes unwieldy:
Process creation is expensive.
Instead, pre-forked pool is created.
Upper limit on # of processes is enforced
First by the server, eventually by the operating system.
Concurrency is limited by upper bound

Web Servers: Erich Nahum 37


Process Model: Pros and Cons

Advantages:
Most importantly, consistent with programmer's way of
thinking. Most programmers think in terms of linear series of
steps to accomplish task.
Processes are protected from one another; can't nuke data in
some other address space. Similarly, if one crashes, others
unaffected.
Disadvantages:
Slow. Forking is expensive, allocating stack, VM data structures
for each process adds up and puts pressure on the memory
system.
Difficulty in sharing info across processes.
Have to use locking.
No control over scheduling decisions.

Web Servers: Erich Nahum 38


Thread Model (Ex: JAWS)

Use threads instead of processes. Threads


consume fewer resources than processes (e.g.,
stack, VM allocation).
Forking and deleting threads is cheaper than
processes.
Similarly, pre-forked thread pool is created. May
be limits to numbers but hopefully less of an issue
than with processes since fewer resources
required.

Web Servers: Erich Nahum 39


Thread Model: Pros and Cons
Advantages:
Faster than processes. Creating/destroying cheaper.
Maintains programmer's way of thinking.
Sharing is enabled by default.
Disadvantages:
Less robust. Threads not protected from each other.
Requires proper OS support, otherwise, if one thread blocks
on a file read, will block all the address space.
Can still run out of threads if servicing many clients
concurrently.
Can exhaust certain per-process limits not encountered with
processes (e.g., number of open file descriptors).
Limited or no control over scheduling decisions.

Web Servers: Erich Nahum 40


Event Model (Ex: Flash)
while (1) {
accept new connections until none remaining;
call select() on all active file descriptors;
for each FD:
if (fd ready for reading) call read();
if (fd ready for writing) call write();
}

Use a single process and deal with requests in a


event-driven manner, like a giant switchboard.
Use non-blocking option (O_NDELAY) on sockets, do
everything asynchronously, never block on anything,
and have OS notify us when something is ready.

Web Servers: Erich Nahum 41


Event-Driven: Pros and Cons
Advantages:
Very fast.
Sharing is inherent, since theres only one process.
Don't even need locks as in thread models.
Can maximize concurrency in request stream easily.
No context-switch costs or extra memory consumption.
Complete control over scheduling decisions.

Disadvantages:
Less robust. Failure can halt whole server.
Pushes per-process resource limits (like file descriptors).
Not every OS has full asynchronous I/O, so can still block on a file
read. Flash uses helper processes to deal with this (AMPED
architecture).

Web Servers: Erich Nahum 42


In-Kernel Model (Ex: Tux)

HTTP user/ user/


kernel kernel
SOCK boundary HTTP boundary

TCP TCP
IP IP
ETH ETH

user-space server kernel-space server

Dedicated kernel thread for HTTP requests:


One option: put whole server in kernel.
More likely, just deal with static GET requests in kernel to
capture majority of requests.
Punt dynamic requests to full-scale server in user space,
such as Apache.

Web Servers: Erich Nahum 43


In-Kernel Model: Pros and Cons

In-Kernel Event Model:


Avoids transitions to user space, copies across u-k boundary, etc.
Leverages already existing asynchronous primitives in the kernel
(kernel doesn't block on a file read, etc.).
Advantages:
Extremely fast. Tight integration with kernel.
Small component without full server optimizes common case.
Disadvantages:
Less robust. Bugs can crash whole machine, not just server.
Harder to debug and extend, since kernel programming required,
which is not as well-known as sockets.
Similarly, harder to deploy. APIs are OS-specific (Linux, BSD, NT),
whereas sockets & threads are (mostly) standardized.
HTTP evolving over time, have to modify kernel code in response.

Web Servers: Erich Nahum 44


So Whats the Performance?

Graph shows server throughput for Tux, Flash, and Apache.


Experiments done on 400 MHz P/II, gigabit Ethernet, Linux 2.4.16, 8 client machines, WaspClient
workload generator
Tux is fastest, but Flash close behind

Web Servers: Erich Nahum 45


Summary: Server Architectures
Many ways to code up a server
Tradeoffs in speed, safety, robustness, ease of programming and
extensibility, etc.
Multiple servers exist for each kind of model
Not clear that a consensus exists.
Better case for in-kernel servers as devices
e.g. reverse proxy accelerator, Akamai CDN node
User-space servers have a role:
OS should provides proper primitives for efficiency
Leave HTTP-protocol related actions in user-space
In this case, event-driven model is attractive
Key pieces to a fast event-driven server:
Minimize copying
Efficient event notification mechanism

Web Servers: Erich Nahum 46


Chapter 4: Event Notification

Web Servers: Erich Nahum 47


Event Notification Mechanisms
Recall how Flash works:
One process, many FD's, calling select() on all active socket
descriptors.
All sockets are set using O_NDELAY flag (non-blocking)
Single address space aids sharing for performance
File reads and writes don't have non-blocking support, thus helper
processes (AMPED architecture)
Point is to exploit concurrency/parallelism:
Can read one socket while waiting to write on another
Event notification:
Mechanism for kernel and application to notify each other of
interesting/important events
E.g., connection arrivals, socket closes, data available to read, space
available for writing

Web Servers: Erich Nahum 48


State-Based: Select & Poll
select() and poll():
State-based: Is socket ready for reading/writing?
select() interface has FD_SET bitmasks turned on/off based on
interest
poll() is simple array, larger structure but simpler implementation
Performance costs:
Kernel scans O(N) descriptors to set bits
User application scans O(N) descriptors
select() bit manipulation can be expensive
Problems:
Traffic is bursty, connections not active all at once
# (active connections) << # (open connections).
Costs are O(total connections), not O(active connections)
Application keeps specifying interest set repeatedly

Web Servers: Erich Nahum 49


Event-Based Notification
Banga, Mogul & Druschel (USENIX 99)
Propose an event based approach, rather than state-
based:
Something just happened on socket X, rather than socket X
is ready for reading or writing
Server takes event as indication socket might be ready
Multiple events can happen on a single socket (e.g., packets
draining (implying writeable) or accumulating (readable))
API has following:
Application notifies kernel by calling declare_interest() once
per file descriptor (e.g., after accept()), rather than multiple
times like in select()/poll()
Kernel queues events internally
Application calls get_next_event() to see changes

Web Servers: Erich Nahum 50


Event-Based Notification (cont)
Problems:
Kernel has to allocate storage for event queue. Little's law says it
needs to be proportional to the event rate
Bursty applications could overflow queue
Can address multiple events by coalescing based on FD
Results in storage O(total connections).
Application has to change the way it thinks:
Respond to events, instead of checking state.
If events are missed, connections might get stuck.
Evaluation shows it scales nicely:
cost is O(active) not O(total)
Windows NT has something similar:
called IO completion ports

Web Servers: Erich Nahum 51


Notification in the Real World
POSIX Real-Time Signals:
Different concept: Unix signals are invoked when something is
ready on a file descriptor.
Signals are expensive and difficult to control (e.g., no ordering),
so applications can suppress signals and then retrieve them via
sigwaitinfo()
If signal queue fills up, events will be dropped. A separate signal
is raised to notify application about signal queue overflow.
Problems:
If signal queue overflows, then app must fall back on state-
based approach. Chandra and Mosberger propose signal-per-fd
(coalescing events per file descriptor).
Only one event is retrieved at a time: Provos and Lever propose
sigtimedwait4() to retrieve multiple signals at once

Web Servers: Erich Nahum 52


Notification in the Real World
Sun's /dev/poll:
App notifies kernel by writing to special file /dev/poll to
express interest
App does IOCTL on /dev/poll for list of ready FD's
App and kernel are still both state based
Kernel still pays O(total connections) to create FD list
Libenzis /dev/epoll (patch for Linux 2.4):
Uses /dev/epoll as interface, rather than /dev/poll
Application writes interest to /dev/epoll and IOCTL's to get
events
Events are coalesced on a per-FD basis
Semantically identical to RT signals with sig-per-fd &
sigtimedwait4().

Web Servers: Erich Nahum 53


Real File Asynchronous I/O

Like setting O_NDELAY (non-blocking) on file


descriptors:
Application can queue reads and writes on FDs and pick them up
later (like dry cleaning)
Requires support in the file system (e.g., callbacks)
Currently doesn't exist on many OS's:
POSIX specification exists
Solaris has non-standard version
Linux has it slated for 2.5 kernel
Two current candidates on Linux:
SGI's /dev/kaio and Ben LeHaises's /dev/aio
Proper implementation would allow Flash to eliminate
helpers

Web Servers: Erich Nahum 54


Summary: Event Notification
Goal is to exploit concurrency
Concurrency in user workloads means host CPU can overlap
multiple events to maximize parallelism
Keep network, disk busy; never block
Event notification changes applications:
state-based to event-based
requires a change in thinking
Goal is to minimize costs:
user/kernel crossings and testing idle socket descriptors
Event-based notification not yet fully deployed:
Most mechanisms only support network I/O, not file I/O
Full deployment of Asynchronous I/O spec should fix this

Web Servers: Erich Nahum 55


Chapter 5:
Workload Characterization

Web Servers: Erich Nahum 56


Workload Characterization

Why Characterize Workloads?


Gives an idea about traffic behavior
("Which documents are users interested in?")
Aids in capacity planning
("Is the number of clients increasing over time?")
Aids in implementation
("Does caching help?")

How do we capture them ?


Through server logs (typically enabled)
Through packet traces (harder to obtain and to process)

Web Servers: Erich Nahum 57


Factors to Consider

client? proxy? server?

Where do I get logs from?


Client logs give us an idea, but not necessarily the same
Same for proxy logs
What we care about is the workload at the server
Is trace representative?
Corporate POP vs. News vs. Shopping site
What kind of time resolution?
e.g., second, millisecond, microsecond
Does trace/log capture all the traffic?
e.g., incoming link only, or one node out of a cluster
Web Servers: Erich Nahum 58
Probability Refresher
Lots of variability in workloads
Use probability distributions to express
Want to consider many factors

Some terminology/jargon:
Mean: average of samples
Median : half are bigger, half are smaller
Percentiles: dump samples into N bins
(median is 50th percentile number)

Heavy-tailed:
Pr[ X x] cx a

As x->infinity

Web Servers: Erich Nahum 59


Important Distributions
Some Frequently-Seen Distributions:

Normal: ( x ) 2 /( 2 2 )

e
(avg. sigma, variance mu) f ( x)
2
Lognormal: (ln( x ) ) 2 /( 2 2 )
(x >= 0; sigma > 0) e
f ( x)
x 2
Exponential:
(x >= 0)
f ( x) e x
Pareto:
(x >= k, shape a, scale k)
f ( x) ak a / x ( a 1)

Web Servers: Erich Nahum 60


More Probability

Graph shows 3 distributions with average = 2.


Note average median in all cases !
Different distributions have different weight in tail.

Web Servers: Erich Nahum 61


What Info is Useful?

Request methods
GET, POST, HEAD, etc.
Response codes
success, failure, not-modified, etc.
Size of requested files
Size of transferred objects
Popularity of requested files
Numbers of embedded objects
Inter-arrival time between requests
Protocol support (1.0 vs. 1.1)

Web Servers: Erich Nahum 62


Sample Logs for Illustration
Name: Chess Olympics IBM IBM
1997 1998 1998 2001
Description: Kasparov- Nagano 1998 Corporate Corporate
Deep Blue Olympics Presence Presence
Event Site Event Site

Period: 2 weeks in 2 days in 1 day in 1 day in


May 1997 Feb 1998 June 1998 Feb 2001
Hits: 1,586,667 11,485,600 5,800,000 12,445,739

Bytes: 14,171,711 54,697,108 10,515,507 28,804,852


Clients: 256,382 86,021 80,921 319,698
URLS: 2,293 15,788 30,465 42,874

Well use statistics generated from these logs as examples.

Web Servers: Erich Nahum 63


Request Methods

Chess Olympics IBM IBM


1997 1998 1998 2001
GET 96% 99.6% 99.3% 97%
HEAD 04% 00.3 % 00.08% 02%
POST 00.007% 00.04 % 00.02% 00.2%

Others: noise noise noise noise

KR01: "overwhelming majority" are GETs, few POSTs


IBM2001 trace starts seeing a few 1.1 methods (CONNECT, OPTIONS, LINK), but still very small (1/10^5 %)

Web Servers: Erich Nahum 64


Response Codes
Code Meaning Chess Olympics IBM IBM
1997 1998 1998 2001
200 OK 85.32 76.02 75.28 67.72
204 NO_CONTENT --.-- --.-- 00.00001 --.--
206 PARTIAL_CONTENT 00.25 --.-- --.-- --.--
301 MOVED_PERMANENTLY 00.05 --.-- --.-- --.--
302 MOVED_TEMPORARILY 00.05 00.05 01.18 15.11
304 NOT_MODIFIED 13.73 23.24 22.84 16.26
400 BAD_REQUEST 00.001 00.0001 00.003 00.001
401 UNAUTHORIZED --.- 00.001 00.0001 00.001
403 FORBIDDEN 00.01 00.02 00.01 00.009
404 NOT_FOUND 00.55 00.64 00.65 00.79
407 PROXY_AUTH --.-- --.-- --.-- 00.002
500 SERVER_ERROR --.-- 00.003 00.006 00.07
501 NOT_IMPLEMENTED --.-- 00.0001 00.0005 00.006
503 SERVICE_UNAVAIL --.-- --.-- 00.0001 00.0003
??? UNKNOWN 00.0003 00.00004 00.005 00.0004

Table shows percentage of responses.


Majority are OK and NOT_MODIFIED.
Consistent with numbers from AW96, KR01.

Web Servers: Erich Nahum 65


Resource (File) Sizes

Shows file/memory usage (not weighted by frequency!)


Lognormal body, consistent with results from AW96, CB96, KR01.
AW96, CB96: sizes have Pareto tail; Downey01: Sizes are lognormal.

Web Servers: Erich Nahum 66


Tails from the File Size

Shows the complementary CDF (CCDF) of file sizes.


Havent done the curve fitting but looks Pareto-ish.

Web Servers: Erich Nahum 67


Response (Transfer) Sizes

Shows network usage (weighted by frequency of requests)


Lognormal body, pareto tail, consistent with CBC95,
AW96, CB96, KR01

Web Servers: Erich Nahum 68


Tails of Transfer Size

Shows the complementary CDF (CCDF) of file sizes.


Looks somewhat Pareto-like; certainly some big transfers.

Web Servers: Erich Nahum 69


Resource Popularity

Follows a Zipf model: p(r) = r^{-alpha}


(alpha = 1 true Zipf; others Zipf-like")
Consistent with CBC95, AW96, CB96, PQ00, KR01
Shows that caching popular documents is very effective

Web Servers: Erich Nahum 70


Number of Embedded Objects

Mah97: avg 3, 90% are 5 or less


BC98: pareto distr, median 0.8, mean 1.7
Arlitt98 World Cup study: median 15 objects, 90%
are 20 or less
MW00: median 7-17, mean 11-18, 90% 40 or less
STA00: median 5,30 (2 traces), 90% 50 or less
Mah97, BC98, SCJO01: embedded objects tend to
be smaller than container objects
KR01: median is 8-20, pareto distribution

Trend seems to be that number is increasing over time.

Web Servers: Erich Nahum 71


Session Inter-Arrivals

Inter-arrival time between successive requests


Think time"
difference between user requests vs. ALL requests
partly depends on definition of boundary
CB96: variability across multiple timescales, "self-
similarity", average load very different from peak
or heavy load
SCJO01: log-normal, 90% less than 1 minute.
AW96: independent and exponentially distributed
KR01: pareto with a=1.5, session arrivals follow
poisson distribution, but requests follow pareto

Web Servers: Erich Nahum 72


Protocol Support

IBM.com 2001 logs:


Show roughly 53% of client requests are 1.1
KA01 study:
92% of servers claim to support 1.1 (as of Sep 00)
Only 31% actually do; most fail to comply with spec
SCJO01 show:
Avg 6.5 requests per persistent connection
65% have 2 connections per page, rest more.
40-50% of objects downloaded by persistent connections

Appears that we are in the middle of a slow transition to 1.1

Web Servers: Erich Nahum 73


Summary: Workload
Characterization

Traffic is variable:
Responses vary across multiple orders of magnitude
Traffic is bursty:
Peak loads much larger than average loads
Certain files more popular than others
Zipf-like distribution captures this well
Two-sided aspect of transfers:
Most responses are small (zero pretty common)
Most of the bytes are from large transfers
Controversy over Pareto/log-normal distribution
Non-trivial for workload generators to replicate

Web Servers: Erich Nahum 74


Chapter 6: Workload Generators

Web Servers: Erich Nahum 75


Why Workload Generators?
Allows stress-testing and bug-
finding
Gives us some idea of server
capacity Measure Reproduce
Allows us a scientific process to
compare approaches
e.g., server models, gigabit
Fix
adaptors, OS implementations Find
and/or
Assumption is that difference in improve Problem
testbed translates to some
difference in real-world The Performance
Allows the performance Debugging Cycle
debugging cycle

Web Servers: Erich Nahum 76


Problems with Workload
Generators

Only as good as our understanding of the traffic


Traffic may change over time
generators must too
May not be representative
e.g., are file size distributions from IBM.com similar to mine?
May be ignoring important factors
e.g., browser behavior, WAN conditions, modem connectivity
Still, useful for diagnosing and treating problems

Web Servers: Erich Nahum 77


How does W. Generation Work?
Many clients, one server
match asymmetry of Internet
Server is populated with some kind
of synthetic content
Simulated clients produce requests
for server
Master process to control clients,
aggregate results
Goal is to measure server
not the client or network
Requests Responses
Must be robust to conditions
e.g., if server keeps sending 404 not
found, will clients notice?

Web Servers: Erich Nahum 78


Evolution: WebStone
The original workload generator from SGI in 1995
Process based workload generator, implemented in C
Clients talk to master via sockets
Configurable: # client machines, # client processes, run time
Measured several metrics: avg + max connect time, response
time, throughput rate (bits/sec), # pages, # files
1.0 only does GETS, CGI support added in 2.0
Static requests, 5 different file sizes:
Percentage Size
35.00 500 B
50.00 5 KB
14.00 50 KB
0.90 500 KB www.mindcraft.com/webstone
0.10 5 MB

Web Servers: Erich Nahum 79


Evolution: SPECWeb96
Developed by SPEC
Systems Performance Evaluation Consortium
Non-profit group with many benchmarks (CPU, FS)
Attempt to get more representative
Based on logs from NCSA, HP, Hal Computers
4 classes of files:

Percentage Size
35.00 0-1 KB
50.00 1-10 KB
14.00 10-100 KB
Poisson distribution
1.00between
100 KB 1each
MB class

Web Servers: Erich Nahum 80


SPECWeb96 (cont)

Notion of scaling versus load:


number of directories in data set size doubles as
expected throughput quadruples (sqrt(throughput/5)*10)
requests spread evenly across all application directories
Process based WG
Clients talk to master via RPC's (less robust)
Still only does GETS, no keep-alive

www.spec.org/osg/web96

Web Servers: Erich Nahum 81


Evolution: SURGE
Scalable URL Reference GEnerator
Barford & Crovella at Boston University CS Dept.
Much more worried about representativeness,
captures:
server file size distributions,
request size distribution,
relative file popularity
embedded file references
temporal locality of reference
idle periods ("think times") of users
Process/thread based WG

Web Servers: Erich Nahum 82


SURGE (cont)

Notion of user-equivalent:
statistical model of a user
active off time (between URLS),
inactive off time (between pages)
Captures various levels of burstiness
Not validated, shows that load generated is
different than SpecWeb96 and has more
burstiness in terms of CPU and # active
connections

www.cs.wisc.edu/~pb

Web Servers: Erich Nahum 83


Evolution: S-client
Almost all workload generators are closed-loop:
client submits a request, waits for server, maybe thinks for some
time, repeat as necessary
Problem with the closed-loop approach:
client can't generate requests faster than the server can respond
limits the generated load to the capacity of the server
in the real world, arrivals dont depend on server state
i.e., real users have no idea about load on the server when they click
on a site, although successive clicks may have this property
in particular, can't overload the server
s-client tries to be open-loop:
by generating connections at a particular rate
independent of server load/capacity

Web Servers: Erich Nahum 84


S-Client (cont)
How is s-client open-loop?
connecting asynchronously at a particular rate
using non-blocking connect() socket call
Connect complete within a particular time?
if yes, continue normally.
if not, socket is closed and new connect initiated.
Other details:
uses single-address space event-driven model like Flash
calls select() on large numbers of file descriptors
can generate large loads
Problems:
client capacity is still limited by active FD's
arrival is a TCP connect, not an HTTP request

www.cs.rice.edu/CS/Systems/Web-measurement

Web Servers: Erich Nahum 85


Evolution: SPECWeb99
In response to people "gaming" benchmark, now includes rules:
IP maximum segment lifetime (MSL) must be at least 60 seconds
(more on this later!)
Link-layer maximum transmission unit (MTU) must not be larger than
1460 bytes (Ethernet frame size)
Dynamic content may not be cached
not clear that this is followed
Servers must log requests.
W3C common log format is sufficient but not mandatory.
Resulting workload must be within 10% of target.
Error rate must be below 1%.
Metric has changed:
now "number of simultaneous conforming connections: rate of a
connection must be greater than 320 Kbps

Web Servers: Erich Nahum 86


SPECWeb99 (cont)
Directory size has changed:
(25 + (400000/122000)* simultaneous conns) / 5.0)
Improved HTTP 1.0/1.1 support:
Keep-alive requests (client closes after N requests)
Cookies
Back-end notion of user demographics
Used for ad rotation
Request includes user_id and last_ad
Request breakdown:
70.00 % static GET
12.45 % dynamic GET
12.60 % dynamic GET with custom ad rotation
04.80 % dynamic POST
00.15 % dynamic GET calling CGI code

Web Servers: Erich Nahum 87


SPECWeb99 (cont)
Other breakdowns:
30 % HTTP 1.0 with no keep-alive or persistence
70 % HTTP 1.0 with keep-alive to "model" persistence
still has 4 classes of file size with Poisson distribution
supports Zipf popularity
Client implementation details:
Master-client communication now uses sockets
Code includes sample Perl code for CGI
Client configurable to use threads or processes
Much more info on setup, debugging, tuning
All results posted to web page,
including configuration & back end code

www.spec.org/osg/web99

Web Servers: Erich Nahum 88


So how realistic is SPECWeb99?

Well compare a few characteristics:


File size distribution (body)
File size distribution (tail)
Transfer size distribution (body)
Transfer size distribution (tail)
Document popularity
Visual comparison only
No curve-fitting, r-squared plots, etc.
Point is to give a feel for accuracy

Web Servers: Erich Nahum 89


SpecWeb99 vs. File Sizes

SpecWeb99: In the ballpark, but not very smooth

Web Servers: Erich Nahum 90


SpecWeb99 vs. File Size Tail

SpecWeb99 tail isnt as long as real logs (900 KB max)

Web Servers: Erich Nahum 91


SpecWeb99 vs.Transfer Sizes

Doesnt capture 304 (not modified) responses


Coarser distribution than real logs (i.e., not smooth)

Web Servers: Erich Nahum 92


Spec99 vs.Transfer Size Tails

SpecWeb99 does OK, although tail drops off rapidly (and in


fact, no file is greater than 1 MB in SpecWeb99!).

Web Servers: Erich Nahum 93


Spec99 vs. Resource Popularity

SpecWeb99 seems to do a good job, although tail


isnt long enough

Web Servers: Erich Nahum 94


Evolution: TPC-W
Transaction Processing Council (TPC-W)
More known for database workloads like TPC-D
Metrics include dollars/transaction (unlike SPEC)
Provides specification, not source
Meant to capture a large e-commerce site
Models online bookstore
web serving, searching, browsing, shopping carts
online transaction processing (OLTP)
decision support (DSS)
secure purchasing (SSL), best sellers, new products
customer registration, administrative updates
Has notion of scaling per user
5 MB of DB tables per user
1 KB per shopping item, 25 KB per item in static images

Web Servers: Erich Nahum 95


TPC-W (cont)
Remote browser emulator (RBE)
emulates a single user
send HTTP request, parse, wait for thinking, repeat
Metrics:
WIPS: shopping
WIPSb: browsing
WIPSo: ordering
Setups tend to be very large:
multiple image servers, application servers, load balancer
DB back end (typically SMP)
Example: IBM 12-way SMP w/DB2, 9 PCs w/IIS: 1M $

www.tpc.org/tpcw

Web Servers: Erich Nahum 96


Summary: Workload Generators
Only the beginning. Many other workload generators:
httperf from HP
WAGON from IBM
WaspClient from IBM
Others?
Both workloads and generators change over time:
Both started simple, got more complex
As workload changes, so must generators
No one single "good" generator
SpecWeb99 seems the favorite (2002 rumored in the works)
Implementation issues similar to servers:
They are networked-based request producers
(i.e., produce GET's instead of 200 OK's).
Implementation affects capacity planning of clients!
(want to make sure clients are not bottleneck)

Web Servers: Erich Nahum 97


Chapter 7: Introduction to TCP

Web Servers: Erich Nahum 98


Introduction to TCP
Layering is a common principle in
network protocol design
TCP is the major transport protocol application
in the Internet
transport
Since HTTP runs on top of TCP,
much interaction between the two network
Asymmetry in client-server model
puts strain on server-side TCP link
implementations
Thus, major issue in web servers is physical
TCP implementation and behavior

Web Servers: Erich Nahum 99


The TCP Protocol

Connection-oriented, point-to-point protocol:


Connection establishment and teardown phases
Phone-like circuit abstraction
One sender, one receiver
Originally optimized for certain kinds of transfer:
Telnet (interactive remote login)
FTP (long, slow transfers)
Web is like neither of these
Lots of work on TCP, beyond scope of this tutorial
e.g., know of 3 separate TCP tutorials!

Web Servers: Erich Nahum 100


TCP Protocol (cont)
application application
writes data reads data
socket socket
layer layer
TCP TCP
data segment
send buffer receive buffer
ACK segment

Provides a reliable, in-order, byte stream abstraction:


Recover lost packets and detect/drop duplicates
Detect and drop bad packets
Preserve order in byte stream, no message boundaries
Full-duplex: bi-directional data flow in same connection
Flow and congestion controlled:
Flow control: sender will not overwhelm receiver
Congestion control: sender will not overwhelm network!
Send and receive buffers
Congestion and flow control windows

Web Servers: Erich Nahum 101


The TCP Header
32 bits
Fields enable the following:
Uniquely identifying a source port # dest port #
connection sequence number
(4-tuple of client/server IP acknowledgement number
address and port head not
len used
UA P R S F rcvr window size
numbers)
checksum ptr urgent data
Identifying a byte range
within that connection Options (variable length)
Checksum value to detect
corruption
Identifying protocol application
transitions (SYN, FIN) data
Informing other side of (variable length)
your state (ACK)

Web Servers: Erich Nahum 102


Establishing a TCP Connection
Client sends SYN with client server
initial sequence number connect()
(ISN) listen()
SYN (
X) port 80
Server responds with
its own SYN w/seq
number and ACK of
client (ISN+1) (next X +1)
+ ACK (
Y )
expected byte) SYN
(

Client ACKs server's


ISN+1 A CK (
Y+1)
The 3-way handshake time
All modulo 32-bit accept()
arithmetic read()

Web Servers: Erich Nahum 103


Sending Data
application application
writes data reads data
socket socket
layer layer
TCP data segment TCP
send buffer receive buffer
ACK segment

Sender puts data on the wire:


Holds copy in case of loss
Sender must observed receiver flow control window
Sender can discard data when ACK is received
Receiver sends acknowledgments (ACKs)
ACKs can be piggybacked on data going the other way
Protocol says receiver should ACK every other packet in
attempt to reduce ACK traffic (delayed ACKs)
Delay should not be more than 500 ms. (typically 200)
Well see how this causes problems later

Web Servers: Erich Nahum 104


Preventing Congestion
Sender may not only overrun receiver, but may also overrun
intermediate routers:
No way to explicitly know router buffer occupancy,
so we need to infer it from packet losses
Assumption is that losses stem from congestion, namely, that
intermediate routers have no available buffers
Sender maintains a congestion window:
Never have more than CW of un-acknowledged data outstanding (or
RWIN data; min of the two)
Successive ACKs from receiver cause CW to grow.
How CW grows based on which of 2 phases:
Slow-start: initial state.
Congestion avoidance: steady-state.
Switch between the two when CW > slow-start threshold

Web Servers: Erich Nahum 105


Congestion Control Principles

Lack of congestion control would lead to congestion


collapse (Jacobson 88).
Idea is to be a good network citizen.
Would like to transmit as fast as possible without loss.
Probe network to find available bandwidth.
In steady-state: linear increase in CW per RTT.
After loss event: CW is halved.
This is called additive increase /multiplicative decrease
(AIMD).
Various papers on why AIMD leads to network stability.

Web Servers: Erich Nahum 106


Slow Start
Initial CW = 1. sender receiver
After each ACK, CW += 1;
Continue until: one segm
ent

RTT
Loss occurs OR
CW > slow start threshold two segm
ents
Then switch to congestion
avoidance
If we detect loss, cut CW four segm
ents

in half
Exponential increase in
window size per RTT
time

Web Servers: Erich Nahum 107


Congestion Avoidance

Until (loss) {
after CW packets ACKed:
CW += 1;
}
ssthresh = CW/2;
Depending on loss type:
SACK/Fast Retransmit:
CW/= 2; continue;
Course grained timeout:
CW = 1; go to slow start.

(This is for TCP Reno/SACK: TCP


Tahoe always sets CW=1 after a loss)

Web Servers: Erich Nahum 108


How are losses recovered?
Say packet is lost (data or ACK!) sender receiver
Coarse-grained Timeout:
Sender does not receive ACK after Seq=9
2, 8 b
some period of time ytes d
ata
Event is called a retransmission time-
out (RTO)

timeout
=100
ACK
RTO value is based on estimated
round-trip time (RTT) X
loss
RTT is adjusted over time using
exponential weighted moving average: Seq=9
2, 8 byte
s data
RTT = (1-x)*RTT + (x)*sample
(x is typically 0.1)

=100
First done in TCP Tahoe AC K

time
lost ACK scenario

Web Servers: Erich Nahum 109


Fast Retransmit
Receiver expects N, gets N+1: sender receiver
Immediately sends ACK(N)
ACK 3000
This is called a duplicate ACK
Does NOT delay ACKs here! SEQ=300
0 , size=100
0
Continue sending dup ACKs for SEQ=400 X
each subsequent packet (not N) 0
SEQ=500
0
Sender gets 3 duplicate ACKs: SEQ=600
0
Infers N is lost and resends
3 chosen so out-of-order packets ACK 3000
dont trigger Fast Retransmit ACK 3000
accidentally ACK 3000
Called fast since we dont need to SEQ=300
0 , size=100
wait for a full RTT 0

time

Introduced in TCP Reno

Web Servers: Erich Nahum 110


Other loss recovery methods

Selective Acknowledgements (SACK):


Returned ACKs contain option w/SACK block
Block says, "got up N-1 AND got N+1 through N+3"
A single ACK can generate a retransmission
New Reno partial ACKs:
New ACK during fast retransmit may not ACK all
outstanding data. Ex:
Have ACK of 1, waiting for 2-6, get 3 dup acks of 1
Retransmit 2, get ACK of 3, can now infer 4 lost as well
Other schemes exist (e.g., Vegas)
Reno has been prevalent; SACK now catching on

Web Servers: Erich Nahum 111


How about Connection Teardown?
Either side may terminate a
connection. ( In fact, client server
connection can stay half-
closed.) Let's say the
server closes (typical in close()
X)
FIN(
WWW)
Server sends FIN with seq
Number (SN+1) (i.e., FIN is close()
ACK(X
+1)

a byte in sequence)
FIN(Y
Client ACK's the FIN with )

SN+2 ("next expected") time C K (Y+1


)

timed wait
A
Client sends it's own FIN
when ready
Server ACK's client FIN as
well with SN+1. closed

Web Servers: Erich Nahum 112


The TCP State Machine
TCP uses a Finite State Machine, kept by each side of a connection, to keep track of what
state a connection is in.
State transitions reflect inherent races that can happen in the network, e.g., two FIN's
passing each other in the network.
Certain things can go wrong along the way, i.e., packets can be dropped or corrupted. In
fact, machine is not perfect; certain problems can arise not anticipated in the original
RFC.
This is where timers will come in, which we will discuss more later.

Web Servers: Erich Nahum 113


TCP State Machine:
Connection Establishment
CLOSED
CLOSED: more implied than
actual, i.e., no connection server application
client application
LISTEN: willing to receive
calls listen()
calls connect()
send SYN
connections (accept call)
LISTEN
SYN-SENT: sent a SYN,
waiting for SYN-ACK SYN_SENT

SYN-RECEIVED: received a receive SYN


send SYN + ACK receive SYN

SYN, waiting for an ACK of send ACK

our SYN receive SYN & ACK


SYN_RCVD
ESTABLISHED: connection send ACK

ready for data transfer receive ACK

ESTABLISHED

Web Servers: Erich Nahum 114


TCP State Machine:
Connection Teardown
ESTABLISHED
FIN-WAIT-1: we closed first,
waiting for ACK of our FIN close() called

(active close) send FIN


receive FIN
FIN-WAIT-2: we closed first, FIN_WAIT_1
send ACK

other side has ACKED our FIN,


but not yet FIN'ed receive ACK receive FIN CLOSE_WAIT
send ACK
CLOSING: other side closed of FIN

before it received our FIN close() called


FIN_WAIT_2 CLOSING
TIME-WAIT: we closed, other send FIN

side closed, got ACK of our FIN receive FIN receive ACK
CLOSE-WAIT: other side sent send ACK of FIN LAST_ACK

FIN first, not us (passive close)


TIME_WAIT
LAST-ACK: other side sent
receive ACK
FIN, then we did, now waiting
for ACK wait 2*MSL
(240 seconds)

CLOSED

Web Servers: Erich Nahum 115


Summary: TCP Protocol

Protocol provides reliability in face of complex


network behavior
Tries to trade off efficiency with being "good
network citizen"
Vast majority of bytes transferred on Internet
today are TCP-based:
Web
Mail
News
Peer-to-peer (Napster, Gnutella, FreeNet, KaZaa)

Web Servers: Erich Nahum 116


Chapter 8: TCP Dynamics

Web Servers: Erich Nahum 117


TCP Dynamics
In this section we'll describe some of the problems you
can run into as a WWW server interacting with TCP.
Most of these affect the response as seen by the
client, not the throughput generated by the server.
Ideally, a server developer shouldn't have to worry
about this stuff, but in practice, we'll see that's not
the case.
Examples we'll look at include:
The initial window size
The delayed ACK problem
Nagle and its interaction with delayed ack
Small receive windows interfering with loss recovery

Web Servers: Erich Nahum 118


TCPs Initial Window Problem
Recall congestion control:
senders initial congestion window is sender receiver
set to 1
Recall delayed ACKs: GET /inde
x.html
ack every other packet

RTT
1 st segme
nt
set 200 ms. delayed ack timer
Short-term deadlock:
sender is waiting for ACK since it sent

200 ms.
1 segment time
receiver is waiting for 2nd segment
before ACKing segment
ACK of 1
st

Problem worse than it seems:

RTT
multiple objects per web page 2nd segm
ent
IE does not do pipelining! 3rd segme
nt

nd + 3rd seg
ments
ACK of 2

Web Servers: Erich Nahum 119


Solving the IW Problem
sender receiver
Solution: set IW = 2-4
x.html
RFC 2414 GET /inde

Didn't affect many BSD

RTT
1 st segme
nt
systems since they 2nd segm
ent
(incorrectly) counted the
connection setup in st and 2
nd
f 1

RTT
ACK o
congestion window 3rd segme
nt
calculation
time

200 ms.
Delayed ACK still happens,
but now out of critical path
of response time for
d
download ACK of 3r

Web Servers: Erich Nahum 120


Receive Window Size Problem
sender receiver
Recall Fast Retransmit:
ACK 3000
Amount of data in flight:
MIN(cong win,recv win) SEQ=300
0, size=10
00
can't ever have more than that SEQ=400 X
0
outstanding SEQ=500
0
In order for FR to work: SEQ=600
0
enough data has to be in flight
after lost packet, 3 more ACK 3000
segments must arrive ACK 3000
ACK 3000
4.5 KB of receive-side buffer
space must be available. SEQ=300
0, size=10
00
note many web documents are less time
than 4.5 KB!

Web Servers: Erich Nahum 121


Receive Window Size (cont)
Previous discussion assumes sender receiver
large enough receive windows!
Early versions of MS Windows 00 , RWIN = 2
0 00
30
had 16 KB default recv. window AC K

Balakrishnan et al. 1998: SEQ=300


0, size=10
00
Study server TCP traces from X

RTO timeout
1996 Olympic Web Server SEQ=400
0
show over 50% of clients have 0 00
receive window < 10K K 30 00 , RWIN = 1
AC
Many suffer coarse-grained
retransmission timeouts (RTOs) (illegal for sender
to send more)
Even SACK would not have
helped! SEQ=300
0 , size=100
0
time

Web Servers: Erich Nahum 122


Fixing Receive Window Problem
Balakrishnan et. al 98
"Right-edge recovery sender receiver
Also proposed by Lin & Kung 98
00
Now an RFC (3042) 30 00 , R WIN = 20
AC K
How does it work? SEQ=300
0, size=10
00
Arrival of dup ack means, segment has left X
the network SEQ=400
0
When dup ACK is received, send next
IN = 1000
segment (not retransmission) ACK 30 00 , RW
Continue with 2nd and 3rd dup acks SEQ=500
0
Idea is "keep ACK clock flowing" by forcing
more duplicate acks to be generated , RWIN = 1000
ACK 3000
Claim is that it would have avoided 25% of
SEQ=600
course-grained timeouts in 96 Olympics 0
trace 3rd dup 0 00
K 30 00 , RWIN = 1
AC
ack!
SEQ=300
0 , size=100
time 0

Web Servers: Erich Nahum 123


The Nagle Algorithm

Different types of TCP traffic exist:


Some apps (e.g., telnet) send one byte of data, then wait
for ACK
Others (e.g., FTP) use full-size segments
Recall server can write() to a socket at any time
Once written, should host stack send? Or should we wait
and hope to get more data?
May send many small packets, which is bad for 2
reasons:
Uses more network bandwidth (raises ratio of headers to
content)
Uses more CPU (many costs are per-packet, not per-byte)

Web Servers: Erich Nahum 124


The Nagle Algorithm

Solution is the Nagle algorithm:

If full-size segment of data is available, just send


If small segment available, and there is no unacknowledged data
outstanding, send
Otherwise, wait until either:
More data arrives from above (can coalesce packet), or
ACK arrives acknowledging outstanding data
Idea is have at most one small packet outstanding

Web Servers: Erich Nahum 125


Interaction of Nagle & Delayed ACK
Nagle and delayed ACK's cause sender receiver
(temporary) deadlock:
Sender wants to send 1.5 segments, x.html
write() GET /inde
sends first full one

RTT
1 st segme
Nagle prevents second from being nt (full siz
e)
sent (since not full size, and now we write()
(Nagle forbids
have unacked data outstanding)
sender from
Sender waits for delayed ACK from

200 ms.
sending more)
receiver
Receiver is waiting for 2nd segment
st segment
before sending ACK A C K of 1
Similar to IW=1 problem earlier

RTT
2 nd segme
n
Result: Many disable Nagle. t (half size
)
via setsockopt() call

Web Servers: Erich Nahum 126


Interaction of Nagle & Delayed ACK

For example, in WWW servers:


original NCSA server issued a write() for every header
Apache does its own buffering to do a single write() call
other servers use writev() (e.g., Flash)
if not careful you can flood the network with packets
More of an issue when using persistent connections:
closing the connection forces data out with the FIN bit
but persistent connections or 1.0 keep-alives affected
Mogul and Minshal 2001 evaluate a number of modifications to
Nagle to deal with this
Linux has similar "TCP_CORK" option
suppresses any non-full segment
application has to remember to disable TCP_CORK when finished.

Web Servers: Erich Nahum 127


Summary: TCP Dynamics

Many ways in which an HTTP transfer can interact


with TCP
Interaction of factors can cause delays in
response time as seen by clients
Hard to shield server developers from having to
understand these issues
Mistakes can cause problems such as flood of
small packets

Web Servers: Erich Nahum 128


Chapter 9: TCP Implementation

Web Servers: Erich Nahum 129


Server TCP Implementation

In this section we look at ways in which the host


TCP implementation is stressed under large web
server workloads. Most of these techniques deal
with large numbers of connections:
Looking up arriving TCP segments with large numbers of
connections
Dealing with the TIME-WAIT state caused by closing
large number of connections
Managing large numbers of timers to support connections
Dealing with memory consumption of connection state
Removing data-touching operations
byte copying and checksums

Web Servers: Erich Nahum 130


In the beginningBSD 4.3
packet arrival: ?
Recall how demultiplexing works: IP: 10.1.1.2, port: 5194

given a packet, want to find connection


state (PCB in BSD) One-behind cache

4-tuple of source, destination port & IP Head of PCB list


addresses
Original BSD:
IP: 192.123.168.40, port: 23
used one-behind cache with linear
search to match 4-tuple
assumption was "next segment very IP: 1.2.3.4, port: 45981
likely is from the same connection
assumed solitary, long-lived, FTP-like
transfer IP: 9.2.16.1, port: 873

average miss time is O(N/2) (N=length


of PCB list) IP: 118.23.48.3, port: 65383

IP: 10.1.1.2, port: 5194

Web Servers: Erich Nahum 131


PCB Hash Tables
McKenney & Dove SigComm 92:
linear search with one-behind cache
doesn't work well for transaction workloads
packet arrival: ?
hashing does much better IP: 10.1.1.2, port: 5194
hash based on 4-tuple
cost: O(1) (constant time) PCB Hash Table
BSD adds hash table in 90's O ------- (N-1)

other BSD Unixes (such as AIX) quickly


followed.
IP: 10.1.1.2, port: 5194
Algorithmic work on hash tables:
e..g., CLR book, perfect hash tables
none specific to web workloads
hash table sizing problematic

Web Servers: Erich Nahum 132


Problem of Old Duplicates
Recall in the Internet: client server
packets may be arbitrarily
SYN (X)
duplicated, delayed, and reordered.
while rare, case must be accounted
for. SYN (X)
Consider the following:
two hosts connect, transfer data, CK(X+1)
SYN (Y) A
close GET /inde
client starts new connection using x.html
same 4-tuple FIN
duplicate packet arrives from first Content +
connection ACK + FIN
connection has been closed, state is
gone
time
how can you distinguish? AC K

SYN (X)
?

Web Servers: Erich Nahum 133


Role of the TIME-WAIT State
Solution: dont do that! client server
prevent same 4-tuple from being
SYN (X)
used
one side must remember 4-tuple
for period of time to reject old SYN (X)
packets.
spec says, whoever closes the CK(X+1)
connection must do this (in the SYN (Y) A
TW state). GET /inde
x.html
Period is 2 times maximum
segment lifetime (MSL), after Content +
FIN
which it is assumed no packet
from previous conversation will ACK + FIN
still be alive

(2 * MSL)
MSL defined as 2 minutes in RFC time
1122 AC K

SYN (Z)
X
reject!

Web Servers: Erich Nahum 134


TIME-WAIT Problem in Servers
Recall in a WWW server, server closes connection!
asymmetry of client/server model means many clients
PCB sticks around for 2*MSL units of time
Mogul 1995 CA Election server study:
shows large numbers (90%) of PCB's in TIME-WAIT.
would have been 97% if followed proper MSL!
Example: doing 1000 connections/sec.
Assume MSL is 120 seconds, request takes 1 second.
Have 1000 connections in ESTABLISHED state.
240,000 connections in TIME-WAIT state!
FTY99 propose & evaluate 3 schemes:
require client to close (requires changing HTTP).
have client use new TCP option (client close) (TCP).
do client reset (browser, MS did this for a while)
claim 50% improvement in throughput, 85% in memory use

Web Servers: Erich Nahum 135


Dealing with TIME-WAIT
Sorting hash table entries
PCB Hash Table
(Aron & Druschel 99)
O ------- (N-1)
Demultiplexing requires that
all PCB's be examined (for
some hash bucket) before you
can give up on that PCB and 192.123.168.40: ESTABLISHED

say it was not found.


Since most lookups are for 128.119.82.37: TIME_WAIT

existing connections, most


connections will be in
ESTABLISHED state rather
9.2.16.145: TIME_WAIT

than TIME-WAIT.
Can sort PCB chain such that 178.23.48.3: TIME_WAIT

TW entries are at the end.


Thus, ESTABLISHED entries 10.1.1.2: TIME_WAIT
are at front of chain.

Web Servers: Erich Nahum 136


Server Timer Management
Each TCP connection can have up to 5
timers associated with it: HEAD OF PCB LIST
delayed ack, retransmission, persistence,
keep-alive, time-wait
Original BSD: 192.123.168.40: RTO in 2 secs

linear linked list of PCB's


fast timer (200 ms): walk all PCB's for 1.2.3.4: TIME-WAIT in 30 secs
delayed ACK timer
slow timer (500 ms): walk all PCB's for
all other timers 9.2.16.1: delayed ACK in 100 ms

time kept in relative form, so have to


subtract time from PCB (500 ms) for 4 118.23.48.3: keep-alive in 1 sec
larger timers
costs: O(#PCBs), not O(#active timers)
10.1.1.2: persist in 10 secs

Web Servers: Erich Nahum 137


Server Timer Management
Can again exploit semantics of PCB Hash Table
the TIME-WAIT state:
O ------- (N-1)
If PCB's are sorted by state,
delayed ACK timer can stop after
it encounters PCB in TIME-
WAIT, since ACKs are not 192.123.168.40: ESTABLISHED

delayed for connections in TIME-


WAIT state
128.119.82.37: TIME_WAIT
Aron and Druschel show 25
percent HTTP throughput
improvement using this technique 9.2.16.145: TIME_WAIT

Attribute most of win to reduced


timer processing, but probably
helps PCB lookup as well. 178.23.48.3: TIME_WAIT

10.1.1.2: TIME_WAIT

Web Servers: Erich Nahum 138


Customized PCB Tables
Regular PCB Hash Table
Maintain 2 sets of PCBs: normal O ------- (N-1)
and TIME-WAIT
first done in BSDI in 96 192.123.168.40: ESTABLISHED
still must search both PCBs
Aron & Druschel 99:
can compress TW PCBs, since only TIME-WAIT PCB Hash Table
port and sequence numbers needed O ------- (N-1)
normal still has full PCB state
show you can save a lot of kernel
pinned RAM (from 31 MB to 5 MB, a 9.2.16.145

82% reduction)
results in more RAM available for
disk cache, which leads to better 128.119.72.4

performance
10.1..1.2

Web Servers: Erich Nahum 139


Scalable Timers: Timing Wheels

wheel pointer
Varghese SOSP 1987:
use a hash-table-like structure called
timing wheel
events are ordered by relative time in
the future Timing Wheel
given event in future time T, put in
O ------- (N-1)
slot (T mod N)
list sorted by time (scheme 5)
Each clock tick: Expire: 12

wheel turns one slot (mod N)


look at first item in chain:
Expire: 22
if ready, fire, check next
if empty or not ready to fire, all done
continue until non-ready item is
Expire: 42
encountered (or end of list)

Ex: current time = 12, N = 10;

Web Servers: Erich Nahum 140


Timing Wheels (cont)
Variant (scheme 6 in paper):
just insert into wheel slot, dont bother to sort
check all timers in slot on each tick
Original SOSP 1987 paper
premise was more for large-scale simulations
have lots of events happening "in the future"
Algorithmic Costs (assuming good hash function):
O(1) average time for basic dictionary operations
insertion, cancellation, per-tick bookkeeping
O(N) (N = number timers) worst-case for scheme 6
O(log(N)) worst-case for scheme 5
Deployment:
Used in FreeBSD as of release 3.4 (scheme 6)
Variant in Linux 2.4 (hierarchy of timers with cascade)
Aron claims "about the same perf" as his approach

Web Servers: Erich Nahum 141


Data-Touching Operations
Lots of research in high-speed network community about how
touching data is bad
especially as CPU speeds increase relative to memory
Several ways to avoid data copying:
Use mmap() as described earlier to cut to one copy
Use I/O lite primitives (new API) to move buffers around
Use sendfile() API combined with integrated zero-copy I/O system in
kernel
Also a cost to reading the data via checksums:
Jacobson showed how it can be folded into the copy for free, with some
complexity on the receive side
I/O Lite /exokernel use checksum caches
Advanced network cards do checksum for you
Originally on SGI FDDI card (1995)
Now on all gigabit adaptors, some 100baseT adaptors

Web Servers: Erich Nahum 142


Summary: Implementation Issues

Scaling problems happen in large WWW Servers:


Asymmetry of client/server model
Large numbers of connections
Large amounts of data transferred
Approaches fall into one or more categories:
Hashing
Caching
Exploit common-case behavior
Exploiting semantic information
Don't touch the data
Most OS's now support these functions over the last
3 years

Web Servers: Erich Nahum 143

Вам также может понравиться