Вы находитесь на странице: 1из 32

System Call Performance as a First-class Result

Making the Box Transparent:

Yaoping Ruan, Vivek Pai Princeton University

Outline
n Motivation n Design

& implementation n Case study n More results

Motivation

Beginning of the Story


n Flash

Web Server on the SPECWeb99 benchmark yield very low score 200
n 220:

Standard Apache n 400: Customized Apache n 575: Best - Tux on comparable hardware
n However,

to be fast

Flash Web Server is known


3

Performance Debugging of OS-Intensive Applications

Motivation

Web Servers (+ full SpecWeb99) n High throughput CPU/network n Large working sets disk activity n Dynamic content multiple programs n QoS requirements latency-sensitive n Workload scaling

How do we debug such a system?

overhead-sensitive
4

Motivation

Current Tradeoffs
Statistical sampling
DCPI, Vtune, Oprofile

Call graph profiling


gprof

Measurement calls
getrusage() gettimeofday( )

Fast C ompleteness Detailed High overhead > 40% Online Guesswork, inaccuracy
5

Motivation

Example of Current Tradeoffs

gettimeofday( )

In-kernel measurement
6

Wall-clock time of an non-blocking system call

Design

Design Goals
n Correlate

kernel information with application-level information n Low overhead on useless data n High detail on useful data n Allow application to control profiling and programmatically react to information

DeBox: Splitting Measurement Policy & Mechanism


n n n

Design

Add profiling primitives into kernel Return feedback with each syscall
n

First-class result, like errno


Profile by call site and invocation Pick which processes to profile Selectively store/discard/utilize data
8

Allow programs to interactively profile


n n n

Design

DeBox Architecture
res = read (fd, buffer, count)

App Kernel

buffer

errno

DeBox Info

online or

read ( ) { Start profiling; End profiling Return; }

offline

Implementation

DeBox Data Structure


DeBoxControl ( DeBoxInfo *resultBuf, int maxSleeps, int maxTrace )
Basic DeBox Info Performance summary Sleep Info Trace Info
0 1 0 1 2

Blocking information of the call Full call trace & timing


10

Sample Output: Copying a 10MB Mapped File


Basic DeBox Info PerSleepInfo

Implementation

CallTrace

Basic DeBox Info PerSleepInfo[0] 4 System call # (write( )) 1270 occurrences CallTrace 3591064 Call # time (usec) (depth 723903 Time (usec) 3591064 ) blocked 989 # write( of page faults 0 ) biowr Resource 1used 25 ) label 2 # holdfp( of PerSleepInfo kern/vfs_bio.c File where blocked 2 ) 05 # fhold( of CallTrace used 2727 Line where 1 3590326 dofilewrite( ) blocked # processes on 2 entry 3586472 1 fo_write( ) exit 0 # processes on
11

Implementation

In-kernel Implementation
n 600

lines of code in FreeBSD 4.6.2 n Monitor trap handler


n

System call entry, exit, page fault, etc.

n Instrument
n

scheduler

Time, location, and reason for blocking n Identify dynamic resource contention
n Full
n

call path tracing + timings


12

More detail than gprofs call arc counts

Implementation

General DeBox Overheads

13

Case Study: Using DeBox on the Flash Web server


100 200 300 400 500 600 700

Case Study

SPECWeb99 Results
800 900

n Experimental
n

setup

Mostly 933MHz PIII n 1GB memory n Netgear GA621 gigabit NIC n Ten PII 300MHz clients

orig
VM Patch

sendfile
14

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Example 1: Finding Blocking System Calls


open()/190 readlink()/84 read()/12 stat()/6 unlink()/1 biord - 162 inode - 84 inode - 9 inode - 6 biord - 1 biord - 3

Case Study

inode - 28

Blocking system calls, resource blocked and their counts


15

Case Study

Example 1 Results & Solution


SPECWeb99 Results
100 200 300 400 500 600 700 800 900

n Mostly

locking orig n Direct all metadata calls to name convert VM Patch helpers sendfile n Pass open FDs using sendmsg( ) FD Passing
Dataset size (GB): 0.12 0.75 1.38 2.01 2.64
16

metadata

Example 2: Capturing Rare Anomaly Paths


FunctionX( ) { open( ); } FunctionY( ) { open( ); open( ); }

Case Study

When blocking happens: abort( ); (or fork + abort) Record call path

1. Which call caused the problem 2. Why this path is executed (App + kernel call trace)
17

Case Study

Example 2 Results & Solution


SPECWeb99 Results
100 200 300 400 500 600 700 800 900

n Cold

cache miss path orig n Allow read helpers to open cache missed VM Patch files sendfile n More benefit in latency
0

FD Passing

FD Passing
Dataset size (GB): 0.12 0.75 1.38 2.01 2.64
18

Example 3: Tracking Call History & Call Trace

Case Study

Call trace indicates: File descriptors copy - fd_copy( ) VM map entries copy - vm_copy( )
Call time of fork( ) as a function of invocation
19

Case Study

Example 3 Results & Solution


SPECWeb99 Results
100 200 300 400 500 600 700 800 900

n Similar

phenomenon orig on mmap( ) VM Patch n Fork helper sendfile n eliminate mmap( )


0

FD Passing Fork helper No mmap()

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64


20

Sendfile Modifications (Now Adopted by FreeBSD)

Case Study

n Cache

pmap/sfbuf entries n Return special error for cache misses n Pack header + data into one packet
21

Case Study

SPECWeb99 Scores
0
100 200 300 400 500 600 700 800 900

orig
VM Patch

sendfile
FD Passing Fork helper No mmap() New CGI Interface New sendfile( )
0.12 0.43 0.75 1.06 1.38 1.69 2.01 2.32 2.64 2.95 Dataset size (GB)
22

Case Study

New Flash Summary


n Application-level
n

changes

FD passing helpers n Move fork into helper process n Eliminate mmap cache n New CGI interface
n Kernel
n

sendfile changes

Reduce pmap/TLB operation n New flag to return if data missing n Send fewer network packets for small files
23

Throughput on SPECWeb Static Workload


700 600 500 400 300 200 100 0 500 MB Apache 1.5 GB Dataset Size Flash New-Flash Throughput (Mb/s)

Other Results

3.3 GB

24

Latencies on 3.3GB Static Workload


44X

Other Results

47X

Mean latency improvement: 12X

25

Other Results

Throughput Portability (on Linux)


1400 1200 1000 800 600 400 200 0 500 MB Haboob Apache 1.5 GB Dataset Size Flash New-Flash
26

Throughput (Mb/s)

3.3 GB

(Server: 3.0GHz P4, 1GB memory)

Latency Portability (3.3GB Static Workload)


112X

Other Results

3.7X

Mean latency improvement: 3.6X

27

Summary
n

DeBox is effective on OS-intensive application and complex workloads


n n n

Low overhead on real workloads Fine detail on real bottleneck Flexibility for application programmers SPECWeb99 score quadrupled
n

Case study
n

Even with dataset 3x of physical memory

n n n

Up to 36% throughput gain on static workload Up to 112x latency improvement Results are portable

28

www.cs.princeton.edu/~yruan/debox yruan@cs.princeton.edu vivek@cs.princeton.edu

Thank you

SpecWeb99 Scores
n Standard

Flash n Standard Apache n Apache + special module n Highest 1GB/1GHz score n Improved Flash n Flash + dynamic request module

200 220 420 575 820 1050

30

SpecWeb99 on Linux
n Standard

Flash 600 n Improved Flash 1000 n Flash + dynamic request module 1350
n 3.0GHz

P4 with 1GB memory

31

New Flash Architecture

32

Вам также может понравиться