Debox Usenix 04

System Call Performance as a First-class Result
Making the Box Transparent:
Yaoping Ruan, Vivek Pai Princeton University
Outline
n Motivation n Design
& implementation n Case study n More results
Motivation
Beginning of the Story

n Flash
Web Server on the SPECWeb99 benchmark yield very low score 200
n 220:
Standard Apache n 400: Customized Apache n 575: Best - Tux on comparable hardware
n However,
to be fast
Flash Web Server is known

3
Performance Debugging of OS-Intensive Applications
Motivation
Web Servers (+ full SpecWeb99) n High throughput CPU/network n Large working sets disk activity n Dynamic content multiple programs n QoS requirements latency-sensitive n Workload scaling
How do we debug such a system?
overhead-sensitive
4
Motivation
Current Tradeoffs
Statistical sampling
DCPI, Vtune, Oprofile
Call graph profiling

gprof
Measurement calls
getrusage() gettimeofday( )
Fast C ompleteness Detailed High overhead > 40% Online Guesswork, inaccuracy
5
Motivation
Example of Current Tradeoffs
gettimeofday( )
In-kernel measurement
6
Wall-clock time of an non-blocking system call
Design
Design Goals
n Correlate
kernel information with application-level information n Low overhead on useless data n High detail on useful data n Allow application to control profiling and programmatically react to information
DeBox: Splitting Measurement Policy & Mechanism

n n n
Design
Add profiling primitives into kernel Return feedback with each syscall
n
First-class result, like errno

Profile by call site and invocation Pick which processes to profile Selectively store/discard/utilize data
8
Allow programs to interactively profile

n n n
Design
DeBox Architecture
res = read (fd, buffer, count)
App Kernel
buffer
errno
DeBox Info
online or
read ( ) { Start profiling; End profiling Return; }
offline
Implementation
DeBox Data Structure

DeBoxControl ( DeBoxInfo *resultBuf, int maxSleeps, int maxTrace )
Basic DeBox Info Performance summary Sleep Info Trace Info
0 1 0 1 2
Blocking information of the call Full call trace & timing

10
Sample Output: Copying a 10MB Mapped File

Basic DeBox Info PerSleepInfo
Implementation
CallTrace
Basic DeBox Info PerSleepInfo[0] 4 System call # (write( )) 1270 occurrences CallTrace 3591064 Call # time (usec) (depth 723903 Time (usec) 3591064 ) blocked 989 # write( of page faults 0 ) biowr Resource 1used 25 ) label 2 # holdfp( of PerSleepInfo kern/vfs_bio.c File where blocked 2 ) 05 # fhold( of CallTrace used 2727 Line where 1 3590326 dofilewrite( ) blocked # processes on 2 entry 3586472 1 fo_write( ) exit 0 # processes on
11
Implementation
In-kernel Implementation
n 600
lines of code in FreeBSD 4.6.2 n Monitor trap handler

n
System call entry, exit, page fault, etc.
n Instrument
n
scheduler
Time, location, and reason for blocking n Identify dynamic resource contention
n Full
n
call path tracing + timings

12
More detail than gprofs call arc counts
Implementation
General DeBox Overheads
13
Case Study: Using DeBox on the Flash Web server

100 200 300 400 500 600 700
Case Study
SPECWeb99 Results
800 900
n Experimental
n
setup
Mostly 933MHz PIII n 1GB memory n Netgear GA621 gigabit NIC n Ten PII 300MHz clients
orig
VM Patch
sendfile
14
Dataset size (GB): 0.12 0.75 1.38 2.01 2.64
Example 1: Finding Blocking System Calls

open()/190 readlink()/84 read()/12 stat()/6 unlink()/1 biord - 162 inode - 84 inode - 9 inode - 6 biord - 1 biord - 3
Case Study
inode - 28
Blocking system calls, resource blocked and their counts

15
Case Study
Example 1 Results & Solution

SPECWeb99 Results
100 200 300 400 500 600 700 800 900
n Mostly
locking orig n Direct all metadata calls to name convert VM Patch helpers sendfile n Pass open FDs using sendmsg( ) FD Passing
Dataset size (GB): 0.12 0.75 1.38 2.01 2.64
16
metadata
Example 2: Capturing Rare Anomaly Paths

FunctionX( ) { open( ); } FunctionY( ) { open( ); open( ); }
Case Study
When blocking happens: abort( ); (or fork + abort) Record call path
1. Which call caused the problem 2. Why this path is executed (App + kernel call trace)
17
Case Study

SPECWeb99 Results
100 200 300 400 500 600 700 800 900
n Cold
cache miss path orig n Allow read helpers to open cache missed VM Patch files sendfile n More benefit in latency
0
FD Passing
FD Passing
Dataset size (GB): 0.12 0.75 1.38 2.01 2.64
18
Example 3: Tracking Call History & Call Trace
Case Study
Call trace indicates: File descriptors copy - fd_copy( ) VM map entries copy - vm_copy( )
Call time of fork( ) as a function of invocation
19
Case Study

SPECWeb99 Results
100 200 300 400 500 600 700 800 900
n Similar
phenomenon orig on mmap( ) VM Patch n Fork helper sendfile n eliminate mmap( )

0
FD Passing Fork helper No mmap()
Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

20
Sendfile Modifications (Now Adopted by FreeBSD)
Case Study
n Cache
pmap/sfbuf entries n Return special error for cache misses n Pack header + data into one packet
21
Case Study
SPECWeb99 Scores
0
100 200 300 400 500 600 700 800 900
orig
VM Patch
sendfile
FD Passing Fork helper No mmap() New CGI Interface New sendfile( )
0.12 0.43 0.75 1.06 1.38 1.69 2.01 2.32 2.64 2.95 Dataset size (GB)
22
Case Study
New Flash Summary

n Application-level
n
changes
FD passing helpers n Move fork into helper process n Eliminate mmap cache n New CGI interface
n Kernel
n
sendfile changes
Reduce pmap/TLB operation n New flag to return if data missing n Send fewer network packets for small files
23
Throughput on SPECWeb Static Workload

700 600 500 400 300 200 100 0 500 MB Apache 1.5 GB Dataset Size Flash New-Flash Throughput (Mb/s)
Other Results
3.3 GB
24
Latencies on 3.3GB Static Workload

44X
Other Results
47X
Mean latency improvement: 12X
25
Other Results
Throughput Portability (on Linux)

1400 1200 1000 800 600 400 200 0 500 MB Haboob Apache 1.5 GB Dataset Size Flash New-Flash
26
Throughput (Mb/s)
3.3 GB
(Server: 3.0GHz P4, 1GB memory)
Latency Portability (3.3GB Static Workload)

112X
Other Results
3.7X
Mean latency improvement: 3.6X
27
Summary
n
DeBox is effective on OS-intensive application and complex workloads

n n n
Low overhead on real workloads Fine detail on real bottleneck Flexibility for application programmers SPECWeb99 score quadrupled
n
Case study
n
Even with dataset 3x of physical memory
n n n
Up to 36% throughput gain on static workload Up to 112x latency improvement Results are portable
28
www.cs.princeton.edu/~yruan/debox yruan@cs.princeton.edu vivek@cs.princeton.edu
Thank you
SpecWeb99 Scores
n Standard
Flash n Standard Apache n Apache + special module n Highest 1GB/1GHz score n Improved Flash n Flash + dynamic request module
200 220 420 575 820 1050
30
SpecWeb99 on Linux
n Standard
Flash 600 n Improved Flash 1000 n Flash + dynamic request module 1350
n 3.0GHz
P4 with 1GB memory
31
New Flash Architecture
32

Debox Usenix 04

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Debox Usenix 04

Загружено:

Авторское право:

Доступные форматы

System Call Performance as a First-class Result

Making the Box Transparent:

Yaoping Ruan, Vivek Pai Princeton University

& implementation n Case study n More results

Beginning of the Story

Flash Web Server is known

Performance Debugging of OS-Intensive Applications

How do we debug such a system?

Call graph profiling

Example of Current Tradeoffs

Wall-clock time of an non-blocking system call

DeBox: Splitting Measurement Policy & Mechanism

First-class result, like errno

Allow programs to interactively profile

read ( ) { Start profiling; End profiling Return; }

DeBox Data Structure

Blocking information of the call Full call trace & timing

Sample Output: Copying a 10MB Mapped File

lines of code in FreeBSD 4.6.2 n Monitor trap handler

System call entry, exit, page fault, etc.

call path tracing + timings

More detail than gprofs call arc counts

General DeBox Overheads

Case Study: Using DeBox on the Flash Web server

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Example 1: Finding Blocking System Calls

Blocking system calls, resource blocked and their counts

Example 1 Results & Solution

Example 2: Capturing Rare Anomaly Paths

Example 2 Results & Solution

Example 3: Tracking Call History & Call Trace

Example 3 Results & Solution

phenomenon orig on mmap( ) VM Patch n Fork helper sendfile n eliminate mmap( )

FD Passing Fork helper No mmap()

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Sendfile Modifications (Now Adopted by FreeBSD)

New Flash Summary

Throughput on SPECWeb Static Workload

Latencies on 3.3GB Static Workload

Mean latency improvement: 12X

Throughput Portability (on Linux)

(Server: 3.0GHz P4, 1GB memory)

Latency Portability (3.3GB Static Workload)

Mean latency improvement: 3.6X

DeBox is effective on OS-intensive application and complex workloads

Even with dataset 3x of physical memory

www.cs.princeton.edu/~yruan/debox yruan@cs.princeton.edu vivek@cs.princeton.edu

200 220 420 575 820 1050

P4 with 1GB memory

New Flash Architecture

Вам также может понравиться