Вы находитесь на странице: 1из 61

Kernel Optimization / Tuning

Dr. Gerald Pfeifer


Project Manager Enterprise
Linux Coordination, Novell
Inc.

(with lots of support by


Chris Mason and Ralf Flaxa)
Mission

• Optimize SUSE Linux Enterprise kernels without


losing vendor support or certification.
- A priori – preferable
- A posteriori – common

• Applicable to SLES9/NLD9/OES and SLE10.

• No unicorns (nor magic wands) required!

2 © Novell Inc, Confidential & Proprietary


Agenda

• General Considerations (Hardware,...)


• Identifying Problems
• File Systems
• /proc and /sys – An Interlude
• Block Layer
• Memory Management (VM)
• Miscellaneous (Scheduler, Network)
• Application Interplay
• Wrapping Up

3 © Novell Inc, Confidential & Proprietary


General Considerations
(Hardware, Configuration,...)
Hardware and Configuration

• Ultimately, hardware and its configuration set the


upper limits for our tuning efforts.
• Are we starting with the best/minimum needed
hardware platform and components?
> CPU speed only critical for compute-intense tasks
> RAM (amount and speed) and interconnects do matter
> Bottleneck I/O: network bandwidth, disk latency,...

• Is the hardware configuration appropriate?

• The weakest link limits performance!

5 © Novell Inc, Confidential & Proprietary


Hardware Platforms

• SUSE LINUX Enterprise Server 9 supports


- x86 (32-bit; Intel / AMD)
- x86-64 (AMD64 / Intel EM64T)
- Intel Itanium (ia64)
- IBM POWER (ppc, ppc64)
- IBM S390 (32-bit)
- IBM zSeries

• x86 is limited to 32 to 48 GB (and that may be painfull).


•64-bit
platforms eliminate legacy limits (up to 512p and
several TB of RAM in production). Performance benefits.
• SLES offers compatible 32-bit mode and userland.
6 © Novell Inc, Confidential & Proprietary
(Hardware) Configuration

• Optimize storage configuration


- Distribute data across controllers/disks.
- Swap to extra disk.
- Use RAID with striping.
• Tune hardware setup (BIOS, EFI,...)
- Only enable/probe what you have.
- Tune for fast reboot vs startup checks (if desired).
- Carefully review all settings.
• Disable unneeded services
- # rc<SERVICE> stop

7 © Novell Inc, Confidential & Proprietary


Hardware: Example FTP Server

• CPU (variant, speed, cache,...) should hardly matter.

• Depending on load and access patterns, get as


much RAM as possible and/or fast storage.

• An appropriate network card for the uplink is a must.

• Consider traffic shaping with “wondershaper”.

8 © Novell Inc, Confidential & Proprietary


Identifying Problems
Where has all my memory gone!?

• Slab Cache:
- structures of much less than one page in size
- generic slabs of predefined sizes (32, 64) plus slabs for
specific data structures
• Page Cache
- pages with actual contents of files (or block devices)
- usually the largest, by far
• Buffer Cache
- file system metadata

10 © Novell Inc, Confidential & Proprietary


Identifying Problems

• Start by locating the bottleneck: I/O, disk, memory,...

• iostat to identify overloaded I/O devices


- package sysstat
- # iostat -x 1

avg-cpu: %user %nice %system %iowait %steal %idle


0.01 1.85 1.03 0.10 0.00 97.02

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn


sda 40.59 144.51 2068.57 1007841 14427048
sdb 0.01 0.62 0.00 4304 0
:
sdx 0.01 0.42 0.00 2937 0

11 © Novell Inc, Confidential & Proprietary


Identifying Problems (ctd.)

• slabtop for slab cache usage


- package procps
- # slabtop

• vmstat for basic system usage


- package procps
- # vmstat 1

r b swpd free buff cache si so bi bo in cs us sy id wa


0 0 76804 8268 14996 167396 1 1 36 64 132 197 4 1 92 3
0 0 76804 8268 14996 167396 0 0 0 0 1023 879 3 0 97 0
0 0 76804 8300 14996 167396 0 0 0 0 1158 1134 2 0 98 0

12 © Novell Inc, Confidential & Proprietary


File Systems
Picking a File System

• Pick the right file system for the task


- File sizes
- Number of files
- Workloads (database, mail server,...)
- Indexed metadata
- Number of CPUs

14 © Novell Inc, Confidential & Proprietary


File Systems: ReiserFS

• Applications that use many small files:


- mail servers
- NFS servers
- database servers
• Other applications that use synchronous I/O.

15 © Novell Inc, Confidential & Proprietary


File Systems: Ext3

• Anywhere a direct upgrade from Ext2 is needed


• Machines with > 4 CPUs
(ReiserFS does not scale that well there)
• Synchronous I/O applications where ReiserFS is
not an option

16 © Novell Inc, Confidential & Proprietary


File Systems: Ext3

• When using Ext3 with many files in one directory,


consider enabling btree support:
- # mkfs.ext3 -O dir_index

• When using Ext3 with multiple threads appending


to files in the same directory, consider turning
preallocation on:
- # mount -o reservation

17 © Novell Inc, Confidential & Proprietary


File Systems: XFS

• Best suited for large configs


- very large machines (>8 CPUs)
- very large file systems (>1 TB)
- large files / many files
• Low latencies (streaming multimedia)

18 © Novell Inc, Confidential & Proprietary


File Systems: OCFS2

• Best suited for


- ${ORACLE_HOME} (home for Oracle RAC)

• SLES9 SP2: x86, x86-64, and ia64


• SLES9 SP3: ppc, s390, s390x as well
• SLE10: all architectures

19 © Novell Inc, Confidential & Proprietary


Barriers

• SLES9 defaults to maximum data integrity by


enforcing so-called “barriers” so that reordering
of file system journal writes cannot happen.
• Costs some performance, especially over NFS.
• Tunable via mount option barrier=<X>
• ReiserFS:
- enable with barrier=flush (default)
- disable with barrier=none
• Ext3:
- enable with barrier=1 (default)
- disable with barrier=0
20 © Novell Inc, Confidential & Proprietary
File Systems: Logging Modes

• Journaling file systems offer different modes to write


the actual data. (Metadata is always journaled.)
• For ReiserFS and Ext3, mount option data=<X>
- data=ordered: use barriers for data (default)
forces data out prior to metadata commit
- data=writeback: no barriers for data
fastest in many workloads, risk exposing old data
- data=journal: use journal for data
generally slow, but can improve mail server workloads

• By default, SLES9 ensures data integrity at the cost


of some performance.

21 © Novell Inc, Confidential & Proprietary


Dedicated Logging Devices

• ReiserFS
- mkreiserfs -j /dev/xxx -s 8193 /dev/xxy
- reiserfstune –journal-new-device /dev/xxx -s 8193
• Ext3
- mke2fs -O journal_dev /dev/xxx
- mke2fs -j -J device=/dev/xxx,size=8193 /dev/xxy
- tune2fs -J device=/dev/xxx,size=8193 /dev/xxy

22 © Novell Inc, Confidential & Proprietary


File System Tuning

• Split file systems based on data access patterns


- Keep commit heavy data away from data that does not
have to be synchronous.
- Keep streaming writes and reads on different spindles
than random I/O.
- Consider disabling atime updates on files and
directories:
> # mount -o noatime,nodiratime

23 © Novell Inc, Confidential & Proprietary


File System Tuning

• Optimize directory layout for the file system


- Keep data that will be accessed together in the same
subdirectories.
- Spread data out into different subdirectories to increase
large file concurrency.
- Different file systems order directories differently.

24 © Novell Inc, Confidential & Proprietary


/proc and /sys: An Interlude
/proc and /sys

• Special file systems which expose important system


information (and allow modification):
- # hostname
rana
- # echo BrainShareDemo > /proc/sys/kernel/hostname
- # hostname
BrainShareDemo
• So, let's tune our CPU! ;-)
- # cat /proc/cpuinfo
...Intel(R) Pentium(R) M processor 1700MHz...
- # echo "Intel(R) Pentium(R) M processor 3700MHz" \
> /proc/cpuinfo
- # cat /proc/cpuinfo
26 © Novell Inc, Confidential & Proprietary
Block Layer
I/O Scheduler

• Flexible, pluggable I/O scheduler


• Selectable via boot parameter elevator=<X>
- noop
- deadline
- as (anticipatory)
- cfq (default in SLES)
• Optimize latency versus throughput
http://www.finux.org/Reprints/Reprint-Pratt-OLS2004.pdf

28 © Novell Inc, Confidential & Proprietary


I/O Scheduler: Noop

• No reordering, just merging requests


- light-weight: minimal memory and CPU overhead
• Best for storage with extensive caching and
scheduling of its own
- large RAID systems with many individual logical devices
• Boot parameter elevator=noop

29 © Novell Inc, Confidential & Proprietary


I/O Scheduler: Deadline

• Per-request service deadline


- caps maximum latency per request
- maintains good disk throughput

• Best for disk-intensive database applications


- (But fails to observe natural dependencies that often exist
between synchronous reads from one process.)
• Boot parameter elevator=deadline

30 © Novell Inc, Confidential & Proprietary


I/O Scheduler: Anticipatory

• Delays a few ms after every request to keep a window of


opportunity for further requests from the same process (and
puts these in front of the queue).
- maximizes throughput

- at the cost of increasing latency

• Best for file servers and desktop workloads with single IDE/SATA disks.
- Disk systems with large internal queues potentially destroy any
anticipatory accounting.
• Default in mainline kernels (as of SLES9)

• Boot parameter elevator=as

31 © Novell Inc, Confidential & Proprietary


I/O Scheduler: CFQ

• Complete Fair Queuing


• Treat all competing processes equally by keeping
a unique request queue for each (and each pdflush
kernel thread) and giving equal bandwidth to each
queue.
- good compromise between throughput and latency
- minimal worst case latency on all reads and writes
• Suitable for a wide variety of applications
• Default in SLES9 and SLE10
• Boot parameter elevator=cfq

32 © Novell Inc, Confidential & Proprietary


I/O Scheduler Changes at Runtime

• SLE10 allows changing the elevator at runtime, on a


per-block device basis:
- # echo "noop" > /sys/block/<BLOCK_DEVICE>/queue/scheduler

• View applicable elevators and current setting:


- cat /sys/block/<BLOCK_DEVICE>/queue/scheduler
noop anticipatory deadline [cfq]

33 © Novell Inc, Confidential & Proprietary


Block Layer Tuning

• Spreading the load across controllers


- Per-target locking for SCSI

- Software RAID bandwidth

• Battery backed (write) caching

34 © Novell Inc, Confidential & Proprietary


Blocker Layer Tunables

• Block read ahead buffer


- /sys/block/<sdX/hdX>/queue/read_ahead_kb
- Default is the hardware request size:
128KB for IDE, 512KB for SCSI.
- Increase to 512 or more for fast storage.
- May speed up streaming reads a lot.
• Number of requests
- /sys/block/<sdX/hdX>/queue/nr_requests
- Default is 128. Increase to 256 with CFQ scheduler
for fast storage.
- Increases throughput at minor latency expense.

35 © Novell Inc, Confidential & Proprietary


Memory Management (VM)
VM: Buffer Flush Daemon

• pdflush kernel threads take care of writing dirty


pages to disk.
• This can be tuned by
- /proc/sys/vm/dirty_ratio (40%)
Generator of dirty data starts writeback.
- /proc/sys/vm/dirty_background_ratio (10%)
- /proc/sys/vm/dirty_expire_centisecs (3000)
How long may dirty pages remain dirty?
- /proc/sys/vm/dirty_writeback_centisecs (500)
How often does bdflush wake up?
- Defaults are pretty high which is good for databases (but may
result in lots of unreclaimable pagecache).
- For other workloads (HPC) you may want to lower these.
37 © Novell Inc, Confidential & Proprietary
VM: Swapiness

• The treshold when processes should be swapped can


be tuned via
- /proc/sys/vm/swapiness

• Default is 60, which works well if you want to swap out


daemons or programs which have not done a lot lately.

• Higher values will provide more buffer/page cache,


lower values will wait longer to swap out idle processes.

38 © Novell Inc, Confidential & Proprietary


NUMA

• NUMA = Non-Uniform Memory Architecture


• SLES9 and SLE10 detect and use NUMA topology
and automatically
- prefers memory that is local to a node;
- evenly balances system data across nodes;
- gracefully handle CPU-less nodes; etc.
• Applications can (and should) optimize for NUMA
topology as well!

39 © Novell Inc, Confidential & Proprietary


NUMA (ctd.)

• The NUMA system can be tuned via numactl <CMD>;


the settings then apply to <CMD> and all of its children.
• Some options:
- preferred=255
- membind=!0-1
- cpubind=2-5
- localalloc (always allocate from current node)

• Node 0 may be the most contended, so avoid it.

40 © Novell Inc, Confidential & Proprietary


NUMA: freeing page caches

• Sometimes, especially for workloads with “symmetric”


load distribution (like HPC), you may want as much
local memory on specific nodes as possible.
• SLES9-specific knob:
- # echo 3-5,15 > /proc/sys/vm/toss_page_cache_nodes
• SLE10 and going forward:
- # sync ; echo 1 > /proc/sys/vm/drop_caches

41 © Novell Inc, Confidential & Proprietary


Miscellaneous (Scheduler, Network)
CPU Scheduler: Timeslices

• The timeslices of the CPU scheduler can be set


(in µs) via
- /proc/sys/kernel/min-timeslice
- /proc/sys/kernel/max-timeslice
• Defaults are 10 and 300ms, which gives processes
with default priority about 150ms before preemption.
• Smaller timeslices may improve interactive response
(desktop).
• For long running, (number) crunching workloads, you
may want to raise this.

43 © Novell Inc, Confidential & Proprietary


Binding Processes/Interrupts to CPUs

• Problem: context switching costs


• CPU Affinity: binding CPUs to a specific
process can improve performance
- taskset 0x3 [-p pid] [command]
- In this example, 0x3 is a bitmap referring to
CPUs 1 and 2; 0x6 would be CPUs 2 and 4.
• Bind Interrupts to CPUs
- cat /proc/interrupts
- echo 0x3 > /proc/irq/0/smp_affinity
- Example: distribute NICs among CPUs.

44 © Novell Inc, Confidential & Proprietary


Network Improvements

• Gibabit Ethernet
- Significant interrupt overhead reduction
- Consider Jumbo Frames (larger than 1500 bytes)
- # ifconfig <DEV> mtu 9000
• NFS modes
- TCP (default) vs UDP
- NFSv3 (default) vs NFSv2
- rsize=<X>/wsize=<X>
• read/write in chunks of <X> bytes
• default is 1024, use 8192 for higher throughput

45 © Novell Inc, Confidential & Proprietary


Application Interplay
mlock() – Locking Memory Down

• mlock() allows an application to prevent swapping of


specified memory areas (real-time, security,... use)
• In order to avoid a regular user triggering out-of-
memory (OOM), SLES9 defaults to only allow mlock()
for root.
• This can be changed via
- # echo 1 > /proc/sys/vm/disable_cap_mlock
• With SLE10, normal users can mlock() up to a system
limit set
- setrlimit(RLIMIT_MEMLOCK,...)
- Default is 32KB.

47 © Novell Inc, Confidential & Proprietary


Async I/O, O_DIRECT

• Asynchronous I/O
- New model for concurrency
- Heavily used by databases
• Direct I/O (O_DIRECT) on block devices or files
- Databases like to use raw disks. Historically /dev/raw was
used, but O_DIRECT is more performant.
- Files should be preallocated (no holes, no appending); the
system falls back to buffered I/O otherwise!
- In both cases: cache pollution benefits
- Not specific to database workloads!

48 © Novell Inc, Confidential & Proprietary


Wrapping Up
Pointers, pointers, pointers

• Package “orarun” to setup/tune for Oracle.


Package “sapinit” to setup/tune for SAP.

• Package “bootcycle” for fast/failsafe reboot/shutdown

• Read SLES9/SLE10 Admin Manual (CD1/docu).


Check docs under /usr/share/doc/<PACKAGE>.
Check man pages.

• You've got the source, Luke:


- Check /etc/sysconfig/<SERVICE> (special variables,...).

- Read /usr/src/linux/Documentation.
50 © Novell Inc, Confidential & Proprietary
Conclusions

• Novell/SUSE provides excellent baseline performance


based on

- Novell/SUSE R&D
- Extensive QA (including benchmarks)
- QA and benchmarks by and with partners

• Configured and tuned for your workload, SUSE Linux


Enterprise is a leader in performance (and not just
there).

51 © Novell Inc, Confidential & Proprietary


Enterprises are Shifting to Open

• Today's Enterprise: • Tomorrow’s Open Enterprise:


Multiple platforms, Performance, reliability, and
proprietary technologies, world-class support on
management and security standards-based, open
challenges platforms

Open Systems • Increased leverage


of IT skills and assets
• Improved security and
Proprietary Systems manageability
• Spiralling costs
• Vendor flexibility
• Interoperability
• Dramatic reductions in
challenges
operating costs
• Growing security threats
• Lack of vendor choice

52 © Novell Inc, Confidential & Proprietary


Introducing:
SUSE Linux Enterprise 10
•The platform for the open
enterprise.

- Built-in application security


- Virtualization
- integrated systems
management
- Supported on the full range
of hardware architectures.

53 © Novell Inc, Confidential & Proprietary


Novell's Comprehensive Ecosystem

User Communities

Worldwide Consulting Services


Independent Hardware Vendors

User Interface

Independent Software Vendors


Applications
Applications
Resource Security
Data Center Desktop Workgroup
Training

Education
Management & Identity

Database
Operating
SUSE Systems10
Linux Enterprise

Worldwide Technical Support


Developer Communities

54 © Novell Inc, Confidential & Proprietary


Questions & Answers
Unpublished Work of Novell, Inc. All Rights Reserved.
This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell,
Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the
scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised,
modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of
Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal
and civil liability.

General Disclaimer
This document is not to be construed as a promise by any participating company to develop, deliver, or market
a product. Novell, Inc., makes no representations or warranties with respect to the contents of this document,
and specifically disclaims any express or implied warranties of merchantability or fitness for any particular
purpose. Further, Novell, Inc., reserves the right to revise this document and to make changes to its content,
at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks
referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and
other countries. All third-party trademarks are the property of their respective owners.
Profiling

oprofile
• Powerful tool for profiling the entire system
• Can identify CPU hogs both in the kernel and userspace
• opcontrol --vmlinux=/boot/vmlinux-2.6.5-7.139-smp
• opcontrol --start
• <run test>
• opreport --symbols > output

58 © Novell Inc, Confidential & Proprietary


Huge Pages

• Huge pages (2MB each on x86) allow more RAM to


be referenced at a time.
- Reduces CPU and memory overhead.
- Heavily used by database management systems.
- Boot parameter hugepages=<N>
- Later allocation via /proc/sys/vm/nr_hugepages possible,
but likely to fail after serious memory fragmentation.
- mount -t hugetlbfs none /mnt_point
- Applications can then use shmget(SHM_HUGETLB)

59 © Novell Inc, Confidential & Proprietary


Huge Pages and Shared Memory

• Special facility in SLES9 to allow applications to


explicitly leverage huge pages:
- System V IPC shared memory can use hugetlb pages instead
of normal ones without the process using SHM_HUGETLB.
- Saves low memory (on x86) without any need to change the
application.
- # echo 1 > /proc/sys/kernel/shm_use_hugepages
• Relevant applications converted, so this is no longer
needed for SLE10.

60 © Novell Inc, Confidential & Proprietary


Graphics & Typeface

RED
Note:
Icons/Lines: This presentation refresh simplifies the
current template and pushes focus on the content
being presented. The icon library will continue to be
utilized, but a refresh will be noticeable with the
ORANGE removal of the dotted lines around each icon, and a
subtle color shift. These icons are created to provide a
professional, consistent look. When these icons are
used sparingly, and in direct relation to the content on
the slides, our presentations will communicate and
GREEN work more effectively.

Typeface: Arial has been selected as the new typeface


for all Novell communications. The following were
considered. 1. Our typeface needs to be designed to
BLUE carry information quickly to the reader.
2. It needs to be usable for Novell employees in
company correspondence and presentations, as well
as for outside vendors for marketing and promotion.
3. It needs to easily function on the Linux, Windows
GRAY and Macintosh platforms. 4. And finally, Arial was
created for these exact purposes.

Download Icon Library at: http://innerweb.novell.com/brandguide

How to Add Novell Icons to OpenOffice Gallery:


1. Go to the “Tools” menu
2. Select “Gallery”
3. In the Gallery window select “New Theme...”
4. With the “General” tab active name your new theme (ie.Red flat)
5. Select the “Files” tab.
6. Select “Find Files...”
7. Find the downloaded folder containing the icons named and click “Select”
8. Select “Add All” and then “OK”
9. Repeat for all icon groups

61 © Novell Inc, Confidential & Proprietary

Вам также может понравиться