Вы находитесь на странице: 1из 26

Seminar: Linux Performance Tuning Knobs

Xuekun Hu PRC Scalability Lab

SSG/PRC Scalability Lab

Intel Confidential Slide 1

My experience of building Linux knowledge man src/Document/* Google Ask people Summarize and Write down

SSG/PRC Scalability Lab

Intel Confidential Slide 2

Agenda
Know your System and Workloads Basic Performance Tuning Knobs
System tuning knobs for limitations File system tuning TCP/IP stack tuning Memory system tuning

Small workloads

SSG/PRC Scalability Lab

Intel Confidential Slide 3

Know your system and workloads /proc/cpuinfo : cpu info /proc/meminfo : mem info dmesg : system info fdisk -l : disk info sar : OS, network, &cpu related counters
sar -bBcrwqWuR -P ALL -n DEV [interval] [count]

iostat : disk related counters


iostat -d -k -t -x /dev/sd* [interval]

vmstat : virtual memory


vmstat -n [interval]

netstat : network statistics


SSG/PRC Scalability Lab
Intel Confidential Slide 4

Most Useful Counters - Memory


Counter Name Pswpin/s What it Counts When to be Concerned What it Could Mean Need more memory; If below 500MB, might need to add memory. If above 1GB, app might be able to utilize more memory

O/S swapping

Values > 200 or significantly higher than baseline If below 500MB or above 1GB

Kbmemfree

Last observed amount of memory available in KB

SSG/PRC Scalability Lab

Intel Confidential Slide 5

Most Useful Counters Network Interface


Counter Name What it Counts Measure of bandwidth Packets discarded network errors NIC interrupt/s When to be Concerned Confirm if the network is bottlenecked Values above 0 What it Could Mean

Txbyt/s, rxbyt/s

If > 100 MB/s, system is approaching Gbe network saturation packets dropped per second because of a lack of space in linux buffers NIC interrupts/s is too high, need to set interrupt moderation rate

Rxdrop/s, txdrop/s

Nic interrupts/s

Values above 10000

SSG/PRC Scalability Lab

Intel Confidential Slide 6

Most Useful Counters Physical Disk


Counter Name What it Counts Latency of the disk subsystem When to be Concerned Counter value is significantly higher than a baseline What it Could Mean

await

Underperforming disk(s); storage sub-system not optimally configured Need to add more storage

Avgqu-sz

Number of Greater than 2 or disk reads/ significantly greater writes than a baseline waiting to be issued

SSG/PRC Scalability Lab

Intel Confidential Slide 7

Most Useful Counters Processor


Counter Name What it Counts When to be Concerned What it Could Mean

1 - % idle

CPU Utilization

Increases from a baseline (e.g. CPU, frequency) without an increase in performance; less than 99% if benchmarking
Significantly higher than a baseline or higher than 40% --

Inefficient use of the CPU(s); workload needs to be optimized

% system

Time CPU in kernel mode (processing interrupts) Time CPU in in user mode (running apps)

Increase in I/O (possibly inefficient); malfunctioning device Decrease in I/O

% user

SSG/PRC Scalability Lab

Intel Confidential Slide 8

Most Useful Counters System


Counter Name What it Counts When to be Concerned Above 10,000 What it Could Mean

Context Switches/Sec

How often CPU switches from user to kernel or between threads

Too many interrupts (inefficient I/O, malfunctioning device); too many threads competing for resources

SSG/PRC Scalability Lab

Intel Confidential Slide 9

Tuning Knobs

SSG/PRC Scalability Lab

Intel Confidential Slide 10

System limits
#ulimit n 110490
Refer to max number of file descriptors allowed per process Fix unable to open file descriptor issue

/proc/sys/fs/file-max = 254108
Refer to max number of file descriptors allowed per system

/proc/sys/kernel/pid_max = 65536
Max number of processes/threads

/proc/sys/vm/max_map_count = 131072
Max number of memory map area a process may have increase it when want to create more than 32k threads in one process

/proc/sys/kernel/shmmax = 268435456 /proc/sys/kernel/shmall = 4194304


When application (like Oracle, DB2) use Sys V shared memory system calls (shm*)

#ulimit -a
SSG/PRC Scalability Lab
Intel Confidential Slide 11

File system tuning


#mount -t ext3 -o noatime,nodiratime,noacl /dev/sda1 /mnt IO scheduler
cfq (completely fair queuing) deadline noop as (anticipatory)

ramdisk/ramfs/tmpfs

SSG/PRC Scalability Lab

Intel Confidential Slide 12

Network Tuning
net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 1800 net.core.wmem_max = 8388608 net.core.rmem_max = 8388608 net.ipv4.tcp_rmem/net.ipv4.tcp_wmem net.ipv4.tcp_max_syn_backlog = 4096 net.ipv4.tcp_max_tw_buckets = 450000 net.core.somaxconn = 20480 net.core.netdev_max_backlog = 30000
Intel Confidential Slide 13

SSG/PRC Scalability Lab

Network Tuning (cont)


Ensure the latest driver version is being used If all CPUs have high privileged time, try affinitizing the NIC to 1 CPU echo 1 > /proc/irq/26/smp_affinity Or echo 0xff > /proc/irq/26/smp_affinity Or if 1 CPU is saturated, try using irqbalance to balance interrupts Verify or tweak the following NIC Properties: Transmit and Receive Descriptors Should generally be around 256. Senders transmit descriptors should = receivers receive descriptors
#Modprobe e1000 RxDescriptors=4096 TxDescriptors=4096

Offload Receive IP Checksum/Receive TCP Checksum/TCP Segmentation/Transmit IP Checksum/Transmit TCP Checksum These are usually enabled by default, and in most cases this is the best setting.
#ethtool K eth0 tso on

Interrupt Moderation Rate Using this feature can reduce CPU utilization. The best setting is workload-dependent

#modprobe e1000 InterruptThrottleRate=3000


transmit queue length

#ifconfig eth0 txqueuelen 40000

SSG/PRC Scalability Lab

Intel Confidential Slide 14

Memory Tuning
vm.min_free_kbytes
force the Linux VM to keep a minimum number of kilobytes free

vm.nr_hugepages
Many server software, like Oracle, DB2, JRockit, support it

If working with a NUMA system, balance memory and use software flags to take advantage of NUMA
numactl

Simply do Add more memory


SSG/PRC Scalability Lab
Intel Confidential Slide 15

Application & General Tuning


Ensure application is built using optimal compiler and flags Research application-specific startup flags or configurables Try enabling large pages if application supports it If clients are being used to drive the workload, consider adding additional clients if their CPU utilization is over 50%

SSG/PRC Scalability Lab

Intel Confidential Slide 16

Small workloads

SSG/PRC Scalability Lab

Intel Confidential Slide 17

Workload Category 1 Business Processing ERP CRM OLTP E-commerce 2 IT Infrastructure File & Print Networking

Workload SAP, SPECjbb2005, Oracle Apps, SPECjAppServer2004, Volanomark

TPC-C, Sysbench-MySQL, DBT2-MySQL Dell DVD Store Dbench, NetBench, LLCBench Netperf, Tbench, Httperf, Jmeter

Proxy Caching Security Systems Management 3 Decision Support Data Warehousing/Data Mart Data Analysis/Data Mining 4 Collaborative Email Workgroup 5 6 Application Development Web Infrastructure Application Development Streaming Media Web Serving 7 8 Technical OS Scientific/Engineering System component File system io SPECint,SPECfp,SPECint_rate,SPECfp_rate, WMLS SPECweb2005, WebBench Linpack, HPC Suite*, SPECint, SPECfp, Stream, Matrix Multiply, Adobe After Effects, blastp, GunGard LmBench, Aim Aiostress, Iozone, Iometer, TIOBench MMB3, Lotus NotesBench TPC-H SPECweb99_SSL

SSG/PRC Scalability Lab

Intel Confidential Slide 18

SysBench-MySQL
.. SysBench AT (OLTP) MySQL SysBench AT is an open source set of several MySQL workloads. The AT case is a set of mixed read and write queries, dominated by reads. SysBench AT is multi-threaded, capable of generating up to 96 concurrent threads. SysBench AT is multi-threaded, capable of generating up to 96 concurrent threads. The number of threads is controlled by a command-line parameter. The performance metric is transactions per second. Database size is ~10 Million rows, about 2.4 GB disk space. Source code can be downloaded from http://sysbench.sourceforge.net SysBench version 0.4.7 Platform capability with an open source database Benefit of Hyper-Threading Technology Performance benefit from larger cache (>10% 1M to 2M) Database workload dominated by reads, 1 insert per transaction Warm-up phase insures that data and metadata are virtually all memory resident (cached) for measurement phase Linux 64 bit SMP Workstations or Servers

Application

Demonstrates

Workload OS Platform Usage

SSG/PRC Scalability Lab

Intel Confidential Slide 19

Linpack
.. Linpack Long standing industry recognized benchmark from: http://www.netlib.org/performance/html/PDStop.html Solves systems of linear equations in dense matrix; adheres to Linpack standard Uses Block Matrix algorithms and Intel Math Kernel Library (MKL) Built-in Parallel processing capability with Open MP Measures floating point performance in FLOPS (Floating Point Ops Per Second) Intel created binary (only) version 2.1.3 Floating Point computation performance potential Achieves very high percentage of tpp1 Near Linear Scaling with clock speed and # CPUs Minimal benefit from larger cache (>1M) Use of memory >4GB (via run-time parameters). 15K matrix = ~2GB memory Entirely Compute Intensive Runs best with Hyper-Threading Technology Off Linux & windows Workstations, servers

Application

Demonstrates

Workload OS Platform

SSG/PRC Scalability Lab

Intel Confidential Slide 20

IOMeter
.. Application IOMeter IO subsystem measurement and characterization tool Open source http://www.iometer.org/ Can measure performance of disk and network controllers, bandwidth and latency capabilities of buses, network throughput to attached drives, shred bus performance, system-level hard drive performance, system-level network performance Windows , Linux, Netware One server

Demonstrates

Workload OS Platform

SSG/PRC Scalability Lab

Intel Confidential Slide 21

IOZone
.. IOZone Comparing the ability of a computer to complete single tasks Open Source Similar with bonnie++ (http://www.coker.com.au/bonnie++/) Download from http://www.iozone.org

Application

Demonstrates Workload OS Platform

Measure operations such as: read, write, fwrite, pread and so on.
Windows , Linux , Solaris One server

SSG/PRC Scalability Lab

Intel Confidential Slide 22

LMbench
.. LMbench A simple and portable benchmark used to test bandwidth and latency. Including Bandwidth benchmark, latency benchmark and miscellanious benchmark Workload author: Larry McVoy and Carl Staelin. GPL Download from http://www.bitmover.com/lmbench/ Test for bandwidth and latency Potential issues on timing and clock resolution Measure system behavior: memory BW, cache BW, system call, context switch Linux One machine

Application

Demonstrates Workload OS Platform Usage

SSG/PRC Scalability Lab

Intel Confidential Slide 23

Netperf
.. Netperf Netperf is a benchmark that can be used to measure the performance of many different types of networking, including TCP and UDP, DLPI, Unix Domain Sockets and SCTP. It provides tests for both unidirecitonal throughput, and end-to-end latency. Not GPL, HP COPYRIGHT, but freely distributable Download from http://www.netperf.org/netperf/NetperfPage.html Tests for both unidirecitonal throughput, and end-to-end latency.

Application

Demonstrates Workload OS Platform Usage

Linux 2 machines: server and client

SSG/PRC Scalability Lab

Intel Confidential Slide 24

Httperf
Httperf

Application

a tool for measuring web server performance With a flexible facility for generating various HTTP workloads Generate HTTP GET requests

Demonstrates Workload OS Platform Usage Not a benchmark but a tool Open-source (Linux/Windows)

SSG/PRC Scalability Lab

Intel Confidential Slide 25

Discussion
Any other things related with Linux want to talk? Workloads you are using? Scripts to collect performance data?

SSG/PRC Scalability Lab

Intel Confidential Slide 26

Вам также может понравиться