App Centric Perf Linux

Redbooks Paper
Dominique A. Heger
Sandra K. Johnson
Mala Anand
Mark Peloquin
Mike Sullivan
Andrew Theurer
Peter Wai Yee Wong
An Application Centric Performance

Evaluation of the Linux 2.6 Operating
System
Abstract
Over the last several years, the Linux operating system has gained rapid
acceptance in many scientific and commercial environments as the operating
system of choice. The Linux operating system is being utilized in rather vast and
distinct server environments, including mail, print, and gateway based solutions,
and is today regarded as a reliable, efficient, and effective de-facto solution for
Web, database, and application servers, respectively. The performance aspects
of the Linux operating system (as executed on these types of machines) is
regarded as being very competitive, particularly on smaller systems with up to
four processors. Recently, there has been an increased emphasis on the
performance of Linux on mid- to high-end enterprise-class systems, where the
number of processors may range up to 64 (sometimes even higher). Therefore,
the emphasis of this paper is on scalability and performance related aspects
surrounding actual applications and the Linux 2.6 framework.
Copyright IBM Corp. 2004. All rights reserved.
ibm.com/redbooks
Introduction
This study was initiated to describe the performance-related enhancements
embedded into the 2.6 Linux kernel, and to discuss the actual performance
behavior encountered by certain applications while executing on enterprise-class
machines. To reemphasize, the objective of the study was to outline the
performance improvements with an emphasis on scalability. The actual
components discussed in this study include the Linux 2.6 CPU scheduler, the
virtual memory, and the I/O subsystem. The actual performance discussion
revolves around a diverse set of application workload scenarios. The presented
results underscore the significant progress obtained by working with the Linux
open source community to address and resolve several issues that in the past
limited the performance and scalability of Linux. The performance results
obtained for this study outline that the Linux operating system is competitive on
IBM Eserver enterprise machines.
The study was conducted focusing on the performance and scalability behavior
of the Linux 2.6 operating system under different workload conditions. Hence,
The virtual memory subsystem on page 2 discusses the virtual memory (VM)
subsystem, whereas CPU scheduler on page 5 elaborates on the Linux O(1)
CPU scheduler. I/O scheduling and the BIO layer on page 7 introduces some of
the new features that were incorporated into the Linux 2.6 I/O subsystem.
Performance results on page 12 presents the actual performance results
obtained on IBM eServer enterprise machines. The study concludes in
Conclusion and future work on page 16 with some remarks on potential future
work and summarizes the project's accomplishments.
The virtual memory subsystem

In general, the VM subsystem can be considered as the core of any UNIX
operating system, as the implementation of the VM system affects and impacts
almost every other subsystem in the framework. The VM subsystem enables a
process to view linear ranges of bytes in its address space, regardless of the
physical layout or the fragmentation encountered in physical memory. It basically
allows processes to execute in a virtual environment that appears as if the entire
address space of a CPU is available to the processing entity. This (supporting)
execution framework results in providing a process with a large programming
model. In such a scenario, a virtual view of memory storage, referenced as the
address space, is presented to an application. In such an environment, the virtual
memory subsystem transparently manages the virtual storage infrastructure,
which incorporates the physical memory as well as any (potential) secondary
storage subsystem [14].
An Application Centric Performance Evaluation of the Linux 2.6 Operating System
As the physical memory subsystem has a lower latency than the disk subsystem,
one of the challenges faced by the VM subsystem is to keep the most frequently
referenced portions of memory in the faster primary storage. In the event of a
physical memory shortage situation, the VM subsystem is required to surrender
some portion of physical memory by transferring infrequently utilized memory
pages out to the backing store. Therefore, the VM subsystem provides (next to
the large address space) resource virtualization, in the sense that a process
does not have to manage the details of physical memory allocation. Further, the
VM provides information and fault isolation, as each process executes in its own
address space. In most circumstances, hardware facilities in the memory
management unit (MMU) perform the memory protection functionality by
preventing a process from accessing memory outside its legal address space
(except of course for memory regions that are explicitly shared among
processes). The virtual address space of a process is defined as the range of
memory addresses that are presented to the process as its environment. At any
time in the life cycle of a process, some addresses may be mapped to physical
addresses and some may not.
Large page support

Most modern computer architectures support more than one page size. As an
example, the IA-32 architecture (not the operating system) supports either 4 KB
or 4 MB pages. The Linux operating system used to only utilize large pages for
mapping the actual kernel image. In general, large page usage is primarily
intended to provide performance improvements for high performance computing
(HPC) applications, as well as database applications that (usually) have large
working sets. However, any memory access intensive application that utilizes
large amounts of virtual memory may obtain performance improvements by
using large pages. Linux utilizes 2 MB or 4 MB large pages, AIX uses 16 MB
large pages, whereas Solaris large pages are 4 MB in size. The large page
performance improvements are attributable to reduced translation lookaside
buffer (TLB) misses. This is due to the TLB being able to map a larger virtual
memory range, therefore increasing the reach of the TLB. Large pages further
improve the process of memory prefetching, by eliminating the necessity to
restart prefetch operations on 4 KB boundaries [10].
In Linux 2.6, the actual number of available large pages (sometimes referred to
as huge pages) can be configured through the proc
(/proc/sys/vm/nr_hugepages) interface. As the actual allocation of large pages
depends on the availability of physically contiguous memory regions, the
recommendation is to launch the allocation at system startup. The core of the
large page implementation in Linux 2.6 is referred to as the hugetlbfs, a pseudo
file system (implemented in fs/hugetlbfs/inode.c) based on ramfs. The basic idea
behind the implementation is that large pages are being used to back up any file
that exists in the file system. The actual pseudo file system is initialized at system
startup, and registered as an internal file system. A process may access large
pages either through the shmget() interface (to set up a shared region that is
backed by large pages) or by utilizing the mmap() call on a file that has been
opened in the huge-page file system [2].
Page allocation and replacement

Physical page allocation in Linux 2.6 is based on a (relaxed) buddy algorithm.
The mechanism itself resembles a lazy buddy scheme that defers the coalescing
part. In general terms, the buffer release activity of the coalescing portion
involves a two-step process, where the first step consists of pushing a buffer
back onto the free list, and hence makes the buffer available to other requests.
The second step involves referencing the buffer as free (normally in a bitmap)
and coalescing the buffer (if feasible) with adjacent buffers. A regular buddy
system implementation performs both steps on each release, whereas the lazy
buddy scheme performs the first step, but defers the second step and only
executes it depending on the state of the buffer class. The Linux 2.6
implementation of the discussed lazy buddy concept utilizes per CPU page lists.
These pagesets contain two lists of pages (marked as hot and cold), where the
hot pages have been used recently, and the expectation is that they are still
present in the CPU cache. On allocation, the pageset for the running CPU will be
consulted first to determine if the pages are available, and if it is the case, the
allocation will proceed accordingly. In order to determine when the pageset
should be released or allocated, a low and high watermark-based approach is
being used. As the low watermark is reached, a batch of pages will be allocated
and placed on the list. As the high watermark is reached, a batch of pages will be
freed simultaneously. As a result of the new and improved implementation, the
spinlock overhead (in regards to protecting the buddy lists) is significantly being
reduced [2].
In the 2.4 Linux kernel, a global daemon (kswapd) acted as the page
replacement activator. The kswapd was responsible for keeping a certain
amount of memory available, and under memory pressure would initiate the
process of freeing pages. In Linux 2.6, a kswapd daemon is available for each
node in the system. The general idea behind this design is to avoid situations
where a kswapd would have to free pages on behalf of a remote node. In Linux
2.4, a spinlock has to be acquired before removing pages from the least recently
used (LRU) list. In Linux 2.6, operations involving the LRU list take place via
pagevec structures, which allows either adding or removing pages from the LRU
lists in sets of up to PAGEVEC_SIZE pages. In the case of removing pages, the
zone lru_lock lock is acquired and the pages placed on a temporary list. Once
the list of pages to be removed is assembled, a call to shrink_list() is initiated.
The actual process of freeing pages can now be performed without acquiring the
zone lru_lock spinlock. In the case of adding pages, a new page vector struct is
initialized via pagevec_init(). Pages are added to the vector through
pagevec_add(), and then committed to being placed onto the LRU list via
pagevec_release().
CPU scheduler
In general, the thread scheduler represents the subsystem responsible for
decomposing the available CPU resources among the runnable threads [11].
The behavior and decision making process of the scheduler has a direct
correlation to thread fairness and scheduling latency. Fairness can be described
as the ability of all threads to not only encounter forward progress, but to do so in
a relatively even manner. The opposite of fairness is starvation, a scenario where
a given thread encounters no forward progress. Fairness is frequently hard to
justify, as it represents an exercise in compromises between global and localized
performance. Scheduling latency describes the actual delay between a thread
entering the runnable state (TSRUN) and actually running on a processor
(TSONPROC). An inefficient scheduling latency behavior results (as an
example) in a perceptible delay in application response time. Hence, an efficient
and effective CPU scheduler is paramount to the operation of the computing
platform. It decides which task to run at what time and for how long. In general,
real-time and time-shared jobs are distinguished, each class revealing different
objectives. Both are implemented through different scheduling disciplines
embedded into the scheduler [13].
Linux assigns a static priority to each task that can be modified through the nice()
interface. Linux has a range of priority classes, distinguishing between real-time
and time-sharing tasks. The lower the priority value, the higher the logical priority
of a task, or in other words its general importance. In this context, the discussion
always references the logical priority when elaborating on priority increases and
decreases. Real-time tasks always have higher priority than time-sharing tasks.
The Linux 2.6 scheduler (referred to as the O(1) scheduler) is a multi-queue
scheduler that assigns an actual run-queue to each CPU, promoting a CPU local
scheduling approach. The O(1) label basically describes the asymptotic upper
bound time complexity for retrieval. In other words, it refers to the property that
no matter the workload on the system, the next task to run will be chosen in a
constant amount of time. The previous incarnation of the Linux scheduler utilized
the concept of goodness to determine which thread to execute next. All runnable
tasks were kept on a single run-queue that represented a linked list of threads in
a TSRUN (TASK_RUNNABLE) state. In Linux 2.6, the single run-queue lock was
replaced with a per-CPU lock, ensuring better scalability on SMP systems. The
per-CPU run-queue scheme adopted by the O(1) scheduler decomposes the
run-queue into a number of buckets (in priority order) and utilizes a bitmap to
identify the buckets that hold runnable tasks. Locating the next task to execute
requires a read from the bitmap to identify the first bucket with runnable tasks,
and choosing the first task in that buckets run-queue. The per-CPU run queue
consists of two vectors of task lists, labeled as the active and the expired vector.
Each vector index represents a list of runnable tasks (each at its respective
priority level). After executing for a period of time (time slice dependent), a task
moves from the active list to the expired list to insure that all runnable tasks get
an opportunity to execute. As the active array becomes empty, the expired and
active vectors are swapped by modifying the pointers. Occasionally, a
load-balancing algorithm is invoked to rebalance the run-queues to ensure that a
similar number of tasks are available per CPU. As already discussed, the
scheduler has to decide which task to run next and for how long. Time quanta in
the Linux kernel are defined as multiples of a system tick. A tick is defined as the
fixed delta (1/HZ) between two consecutive timer interrupts. In Linux 2.6, the HZ
parameter is set to 1000, indicating that the interrupt routine scheduler_tick() is
invoked once every millisecond, at which time the currently executing task is
charged with one tick. Setting the HZ parameter to 1000 does not improve the
responsiveness of the system, as the actual scheduler timeslice is not affected
by this setting. For systems that primarily execute in a number crunching mode,
the HZ=1000 setting may not be appropriate. In such an environment, the
aggregate systems performance could benefit from setting the HZ parameter to
a lower value (around 100) [2].
Besides the static priority (static_prio), each task maintains an effective priority
(prio). The distinction is being made to account for certain priority bonuses or
penalties based on the recent sleep average (sleep_avg) of a given task. The
sleep average accounts for the number of ticks a task was recently descheduled
for. The effective priority of a task determines its location in the priority list (for the
active as well as the expired array) of the run queue. A task is declared
interactive when its effective priority exceeds its static priority by a certain level (a
scenario that can only be based on the task accumulating sleep average ticks).
The interactive estimator framework embedded into Linux operates automatically
and transparently. In Linux 2.6, high priority tasks reach the interactivity state
with a much smaller sleep average than do lower priority tasks. It is worthwhile to
reemphasize that as the interactivity of a task is estimated via the sleep average,
I/O-bound tasks are potential candidates to reach the interactivity status,
whereas CPU-bound tasks are normally not perceived as interactive tasks. The
actual timeslice can be defined as the maximum time a task can execute without
yielding the CPU (voluntarily) to another task, and is simply a linear function of
the static priority of normal tasks.
The priority itself is projected into a range of MIN_TIMESLICE to
MAX_TIMESLICE, where the default values are set to 10 and 200 milliseconds.
The higher the priority of a task the greater the tasks timeslice. For every timer
tick, the running tasks timeslice is decremented. If decremented to 0, the
scheduler replenishes the timeslice, recomputes the effective priority, and
re-enqueues the task into either the active vector (if the task is classified as
interactive) or into the expired vector (if the task is considered as being
non-interactive). At this point, the system will dispatch another task from the
active array. This scenario ensures that tasks in the active array will be executed
first, before any expired task will have the opportunity to run again. If a task is
descheduled, its timeslice will not be replenished at wakeup time; however, it has
to be pointed out that its effective priority might have changed due to any
accumulated sleep time. If all the runnable tasks have exhausted their timeslice,
and hence have been moved to the expired list, the expired and the active vector
are swapped. This technique ensures that the scheduler O(1) does not have to
traverse a (potentially large) list of tasks, as was required in the Linux 2.4
scheduler. The crux of the issue with the new design is that, due to any potential
interactivity, scenarios may arise where the active queue continues to have
runnable tasks that are not migrated to the expired list. The ramification is that
there may be tasks starving for CPU time. To circumvent the starvation issue, in
the case that the task that first migrated to the expired list is older than the
STARVATION_LIMIT (which is set to 10 seconds), the active and the expired
arrays are being switched.
While discussing performance enhancement techniques in the kernel, it has to
be pointed out that the Linux 2.6 environment provides a Non Uniform Memory
Access (NUMA) aware extension to the O(1) scheduler. The focus is on
increasing the likelihood that memory references are local rather than remote on
NUMA systems. The NUMA-aware extension augments the existing CPU
scheduler implementation via a node-balancing framework. Further, it is
imperative to point out that next to the preemptible kernel support in Linux 2.6,
the Native POSIX Threading Library (NPTL) [16],[21] represents the next
generation POSIX threading solution for Linux, and hence has received a lot of
attention from the performance community. The new threading implementation in
Linux 2.6 has several major advantages, such as in-kernel POSIX signal
handling. In a well-designed multi-threaded application domain, fast user space
synchronization (futexs) can be utilized to implement the pthread_mutex object.
In contrast to the Linux 2.4 sched_yield construct, the futex framework avoids a
scheduling collapse during heavy lock contention among different threads. The
NPTL library is fully POSIX compliant.
I/O scheduling and the BIO layer

The I/O scheduler in Linux forms the interface between the generic block layer
and the low-level device drivers [6], [7]. The block layer provides functions that
are utilized by the file systems and the virtual memory manager to submit I/O
requests to block devices. These requests are transformed by the I/O scheduler
and made available to the low-level device drivers. The device drivers consume
the transformed requests and forward them (by using device-specific protocols)
to the actual device controllers that perform the I/O operations [15]. As prioritized
resource management seeks to regulate the use of a disk subsystem by an
application, the I/O scheduler is considered an important kernel component in

the I/O path. It is further possible to regulate the disk usage in the kernel layers
above and below the I/O scheduler. Adjusting the I/O pattern generated by the
file system or the virtual memory manager (VMM) is considered as an option.
Another option is to adjust the way specific device drivers or device controllers
consume and manipulate the I/O requests.
The various Linux 2.6 I/O schedulers can be abstracted into a rather generic I/O
model. The I/O requests are generated by the block layer on behalf of threads
that are accessing various file systems, threads that are performing raw I/O, or
are generated by virtual memory management (VMM) components of the kernel,
such as the kswapd or the pdflush threads. The producers of I/O requests initiate
a call to __make_request(), which invokes various I/O scheduler functions such
as elevator_merge_fn. The enqueuing functions in the I/O framework generally
intend to merge the newly submitted block I/O unit (a bio in 2.6 or a buffer_head
in 2.4) with previously submitted requests, and to sort (or sometimes just insert)
the request into one or more internal I/O queues. As a unit, the internal queues
form a single logical queue that is associated with each block device. At a later
stage, the low-level device driver calls the generic kernel function
elv_next_request() to obtain the next request from the logical queue. The
elv_next_request() call interacts with the I/O schedulers dequeue function
elevator_next_req_fn, and the latter has an opportunity to select the appropriate
request from one of the internal queues. The device driver processes the request
by converting the I/O submission into potential scatter-gather lists as well as
protocol-specific commands that are submitted to the device controller. From an
I/O scheduler perspective, the block layer is considered as the producer of I/O
requests and the device drivers are labeled as the actual consumers.
From a generic perspective, every read or write request launched by an
application results in either utilizing the respective I/O system calls or in memory
mapping (mmap) the file into a processs address space. I/O operations normally
result into allocating PAGE_SIZE units of physical memory. These pages are
being indexed, as this enables the system to later locate the page in the buffer
cache. Any cache subsystem only improves performance if the data in the cache
is being reused. Further, the read cache abstraction allows the system to
implement (file system dependent) read-ahead functionalities, as well as to
construct large contiguous (SCSI) I/O commands that can be served via a single
DMA operation. In circumstances where the cache represents pure (memory
bus) overhead, I/O features such as direct I/O should be explored (especially in
situations where the system is CPU bound).
In a general write() scenario, the system is not necessarily concerned with the
previous content of a file, as a write() operation normally results in overwriting the
contents in the first place. The write cache emphasizes other aspects such as
asynchronous updates, as well as the possibility of omitting some write requests
in the case where multiple write() calls into the cache subsystem result in a single
I/O operation to a physical disk. Such a scenario may occur in an environment
where updates to the same (or a similar) inode offset are being processed within
a rather short time-span. The block layer in Linux 2.4 was organized around the
buffer_head data structure. The issue with this implementation is that it is a
rather daunting task to create a truly effective and performance-related block I/O
subsystem if the underlying buffer_head structures force each I/O request to be
decomposed into 4 KB chunks. The new representation of the block I/O layer in
2.6 encourages large I/O operations. The block I/O layer now tracks data buffers
by using struct page pointers. Linux 2.4 systems were prone to lose sight of the
logical form of the writeback cache when flushing the cache subsystem. Linux
2.6 utilizes logical pages attached to inodes to flush dirty data. This allows
multiple pages (that belong to the same inode) to be coalesced into a single bio
that can be submitted to the I/O layer (a process that works well if the file is not
fragmented on disk).
The 2.4 Linux I/O scheduler

The default 2.4 Linux I/O scheduler primarily manages the disk utilization. The
scheduler has a single internal queue, and for each new request, the I/O
scheduler determines if the request can be merged with an existing request. If
that is not the case, a new request is placed in the internal queue, which is sorted
by the starting device block number of the request. This approach focuses on
minimizing disk seek times. An aging mechanism implemented into the I/O
scheduler limits the number of times an existing I/O request in the queue can be
omitted and basically passed over by a newer request, which prevents any
potential starvation scenarios. The dequeue function in the I/O framework
represents a simple removal of requests from the head of the internal queue.
The 2.6 deadline I/O scheduler

The deadline I/O scheduler available in Linux 2.6 incorporates a per-request
expiration-based approach, and operates on five I/O queues [4]. The basic idea
behind the implementation is to aggressively reorder requests to improve I/O
performance while simultaneously ensuring that no I/O request is being starved.
More specifically, the scheduler introduces the notion of a per-request deadline,
which is used to assign a higher preference to read than write requests. As
already mentioned, the scheduler maintains five I/O queues. During the enqueue
phase, each I/O request gets associated with a deadline, and is being inserted
into I/O queues that are either organized by starting block (a sorted list) or by the
deadline factor (a FIFO list). The scheduler incorporates separate sort and FIFO
lists for read and write requests, respectively. The fifth I/O queue contains the
requests that are to be handed off to the device driver. During a dequeue
operation, in the case that the dispatch queue is empty, requests are moved from
one of the four I/O lists (sort or FIFO) in batches. The next step consists of
passing the head request on the dispatch queue to the device driver (this
scenario also holds true in the case that the dispatch-queue is not empty). The
logic behind moving the I/O requests from either the sort or the FIFO lists is
based on the schedulers goal to ensure that each read request is processed by
its effective deadline without actually starving the queued-up write requests. In
this design, the goal of economizing on the disk seek time is accomplished by
moving a larger batch of requests from the sort list (sector sorted) and balancing
it with a controlled number of requests from the FIFO list. Hence, the ramification
is that the deadline I/O scheduler effectively emphasizes average read request
response time over disk utilization and total average I/O request response time.
To summarize, the basic idea behind the deadline scheduler is that all read
requests are satisfied within a specified time period. On the other hand, write
requests do not have any specific deadlines associated. As the block device
driver is ready to launch another disk I/O request, the core algorithm of the
deadline scheduler is invoked. In a simplified form, the first action being taken is
to identify if there are I/O requests waiting in the dispatch queue, and if yes, there
is no additional decision to be made on what to execute next. Otherwise, it is
necessary to move a new set of I/O requests to the dispatch queue.
The scheduler searches for work in the following places, but will only migrate
requests from the first source that results in a hit. (1) If there are pending write
I/O requests, and the scheduler has not selected any write requests for a certain
amount of time, a set of write requests is selected. (2) If there are expired read
requests in the read_fifo list, the system will move a set of these requests to the
dispatch queue. (3) If there are pending read requests in the sort list, the system
will migrate some of these requests to the dispatch queue. (4) If there are any
pending write I/O operations, the dispatch queue is populated with requests from
the sorted write list. In general, the definition of certain amount of time for write
request starvation is normally two iterations of the scheduler algorithm (this is
tunable) [3]. After moving two sets of read requests, the scheduler will migrate
some write requests to the dispatch queue. A set of requests (by default) can
represent as many as 64 contiguous requests, but a request that requires a disk
seek operation counts the same as 16 contiguous requests (this is tunable as
well).
The 2.6 anticipatory I/O scheduler

The anticipatory I/O schedulers design attempts to reduce the per-thread read
response time. It introduces a controlled delay component into the dispatching
equation. The delay is being invoked on any new request to the device driver,
thereby allowing a thread that just finished its I/O request to submit a new
request. This basically enhances the chances (based on locality) that this
scheduling behavior will result in smaller seek operations. The tradeoff between
10
reduced seeks and decreased disk utilization (due to the additional delay factor
in dispatching a request) is managed by utilizing an actual cost-benefit
calculation method [5], [8].
Overall, the 2.6 implementation of the anticipatory I/O scheduler is similar to, and
may be considered as, an extension to the deadline scheduler. In general, the
scheduler follows the basic idea that if the disk drive just operated on a read
request, the assumption can be made that there is another read request in the
pipeline, and hence it is worthwhile to wait. The I/O scheduler starts a timer, and
at this point, there are no more I/O requests passed down to the device driver. If
a (close) read request arrives during the wait time, it is serviced immediately, and
in the process, the actual distance that the kernel considers as close grows as
time passes (the adaptive part of the heuristic). Eventually the close requests will
dry out and the scheduler will hence decide to submit some of the write requests,
basically converging back to what is considered as a normal I/O request
dispatching scenario. In general, the anticipatory I/O scheduler is considered as
a rather complex implementation, as the actual cost-benefit analysis that is
necessary to provide adequate performance has to be conducted in an as
efficient as possible manner.
Completely Fair Queuing scheduler

The Completely Fair Queuing (CFQ) I/O scheduler can be considered as
representing an extension to the better known Stochastic Fair Queuing (SFQ)
scheduler implementation [12]. The focus of both implementations is on the
concept of fair allocation of I/O bandwidth among all the initiators of I/O requests.
A SFQ-based scheduler design was initially proposed for some network
subsystems. The goal to be accomplished is to distribute the available I/O
bandwidth as equally as possible among the I/O requests. The actual
implementation utilizes n (normally 64) internal I/O queues, as well as a single
I/O dispatch queue. During an enqueue operation, the PID of the currently
running process (the actual I/O request producer) is utilized to select one of the
internal queues (normally hash based) and, hence, the request is basically
inserted into one of the queues (in FIFO order). During dequeue, the SFQ design
calls for a round-robin based scan through the non-empty I/O queues, and
basically selects requests from the head of the queues. To avoid encountering
too many seek operations, an entire round of requests is first collected, and
secondly sorted, and ultimately merged into the dispatch queue. Next, the head
request in the dispatch queue is passed to the actual device driver.
A standard CFQ implementation, on the other hand, does not utilize a hash
function. Hence, each I/O process receives an internal queue assigned to it,
which implies that the number of I/O processes determines the number of active
internal queues. From an aggregate performance perspective, the CFQ
scheduling concept works efficiently well under the circumstances that all I/O
11
processes generate requests at basically the same rate. In Linux 2.6.5, the CFQ
I/O scheduler utilizes a hash function (and a certain amount of request queues),
and therefore resembles an SFQ implementation. The CFQ, as well as the SFQ
implementations, strives to manage per-process I/O bandwidth, and provide
fairness at the level of process granularity.
The 2.6 noop I/O scheduler

The Linux 2.6 noop I/O scheduler can be considered as a rather minimal I/O
scheduler that performs (as well as provides) basic merging and sorting
functionalities. The main usage of the noop scheduler revolves around non
disk-based block devices (such as memory devices), as well as specialized
software or hardware environments that incorporate their own I/O scheduling and
(large) caching functionality, and hence require only minimal assistance from the
kernel. Hence, in large-scale I/O configurations that incorporate RAID controllers
and a vast number of contemporary physical (TCQ capable) disk drives, the
noop scheduler has the potential to outperform the other three I/O schedulers
(workload dependent).
Performance results
Presented in this section is a quantification of the performance improvements
(demonstrated by some of the 2.6 enhanced subsystems) that are discussed in
this paper. The main focus is on the CPU scheduler and the VM subsystem. In
addition, presented are the performance improvements made to the 2.4 kernel
that are now available in the 2.6 framework. The data was collected from the
execution of benchmarks representing a diverse set of workloads.
Several benchmarks were used in this study, including SPECjbb2000 [17],
SPECjAppServer2002 [18], VolanoMark [19], as well as an additional
database workload. SPECjbb2000 was developed by the Standard Performance
Evaluation Corporation [20]. It reflects a Java business benchmark that
represents server-side Java applications by modeling a 3-tier server
environment. Drivers emulate the clients, and a binary object tree emulates the
database. Therefore, the primary component being modeled is the middle tier,
which includes the business logic, as well as the manipulation of objects.
SPECjAppServer2002 is also a SPEC benchmark that was designed to model
application servers. It emulates certain enterprise-level business activities such
as order-entry, supply-chain management, and manufacturing. Both the
SPECjbb2000 and SPECjAppServer2002 benchmark work was conducted for
research purposes only, and the results were non-compliant (see the Legal
Statement). For the SPECjAppServer2002 benchmark, we used a dual-mode
(3-tier) setup, utilizing IBM WebSphere 5.x Application Server in the middle tier,
12
and IBM DB2 UDB 8.x for our backend database setup. The VolanoMark
benchmark models a chat room server by creating 10 chat rooms of 20 clients
each. Each chat room sends messages from one client to the other 19 clients in
the room. This benchmark is typically utilized to quantify raw server performance,
as well as to analyze network scalability. The database workload used in this
study emulates enterprise business intelligence operations.
Performance improvements
In this section we present data to quantify the performance improvements (based
upon the 2.6 kernel), including the CPU scheduler, memory management, large
page support, and NUMA memory allocation. Many of the systems utilized in the
benchmarks were NUMA machines. One of the performance issues discovered
was the inadequate load balancing of processes when hyperthreading was
turned on. In this case, the scheduler tended to assign interrupt requests (IRQ) to
CPU 0. This scheduling behavior represented a potential to create a bottleneck
in the system. However, CPU 0 by itself represents a (complete) physical
processor when hyperthreading is disabled. The bottleneck is amplified when
hyperthreading is turned on, as CPU 0 becomes a virtual processor, and its
physical hardware is shared with its sibling. A patch was developed to address
this issue by balancing the IRQ processing among the processors when the
ennoblement of hyperthreading is detected. Presented in Table 1 are the
performance improvements obtained by utilizing the patch, which is available in
the 2.6 kernel. As illustrated, double-digit performance improvements were
achieved via this feature.
Table 1 Performance improvements for 2.6 scheduler with hyperthreading enabled
Benchmark
Configuration
% increase
SPECjbb2000
8-way x440
10%
Database
8-way x440
15%
Large page support on page 3 describes the large page support that is
available in the 2.6 kernel. Table 2 outlines the performance improvements
obtained from utilizing large pages. The results show moderate to good
performance improvements.
Table 2 Performance improvements obtained by large page support in the 2.6 kernel
Benchmark
Configuration
% increase
SPECjbb2000
8-way x440
8%
13
Database
8-way x440
2%
SPECjAppServer
2002
8-way x440
3.3%
In the 2.4 kernel, we observed performance degradations for some workloads

that were attributed to a lack of knowledge of the NUMA topologies by the kernel.
In such instances, processes executing on a CPU in a specific NUMA node had
their associated memory allocated on a different node, which resulted in
extended memory-access latency scenarios. In other instances, a process may
have been associated with a CPU, and hence had its memory allocated on that
particular node. However, over time the process may have migrated to another
node, which basically generated the same memory latency problem. This issue
was addressed in the 2.6 kernel by incorporating knowledge about the NUMA
technology into the memory allocation process. Basically, the NUMA patch
allows a process to receive its memory from the local node. By affinitizing the
process to the CPUs of the local node, the cross-node memory access
operations can significantly be reduced. Presented in Table 3 are the
performance improvements exhibited by this feature on the database and the
SPECjAppServer2002 workloads.
Table 3 Performance improvement obtained via the NUMA memory patch
Benchmark
Configuration
% increase
Database
8-way x440
8%
SPECjAppServer
2002
8-way x440
8.7%
Workload improvements for 2.6

Illustrated in Figure 1 on page 15 is a quantification of the performance
enhancements made to the Linux kernel over time for the SPECjbb2000
benchmark. (All graphs presented in this paper show a performance
improvement that is relative to the initial result, that is, in January, which is set to
100). The results demonstrate significant improvements in the execution of this
benchmark over the course of four months. The improvements were the result of
a combination of changes and enhancements. The changes included the use of
a new version of the JVM and JVM tuning. The 2.6 kernel enhancements
included support for large pages and NUMA (see Performance improvements
on page 13 for details).
Presented in Figure 2 on page 15 is a similar quantification for the
SPECjAppServer2002 benchmark, tracking the improvements over the course of
a year. The improvements obtained for this workload are attributed to Linux
14
Normalized Improvement
kernel performance enhancements, improvements in the middleware

components, and the addition of new hardware. For example, in the month of
May, a new machine with a larger cache was utilized, providing a significant
performance boost for this benchmark. Nonetheless, the 2.6 Linux kernel
optimizations for this workload were significant. They resulted from the inclusion
of support for large pages and NUMA (see Performance improvements on
page 13 for details).
200
100
May
June
July
August
Figure 1 Performance improvements for 2.6 Kernel, Java Business Benchmark

Workload; SPECjbb2000, 8-Way x440
Note: May = 1.4.0 JVM, June = 1.4.1 JVM, July = 3-Tier Tuning, August, Large Pages
APP SERVER: x440, 8-Way, DB SERVER: x350, 4-way

Application Server Improvements
400
350
300
250
200
150
100
50
0
Jan Feb Mar May Jun
Jul
Aug
Figure 2 Performance Improvements for 2.6 kernel, application server workload
Shown in Figure 3 on page 16 are the results for the performance improvements
for the Linux 2.6 kernel over the course of several months. The 2.6 kernel
15
features that resulted in these performance improvements include support for

hyperthreading, large pages, and NUMA (see Performance improvements on
page 13). In addition, database optimizations contributed to the performance
gains over this period of time.
Conclusion and future work

The Linux 2.6 kernel represents another evolutionary step forward, and builds
upon its predecessors to boost (application) performance through enhancements
to the VM subsystem, the CPU scheduler, and the I/O scheduler. In addition, this
new version of the kernel delivers important functional enhancements in security,
scalability, and networking. The results presented in this paper highlight some of
the performance optimizations obtained from a sampling of the new features in
the Linux 2.6 kernel. However, there are many additional features that provide a
diverse range of performance enhancements. For example, NPTL and user level
locking primitives (futexes) [22] can provide good performance improvements.
Applications that utilize futexs do not branch into the kernel layer (ring0) code if
the lock is not contended. Moreover, it also eliminates some of the limitations
that were imposed by the old threading library. These, and other features, are the
subject of future investigations into how to further improve the performance,
stability, and scalability of the Linux kernel.
Database Query workload, x440, 8-way

300
250
200
150
100
50
0
Feb Mar Apr May Jun Jul Aug Oct

Figure 3 Performance improvements for 2.6 Kernel, database workload
Legal statement
This work represents the view of the authors and does not necessarily represent
the view of IBM. IBM, AIX, and eServer are trademarks or registered trademarks
of International Business Machines Corporation in the United States, other
16
countries, or both. UNIX is a registered trademark of The Open Group in the

United States and other countries. Java and all Java-based trademarks are
trademarks of Sun Microsystems, Inc., in the United States, other countries, or
both. Other company, product, and service names may be trademarks or service
marks of others. SPEC and the benchmark names SPECjbb2000 and
SPECjAppServer 2002 are registered trademarks of the Standard
Performance Evaluation Corporation. The benchmarking was conducted for
research purposes only. For the latest SPECjbb2000 and SPECjApp Server2002
benchmark results, visit:
http://www.spec.org
References
1. The Linux Source Code
2. Arcangeli, A., Evolution of Linux Towards Clustering, EFD R&D Clamart,
2003
3. Axboe, J., Deadline I/O Scheduler Tunables, SuSE
4. Corbet, J., A new deadline I/O scheduler. http://lwn.net/Articles/10874
5. Corbet, J., Anticipatory I/O scheduling. http://lwn.net/Articles/21274
6. Corbet, J., The Continuing Development of I/O Scheduling,
http://lwn.net/Articles/21274
7. Corbet, J., Porting drivers to the 2.5 kernel, LSO, 03
8. Iyer, S., Drushel, P., Anticipatory Scheduling A disk scheduling framework
to overcome deceptive idleness in synchronous I/O, Rice University
9. Lee Irwin III, W., A 2.5 Page Clustering Implementation, Linux Symposium,
Ottawa, 2003
10.Patterson D.A, , Hennessy, J.L, Computer Architecture A Quantitative
Approach , 3rd Edition, 2002
11.Nagar, S., Franke, H., Choi, J., Seetharaman, C., Kaplan, S., Singhvi, N.,
Kashyap, V., Kravetz, M., Class-Based Prioritized Resource Control in
Linux, 2003 Linux Symposium,
12.McKenney, P., Stochastic Fairness Queueing, INFOCOM, 1990
13.Molnar, I., Goals, Design and Implementation of the new ultra-scalable O(1)
scheduler. (sched-design.txt).
14.Mosberger, D., Eranian, S., IA-64 Linux Kernel, Design and Implementation,
Prentice Hall, NJ, 2002
17
15.Shriver, E., Merchant, A., Wilkes, J., An Analytic Behavior Model with
Readahead Caches and Request Reordering, Bell Labs, 1998.
16.Wienand, I., An analysis of Next Generation Threads on IA64, HP, 2003
17.SPECjbb2000 http://www.spec.org/jbb2000.
18.SPECjAppServer2002 http://www.spec.org/jAppServer2002.
19.VolanoMark http://www.volano.com/benchmarks
20.Standard Performance Evaluation Corporation (SPEC), http://www.spec.org
21.Drepper, U. and I. Molnar, The Native POSIX Thread Library for Linux,
01/30/03
22.Franke, H., Russell, R. and M. Kirkwood, Fuss Futex and Furwocks: Fast
User level Locking in Linux, OLS 2003
18
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Any performance data contained herein was determined in a controlled environment. Therefore, the results
obtained in other operating environments may vary significantly. Some measurements may have been made
on development-level systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurement may have been estimated through
extrapolation. Actual results may vary. Users of this document should verify the applicable data for their
specific environment.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on
the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrates programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
Copyright IBM Corp. 2004. All rights reserved.
19
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy,
modify, and distribute these sample programs in any form without payment to IBM for the purposes of
developing, using, marketing, or distributing application programs conforming to IBM's application
programming interfaces.
This document created or updated on July 23, 2004.
Send us your comments in one of the following ways:
Use the online Contact us review redbook form found at:
ibm.com/redbooks
Send your comments in an email to:
redbook@us.ibm.com
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. JN9B, Building 905, Internal Mail Drop 9053D005, 11501 Burnet Road
Austin, Texas 78758-3493 U.S.A.
Trademarks
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Approach
AIX
DB2
Eserver
IBM
ibm.com
Redbooks
Redbooks (logo)
WebSphere
The following terms are trademarks of other companies:

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun
Microsystems, Inc. in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, and service names may be trademarks or service marks of others.
SPEC and the benchmark names SPECjbb2000 and SPECjAppServer 2002 are registered
trademarks of the Standard Performance Evaluation Corporation.
20

App Centric Perf Linux

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

App Centric Perf Linux

Загружено:

Авторское право:

Доступные форматы

Redbooks Paper

An Application Centric Performance

Copyright IBM Corp. 2004. All rights reserved.

The virtual memory subsystem

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

Large page support

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

Page allocation and replacement

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

I/O scheduling and the BIO layer

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

application, the I/O scheduler is considered an important kernel component in

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

The 2.4 Linux I/O scheduler

The 2.6 deadline I/O scheduler

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

The 2.6 anticipatory I/O scheduler

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

Completely Fair Queuing scheduler

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

The 2.6 noop I/O scheduler

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

In the 2.4 kernel, we observed performance degradations for some workloads

Workload improvements for 2.6

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

kernel performance enhancements, improvements in the middleware

Figure 1 Performance improvements for 2.6 Kernel, Java Business Benchmark

APP SERVER: x440, 8-Way, DB SERVER: x350, 4-way

Jan Feb Mar May Jun

Figure 2 Performance Improvements for 2.6 kernel, application server workload

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

features that resulted in these performance improvements include support for

Conclusion and future work

Database Query workload, x440, 8-way

Feb Mar Apr May Jun Jul Aug Oct

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

countries, or both. UNIX is a registered trademark of The Open Group in the

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

Copyright IBM Corp. 2004. All rights reserved.

The following terms are trademarks of other companies:

An Application Centric Performance Evaluation of the Linux 2.6 Operating System

Вам также может понравиться