Академический Документы
Профессиональный Документы
Культура Документы
Student Guide
D79440GC10
Edition 1.0
May 2013
D81811
Author Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Editor
Vijayalakshmi Narasimhan
Graphic Designer
Rajiv Chandrabhanu
Publishers
Nita Brozowski
Jayanthy Keshavamurthy
Sujatha Nagendra
Srividya Rameshkumar
Contents
1 Introduction
Course Goals 1-2
Schedule 1-3
Objectives 1-4
Virtualization with Oracle VM Server 1-5
Oracle VM Server in the Classroom 1-6
iii
3 Control Groups (cgroups)
Objectives 3-2
Control Groups: Introduction 3-3
Cgroup Subsystems (Resource Controllers) 3-5
Cgroup Subsystem Parameters 3-6
Cgroup Hierarchy 3-7
Cgroup Configuration Rules and Constraints 3-8
Cgroup Configuration 3-9
/etc/cgconfig.conf “mount” Section 3-10
lssubsys Utility 3-12
iv
Additional Linux Container Utilities 4-22
Using Oracle VM Template as a Base Environment 4-23
Quiz 4-25
Summary 4-27
Practice 4: Overview 4-28
v
Practice 5: Overview 5-54
7 Performance Diagnostics
Objectives 7-2
Rescue Mode 7-3
Frequently Encountered Performance Problems 7-5
Performance Tuning Kernel Parameters 7-7
System Resource Reporting Utilities 7-8
CPU Reporting Utilities 7-10
Memory Reporting Utilities 7-11
Using the dstat Utility 7-13
Using vmstat and free Utilities 7-14
Using the top Utility 7-15
Reporting Top Consumers of System Resources 7-17
Network Reporting Utilities 7-18
I/O Reporting Utilities 7-19
Using the iotop Utility 7-20
System Monitor GUI 7-21
OS Watcher Black Box (OSWbb) 7-22
OS Watcher Black Box Analyzer (OSWbba) 7-24
Analyze OSWbb Archive Files 7-27
vi
Monitoring System Latency: latencytop 7-29
strace Utility 7-30
Quiz 7-32
Summary 7-33
Practice 7: Overview 7-34
vii
Built-In D Variables 9-21
trace() Built-in Function 9-22
DTrace Examples 9-23
D Scripts 9-27
Using Predicates in a D Script 9-29
D Script Example 9-31
Quiz 9-35
Summary 9-39
Practice 9: Overview 9-40
viii
Introduction
Session Module
Lesson 1: Introduction
Day 1 Lesson 2: Btrfs File System
Lesson 3: Control Groups (cgroups)
Lesson 4: Linux Containers (LXC)
Lesson 5: Advanced Storage Administration
Day 2
Oracle VM Server
host03
Server Utility Virtual
Pool Server Machine
Master Server
Oracle VM agent
Hypervisor
Virtualization
Virtualization allows you to use one server and its computing resources to run one or more
guest operating system and application images concurrently, sharing those resources among
the guests.
Supervisor Versus Hypervisor
The sharing of the host server’s resources can be accomplished through two different
approaches:
• A supervisory application such as VirtualBox that runs on the host operating system and
in turn runs the guest virtual machines
• A hypervisor such as Oracle VM Server or VMware ESX that provides a small footprint
host operating system and exposes the server’s resources to the guest virtual machines
that run directly on top of the hypervisor
A hypervisor removes the intermediary represented by the supervisory application, thereby
freeing up more resources for the guest virtual machines to share.
Oracle VM Server Domains
Oracle VM Server guests are also referred to as domains. Dom0 is always present, providing
management services for the other domains running on the same server.
Oracle VM Server
Hypervisor
Classroom PC
Networking Classroom
net
between dom0 and
Network Configuration
This slide shows the network configuration of each classroom system, as well as the disk
devices for each VM guest. Dom0 on each system has a single physical network interface,
eth0. Three network bridges are configured on Dom0:
• xenbr0: Communication to the classroom network is through this bridge. The IP
address of xenbr0 is the same as the IP address assigned to eth0.
• virbr0: This is a Xen bridge interface used by the VM guests on the 192.0.2 public
subnet. The IP address of virbr0 is 192.0.2.1.
• virbr1: This is a Xen bridge interface used by the VM guests on the 192.168.1
private subnet. This private subnet is used in the OCFS2 practice. The IP address of
virbr1 is 192.168.1.1.
The host01 and host02 VM guests have the same network configuration. These VM guests
have two virtual network interfaces.
• eth0: This network interface is on the 192.0.2 subnet. On host01, eth0 has an IP
address of 192.0.2.101. On host02, eth0 has an IP address of 192.0.2.103.
• eth1: This network interface is on the 192.168.1 subnet. On host01, eth1 has an IP
address of 192.168.1.101. On host02, eth0 has an IP address of 192.168.1.103.
A local Yum repository exists on dom0. You use the yum command to install and upgrade
software packages on your VMs from this local Yum repository. A vm.repo file exists in the
/etc/yum.repos.d directory which points to this local Yum repository.
# ls /etc/yum.repos.d
public-yum-ol6.repo vm.repo
All repositories are disabled in the public-yum-ol6.repo file. Only vm.repo has an
enabled repository, which points to the local Yum Repository on dom0.
On dom0, the Oracle Linux 6.3 ISO files exists in the following directory:
# ls /OVS/OL6U3REPO
EFI Packages ResilientStorage
EULA README-en RPM-GPG-KEY
...
The webserver URL is pointing to OL6U3REPO:
# ls –l /var/www/html/repo/OracleLinux/OL6/3/base
lrwxrwxrwx ... x86_64 -> /OVS/OL6U3REPO
Btrfs is an open-source, general-purpose file system for Linux. The name derives from the
use of B-trees to store internal file system structures. Different names are used for the file
system, including “Butter F S” and “B-tree F S.” Development of Btrfs began at Oracle in
2007, and now a number of companies (including Red Hat, Fujitsu, Intel, SUSE, and many
others) are contributing to the development effort. Btrfs is included in the mainline Linux
kernel.
Oracle Linux 6.3 with the Unbreakable Enterprise Kernel (UEK) R2 is the first release that
officially supports Btrfs. The standard installation media does not have support for creating a
Btrfs root file system on the initial installation. Oracle does provide an alternative boot ISO
image that allows you to create a Btrfs root file system. The boot ISO installs Oracle Linux 6.3
with the UEK. It requires a network installation source, accessible via FTP, HTTP, or NFS,
that provides the actual RPM packages.
Btrfs provides extent-based file storage with a maximum file size of 16 Exabytes. All data and
metadata are copy-on-write; this means that blocks of data are not changed on disk. Btrfs just
copies the blocks and then writes out the copies to a different location. Not updating the
original location eliminates the risk of a partial update or data corruption during a power
failure. The copy-on-write nature of Btrfs also facilitates file system features such as
replication, migration, backup, and restoration of data.
To create a Btrfs file system on your system, install the btrfs-progs software package:
# yum install btrfs-progs
Load the btrfs kernel module using the modprobe command. Use the lsmod command to
ensure that the kernel module is loaded.
# modprobe btrfs
# lsmod | grep btrfs
Use the mkfs.btrfs command to create a Btrfs file system. The syntax is:
mkfs.btrfs [options] block_device [block_device ...]
You can create a Btrfs file system on a single device or on multiple devices. Devices can be
disk partitions, loopback devices (disk images in memory), multipath devices, or LUNs that
implement RAID in hardware.
The available options for the mkfs.btrfs command are:
• -A offset – Specify the offset from the start of the device for the file system. The
default is 0, which is the start of the device.
• -b size – Specify the size of the file system. The default is all the available storage.
Use the btrfs command to manage and display information about a Btrfs file system. The
command requires a subcommand. Enter btrfs without any arguments to list the
subcommands:
# btrfs
Usage: btrfs [--help] [--version] <group> [<group>...] <command>
btrfs subvolume create [<dest>/]<name>
Create a subvolume
btrfs subvolume delete <name>
Delete a subvolume
...
btrfs filesystem df <path>
Show space usage information for a mount point
btrfs filesystem show [--all-devices] [<uuid>|<label>]
Show the structure of a filesystem
...
Directory D1 File F1
File F1 File F2
File F2
File F3
File F1
File F1
Subvolume SV22
This slide illustrates a Btrfs file system hierarchy that consists of subvolumes, directories, and
files. Btrfs subvolumes are named B-trees that hold files and directories. Subvolumes can
also contain subvolumes, which are themselves named B-trees that can also hold files and
directories. The top level of a Btrfs file system is also a subvolume, and is known as the root
subvolume.
The root subvolume is mounted by default and Btrfs subvolumes appear as regular directories
within the file system. However, a subvolume can be mounted and only files and directories in
the subvolume are accessible. The following example lists the hierarchy displayed in the
slide, with the default root subvolume mounted on /btrfs:
# ls -l /btrfs
drwx------ ... SV1
drwx------ ... D1
-rwxr-xr-x ... F1
drwx------ ... SV2
Mounting the SV1 subvolume or the SV2 subvolume on /btrfs allows access only to the
files and directories within the respective subvolumes. Remount the root subvolume to gain
access to entire hierarchy.
Use the btrfs subvolume create command to create a subvolume. The following
example creates a subvolume named SV1 on a Btrfs file system mounted on /btrfs:
# btrfs subvolume create /btrfs/SV1
Create subvolume ‘/btrfs/SV1’
The subvolume appears as a regular directory although the permissions are different. The
following example creates a regular directory in /btrfs and then displays the content:
# mkdir /btrfs/D1
# ls –l /btrfs
drwxr-xr-x ... D1
drwx------ ... SV1
Use the btrfs subvolume list command to view only the subvolumes in a Btrfs file
system, as in this example:
# btrfs subvolume list /btrfs
ID 258 gen 12 top level 5 path SV1
This command also displays the subvolume ID (258), root ID generation of the btree (12),
and the top level ID (5). These fields are described later in this lesson.
Btrfs subvolumes can be snapshotted and cloned, which creates additional B-trees. A
snapshot starts as a copy of a subvolume taken at a point in time. You can make a snapshot
writable and use it as an evolving clone of the original subvolume. Or you can use the
snapshot as a stable image of a subvolume for backup purposes or for migration to other
systems. Snapshots can be created quickly and they initially consume very little disk space.
Use the btrfs subvolume snapshot command to create a writable/readable snapshot of a
subvolume. The following example creates a snapshot of the SV1 subvolume:
# btrfs subvolume snapshot /btrfs/SV1 /btrfs/SV1-snap
Create a snapshot of ‘/btrfs/SV1’ in ‘/btrfs/SV1-snap’
Use the btrfs subvolume snapshot -r option to create a read-only snapshot:
# btrfs subvolume snapshot –r /btrfs/SV1 /btrfs/SV1-rosnap
Create a readonly snapshot of ‘/btrfs/SV1’ in ‘/btrfs/SV1-
rosnap’
The snapshots appear as a regular directory when the ls command is used. Snapshots also
appear in the output of the btrfs subvolume list command.
You can use the cp --reflink command to take a snapshot of a file. With this option, the
file system does not create a new link pointing to an existing inode, but instead creates a new
inode that shares the same disk blocks as the original copy. The new file appears to be a
copy of the original file but the data blocks are not duplicated. This allows the copy to be
almost instantaneous and also saves disk space. As the file’s content diverges over time, its
amount of required storage grows. One restriction is that this operation can work only within
the boundaries of the same file system and within the same subvolume.
The following example copies a file using the cp --reflink command. The space used is
given both before and after the copy operation. Note that the space used does not increase.
# df –h /btrfs
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 16G 8.2M 14G 1% /btrfs
# cp --reflink /btrfs/SV1/vmlinuz* /btrfs/SV1/copy_of_vmlinuz
# df –h
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 16G 8.2M 14G 1% /btrfs
By default, Linux mounts the parent Btrfs volume, which has an ID of 0. In this example, the
following mount command was issued before creating any subvolumes and snapshots:
# mount /dev/sdb /btrfs
The subvolume SV1 was created in /btrfs. The ls command shows the subvolume:
# ls –l /btrfs
drwx------ ... SV1
The following example copies files into SV1, creates a snapshot of SV1, and verifies that both
the subvolume and the snapshot contain the same files:
# cp /boot/vmlinuz-2.6.39* /btrfs/SV1
# btrfs sub snapshot /btrfs/SV1 /btrfs/SV1-snap
# ls /btrfs/SV1*
/btrfs/SV1:
vmlinuz-2.6.39-200.32.1.el6uek.x86_64
/btrfs/SV1-snap:
vmlinuz-2.6.39-200.32.1.el6uek.x86_64
Use the btrfs filesystem command to manage and report on Btrfs file systems. A partial
list of the available commands is as follows:
# btrfs filesystem
Usage: btrfs filesystem [<group>] <command> [<args>]
btrfs filesystem df <path>
Show space usage information for a mount point
btrfs filesystem show [--all-devices] [<uuid>|<label>]
Show the structure of a filesystem
btrfs filesystem sync <path>
Force a sync on a filesystem
btrfs filesystem defragment [options] <file>|<dir> [...]
Defragment a file or a directory
btrfs filesystem resize [+/-]<newsize>[gkm]|max <path>
Resize a filesystem
...
Some information is presented when you create a Btrfs file system. The following example
creates a Btrfs file system with two 8 GB devices (/dev/sdb and /dev/sdc) and mirrors
both the data and the metadata (metadata is mirrored by default):
# mkfs.btrfs –L Btrfs –d raid1 /dev/sdb /dev/sdc
...adding device /dev/sdc id 2
fs created label Btrfs on /dev/sdb
nodesize 4096 leafsize 4096 sectorsize 4096 size 16.00GB
...
The output above shows that the block size is 4 KB with a total of 16 GB of space. But
because the array is RAID1, you can fit only 8 GB of data on this file system. You actually
have less than 8 GB because space is needed for the metadata as well. The example
continues with creating a mount point and mounting the file system:
# mkdir /btrfs
# mount /dev/sdb /brtfs
As previously discussed, you can mount by referencing either device in the array, the LABEL,
or the UUID.
Use the btrfs filesystem show command to display the structure of a file system. The
syntax follows:
btrfs filesystem show [--all-devices] [<uuid>|<label>]
If you omit the optional uuid and label, the command shows information about all the Btrfs
file systems. If you provide the --all-devices argument, all devices under /dev are
scanned. Otherwise, the device list is obtained from /proc/partitions.
The following example displays the structure of the Btrfs file system with a label of Btrfs:
# btrfs filesystem show Btrfs
Label: ‘Btrfs’ uuid: 8bb5...
Total devices 2 FS bytes used 4.05MB
devid 2 size 8.00GB used 2.01GB path /dev/sdc
devid 1 size 8.00GB used 2.03GB path /dev/sdb
Use the btrfs filesystem sync command to force a sync for the file system. The file
system must be mounted. To force a sync of the file system mounted on /btrfs:
# btrfs filesystem sync /btrfs
FSSync ‘/btrfs’
Btrfs provides online defragmentation of a file system, file, or directory. The online
defragmentation facility reorganizes data into contiguous chunks wherever possible to create
larger sections of available disk space and to improve read and write performance. Use the
btrfs filesystem defragment command to defragment a file or a directory.
btrfs filesystem defragment [options] <file>|<dir> [...]
The available options include the following:
• -v – Verbose
• -c – Compress file contents while defragmenting.
• -f – Flush file system after defragmenting.
• -s start – Defragment only from byte start onward.
• -l len – Defragment only up to len bytes.
• -t size – Defragment files only at least size bytes.
You can set up automatic defragmentation by specifying the -o autodefrag option when
you mount the file system. Do not defragment with kernels up to version 2.6.37 if you have
created snapshots or made snapshots of files using the cp --reflink option. Btrfs in these
earlier kernels unlinks the copy-on-write copies of data.
Btrfs provides online resizing of a file system. Use the btrfs filesystem resize
command to resize a file system. You must have space available to accommodate the
resizing because the command has no effect on the underlying devices. The syntax is as
follows:
btrfs filesystem resize [+/-]<newsize>[gkm]|max <path>
Descriptions of the parameters:
• + newsize – Increases the file system size by newsize amount
• - newsize – Decreases the file system size by newsize amount
• newsize – Specifies the newsize amount
• g, k, or m: Specifies the unit of newsize, either GB or KB or MB. If no units are
specified, the parameter defaults to bytes.
• max: Specifies that the file system occupy all available space
For example, to reduce the size of the file system by 2 GB:
# btrfs filesystem resize -2G /btrfs
Resize ‘/btrfs/’ of ‘-2G’
Use the btrfs device command to manage devices on Btrfs file systems. A list of the
available commands is as follows:
# btrfs device
Usage: btrfs device <command> [<args>]
btrfs device add <device> [<device>...] <path>
Add a device to a filesystem
btrfs device delete <device> [<device>...] <path>
Remove a device from a filesystem
btrfs device scan [--all-devices|<device>| [<device>...]
Scan devices for a btrfs filesystem
The btrfs device scan command scans physical devices looking for members of a Btrfs
volume. This command allows a multiple disks Btrfs file system to be mounted without
specifying all the disks on the mount command.
You do not need to run btrfs device scan from the command line because Oracle has
modified Udev on Oracle Linux 6 to automatically run a btrfs device scan on boot.
Use the btrfs device add command to add a device to a file system. In this example, the
current file system structure is as follows:
# btrfs filesystem show Btrfs
Label: ‘Btrfs’ uuid: 8bb5...
Total devices 2 FS bytes used 4.05MB
devid 2 size 8.00GB used 2.01GB path /dev/sdc
devid 1 size 8.00GB used 2.03GB path /dev/sdb
The btrfs filesystem df command shows:
# btrfs filesystem df /btrfs
Data, RAID1: total=1.00GB, used 4.02MB
Data, total=8.00MB, used=0.00
System, RAID1: total=8.00MB, used=4.00KB
System, total=4.00MB, used=0.00
Metadata, RAID1: total=1.00GB, used=32.00KB
Metadata, total=8.00GB, used=0.00
You can initiate a check of the entire file system by triggering a file system scrub job. The
scrub job runs in the background by default and scans the entire file system for integrity. It
automatically attempts to report and repair any bad blocks that it finds along the way. Instead
of going through the entire disk drive, the scrub job deals only with data that is actually
allocated. Depending on the allocated disk space, this is much faster than performing an
entire surface scan of the disk.
Scrubbing involves reading all the data from all the disks and verifying checksums. If any
values are not correct, the data can be corrected by reading a good copy of the block from
another drive. The scrubbing code also scans on read automatically. It is recommended that
you scrub high-usage file systems once a week and all other file systems once a month.
The following is a partial list of the available btrfs scrub commands:
# btrfs scrub
Usage: btrfs scrub <command> [options] <path>|<device>
btrfs scrub start [-Bdqr] <path>|<device>
Start a new scrub
btrfs scrub status [-dR] <path>|<device>
Show status of running or finished scrub
Use the btrfs scrub start command to start a scrub on all the devices of a file system or
on a single device. The syntax is as follows:
btrfs scrub start [-Bdqr] <path>|<device>
Description of options:
• -B – Do not run in the background and print statistics when finished.
• -d – Print separate statistics for each device of the file system. This option is used in
conjunction with the -B option.
• -q – Run in quiet mode, omitting error messages and statistics.
• -r – Run in read-only mode, not correcting any errors.
The following example starts a scrub on the Btrfs file system that is mounted on /btrfs.
# btrfs scrub start –B /btrfs
scrub started on /brtfs, fsid ... (pid=...)
Use the btrfs scrub status command to get the status on a scrub job. Two options are
available:
• -d – Print separate statistics for each device of the file system.
• -R – Print detailed scrub statistics.
Btrfs supports the conversion of ext2, ext3, and ext4 file systems to Btrfs file systems. The
original ext file system metadata is stored in a snapshot named ext#_saved so that the
conversion can be reversed if necessary.
Use the btrfs-convert utility to convert an ext file system. Always make a backup copy
before converting a file system. To convert an ext file system other than the root file system,
perform steps listed in the slide.
Refer to the Oracle Linux Administrator’s Solutions Guide for details about converting the
root file system:
http://www.oracle.com/technetwork/server-storage/linux/documentation/index.html
Oracle provides an alternative installation image (UEK Boot ISO) that supports the installation
of Oracle Linux 6 update 3 using the Unbreakable Enterprise Kernel (UEK) as the installation
kernel. This installation method allows you to create a Btrfs root file system. This ISO image
contains only the bootable installation image. It requires a network installation source that
provides the actual RPM packages. Perform the steps listed in the slide to install from the
UEK Boot ISO.
In step 4, you replace the contents of the images directory in the installation source created
in step 2 with the contents of the images directory from the UEK Book ISO. The default
Oracle Linux 6.3 Media follows Red Hat Enterprise Linux (RHEL) Boot Media to maintain
compatibility. The UEK Boot ISO is available only with Oracle Linux and contains images to
allow creation of a Btrfs root file system. If you use the default Media images directory, an
ext4 root file system is created.
Control groups (cgroups) provide a mechanism to put Linux processes (tasks) into groups to
ensure that critical workloads get the system resources (CPU, memory, and I/O) that they
need. You can allocate system resources, track usage, and impose limits on the cgroups.
Cgroups provide more fine-grained control of CPU, I/O, and memory resources. You can
associate a set of CPU cores and memory nodes with a group of processes that make up an
application or a group of applications. This enables the subsetting of larger systems, more
fine-grained control over memory, CPUs, and devices, and the isolation of applications.
For example, with very large NUMA systems, you make the best use of system resources by
compartmentalizing. Cgroups give you a great deal of control over how to set up a system,
which memory to give, and which CPUs to give to an individual task. You can pin processes
to the same NUMA node and use NUMA-local memory. Cgroups facilitate database
consolidation on large NUMA servers, I/O throttling support, and device whitelisting. Cgroups
work inside virtual guests as well.
Some capabilities (specifically the blkio subsystem which provides I/O throttling) require the
Oracle Linux Unbreakable Enterprise Kernel release 2 (2.6.39) or the Red Hat compatible
kernel in Oracle Linux 6.1 or later.
A sample implementation is shown on the following page.
A subsystem is a kernel resource controller that applies limits or acts on a group of Linux
processes. A subsystem represents a single kernel resource, such memory or CPU cores. A
cgroup associates a group of processes or tasks with a set of parameters for one or more
subsystems. Processes assigned to each group are subject to the subsystem parameters.
Here are brief descriptions of the cgroup subsystems:
• cpuset – Assigns individual CPUs and memory nodes (for systems with NUMA
architectures) to cgroup tasks
• cpu – Schedules CPU access (according to relative shares or for real-time processes)
to cgroup tasks
• cpuacct – Reports the total CPU time used by cgroup tasks
• memory – Reports or limits memory use of cgroup tasks
• devices – Grants or denies cgroup tasks access to devices
• freezer – Suspends or resumes cgroup tasks
• net_cls – Tags outgoing network packets with an identifier. The Linux traffic controller
(tc) can be configured to assign different priorities to packets from different cgroups.
• blkio – Reports or controls I/O bandwidth for block devices. You can assign
proportional weights to specific cgroups or set an upper limit for the number of I/O
operations performed by a device.
Each subsystem has specific parameters that enable resource control and reporting
mechanisms. The subsystem parameters are the heart of cgroup resource controls. They
allow you to set limits, restrict access, or define allocations for each subsystem.
Understanding the specific parameters helps you to understand the possibilities for controlling
resources with cgroups.
In addition to the subsystem parameters, each subsystem directory contains the following
files:
• cgroup.clone_children
• cgroup.event_control
• cgroup.procs
• notify_on_release
• release_agent
• tasks
The tasks file keeps track of the processes associated with the cgroup and the associated
subsystem parameter settings. After the setup is complete, the tasks file contains all the
process IDs (PIDs) assigned to the cgroup.
Refer to the appendix titled “Cgroup Subsystem Parameters” for a description of each
subsystem parameter.
/cgroup/cpuset cpuset
/cgroup/cpu cpu
/cgroup/cpuacct cpuacct
Each root
/cgroup/memory memory cgroup can
/cgroup/net_cls net_cls
/cgroup/blkio blkio
This slide illustrates the default cgroup hierarchy. In the default hierarchy, each of the
subsystems is attached to separate hierarchies, /cgroup/<subsystem>. Each hierarchy
has an associated cgroup which is known as the root cgroup. All the processes, or tasks, on
the system are initially members of the root cgroup.
Cgroups are implemented using a file system–based model. You can traverse the /cgroup
hierarchy to view current control group hierarchies, parameter assignments, and associated
tasks. In this file system hierarchy, children inherit characteristics from their parents.
A hierarchy is a set of subsystems and cgroups arranged in a tree, so that every system
process is in exactly one of the cgroups in the hierarchy. Groups can be hierarchical, where
each group inherits characteristics from its parent group.
Many different hierarchies of cgroups can exist simultaneously on a system. Whereas the
Linux process model is a single tree of processes (all processes are child processes of a
common parent: the init process), the cgroup model is one or more separate, unconnected
trees of tasks (that is, processes). Multiple separate hierarchies of cgroups are necessary
because each hierarchy is attached to one or more subsystems.
A single hierarchy can have one or more subsystems attached to it. For example, both the
cpu and memory subsystems can be attached to a single hierarchy, /cgroup/cpu-mem.
These two subsystems (cpu and memory) cannot be attached to any other hierarchy that has
other subsystems already attached to it.
A subsystem cannot be attached to a second hierarchy if the second hierarchy has a different
subsystem already attached to it. For example, if the cpu subsystem is attached to the
/cgroup/cpu hierarchy and the memory subsystem is attached to the /cgroup/memory
hierarchy, an attempt to attach the cpu subsystem to the /cgroup/memory hierarchy fails. A
single subsystem can be attached to two hierarchies if both those hierarchies have only that
subsystem attached.
For any single hierarchy you create, each task on the system can be a member of exactly one
cgroup in that hierarchy. A single task can be in multiple cgroups, as long as each of those
cgroups is in a different hierarchy. If you assign a task as a member of a second cgroup in the
same hierarchy, it is removed from the first cgroup in that hierarchy. At no time is a task ever
in two different cgroups in the same hierarchy. For example, a running sshd process can be a
member of any one cgroup in the /cgroup/cpu hierarchy and can be a member of any one
cgroup in the /cgroup/memory hierarchy. The process cannot be a member of two different
cgroups in the same hierarchy, such as /cgroup/cpu or /cgroup/memory.
To enable the cgroup services on your system, install the libcgroup software package:
# yum install libcgroup
To ensure that the cgconfig service starts at boot time, enter the following command to
enable the service for run levels 2, 3, 4, and 5:
# chkconfig cgconfig on
The main configuration file for cgroups is /etc/cgconfig.conf. When you start the
cgconfig service, it reads this configuration file. Restart the cgconfig service after making
any configuration changes:
# service cgconfig restart
The /etc/cgconfig.conf file is used to define control groups, their parameters, and
mount points. The file contains two types of definitions.
• mount: Defines the virtual file systems that you use to mount resource subsystems
before you attach them to cgroups. The configuration file can contain only one mount
definition.
• group: Defines a cgroup, its access permissions, the resource subsystems that it uses,
and the parameter values for these subsystems. The configuration file can contain more
than one group definition.
mount {
cpuset = /cgroup/cpuset;
cpu = /cgroup/cpu;
cpuacct = /cgroup/cpuacct;
memory = /cgroup/memory;
devices = /cgroup/devices;
freezer = /cgroup/freezer;
net_cls = /cgroup/net_cls;
/cgroup/blkio:
blkio.weight blkio.weight_device blkio.time
...
/cgroup/cpu:
cpu.shares cpu.rt_period_us cpu.rt_runtime_us
/cgroup/cpuacct:
cpuacct.usage cpuacct.stat cpuacct.usage_percpu
...
/cgroup/cpuset:
cpuset.cpus cpuset.mems cpuset.cpu_exclusive
...
/cgroup/devices:
devices.allow devices.deny devices.list
...
/cgroup/freezer:
freezer.state
...
/cgroup/memory:
memory.limit_in_bytes memory.soft_limit_in_bytes
...
/cgroup/net_cls
net_cls.classid
...
Use the lssubsys command to list the hierarchies. The following example lists all (-a)
hierarchies, and includes the mount points (-m):
# lssubsys –am
cpuset /cgroup/cpuset
cpu /cgroup/cpu
cpuacct /cgroup/cpuacct
memory /cgroup/memory
devices /cgroup/devices
freezer /cgroup/freezer
net_cls /cgroup/net_cls
blkio /cgroup/blkio
Include the subsystem(s) arguments to list specific hierarchies, as in this example:
# lssubsys –m blkio cpu
blkio /cgroup/blkio
cpu /cgroup/cpu
The following sample content of the /etc/cgconfig.conf file creates the cpu_ram
hierarchy, and attaches the cpu, cpuset, and memory subsystems:
mount {
cpuset = /cgroup/cpu_ram;
cpu = /cgroup/cpu_ram;
memory = /cgroup/cpu_ram;
}
Restart the cgconfig service to read the /etc/cgconfig file and create the
/cgroup/cpu_ram hierarchy:
# service cgconfig restart
Command-Line Method
Alternatively, you can create a mount point for the hierarchy, and then attach the appropriate
subsystems. The following commands create the same configuration as the preceding one:
# mkdir /cgroup/cpu_ram
# mount –t cgroup –o cpu,cpuset,memory cpu_ram /cgroup/cpu_ram
group <cgroup_name> {
[perm {
task {
uid = <task_user>;
gid = <task_group>;
}
admin {
uid = <admin_user>;
The group section defines a cgroup, its access permissions, the resource subsystems that it
uses, and the parameter values for these subsystems. You can define multiple groups in the
/etc/cgconfig.conf file. The syntax is shown in the slide.
The cgroup_name argument defines the name of the cgroup. The perm (permissions) section
is optional but allows you to define a task section and an admin section:
• task – Defines the user and group combination that can add tasks to the cgroup
• admin – Defines the user and group combination that can modify subsystem
parameters and create subgroups
The root user always has permission to add tasks, modify subsystem parameters, and
create subgroups.
The subsystem section allows you to define the parameter settings for the cgroup. You can
have multiple subsystem sections and provide values for specific parameter settings within
each section. If several subsystems are grouped in the same hierarchy, include definitions for
all the subsystems in the section. For example, if the /cgroup/cpu_ram hierarchy includes
the cpu, cpuset, and memory subsystems, include definitions for all these subsystems.
/cgroup/cpu_ram/
Cgroup Hierarchy
This slide illustrates the hierarchy created by the configuration on the previous page. Three
subsystems—cpuset, cpu, memory—are attached to a single hierarchy,
/cgroup/cpu_ram. This parent cgroup has one child group, dbgrp.
The child inherits all characteristics from the parent with the exception of those explicitly set in
the configuration file. For example, the parent group contains default values:
# cat /cgroup/cpu_ram/cpu.rt_period_us
1000000
# cat /cgroup/cpu_ram/cpu.rt_runtime_us
950000
The child group contains the values set in the configuration file:
# cat /cgroup/cpu_ram/dbgrp/cpu.rt_period_us
5000000
# cat /cgroup/cpu_ram/dbgrp/cpu.rt_runtime_us
4000000
The child does not inherit the processes. Moving processes to a child control group is covered
later in this lesson.
Alternatively, you can use the cgcreate command to create a cgroup. The syntax for
cgcreate is:
cgcreate [-t uid:gid] [-a uid:gid] -g subsystems:path
The options are described as follows:
• -t <uid:gid> – Specifies the user and group IDs of those allowed to add tasks to the
cgroup. The default values are the IDs of the parent cgroup.
• -a <uid:gid> – Specifies the user and group IDs of those allowed to modify
subsystem parameters and create subgroups. The default values are the IDs of the
parent cgroup.
• -g <subsystems:cgroup-path> – Specifies the subsystems to add and the relative
path to the subsystems. You can specify this option multiple times.
For example, to create a new cgroup named group2 in the cpu subsystem hierarchy:
# cgcreate –g cpu:group2
Use the cgdelete command to remove a cgroup:
# cgdelete cpu:group2
Alternatively, use the cgset command to set the parameters of cgroups from the command
line. The syntax is as follows:
cgset [-r <name=value>] <cgroup-path>
cgset --copy-from <source_cgroup-path>] <cgroup-path>
Options:
• -r <name=value> – Specifies the name of the file to set and the value that is written
to that file. You can use this parameter multiple times.
• --copy-from <source_cgroup-path> – Specifies the cgroup whose parameters
are copied to the input cgroup
The following example assumes that the cpuset subsystem is attached to the group1-web
and group2-db cgroups. It sets parameters for the cgroups, allocating CPU and memory
nodes:
# cgset -r cpuset.cpus='0,2,4,6,8,10,12,14' group1-web
# cgset -r cpuset.cpus='1,3,5,7,9,11,13,15' group2-db
# cgset -r cpuset.mems='0' group1-web
# cgset -r cpuset.mems='1' group2-db
Use the cgroup rules definition file, /etc/cgrules.conf, to define the control groups to
which the kernel assigns processes when they are created. Use the following syntax to define
a cgroup and associated subsystems for a user:
user_name [:command_name] subsystem_name [,...] cgroup_name
The optional command_name specifies the name or absolute path name of a command. The
following example assigns the tasks run by the oracle user to the cpu, cpuset, and
memory subsystems in the dbgrp cgroup:
oracle cpu,cpuset,memory dbgrp
You can also use an asterisk (*) for user_name (for all users) and for subsystem_name to
associate all subsystems for user_name in the cgroup. The following example specifies all
users and all subsystems for the allgrp cgroup:
* * allgrp
Use the @group_name argument to define a cgroup and subsystems for all users in a group.
The following example assigns the tasks run by users in the guest group to the devgrp
cgroup:
@guest devices devgrp
Use the service command to start the cgred service after updating /etc/cgrules.
You can configure PAM to use the rules that you define in the /etc/cgrules.conf file to
associate processes with cgroups. You must first install the libcgroup-pam software
package:
# yum install libcgroup-pam
Installing this package installs the pam_cgroup.so module in /lib64/security on 64-bit
systems or in /lib/security on 32-bit systems.
Edit the /etc/pam.d/su configuration file and add the following line:
session optional pam_cgroup.so
For a service that has a configuration file in /etc/sysconfig, add the following line to start
the service in a specified cgroup:
CGROUP_DAEMON=“*:cgroup_name”
Use the following command to determine which process belongs to which cgroup:
# ps –O cgroup
PID CGROUP ... COMMAND
2692 memory,cpu:/ ... su –
2700 memory,cpu:/ ... –bash
...
If the process ID (PID) is known, you can use the following command as well:
# cat /proc/PID/cgroup
For example, for PID 2692:
# cat /proc/2692/cgroup
1:memory,cpu:/
You can also view the /proc/cgroups file to view information about the subsystems:
# cat /proc/cgroup
#subsys_name hierarchy num_cgroups enabled
cpu 1 3 1
...
Oracle Linux Advanced System Administration 3 - 24
You can use the tree command (if installed) to view all the hierarchies and their cgroups:
# tree /cgroups
/cgroup/
|---blkio
|---cpu
|---cpu-n-ram
| |---cgroup
| | |---cpu.shares
| | |---cpu.rt_period_us
...
11 directories, 73 files
Use the lscgroup command to list the cgroups on a system:
You can use the cgget command to view the value of a subsystem parameter:
cgget -r cgroup_file cgroups_list
The cgroup_file argument is the virtual file that contains the values for a subsystem. The
cgroups_list argument is a list of cgroups separated with spaces. For example, to view
the memory statistics for the cgroup dbgrp:
# cgget -r memory.stat dbgrp
You can also use the cgget command to list the values of all the subsystem parameters:
cgget -r subsystem /
For example, to list all the memory subsystem parameter values:
# cgget -r memory /
Linux Containers (LXC) are the next step up from cgroups for using system resources more
efficiently. Whereas cgroups allow you to isolate system resources, containers provide
application and operating system isolation. Containers allow you to run multiple userspace
versions of Linux on the same host without the need of a hypervisor. You can isolate
environments and control how resources are allocated without the virtualization overhead.
Linux Containers are similar to Oracle Solaris Zones in that they are virtualization at the
application level, above the kernel. There is one operating system kernel that is shared by
many zones or containers. Because the kernel is shared, you are limited to the modules and
drivers that it has loaded. The difference between zones and containers is more at the
implementation level and in the way it is integrated into the operating system.
Containers rely on the cgroups functionality but also rely on namespace isolation, similar to
chroot. Within each container, processes can have their own private view of the operating
system with its own process ID space, file system structure, and network interfaces.
Containers can be useful for:
• Running different copies of application configurations on the same server
• Running multiple versions of Oracle Linux on the same server
• Creating sandbox environments for testing and development
• Controlling the resources allocated to user environments
You can also use Btrfs subvolumes as a way to quickly create containers.
Containers provide resource management through control groups (cgroups) and resource
isolation through namespaces. They use these functionalities to provide a userspace
container object that can then provide full resource isolation and resource control for an
application or a system.
Containers rely on a set of kernel functionalities, such as namespaces, control groups,
networking, and file capabilities, to be active. Beginning with kernel 2.6.29 or later, LXC is
fully functional. Running with an older kernel version causes LXC to work with a restricted
number of functionalities, or can even cause LXC to fail. You can use the lxc-checkconfig
command to get information about your kernel configuration. This utility reads the
/proc/config.gz file if it is found, or reads the /boot/config* file for your active kernel
version and displays the status (enabled/disabled) of kernel functionalities.
Before running an application in a container, identify the resources to isolate. By default, the
process IDs, the SysV IPCs, and the mount points are isolated. With the default configuration,
you can run a simple shell command within a container. When running an application, for
example sshd, provide a new network stack and a new host name. To avoid container
conflicts, specify a root file system for the container. Running a system in a container is
easier than running an application because you do not care about specific resource isolation
when running a system. Everything is isolated when running a system.
The Linux Containers feature takes the cgroups resource management facilities as its basis
and depends on the cgroup file system being mounted. This requires that you install the
libcgroup software package, configure the cgconfig service to start at boot time, and
start the cgconfig service. Use the following commands to perform these tasks:
# yum install libcgroup
# chkconfig cgconfig on
# service cgconfig start
The LXC packages are also required to create and use Linux Containers. Install these
packages as follows:
# yum install lxc lxc-libs
The Btrfs file system can be used for a container repository. New instances can then be
cloned and spawned quickly, without requiring significant additional disk space. Btrfs allows
you to create a subvolume that contains the base template for the containers, and to create
containers from writable snapshots of the template. Install the btrfs-progs package to use
Btrfs as follows:
# yum install btrfs-progs
Template scripts define the settings for the different system resources that are assigned to a
running container. Each template script defines different resources but examples include the
container host name, network configuration, mount points, the root file system, number of
available ttys, and other settings. The following is a list of the current template scripts:
# ls /usr/share/lxc/templates
lxc-altlinux lxc-archlinux lxc-busybox lxc-debian lxc-fedora
lxc-lenny lxc-opensuse lxc-oracle lxc-sshd lxc-ubuntu
lxc-ubuntu-cloud
Each template script is used to generate a specific type of Linux Container. The lxc-
altlinux script is for generating an ALT Linux Container, the lxc-busybox script is for
generating a BusyBox container, the lxc-oracle script is for generating an Oracle Linux
Container, and so on.
The following example uses the lxc-busybox template to create a BusyBox container
named bb_cont. The -n option provides the container name and the -t option specifies the
template to use. Omit the lxc- portion of the template script:
# lxc-create –n bb_cont –t busybox
Use the lxc-create command to create a container. This command creates a system
object directory in /container, which stores configuration and user information. The object
provides a definition of the different resources that an application can use or can see. The
syntax is:
lxc-create –n <container_name> [-f <config_file>] [-t
<template>] [-B <backing-store>] [-- template-options]
The options are described as follows:
• -n <container_name> – Specifies the name of the container. This name becomes
the name of the system object directory, /container/<container_name>.
• -f <config_file> – Specifies the file used to configure the virtualization and
isolation functionality of the container. If a configuration file is not specified, the container
is created with the default isolation: processes, SysV IPC, and mount points.
• -t <template> – Specifies the short name of an existing template script. Template
scripts are located in /usr/share/lxc/templates. Configuration information is
provided in these template scripts.
• -B <backing-store> – Specifies either none, btrfs, or lvm as the backing-store
• -- template-options – Specifies arguments to pass to the template
The lxc-oracle template currently supports creating an Oracle Linux 5.x container and an
Oracle Linux 6.x container using packages from public-yum.oracle.com to create a new
rootfs. In addition, this template can create a container based on an existing rootfs using an
Oracle VM template, which was an option with the lxc-0.7.5 RPM. The template is
accepted upstream for inclusion in lxc-0.9. Oracle includes this template in lxc-0.8.
The following example uses the lxc-oracle template to create an Oracle Linux 6.3
container named ol63-32 from public-yum with i386 packages:
# lxc-create -n ol63-32 -t oracle -- -a i386 -R 6.3
To use the template and to create the latest OL container from public-yum with x86_64
packages:
# lxc-create -n ol6L-64 -t oracle -- -a x86_64 -R 6.latest
The following example creates the latest OL container from an existing rootfs. The container
rootfs is a Btrfs snapshot if /path/to/rootfs is a Btrfs subvolume on the same Btrfs file
system.
# lxc-create -n ol6L-64 -t oracle -- -t /path/to/rootfs
No template options default to the same architecture and OL version as the host.
System containers emulate an entire Linux OS booting from scratch. They provide their own
init program, so lxc-init is not needed. Use lxc-start for system containers and use
lxc-execute for running a single program, which generally shares the rootfs with the host
system. The complete syntax of lxc-start is as follows:
lxc-start -n <name> [-f <config_file>] [-c <console_file>]
[-d] [-s <KEY=VAL>] [<command>]
This command runs the specified [<command>] inside the container, which is identified by
the –n <name> argument. If no [<command>] is specified, lxc-start uses the default
/sbin/init to run a system container.
If the container does not exist, that is, it was not already created by the lxc-create
command, the container is set up as defined in the <config_file>. If no configuration is
defined, the default isolation (processes, SysV IPC, and mount points) is used.
The remaining options are described as follows:
• -d – Specifies to run the container as a daemon. If an error occurs, nothing is displayed.
• -c <console_file> – Specifies a file to output the container console. By default,
output is written to the terminal if the -d option is not specified.
• -s <KEY=VAL> – Assigns a value (VAL) to a configuration key (KEY). This overrides
assignments in the configuration file.
Use the lxc-ls command to display the containers that the host system is running.
# lxc-ls
ol6ctrl
The command accepts the options of the ls command.
Use the lxc-info command to display the state of a container.
# lxc-info –n ol6ctrl
state: STOPPED
pid: -1
A container can be in one of the following states: ABORTING, RUNNING, STARTING,
STOPPED, STOPPING, or FROZEN. Start the container, and then display the state, which
shows that it is RUNNING.
# lxc-start –n ol6ctrl -d
# lxc-info –n ol6ctrl
state: RUNNING
pid: 2677
If the container is in RUNNING state and is configured with ttys (that is, /sbin/mingetty
processes have started running in the container), you can use the lxc-console command
to access the container through the ttys. The syntax of lxc-console is as follows:
lxc-console -n <name> [-t <ttynum>]
You can optionally specify the tty number to connect. If it is not specified, a tty number is
automatically selected by the container. You are prompted for a login and a password:
# lxc-console -n ol6ctrl
ol6ctrl login: root
password:
Use the following configuration entry to define the number of ttys available:
lxc.tty = #
To exit the lxc-console session, type Ctrl + A followed by Q.
You can also log in using the ssh command if the container has an IP address assigned to
the virtual network interface and the container has the /usr/sbin/sshd process running.
Use the lxc-freeze command to stop all processes that belong to a container, for example,
to accommodate job scheduling. This command puts all the processes in an uninterruptible
state. The state changes to FROZEN.
# lxc-freeze –n ol6ctrl
# lxc-info –n ol6ctrl
state: FROZEN
pid: 2677
Use the lxc-unfreeze command to resume all processes.
# lxc-unfreeze –n ol6ctrl
# lxc-info –n ol6ctrl
state: RUNNING
pid: 2677
This feature is enabled only if the cgroup freezer subsystem is enabled in the kernel.
When a container is started, a control group (cgroup) is created and associated with the
container. Use the lxc-cgroup command to view and modify the cgroup subsystem
parameters set in the container. To display the current value of a cgroup subsystem
parameter, provide the parameter as an argument.
# lxc-cgroup –n ol6ctrl cpu.shares
1024
To set the value of a cgroup subsystem parameter, pass the parameter and the value as
arguments.
# lxc-cgroup –n ol6ctrl cpu.shares 500
# lxc-cgroup –n ol6ctrl cpu.shares
500
To make the cgroup subsystem values persistent, add the settings to the container’s
configuration file or add the settings to the template script before creating the container.
lxc.cgroup.cpu.shares = 500
lxc.cgroup.devices.allow = c 1:3 rwm
Perform the following steps to use an Oracle VM template as a base environment for the
lxc-oracle script:
1. Download an Oracle VM template from http://edelivery.oracle.com/linux into a temporary
directory. This temporary directory needs to be large enough to accommodate a 12 GB
System.img file.
2. The downloaded file name begins with a V and has a .zip extension. Use the unzip
utility to unzip this file. The example OVM template file name is V35123-01.zip.
# unzip V35123-01.zip
Archive: V35123-01.zip
inflating: OVM_OL6U3_x86_64_PVM.ova
3. In this example, extracting the zip file results in OVM_OL6U3_x86_64_PVM.ova. Use
the tar utility to untar the VM template:
# tar xvf OVM_OL6U3_x86_64_PVM.ova
OVM_OL6U3_x86_64_PVM.ovf
OVM_OL6U3_x86_64_PVM.mf
System.img
Ethernet / IP
network
Internet Small Computer System Interface (iSCSI) is an IP-based standard for connecting
storage devices. iSCSI uses IP networks to encapsulate SCSI commands, allowing data to be
transferred over long distances. iSCSI provides shared storage among a number of client
systems. Storage devices are attached to servers (targets) and client systems (initiators)
access the remote storage devices over IP networks. To the client systems, the storage
devices appear to be locally attached. iSCSI uses the existing IP infrastructure and does not
require any additional cabling, as is the case with Fiber Channel (FC) storage area networks.
This slide shows a simple Ethernet network with storage attached to an iSCSI server (target)
and several client systems (initiators) able to access the shared storage over the network.
The client initiators send SCSI commands over the IP network to the iSCSI target.
The iSCSI target can be a dedicated network-connected storage device but can also be a
general purpose computer as is the case in this slide. Software to provide iSCSI target
functionality is available for Oracle Linux and other operating systems. There are also
specific-purpose operating systems that implement iSCSI target support. Two examples of
open source network storage solutions are FreeNAS and Openfiler. The iSCSI part of this
course focuses on the implementation of iSCSI targets and initiators in Oracle Linux.
The configuration file for defining targets and LUNs is /etc/tgt/targets.conf. This file
uses HTML-like structure with tags to define targets and LUNs. The file contains several
sample configurations, all of which are commented out by default. An example is as follows:
<target iqn.2012-01.com.example.host01:target1>
direct-store /dev/sdb # LUN 1
direct-store /dev/sdc # LUN 2
</target>
In this example, the target is identified by an “iqn” identifier. This is an iSCSI Qualified Name
(IQN), which uniquely identifies a target. IQN format addresses are most commonly used to
identify a target. This address consists of the following fields:
• Literal iqn
• Date (in yyyy-mm format) that the naming authority took ownership of the domain
• Reversed domain name of the authority
• Optional ":" that prefixes a storage target name specified by the naming authority
Other address types include Extended Unique Identifier (EUI) and T11 Network Address
Authority (NAA). The sample configuration file uses only IQN target addresses.
The tgtadm utility is a binary executable file that is used to monitor and configure iSCSI
targets. The following is a partial list of options for the tgtadm utility:
• -L|--lld <driver> – Specify a low-level driver. Valid values are iscsi and iser.
The default is iscsi.
• -m|--mode <entity> – Specify the entity to perform an action on. Valid values are
target, logicalunit, and account.
• -o|--op <action> – Specify the operation or action to perform. Valid values are new,
delete, show, update, bind, and unbind.
• -t|--tid <id> – Specify the target ID.
• -T|--targetname <name> – Specify the target name.
• -F|--force – Use with the delete action to delete an active I_T nexus [a
relationship between and initiator (I) and a target (T)].
• -n|--name <param>, --value <value> – Use with the update action to change
the value <value> of a parameter <param>.
• -l|--lun <lun>: Specify a logical unit number.
• -I|--initiator-address <address>: Specify an initiator address for access
control.
The tgt-admin Perl script uses tgtadm commands to create, delete, and show target
information. For example, the following tgt-admin command executes the subsequent
tgtadm command:
# tgt-admin --show
# tgtadm --op show --mode target
The following is a partial list of options for the tgt-admin script:
-e|--execute – Read /etc/tgt/targets.conf and execute tgtadm commands.
-d|--delete <value> – Delete all or selected targets.
--offline <value> – Put all or selected targets in offline state.
--ready <value> – Put all or selected targets in ready state.
--update <value> – Update configuration for all or selected targets.
-s|--show: Show all the targets.
-c|--conf <conf file>: Specify an alternative configuration file.
-h|--help: Display script usage.
• Example:
# tgt-setup-lun –d /OVS/sharedDisk/physDisk6.img –n
new_tgt 192.0.2.101 192.0.2.102
Creating the new target (iqn.2001-04.com.<hostname>-
new_tgt
...
The tgt-setup-lun Bash script creates a target, adds a device to the target, and defines
initiators that can connect to the target. The tgtd daemon must be running to use this script.
The syntax is as follows:
tgt-setup-lun –d device –n target_name [initiator_IP1...]
Provide a simple target name as the -n argument. The script appends an IQN format
address, including the host name, to the target name you provide. If IP addresses are defined,
they are added to the access list of allowed initiators for that target. If no IP addresses are
specified, the target accepts any initiator.
The following example creates a target that uses /OVS/sharedDisk/physDisk6.img and
allows connections from only IP addresses 192.0.2.101 and 192.0.2.102:
# tgt-setup-lun –d /OVS/sharedDisk/physDisk6.img –n new_tgt
192.0.2.101 192.0.2.102
Creating the new target (iqn.2001-04.com.<hostname>-new_tgt
Adding a logical unit (/OVS/sharedDisk/physDisk6.img) to the
target
Accepting connections only from 192.0.2.101 192.0.2.102
An iSCSI client functions as an SCSI initiator to access target devices on an iSCSI server. An
iSCSI initiator sends SCSI commands over an IP network. There are two types of iSCSI
initiators.
• Software initiator: A kernel-resident device driver uses the existing network interface
card (NIC) and network stack to emulate SCSI devices. Some network cards offer
TCP/IP Offload Engines (TOE), which perform much of the IP processing that is
normally performed by server resources. These TOE cards have a built-in network chip,
which creates the TCP frames. The Linux kernel does not support TOE directly; so the
card vendors write drivers for the OS.
• Hardware initiator: This uses a dedicated iSCSI Host Bus Adaptor (HBA) to implement
iSCSI. Hardware initiators provide better performance because the processing is
performed by the HBA rather than the server. You can also boot a server from iSCSI
storage, which you cannot do with software initiators. The downside is that the cost of an
iSCSI HBA is much higher than an Ethernet NIC.
Oracle Linux and most operating systems provide software initiator functionality.
Discovery is the process that makes the targets known to an initiator. It defines the method by
which the iSCSI targets are found. The following discovery methods are available:
• SendTargets: This is a native iSCSI protocol that allows an iSCSI server to send a list
of available targets to the initiator.
• Service Location Protocol (SLP): Servers use the SLP to announce available targets.
The initiator can implement SLP queries to get information about these targets.
• Internet Storage Name Service (iSNS): Targets are discovered by interacting with one
or more iSNS servers. iSNS servers record information about available targets. Initiators
pass the address and an optional port of the iSNS server to discover targets.
• Static: The static target address is specified.
The following example uses the SendTargets discovery method to discover targets on IP
address 192.0.2.1. This command also starts the iscsid daemon if needed.
# iscsiadm -m discovery --type sendtargets –p 192.0.2.1
Starting iscsid: [ OK ]
192.0.2.1:3260,1 iqn.2011-12.com.example.mypc:tgt1
192.0.2.1:3260,1 iqn.2011-12.com.example.mypc:tgt2
After targets have been discovered by the initiator, the initiator needs to log in to access the
LUNs. Logging in establishes a session, which is a TCP connection between an initiator node
port and a target node port. Before logging in, no sessions are active. For example:
# iscsiadm -m session
iscsiadm: No active sessions
Use the following command to log in and establish a session. When a session is established,
iSCSI control, data, and status messages are communicated over the session.
# iscsiadm -m node -l
To log in to a specific target:
# iscsiadm -m node --targetname iqn.2011-
12.com.example.mypc:tgt1 –p 192.0.2.1:3260 -l
Login to [iface: default, target: iqn.2011-
12.com.example.mypc:tgt1, portal: 192.0.2.1:3260] successful.
To log off all targets:
# iscsiadm -m node -u
To log off from a specific target, use the same login command as shown, except use -u.
After establishing a session, the LUNs are represented as block devices in the /dev
directory. In an Oracle VM Server environment, disk devices appear as virtual block devices
(xvd). This example also shows LVM volumes for the root and swap partitions:
# fdisk –l | grep /dev
Disk /dev/xvda: 12.9 GB, 12884901888 bytes
/dev/xvda1 * 1 64 512000 83 Linux
/dev/xvda2 64 1567 12069888 8e Linux LVM
Disk /dev/xvdb: 10.7 GB, 10737418240 bytes
Disk /dev/xvdd: 10.7 GB, 10737418240 bytes
Disk /dev/mapper/vg_host01-lv-root: 9202 MB, 9202302976 bytes
Disk /dev/mapper/vg_host01-lv-swap: 3154 MB, 3154116608 bytes
After establishing a session, the iSCSI LUNs appear as SCSI (sd) devices:
# fdisk –l | grep /dev/sd
Disk /dev/sda: 10.7 GB, 10737418240 bytes
Disk /dev/sdb: 10.7 GB, 10737418240 bytes
Server
hba 1 hba 2
Storage
Device-Mapper Multipath (DM-Multipath) is a native Linux multipath tool, which allows you to
configure multiple I/O paths between a server and a storage device into a single path. Multiple
paths to storage devices provide redundancy and failover capabilities, as well as improved
performance and load balancing.
DM-Multipath can be configured in either of the following configurations:
• Active/Passive (or Standby) – Only half the paths are used for I/O. If any component in
the active path fails, DM-Multipath switches I/O to the alternate path.
• Active/Active – In an active/active configuration, DM-Multipath can be configured to
spread the I/O load across all paths in a round-robin fashion, or can dynamically load-
balance the load.
The slide shows a simple DM-Multipath configuration through a storage area network (SAN).
There are two I/O paths in this example:
• Host bus adapter (hba1), through the SAN, to Controller (ctrl1)
• Host bus adapter (hba2), through the SAN, to Controller (ctrl2)
Without DM-Multipath, each I/O path is a separate device even though the path connects the
same server to the same storage device. DM-Multipath creates a single multipath device on
top of the underlying devices.
The main configuration file for DM-Multipath is /etc/multipath.conf. This file is not
created by the initial installation of the RPM package. However, the following sample files are
installed in the /usr/share/doc/device-mapper-multipath-<version> directory:
• multipath.conf – Basic configuration file with some examples for DM-Multipath. This
file is used to create the /etc/multipath.conf file.
• multipath.conf.defaults – A complete list of the default configuration values for
the storage arrays supported by DM-Multipath
• multipath.conf.annotated – A list of configuration options with descriptions
The configuration file contains entries that use the following form:
<section> {
<attribute> <value>
<subsection> {
<attribute> <value>
}
}
defaults {
udev_dir /dev
polling_interval 10
path_selector “round-robin 0”
path_grouping_policy multibus
getuid_callout “/lib/udev/scsi_id --
whitelisted --device=/dev/%n”
prio alua
path_checker readsector0
A partial list of attributes defined in the default section of the configuration file is as follows:
• udev_dir – Directory where udev creates device nodes. The default is /dev.
• polling_interval – Interval in seconds that paths are checked. The default is 5
seconds.
• path_selector – One of the following path selector algorithms to use:
- round-robin 0: Loop through every path sending the same amount of I/O to
each. This is the default.
- queue-length 0: Send I/O down a path with the least amount of outstanding I/O.
- service-time 0: Send I/O down a path based on the amount of outstanding I/O
and relative throughput.
• path_grouping_policy – Paths are grouped into path groups. The policy
determines how path groups are formed. There are five different policies.
- failover: One path per priority group
- multibus: All paths in one priority group. This is the default.
- group_by_serial: One priority group per storage controller (serial number)
- group_by_prio: One priority group per priority value
- group_by_node_name: One priority group per target node name
blacklist {
wwid “*” # blacklist all devices by WWID
devnode “^sd[a-z]” # blacklist all SCSI devices
device { # blacklist by device type
vendor “COMPAQ ”
product “HSV110 (C)COMPAQ”
}
}
Use the blacklist section in the /etc/multipath.conf file to exclude devices from
being grouped into a multipath device. You can blacklist devices using any of the following
identifiers. Use the same identifiers in the blacklist_exceptions section.
• WWID
• Device Name: Use the devnode keyword.
• Device Type: Use the device subsection.
The following example uses all three identifiers to blacklist devices:
blacklist {
wwid 360000970000292602744533030303730
devnode "^sd[a-z]“ # blacklist all SCSI devices
device {
vendor "COMPAQ “
product "HSV110 (C)COMPAQ"
}
}
multipaths {
multipath {
wwid 3600508b4000156d700012000000b0000
alias yellow
failback manual
no_path_retry 5
}
multipath {
wwid 360000970000292602744533032443941
Set attributes in the multipaths section of the configuration file for each individual multipath
device. These attributes apply to a specified multipath and override the attributes set in the
defaults and devices sections.
This slide shows the settings that override the failback and no_path_retry default
settings for the first WWID and set aliases for both WWIDs. Valid values for the
no_path_retry attribute are:
• <n> – The number of retries until multipath stops the queueing and fails the path
• fail – Specifies immediate failure (no queueing)
• queue – Never stop queueing (queue forever until the path comes alive)
devices {
device {
vendor “SUN”
product “(StorEdge 3510|T4”
getuid_callout “/sbin/scsi_id –-whitelisted -
-device=/dev/%n”
features “0”
hardware_handler “0”
path_selector “round-robin 0”
DM-Multipath includes support for the most common storage arrays. These supported storage
arrays and their default configuration values are stored in the /usr/share/doc/device-
mapper-multipath-<version>/multipath.conf.defaults file. You can copy and
paste from this file into the /etc/multipath.conf file to add a storage device.
To add a storage device that is not supported by default, obtain the vendor, product, and
revision information from the sysfs file system for the storage device and add this to the
/etc/multipath.conf file. View the following files to obtain this information:
• /sys/block/device_name/device/vendor – Vendor information
• /sys/block/device_name/device/model – Product information
• /sys/block/device_name/device/rev – Revision information
Include additional device attributes as required. The slide shows sample device settings for
the Sun StorEdge 3510|T4 storage arrays.
By default, multipath devices are identified by a World Wide Identifier (WWID), which is
globally unique. As an alternative, you can set the user_friendly_names attribute to yes
in the /etc/multipath.conf file, which sets the multipath device to mpathN where N is
the multipath group number.
The DM-Multipath tool uses two different sets of file names:
• /dev/dm-N – Never use these device names because they are intended to be used
only by the DM-Multipath tool.
• /dev/mapper/mpathN – Always use these device names to access the multipath
devices. These names are persistent and are automatically created by device-mapper
early in the boot process.
You can also use the alias attribute in the multipaths section of the configuration file to
specify the name of a multipath device. The alias attribute overrides mpathN names.
Use either the /dev/mapper/mpathN name or the /dev/mapper/<alias> name when
creating a partition, when creating a LVM physical volume, and when making and mounting a
file system.
Use the mpathconf utility to configure DM-Multipath. The mpathconf utility creates or
modifies the /etc/multipath.conf file, using a copy of the sample multipath.conf file
in the /usr/share/doc/device-mapper-multipath-<version> directory as a
template if necessary.
Run mpathconf --help to display usage:
# mpathconf --help
usage: /sbin/mpathconf <command>
Commands:
Enable: --enable
Disable: --disable
Set user_friendly_names (Default n): --user_friendly_names <y|n>
Set find_multipaths (Default n): --find_multipaths <y|n>
Load the dm-multipath modules on enable (Default y): --with ...
start/stop/reload multipathd (Default n): --with_multipathd ...
chkconfig on/off multipathd (Default y): --with_chkconfig <y|n>
The multipath utility is the device mapper target auto-configurator, which is used to detect
and configure multiple paths to devices. Use the following command to display usage:
# multipath -h
Some of the available options are described as follows:
• -v <verbosity> – Specify the verbosity level when displaying paths and multipaths.
• -l – List the multipath topology.
• -ll – List the maximum multipath topology information.
• -f – Flush a multipath device map. Use –F to flush all multipath device maps.
• -c – Check if a device should be a path in a multipath device.
• -p failover | multibus | group_by_serial | group_by_prio
|group_by_node_name – Force all maps to the specified path grouping policy.
• -r – Force device map reload.
You can optionally specify a device name to update only the device map that contains the
specified device. Use the /dev/sd# format, or the major:minor format, or the multipath
map name, for example mpathN, or the WWID to specify a device.
The multipathd daemon checks for failed paths and reconfigures the multipath map. The
daemon runs the multipath utility when events occur that require a device map
reconfiguration. You can run the daemon in the foreground, which displays all messages
using the -d option. You can set the verbosity level using the -v <level> option. The
multipathd daemon also has an interactive mode enabled with the –k option. Some of the
commands available in interactive mode include:
• help – List the available interactive commands.
• list|show paths – Show the paths that multipathd is monitoring and their state.
• list|show maps|multipaths – Show the multipath devices that multipathd is
monitoring.
• list|show topology – Show the multipath topology. This gives the same output as
using the multipath -ll command.
• list|show config – Show the current configuration that is derived from
/etc/multipath.conf.
• reconfigure – Reconfigures the multipaths. This is triggered automatically after any
hotplug event.
• quit|exit – End the interactive session.
Udev is the device file naming mechanism in the 2.6 Linux kernel. Udev dynamically creates
or removes device node files at boot time, by default in the /dev directory, for all types of
devices. Udev is executed if a kernel device is added or removed from the system. On device
creation, Udev reads the sysfs directory of the given device to collect device attributes such
as label, serial number, or bus device number.
Note that device file names can change when disks are removed from the system due to
failure. For example, devices are named /dev/sda, /dev/sdb, and /dev/sdc at boot time.
But on the next reboot, /dev/sdb fails and what was previously /dev/sdc is named
/dev/sdb. Any configuration references to /dev/sdb now contain content originally
referenced by /dev/sdc.
The solution to avoid this type of situation is to guarantee consistent names for devices
through reboots. You can configure Udev to create persistent names and use these names in
an Oracle RAC configuration, for Oracle Cluster Registry (OCR), Voting Disks, and Automatic
Storage Management (ASM) disks. The main configuration file is /etc/udev/udev.conf.
Define the following variables in this file:
• udev_root – Location of the device nodes in the file system. The default is /dev.
• udev_log – The logging priority. Valid values are err, info, and debug.
Udev rules determine how to identify disks and how to assign a name that is persistent
through reboots or disk changes. When Udev receives a device event, it matches the
configured rules against the device attributes in sysfs to identify the device. Rules can also
specify additional programs to run as part of device event handling.
Udev rules files are located in the following directories:
• /lib/udev/rules.d/ – The default rules directory
• /etc/udev/rules.d/ – The custom rules directory. These rules take precedence.
• /dev/.udev/rules.d/ – The temporary rules directory
Rule files are required to have unique names. Files in the custom rules directory override files
of the same name in the default rules directory. Rule files are sorted and processed in lexical
order. The following is a partial listing of rule files from the default and custom rules
directories:
# ls /lib/udev/rules.d
10-console.rules 10-dm.rules 11-dm-lvm.rules 13-dm-disk.rules
# ls /etc/udev/rules.d
40-hplip.rules 56-hpmud_support.rules 60-pcmia.rules
SUBSYSTEM==“block”, SYMLINK{unique}+=“block/%M:%m”
SUBSYSTEM!=“block”, SYMLINK{unique}+=“char/%M:%m”
KERNEL==“tty[0123456789abcdef]”, GROUP=“tty”,
MODE=“0660”
# mem
KERNEL==“mem|kmem|port|nvram”, GROUP=“kmem”, MODE=“0640”
# block
SYBSYSTEM==“block”, GROUP=“disk”
# network
The udevadm utility is a userspace management tool for Udev. Among other functions, you
can use udevadm to query sysfs and obtain device attributes to help in creating Udev rules
that match a device. To display udevadm usage:
# udevadm --help
Usage: udevadm [--help] [--version] [--debug] COMMAND [OPTIONS]
info query sysfs or the udev database
trigger requests events from the kernel
settle wait for the event queue to finish
control control the udev daemon
monitor listen to kernel and udev events
test simulation run
You can also obtain usage for each of the udevadm commands. For example, to get help on
using the info command:
# udevadm info --help
Usage: udevadm info OPTIONS ...
The order in which rules are evaluated is important. When creating your own rules, you want
these evaluated before the defaults. Because rules are processed in lexical order, create a
rule file with a file name such as /etc/udev/rules.d/10-local.rules for it to be
processed first. The following rule renames /dev/sdb to /dev/my_disk:
KERNEL=="sdb", SUBSYSTEM==“block", NAME=“my_disk”
Before restarting Udev, the current /dev/sd* output is:
# ls /dev/sd*
/dev/sda /dev/sda1 /dev/sda2 /dev/sdb /dev/sdc /dev/sdd
Run start_udev to process the rules files:
# start_udev
Starting udev: [ OK ]
The file name has changed:
# ls /dev/sd* /dev/my*
/dev/my_disk /dev/sda /dev/sda1 /dev/sda2 /dev/sdc /dev/sdd
Remove the 10-local.rules file and run start_udev to return to default names.
OCFS2 is a general-purpose, POSIX-compliant, and shared disk cluster file system. It allows
multiple nodes to read from and write to files on the same disk at the same time. It behaves
on all nodes exactly like a local file system, and can be used in a non-clustered environment.
Files, directories, and their content are always in sync across all nodes regardless of which
node updates the files. For example, if you add a line to a file on node 1, it appears
immediately on node 2.
OCFS (Release 1) was released in December 2002 to enable Oracle Real Application Cluster
(RAC) users to run the clustered database without having to deal with RAW devices. The file
system was designed to store database related files, such as data files, control files, redo
logs, and archive logs. OCFS refers to the file system in the 2.4 Linux Kernel.
OCFS2 is the next generation of the Oracle Cluster File System. It is designed to be a general
purpose cluster file system. You can store any files on OCFS2 and you interact with OCFS2
the same way you do on a regular file system. OCFS2 is being used in middleware clusters
(Oracle E-Business Suite), appliances such as SAP’s Business Intelligence Accelerator, and
virtualization with Oracle VM Server.
OCFS2 was developed as a native Linux cluster file system at Oracle. It was submitted for
inclusion and accepted into the mainline kernel in January 2006. It was included in the 2.6.16
kernel and is the first native cluster file system to be included in the Linux kernel.
The high-level steps to install and use OCFS2 are summarized in this slide.
Allow
TCP/UDP 7777
Public network
Shared Storage
Private network
This slide illustrates a sample configuration with shared storage for cluster nodes, a private
network between the nodes (192.168.1.0), and the requirement to allow access on the
private network for TCP and UDP port 7777.
Each node in the cluster requires access to shared storage. For cluster file systems, typically
this shared storage device is a shared SAN disk, an iSCSI device, or a shared virtual device
on an NFS server.
In the example in the slide, each node (guest VM) shares /dev/xvdb as configured with the
following entry in the vm.cfg files for each VM:
[dom0]# grep hdb /OVS/running_pool/host*/vm.cfg
host01/vm.cfg: ‘file:/OVS/sharedDisk/physDisk1.img,hdb,w!’,
host02/vm.cfg: ‘file:/OVS/sharedDisk/physDisk1.img,hdb,w!’,
host03/vm.cfg: ‘file:/OVS/sharedDisk/physDisk1.img,hdb,w!’,
A shared disk is the same as a normal virtual disk, except that it uses w! in vm.cfg instead of
just w. The entries in all vm.cfg files point to the same .img file with the same hdb (which
translates to xvdb) identifier with w!.
It is also recommended that you configure a private interconnect between the nodes.
You must set two kernel settings for the O2CB cluster stack to function properly. The first
kernel setting is panic_on_oops, which you must enable to change a kernel oops into a
panic. If a kernel thread required for O2CB crashes, the system is reset to prevent a cluster
hang.
The other kernel setting is panic, which specifies the number of seconds after a panic that
the system is automatically reset. The default setting is zero, which disables auto-reset, in
which case the cluster requires manual intervention. The recommended setting is 30 seconds
but you can set it higher for large systems.
To manually enable panic_on_oops and set a 30-second timeout for reboot on panic:
# echo 1 > /proc/sys/kernel/panic_on_oops
# echo 30 > /proc/sys/kernel/panic
To make these settings persistent, add the following entries to the /etc/sysctl.conf file:
kernel.panic_on_oops = 1
kernel.panic = 30
node:
name = host01
cluster = mycluster
number = 0
node:
name = host02
cluster = mycluster
number = 1
ip_address = 192.168.1.103
ip_port = 7777
This slide shows a sample cluster layout configuration. By default, the cluster layout
configuration file for OCFS2 is /etc/ocfs2/cluster.conf. It is not necessary to configure
this file when mounting an OCFS2 volume on a stand-alone, non-clustered system.
After creating this configuration file on one node in the cluster, copy the file to all nodes in the
cluster. If you edit this file on any node, ensure that the other nodes are updated as well.
When adding a new node to the cluster, update this configuration file on all nodes before
mounting the OCFS2 file system from the new node.
The example configuration file in the slide has two sections (or stanzas):
• cluster – This stanza specifies the parameters for the cluster. The configuration file
typically contains only one cluster stanza. It can contain multiple cluster stanzas;
however, only one cluster can be active at any time.
• node – This stanza specifies the parameters for the individual nodes in the cluster. The
configuration file typically contains multiple node stanzas.
You can use a text editor to manually create the /etc/ocfs2/cluster.conf file or use the
o2cb utility to create and modify the configuration file. It is recommended that you use the
o2cb command to modify the configuration file. This command ensures that entries in the
configuration file are formatted correctly.
Use the o2cb utility to add, remove, and list information in the cluster layout configuration file.
You can also use this utility to register and unregister the cluster, and to start and stop the
global heartbeat. The syntax for the command is:
o2cb [--config-file=path] [-h|--help] [-v|--verbose] [-V|--
version] COMMAND [ARGS]
Use the --config-file=path option to override the default configuration file,
/etc/ocfs2/cluster.conf.
Available COMMAND parameters include:
• add-cluster <cluster-name> – Adds the specified <cluster-name> to the
configuration file. The configuration file can specify multiple clusters but only one cluster
can be active.
• remove-cluster <cluster-name> – Removes the specified <cluster-name>
from the configuration file. This command also removes all nodes and heartbeat regions
associated with the cluster.
• add-node <cluster-name> <node-name> – Adds the specified <node-name> to
the specified <cluster-name> to the configuration file. You can optionally specify an
IP address [--ip <addr>], port number [--port <port>], and node number [--
number <node>].
Connectivity of nodes and storage devices must be assured to prevent file system corruption.
OCFS2 sends regular keepalive packets (heartbeat) to ensure that the nodes are alive. There
are two types of heartbeat modes: local and global.
Local Heartbeat
In local heartbeat mode, a heartbeat is established when a volume is mounted by a node. The
O2CB cluster stack does disk heartbeat on a per-mount basis on an area on disk, called the
heartbeat file, which is reserved during format. The heartbeat thread is started and stopped
automatically during mount and unmount. Each node that has a file system mounted writes
every two seconds to its block in the heartbeat file. Each node opens a TCP connection to
every node that establishes a heartbeat. If the TCP connection is lost for more than 10
seconds, the node is considered dead, even if the heartbeat is continuing.
If a node loses network connectivity to more than half of the heartbeating nodes, it has lost the
quorum and fences itself off from the cluster. A quorum is the group of nodes in a cluster that
are allowed to operate on the shared storage. Fencing is the act of forcefully removing a node
from a cluster. A node with OCFS2 mounted fences itself when it realizes that it does not
have quorum in a degraded cluster. It does this so that other nodes do not attempt to continue
trying to access its resources. OCFS2 panics the node that is fenced off. A surviving node
then replays the journal of the fenced node to ensure that all updates are on disk.
heartbeat:
region = 5678675678ABCDFE309888C34A0DB6B2
cluster = mycluster
cluster:
node_count = 10
heartbeat_mode = global
name = mycluster
Refer to https://oss.oracle.com/projects/ocfs2-tools/src/branches/global-
heartbeat/documentation/o2cb/heartbeat-configuration.txt for additional information about
heartbeat configuration. Also see https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat
for an example of configuring global heartbeat.
Use the /etc/init.d/o2cb initialization script to configure the cluster stack, as well as to
configure the cluster timeout values. Run the following command on each node of the cluster:
# service o2cb configure
Configuring the O2CB driver.
This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is
loaded on boot. The current values will be shown in brackets
('[]'). Hitting <ENTER> without typing an answer will keep that
current value. Ctrl-C will abort.
Load O2CB driver on boot (y/n) [n]: y
Respond with y to load the cluster stack driver at boot time.
Cluster stack backing O2CB [o2cb]: ENTER
Provide the name of the cluster stack service. The default and correct response is o2cb.
Cluster to start on boot (Enter "none" to clear) [ocfs2]:
mycluster
Provide the name of your cluster that you created in the cluster layout configuration file.
Use the mkfs.ocfs2 command to create an OCFS2 volume on a device. You cannot use
this utility to overwrite an existing volume that is mounted by another node in the cluster. This
utility also requires that the cluster service is online. When not required, it is recommended
that you create OCFS2 volumes only on partitions. Only partitioned volumes can be mounted
by label. Labels are a must for ease of management in a cluster environment.
Some of the available options for mkfs.ocfs2 are listed as follows. All the options, with the
exception of block size and cluster size, can be changed using tunefs.ocfs2.
• -b|--block-size <block-size> – Specifies the smallest unit of I/O performed by
the file system, and the size of inode and extent blocks. Supported block sizes are 512
bytes, 1 KB, 2 KB, and 4 KB. The default size is 4 KB.
• -c|--cluster-size <cluster-size> – Specifies the smallest unit of space
allocated for file data. Supported sizes are 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB,
512 KB, and 1 MB. The default size is 4 KB. When using the volume for storing
database files, do not use a cluster size value smaller than the database block size.
• -L|--label <volume-label> – Specifies a name for the volume. In a cluster,
nodes can detect devices in different order, which could result in the same device
having different names on different nodes. Labeling allows consistent naming for
OCFS2 volumes across a cluster.
Use the mount and umount commands to mount and unmount an OCFS2 volume, as you
would with any other type of file system. The following example creates a mount point, and
then mounts the OCFS2 volume by label:
# mkdir /u01
# mount –L myvolume /u01
There is a delay in the clustered mount operation because it must wait for the node to join the
DLM domain. A clustered unmount is also not instantaneous because it involves migrating all
mastered lock-resources to the other nodes in the cluster. If the mount fails, use the dmesg
command to view error messages.
The cluster stack must be online for the mount to succeed. Check the status of the cluster
stack with the following status command. Use the online command to bring the cluster
stack online if necessary.
# service o2cb status
# service o2cb online
Use the tunefs.ocfs2 command to change file system parameters. All parameters with the
exception of block size and cluster size can be changed. The options for tunefs.ocfs2 are
similar to the options for the mkfs.ocfs2 command. Some changes require the cluster
service to be online.
You can also use the tunefs.ocfs2 command with the -Q option to query the file system
for specific attributes. The following example queries block size (%B), cluster size (%T),
number of node slots (%N), volume label (%V), and volume UUID (%U) for the OCFS2 volume
on /dev/xvdb1. The following example specifies labels for each attribute and displays each
attribute on a new line:
# tunefs.ocfs2 –Q “Block Size: %B\nCluster Size: %T\nNode Slots:
%N\nVolume Label: %V\nVolume UUID: %U\n” /dev/xvdb1
Block Size: 4096
Cluster Size: 4096
Node Slots: 4
Volume Label: myvolume
Volume UUID: EC42AFE1CF614B0BB03F4ACAA4E5BC28
This lesson begins by discussing booting into rescue mode to repair boot problems. The
majority of this lesson describes some of the tools available to assist in diagnosing the cause
of poor performance on an Oracle Linux system. Performance issues can be caused by any
of the components on a system, both software and hardware, and by their interactions. A
multitude of performance diagnostics tools exist in Linux, ranging from tools that collect and
analyze system resource usage on the different hardware components, to tracing tools for
troubleshooting performance issues and debugging problems in multiple processes and
threads.
This lesson also addresses frequently encountered performance problems in Oracle Linux
and common performance tuning parameters. Some of the basic Linux tools for collecting
system resource usage are discussed next. These tools assist in identifying overloaded disks,
network, memory, or CPUs.
This lesson also discusses OS Watcher Black Box (OSWbb) and OS Watcher Black Box
Analyzer (OSWbba). These tools provide a means for continuous data collection and
analysis, which provide details on any performance problems.
The lesson concludes by describing the latencytop and strace utilities.
This slide presents tips for solving some frequently encountered performance problems on an
Oracle Linux system. The first problem involves out-of-memory errors and poor performance
generally when running Oracle Database. The cause of this problem is likely to be that the
System Global Area (SGA) is mapped in 4k pages instead of 2 MB.
The solution to this problem is to use HugePages. With HugePages, page size can be
increased to anything between 2 MB and 1 GB, thereby reducing the total number of pages to
be managed by the kernel, and therefore reducing the amount of memory required to hold the
page table in memory. In addition to these changes, the memory associated with HugePages
cannot be swapped out, which forces the SGA to stay memory resident. The savings in
memory and the effort of page management make HugePages nearly mandatory for Oracle
database systems running on x86-64 architectures.
In the second problem, you need to preserve the data written to the console. Often syslog or
/var/log/messages are not complete due to fencing. Different options for console logging
exist, including serial console or virtual serial console and netconsole. Another option is to set
terminal attributes as follows (this disables blanking the screen during inactivity and disables
the powersave and powerdown features):
# setterm -blank 0 -powersave off -powerdown 0 /dev/console
It is even a good idea to keep a digital camera handy to take a picture of the console output.
Many performance issues are the result of configuration mistakes. These can be avoided by
using validated configurations and by referring to the release notes for recommendations on
settings for specific kernel parameters. Validated configurations are pre-tested and supported
Linux architectures, including software, hardware, storage, drivers, and networking
components. These incorporate best practices for Linux deployment and have undergone
real-world testing of the complete stack. Over 100 validated configurations are published and
freely available for download.
• The net.core.rmem_max and net.core.wmem_max parameters set the maximum
OS socket buffer sizes. To help minimize network packet loss, these buffers must be
large enough to handle incoming network traffic.
• The fs.file-max parameter denotes the maximum number of open files for all
processes. Increase this value when you get messages about running out of file
handles.
• The vm.swappiness parameter is a value between 0 and 100. When set to 0,
swapping occurs only to avoid an out-of-memory condition. When set to 100, Linux
swaps aggressively.
• The vm.panic_on_oom parameter enables or disables a kernel panic in an out-of-
memory (OOM) situation. When set to 0, the kernel’s OOM-killer attempts to kill some
rogue process to avoid a panic. When set to 2, the kernel always panics when an OOM
condition occurs. When set to 1, the kernel normally panics but can survive in certain
conditions.
Oracle Linux Advanced System Administration 7 - 7
System Resource Reporting Utilities
This slide lists some of the basic Linux tools for collecting system resource usage and errors.
Each of these tools is helpful in identifying performance problems caused by overloaded
disks, network, memory, or CPUs. Some of these tools provide the same information but in a
slightly different format. File system statistics are also reported from some of these tools.
Keep in mind that system resource statistics need to be collected and monitored regularly.
Continuous data collection gives you a time-based snapshot of the system. Establish a
baseline of acceptable measurements under typical operating conditions. Use the baseline as
a reference point to make it easier to identify memory leaks, periodic jobs, and other problems
when they occur. Monitoring system performance also allows you to plan for future growth
and how configuration changes can affect future performance.
Several of the utilities listed allow you to monitor CPU usage. When a CPU is occupied by a
process, it is unavailable for processing other requests. Pending requests must wait until the
CPU is free or idle. This becomes a bottleneck in the system. Many of the utilities listed in the
slide provide CPU percent idle statistics. A low percent idle statistic indicates a busy CPU.
The amount of time spent running system code should not exceed 30%, especially if idle time
is close to 0%. A combination of large run queue with no idle CPU is an indication that the
system has insufficient CPU capacity. When CPU usage is high, it is also important to
determine the processes that are responsible.
The mpstat command helps you to identify CPU-related performance problems. The
following example provides statistics at one-second intervals for each individual processor
and globally for all processors. Observe the %idle value, which shows the percentage of
time that the CPUs were idle and the system did not have an outstanding disk I/O request.
# mpstat -P ALL 1
The sar command produces system activity information. The sar command, by default,
displays information from the current daily data file. Use the –f <filename> option to view
information for a different day. The daily data files are stored in /var/log/sa/sa<dd>,
where <dd> is the day’s two-digit date. The following example provides statistics for each
individual processor and globally for all processors. Observe the %idle value.
# sar -P ALL
The following example reports system-wide CPU utilization. Observe the %idle value.
# sar -u
The following example reports queue length and load averages. Observe the runq-sz value,
which is the run queue length, or the number of tasks waiting for run time. This field provides
an indication of CPU saturation and that your system does not have sufficient CPU capacity.
# sar -q
The sar command also reports memory statistics. The following example reports memory
utilization statistics. Observe the %memused value, which is the percentage of used memory.
# sar -r
The following example reports memory paging statistics. Observe pgscank, which is the
number of pages scanned by the kswapd daemon per second, and pgscand, which is the
number of pages scanned directly per second.
# sar -B
The following example reports swapping statistics. Observe pswpin/s, which is the number
of swap pages the system brought in per second, and pswpout/s, which is the number of
swap pages the system brought out per second.
# sar -W
The dmesg command prints the message buffer of the kernel. This command prints out boot
messages, including any physical memory failures detected.
# dmesg
The /proc/meminfo file reports a large amount of valuable information about your system’s
memory usage.
The dstat command generates system resource statistics. Use the -c option to display CPU
statistics. Observe the idl (idle) value, which shows the percentage of time that the CPUs
were idle and the system did not have an outstanding disk I/O request.
# dstat –c
The following example reports process statistics. Observe the run value, which is an
indication of CPU saturation.
# dstat -p
The following example reports memory statistics. Observe the free value.
# dstat –m
The following example reports file system statistics. Observe the files value, which is the
number of open files.
# dstat --fs
The following example uses the sar command to print file system information. This command
reports the status of inode, file, and other kernel tables. Observe the file-nr value, which is
the number of file handles used by the system.
# sar -v
The vmstat command is primarily used to monitor your system’s memory usage but the
command output contains more than just memory statistics. Output is divided into six
sections: procs, memory, swap, io, system, and cpu. The following example reports
statistics at one-second intervals.
# vmstat 1
Observe the id value in the cpu column. This is CPU time spent idle, including I/O wait time.
Observe the r value in the proc column also. This is the number of processes waiting for run
time.
Important memory statistics (seen in the memory column) include free, which is the amount
of idle memory, and si and so, which indicate the amount of memory swapped in from disk
per second and the amount of memory swapped to disk per second, respectively.
The free command also displays memory information. It displays the total amount of free
and used physical and swap memory in your system, as well as the buffers used by the
kernel. By default, units are displayed in kilobytes. The following example displays amounts in
megabytes:
# free -m
The top command provides an ongoing look at processor activity in real time. By default, it
displays a listing of the most CPU-intensive processes or tasks on the system. It provides a
limited interactive interface for manipulating processes. Press H to view the top commands.
The upper section displays general information such as the load averages during the last 1, 5,
and 15 minutes (same output as uptime command), number of running and sleeping tasks,
and overall CPU and memory usage.
The tasks line displays the total number of processes running at the time of the last update. It
also indicates how many processes exist, how many are sleeping (blocked on I/O or a system
call), how many are stopped (someone in a shell has suspended it), and how many are
actually assigned to a CPU. Like load average, the total number of processes on a healthy
machine usually varies just a small amount over time. Suddenly having a significantly larger
or smaller number of processes could be a warning sign.
The following keys change the output displayed in the upper section:
• l: Toggles load average and uptime on and off
• m: Toggles memory and swap usage on and off
• t: Toggles tasks and CPU states on and off
The dstat command can also report the “top” consumers of system resources. Following are
some of the available options for displaying the top resource consumers:
• --top-bio – Show the most expensive block I/O process.
• --top-cpu – Show the most expensive CPU process.
• --top-cputime – Show the process that is using the most CPU time in milliseconds
(ms).
• --top-io – Show the most expensive I/O process.
• --top-latency – Show the process with the highest total latency in ms.
• --top-mem – Show the process that is using the most memory.
• --top-oom – Show the process to be killed first by the OOM-killer.
The following example illustrates that these options can be combined:
# dstat --top-cpu --top-bio --top-mem
-most-expensive- ----most-expensive---- --most-expensive-
cpu process | block i/o process | memory process
nautilus 0.0|init 445B 186B|nautilus 24.0M
^C
The ip command can display network statistics using the -s option. The following example
displays network statistics for all network devices. Observe the bytes transmitted and received
(TX bytes and RX bytes) values as an indicator of network interface utilization. Network
interface errors are also reported. This information is obtained from the /proc/net/dev file.
# ip –s link
The ifconfig command displays overruns and dropped frames. These fields provide an
indicator of network interface saturation. Frames being received are dropped and the overrun
counter is incremented when the capacity of the interface is exceeded. This command also
reports network interface errors.
# ipconfig
The netstat command displays statistics using the -s option. The following example
displays summary statistics for each protocol. Observe the number of segments
retransmitted as an indicator of network interface saturation.
# netstat -s
This command also displays network interface information, including errors, using the -i
option. Observe the RX-ERR and TX-ERR values for any receive and transmit errors.
# netstat -i
The iostat command is used for monitoring system input/output device loading by
observing the time that the physical disks are active in relation to their average transfer rates.
This information can be used to change system configuration to better balance the
input/output load between physical disks and adapters. The following example reports
extended statistics (-x) at one-second intervals, and excludes devices with no activity (-z).
# iostat –xz 1
Observe the %util value, which is the percentage of CPU time during which I/O requests
were issued to the device. Device saturation occurs when this value is close to 100%. The
following example includes the -n option, which includes an NFS header row followed by
statistics for each network file system that is mounted.
# iostat –xnz 1
Observe the avgqu-sz value, which is the average queue length of the requests that were
issued to the device. Average queue length greater than 1 is an indication of device I/O
saturation.
The following sar command reports similar activity for each block device. Observe the %util
value for bandwidth utilization and avgqu-sz as an indicator of device I/O saturation.
# sar -d
The iotop utility is a Python program that is used for monitoring swap and disk I/O on a per-
process basis. If you are getting more disk activity on your system than you would like, iotop
can help identify the process or processes that are responsible for the excessive I/O.
The iotop utility has a user interface similar to top. The upper section of the output displays
the sum of the DISK READ and DISK WRITE bandwidth in B/s (bytes per second). The lower
section displays I/O bandwidth read (DISK READ ) and written (DISK WRITE) by each
process or thread. It also displays the percentage of time the thread or process spent while
swapping in (SWAPIN) and while waiting on I/O (IO). The COMMAND column displays the
name of the process.
By default, iotop monitors all users on the system and all processes. Several options are
available. The following is a partial list of options to iotop:
• -h: Display help and a list of options.
• -o: Show only processes and threads that are actually performing I/O.
• -u USER: Show specific USER processes.
• -a: Show accumulated I/O instead of bandwidth.
Press the letter Q to exit.
The graphical system monitor shown in the slide is included with the GNOME desktop
environment, which requires X Windows. The system monitor allows you to display system
resource usage, basic system information, file system information, and monitor system
processes. Enter the following command to display the monitor:
# gnome-system-monitor
The monitor has four tabs, with the Resource tab shown in the slide. The top graph on the
Resource tab displays CPU usage history in graphical format, and also displays the current
CPU usage as a percentage. The middle graph displays memory usage and swap usage
history in graphical format. Displayed below the graph in the middle is the amount of used
memory out of the total memory and used swap out of the total swap. The graph at the bottom
displays the network history in graphical format. Below this graph is the received data per
second and total received, and the sent data per second and total sent.
Click the File Systems tab to display a list of mounted file systems and the associated device
name, mount point, file system type, total capacity, free space, available space, and used
space.
Click the Processes tab to display a list of processes and associated information, similar to
that which is displayed with the ps command. The System tab displays system properties
such as software versions and hardware specifications and status.
OS Watcher Black Box (OSWbb) is a collection of UNIX shell scripts that collect and archive
operating system and network metrics to aid in diagnosing performance issues. OSWbb
operates as a set of background processes on the server, and gathers OS data on a regular
basis, invoking such utilities as vmstat, netstat, and iostat.
OSWbb also includes an analysis tool, OS Watcher Black Box Analyzer (OSWbba), which
analyzes the log files produced by OSWbb. This tool allows OSWbb to be self-analyzing.
OSWbba also provides a graphing capability to graph the data and to produce an HTML
profile.
Installing OSWbb
OSWbb is available as a tar file through My Oracle Support (MOS). Download the tar file, and
then copy the file where OSWbb is to be installed. In a cluster environment, install OSWbb on
each node. Use the following command to extract the oswbb<version>.tar file. Extracting
the tar file creates and populates the oswbb directory, which completes the installation.
# tar xvf oswbb.tar
Starting OSWbb
A directory named oswbb is created, which stores all the files associated with OSWbb. Within
this directory is the startOSWbb.sh shell script, which starts OSWBB.
OS Watcher Black Box Analyzer (OSWbba) is a graphing and analysis utility that comes
bundled with OSWbb v4.0.0 and higher. OSWbba allows you to graphically display the data
that is collected, to generate reports containing these graphs, and provides a built in analyzer
to analyze the data and provide details on any performance problems that it detects. The
ability to create a graph and analyze this information relieves you of manually inspecting all
the files.
OSWbba replaces the OSWg utility. This was done to eliminate the confusion caused by
having multiple tools in support named OSWatcher. OSWbba is supported only for data
collected by OSWbb and no other tool.
OSWbba is written in Java and requires a minimum Java Version 1.4.2 or higher. OSWbba
can run on any Unix X Windows or PC Windows platform. OSWbba uses Oracle Chartbuilder,
which requires an X Windows environment.
OSWbba parses all the OSWbb vmstat, iostat, and top utility log files contained in the
oswbb/archive directory. When the data is parsed, you are presented with a command-line
menu that has options for displaying graphs, creating binary GIF files of these graphs, and
generating an HTML report containing all the graphs with a narrative on what to look for, and
the ability to self-analyze the files that OSWbb creates.
OSWbba requires no installation. It comes shipped as a stand-alone Java jar file with OSWbb
v4.0.0 and higher.
Select Option A from the OSWbba menu to analyze the files in the archive directory and
produce a report. You need to be in the directory where OSWbba is installed to run the
analyzer. You can also run the analyzer directory from the command line by including the –A
option:
# java –jar oswbba.jar –i ~/oswbb/archive –A
New analysis file created in the analysis directory.
The following is a sample analysis file name:
# ls ~/oswbb/analysis
analysis_1359577132282.txt
The analyzer output is divided into sections for easy readability.
• Section 1: Overall Status
• Section 2: System Slowdown Summary Ordered By Impact
• Section 3: Other General Findings
• Section 4: CPU Detailed Findings
• Section 5: Memory Detailed Findings
• Section 6: Disk Detailed Findings
This slide shows latencytop running in a non-graphical interface mode. There is also a GUI
version of this tool. latencytop is a tool to visualize Linux kernel latency statistics. It is
primarily a tool for software developers, both kernel and user space, to identify where latency
is occurring in the system, and what kind of operation or action is causing the latency to
occur. It displays in real time what the kernel is doing. In the text user interface version, you
can walk through the processes using the cursor keys. Latencytop requires the Unbreakable
Enterprise Kernel.
The strace utility traces system calls and signals made by a program. This command is not
really useful for performance monitoring but is useful in situations where a command or
application fails to launch/start, or to find out what a program is doing.
A system call is a controlled entry point into the kernel which allows a process to request the
kernel to perform some action on the process’s behalf. A system call changes the processor
state from user mode to kernel mode, so that the CPU can access protected kernel memory.
The kernel makes services accessible to programs through the system call API. See the
syscalls(2) manual page for a list of the Linux system calls.
A signal is a software interrupt. Signals inform a process that some event or exception
condition has occurred. Signals are sent to a process by the kernel, or by another process, or
by the process itself. Each signal type is identified by a different integer and defined with
symbolic names using the form SIGxxx. When a process receives a signal, it either ignores
the signal, is killed by the signal, or is suspended by the signal. The default action of some
signals is to terminate the process and produce a core dump file, which is a disk file
containing an image of the process’s memory at the time of termination. See the signal(7)
manual page for a list of the Linux signals.
The syntax of strace is as follows:
strace [options] <command> [arg ...]
kexec and kdump are used to ensure faster boot up and the creation of reliable kernel
vmcores for diagnostic purposes. kexec is a fast-boot mechanism, which allows booting a
Linux kernel from the context of an already running kernel without going through BIOS.
kdump is a kernel crash dumping mechanism and is reliable because the crash dump is
captured from the context of a freshly booted kernel and not from the context of the crashed
kernel. kdump uses kexec to boot into a second kernel whenever the system crashes. This
second kernel boots with very little memory and captures the dump image.
To enable and use kdump, install the following package:
# yum install kexec-tools
Enabling kdump requires you to reserve a portion of the system memory for the capture
kernel. This portion of memory is unavailable for other uses. The amount of memory that is
reserved for the kdump kernel is represented by the crashkernel boot parameter. This is
appended to the kernel line in the GRUB configuration file, /boot/grub/grub.conf. The
following example enables kdump and reserves 128 MB of memory:
kernel /vmlinuz-2.6.39-400.17.2.el6uek ... crashkernel=128M
If kdump fails to start, the following error appears in /var/log/messages:
kdump: No crashkernel parameter specified for running kernel
The configuration file for kdump is /etc/kdump.conf. The default target location for the
vmcore is the /var/crash directory on the local file system, which is represented as follows:
path /var/crash
To change the target location, edit this file and specify the new target. To write to a different
local directory, edit the path directive and provide the directory path. To write to a different
partition, specify the type of file system followed by an identifier.
Example:
path /
ext4 UUID=bff248...
To write directly to a device, edit the raw directive and specify the device name.
Example:
raw /dev/sda1
To write to a remote system by using NFS, use the net directive followed by the FQDN of the
remote system, then a colon (:), and then the directory path. Example:
net host01.example.com:/export/crash
The magic SysRq (system request) key is a key combination that is understood by the Linux
kernel, which allows you to execute low-level commands regardless of the system’s state. It is
often used to recover from, or diagnose, a system hang. The slide lists some of the common
uses and the method to invoke. The left column lists the Magic SysRq keys and right column
lists the different methods of invoking the keys. For example, to cause a system crash of an
OVM guest, determine the domain ID of the guests as follows:
[dom0]# xm list
Assuming that the domain ID is 60, use the following command from dom0 to crash the VM
guest:
[dom0]# xm sysrq 60 c
The magic SysRq feature is controlled by the kernel.sysrq kernel parameter
(/proc/sys/kernel/sysrq). Set the value to 1 to enable, or to 0 (default) to disable.
The crash utility allows you to analyze the state of the Linux system while it is running or
after a kernel crash has occurred. The utility has been merged with gdb, the GNU Debugger,
so crash includes source code level debugging capabilities. The crash utility is used to
analyze core dumps created by the kdump, netdump, diskdump, xm dump-core, Linux
Kernel Crash Dumps (LKCD), and vissh dump facilities. The syntax for crash is as follows:
crash [options] [vmlinux] [vmcore]
Optional <vmlinux> and <vmcore> arguments:
• vmlinux – A vmlinux kernel object file, often referred to as the namelist. The
vmlinux file is part of the kernel-debuginfo package. This file is installed in the
/usr/lib/debug/lib/modules/<kernel> directory.
• vmcore – The memory image, which is a kernel crash dump file created by any of the
supported core dump facilities. Omit the vmcore argument to invoke the crash utility
on a live system.
The following is an example with both <vmlinux> and <vmcore> arguments:
# crash /usr/lib/debug/lib/modules/2.6.39-
400.17.2.el6uek.x86_64/vmlinux /root/2013-0211-2358.45-
host03.28.core
To use the crash utility, you must install the crash and kernel-debuginfo RPM
packages. The crash RPM is included in the Oracle Linux distribution. Download the
kernel-debuginfo RPM packages from https://oss.oracle.com/ as shown in the slide.
You need to determine the kernel version and architecture of the vmcore dump file to know
which kernel-debuginfo RPM packages to download and install. If you have access to
the system that produced the vmcore, use the uname –r command to determine the kernel
version and architecture as follows:
# uname –r
2.6.39-400.17.2.el6uek.x86_64
In this example, the kernel version is 2.6.39-400.17.2.el6uek and the architecture is
x86_64. If you have only the vmcore file, use the strings <vmcore> command as follows:
# strings <vmcore> | less
...
/vmlinuz-2.6.39-400.17.2.el6uek.x86_64 ro root=/dev/mapper...
...
Browse through the output until you see the kernel version and system architecture.
The slide displays an example of information that is initially displayed when crash is run. You
can learn a lot about the state of a system from this initial output. The number of CPUs and
the load average over the last 1 minute, last 5 minutes, and last 15 minutes are displayed.
The number of tasks running, the amount of memory, the panic string, the command
executing at the time the core dump was created, and additional information are displayed.
In this example, the system did not panic. The core dump file was created using the xm
dump-core command. The following example displays some of the initial crash information
for an actual panic:
DUMPFILE: tmp/vmcore
PANIC: "Oops: 0002" (check log for details)
PID: 1696
COMMAND: "insmod“
TASK: c74de000
CPU: 0
STATE: TASK_RUNNING (PANIC)
In this example, an insmod attempt to install a module resulted in an “oops” violation.
After the initial crash output is displayed, the crash utility provides an interactive prompt
from which crash commands are executed:
# crash /usr/lib/debug/lib/modules/kernel/vmlinux /root/2013-
0211-2358.45-host03.28.core
...
crash>
From the crash> prompt, you can enter help or ? to display the crash commands. You can
also enter help <command> to display usage information for a specific command. Each
crash command falls into one of the following categories:
• Symbolic display – Displays kernel text and data structures
• System state – Includes kernel-aware commands that examine various kernel
subsystems on a system-wide or per-task basis
• Utility functions – Include useful helper commands such as a calculator, translator, and
search function.
• Session control – Includes commands that control the crash session itself
The following symbolic display crash commands take advantage of the gdb integration to
display kernel data structures symbolically:
• struct – Displays a structure definition or provides a formatted display of the contents
of a structure at a specified address. For example:
crash> struct cpu
struct cpu {
int node;
int hotpluggable;
struct sys_device sysdev;
}
• union – Is the same as the struct command, but is used for kernel data types that
are defined as unions instead of structures
• * – Is the “Pointer-to” command, which can be used instead of entering struct or
union. The gdb module first determines whether the argument is a structure or a union,
and then calls the appropriate function.
The following system state crash commands are “kernel-aware” commands that display
various kernel subsystems on a system-wide or per-task basis:
• bt – Displays a kernel stack backtrace of the current context. This is probably the most
useful crash command. Because the initial context is the panic context, it shows the
function trace leading up to the kernel panic. For example:
crash> bt
PID: 0 TASK: ffffffff81781020 CPU: 0 COMMAND: “swapper”
#0 [ffffffff81777e38] __schedule at ffffffff815031d4
#1 [ffffffff81777e98] native_safe_halt at ffffffff8103d1db
#2 [ffffffff81777ed0] default_idle at ffffffff8101bbfd
#3 [ffffffff81777ef0] cpu_idle at ffffffff81013096
• dev – Displays character and block device data. For example:
crash> dev
CHRDEV NAME CDEV OPERATIONS
1 mem ffff88005bc31640 memory_fops
The following utility crash commands are helper commands that serve a variety of functions:
• ascii – Translates a hexadecimal value to ASCII. If no argument is entered, an ASCII
chart is displayed.
• btop – Translates a hexadecimal address to its page number
• eval – Evaluates an expression and displays the result in hexadecimal, decimal, octal,
and binary. This command can also function as a calculator.
• list – Displays the contents of a linked list of structures
• ptob – Translates a page number to its physical address (byte value)
• ptov – Translates a hexadecimal physical address into a kernel virtual address
• search – Searches for a specified value within the user virtual, kernel virtual, or
physical memory space. You can also provide a starting and ending point to search.
• rd – Reads or displays a specified amount of user, kernel, or physical memory in a
specified format
• wt – Writes or modifies the contents of memory. Use this command with great care.
The following session control crash commands control the crash session itself:
• alias – Creates an alias for a command string. Several built-in aliases are provided.
Enter the alias command with no arguments to display the current list of aliases.
• exit – Exits the crash session. This command is the same as the q command.
• extend – Extends the crash command set by dynamically loading crash extension
shared object libraries. Use the –u option to unload shared object libraries.
• foreach – Allows you to execute a crash command on multiple tasks in the system. It
can be used with bt, vm, task, files, net, set, sig and vtop commands.
• gdb – Passes arguments to the GNU Debugger for processing. Use the gdb help
command to list classes of commands.
• repeat – Repeats a command indefinitely until stopped with CTRL-C. This command is
useful only on a live system.
• set – Changes the crash context to a new task, or is used to display the current
context
• q – Exits the crash session. This command is the same as the exit command.
The steps to debug a kernel crash vary according to the problem. The following provides
some basic steps to follow. The bt command is often the first command you use after starting
a crash session. The bt command shows the function trace leading up to the kernel panic.
Use the bt -a command to show the stack traces of an active task on each CPU because
there is often a relationship between the panicking tasks on one CPU and the running tasks
on the other CPUs. If you see cpu_idle and swapper, it means nothing is running on that
CPU. You can also use bt as an argument to the foreach command to display backtraces
of all tasks. Use the bt –l command to see source file line numbers.
The kmem –i command provides a good summary of memory and swap use. Look for SLAB
greater than 500 MB and SWAP in use. Use the command ps | grep UN to check for D
state stuck processes. These processes contribute to the load average.
You can also redirect output to a file as follows. Look through the output for a process that is
hung to set the PID. Use the set command to change the context to that PID:
crash> ps > ps.txt
crash> set <PID>
Commands that show open files (files), mount points (mount), and network devices with IP
addresses (net) are often helpful. Use the help <command> in crash to see options.
DTrace is a comprehensive dynamic tracing facility originally developed for Oracle Solaris,
and is now available with Oracle Linux. DTrace is a performance analysis and troubleshooting
tool designed to give operational insights that allow you to tune and troubleshoot the OS. Use
DTrace to explore your system and to understand how it works, to track down performance
problems across many layers of software, or to locate the cause of a multitude of abnormal
behavior problems. DTrace also provides Oracle Linux developers with a tool to analyze
performance, and to better see how their systems work. DTrace enables higher quality
application development, reduced down time, lower cost, and greater utilization of existing
resources.
DTrace stands for Dynamic Tracing and provides an instrumentation technique that
dynamically patches live running instructions with instrumentation code. At specific points of
execution in the code, you can activate probes and designate actions, such as collecting and
displaying information. DTrace allows you to dynamically define probe points. This means that
they are not precompiled into the kernel. DTrace has providers, which are basically categories
of probes. The providers of Dtrace for Linux are listed in the slide.
DTrace provides a D programming language to enable probe points and associated actions.
Use the D language from the command line, or create D script files when performing complex
tracing. The D language is similar to the C programming language and awk.
• Many tracing and profiling tools exist for Linux, for varying
use cases:
– Inconsistent syntax, scripting languages, and output formats
– Lack of integration of both kernel and application tracing in a
single tool
• DTrace provides an integrated solution.
• Administrators and developers know DTrace from Oracle
DTrace provides an integrated solution to the variety of tracing tools currently available for
Linux. Available tracing and profiling tools include:
• strace – This tool traces system calls and signals made by a program.
• pstack – This tool prints a stack trace of a running process.
• oprofile – This is a system profiling tool that is capable of profiling all running code,
including hardware and software interrupt handlers, kernel modules, the kernel, shared
libraries, and applications. It uses hardware counters and can also count cache activity.
• perf – This tool is part of the Performance Counters for Linux (PCL) kernel-based
subsystem. It tracks hardware events and can also measure software events.
• stap – This tool is the front end to Systemtap, which can collect information about a
running Linux system for use in the diagnosis of performance or functional problems.
• valgrind – This is a suite of tools for debugging and profiling Linux executables. The
tool suite includes memory leak checking, thread error detectors, and other profiler tools.
• blktrace – This tool is a block layer I/O tracing mechanism, which traces I/O traffic on
block devices.
Oracle makes DTrace on Linux available through the Unbreakable Enterprise Kernel. To use
DTrace on Linux, you must be on x86_64 architecture. Oracle is making all kernel changes
under the GNU GPL, whereas the DTrace kernel module itself is CDDL.
The DTrace packages are available on the Unbreakable Linux Network (ULN). Register your
server with ULN and subscribe your system to the ol6_x86_64_DTrace_latest channel.
After you have this channel registered with your server, you can install the RPMs using the
following command:
# yum install dtrace-utils
This command installs the dtrace-modules, dtrace-utils, kernel-uek-dtrace,
kerneluek-dtrace-firmware, and libdtrace-ctf packages. If you want to be able to
link applications against DTrace, install the development packages as follows:
# yum install kernel-uek-dtrace-devel dtrace-utils-devel
This command installs the dtrace-utils-devel, kernel-uek-dtrace-devel, and
libdtracectf-devel packages.
After the RPMs are installed, reboot the server into the newly installed kernel.
DTrace probes are tracing points, or instrumentation points, that enable you to record data at
various points of interest. Each probe is associated with an action. When the probe fires, the
action is executed. For example, you can define probes that fire upon entry into a kernel
function, when a specific file opens, when a particular process starts, or when a line of code
executes. Examples of data that you can display upon entry into a kernel function include:
• Any argument to the function
• Any global variable in the kernel
• A nanosecond timestamp of when the function was called
• A stack trace to indicate what code called this function
A probe:
• Is made available by a provider
• (For function-bound probes) Identifies the instrumented function and, in most cases, the
containing module
• Has a name
• Has a unique integer identifier
DTrace probes are implemented by providers. A provider is a kernel module that enables a
requested probe to fire when it is hit. A provider receives information from DTrace about when
a probe is to be enabled and transfers control to DTrace when an enabled probe is hit.
Use the dtrace -l command to list all probes and their respective providers:
# dtrace -l
ID PROVIDER MODULE FUNCTION NAME
1 dtrace BEGIN
2 dtrace END
3 dtrace ERROR
4 syscall read entry
5 syscall read return
...
Each line of output identifies a specific probe. Use the following syntax to uniquely identify
each probe:
provider:module:function:name
Use dtrace probes to initialize state (pre-processing) before tracing begins, process state
after tracing has completed (post-processing), and handle unexpected execution errors in
other probes.
Only three probes are provided by the dtrace provider:
• BEGIN – Fires each time you start a new tracing request. Use this probe to initialize any
state that is needed in other probes.
• END – Fires after all other probes. Use this probe to process state that has been
gathered or to format the output.
• ERROR – Fires when a runtime error occurs in executing a clause for a probe
The syntax to uniquely identify each probe is as follows:
provider:module:function:name
The dtrace probes are not bound to a function or module; therefore, these probes are
identified as follows:
• dtrace:::BEGIN
• dtrace:::END
• dtrace:::ERROR
The syscall provider makes available a probe at the entry to and return from every system
call in the system. A system call changes the processor state from user mode to kernel mode,
so that the CPU can access protected kernel memory. Because system calls are the primary
interface between user-level applications and the operating system kernel, the syscall
provider offers tremendous insight into application behavior with respect to the system.
A majority of the probes are provided by the syscall provider. The syscall provider
provides a pair of probes for each system call:
• entry – Fires before the system call is entered
• return – Fires after the system call has completed but before control has transferred
back to user-level
For all syscall probes, the function name is set to the name of the instrumented system call
and the module name is undefined. As an example, the name for syscall probe ID 4 is:
syscall::read:entry
The sched provider makes available probes related to CPU scheduling. The sched provider
dynamically traces key scheduling events. Because CPUs are the one resource that all
threads must consume, the sched provider is very useful for understanding systemic
behavior. For example, using the sched provider, you can understand when and why threads
sleep, run, change priority, or wake other threads.
There are currently 12 sched probes, including the following:
• on-cpu – Fires when a CPU begins to execute a thread
• off-cpu – Fires when a thread is about to be taken off of a CPU
• surrender – Fires when a CPU has been instructed by another CPU to make a
scheduling decision, often because a higher-priority thread has become runnable
• change-pri – Fires whenever a thread’s priority is about to be changed
• enqueue – Fires immediately before a runnable thread is enqueued to a run queue
• dequeue – Fires immediately before a runnable thread is dequeued from a run queue
• tick – Fires as a part of clock tick–based accounting
• wakeup – Fires immediately before the current thread wakes a thread sleeping on a
synchronization object
The sdt (Statically Defined Tracing) provider allows programmers to insert explicit probes in
their applications. Programmers can select locations of interest to users of DTrace and
provide some information about each location through the probe name.
The sdt probes are defined as follows:
• init_module – Fires if a loadable module entry is initialized
• delete_module – Fires if a loadable module entry is deleted
• signal_fault – Fires if a signal fault is raised
• oom_kill_process – Fires if a process is killed due to shortage of memory
• check_hung_task – Fires if a hung task is detected
• -handle_sysrq – Fires if SysRq is pressed
The io provider makes available probes related to device input and output (I/O). You can
explore behavior observed through I/O monitoring tools such as iostat.
The io probes are defined as follows:
• start – Fires when an I/O request is about to be made either to a peripheral device or
to an NFS server
• done – Fires after an I/O request has been fulfilled
• wait-start – Fires immediately before a thread begins to wait, pending completion of
a given I/O request
• wait-done – Fires when a thread finishes waiting for the completion of a given I/O
request
The proc provider makes available probes pertaining to the following activities: process
creation and termination, lightweight process (LWP) creation and termination, executing new
program images, and sending and handling signals.
There are currently 13 proc probes, including the following:
• start – Fires in the context of a newly created process, before any user-level
instructions are executed in the process
• create – Fires when a process (or process thread) is created using fork() or
vfork()
• exec – Fires whenever a process loads a new process image using a variant of the
execve() system call
• exec-failure – Fires when an exec() variant has failed
• exec-success – Fires when an exec() variant has succeeded
• exit – Fires when the current process is exiting
• lwp-create – Fires when a process thread is created
• lwp-start – Fires within the context of a newly created process or process thread
• lwp-exit – Fires when a process or process thread is exiting
The profile provider provides probes associated with a time-based interrupt that fires at a
fixed, specified time interval. These probes are not associated with any particular point of
execution. Use these probes to sample some aspect of system state such as the state of the
current thread, the state of the CPU, or the current machine instruction.
The tick-N probes fire at fixed intervals at a high interrupt level on only one CPU per
interval. The actual CPU may change over time.
The value of N defaults to rate-per-second. You can include an optional time suffix as follows:
• nsec|ns – Nanoseconds; for example, tick-1ns
• usec|us – Microseconds; for example, tick-10us
• msec|ms – Milliseconds; for example, tick-100ms
• sec|s – Seconds (this is the default)
• min|m – Minutes; for example, tick-1m
• hour|h – Hours; for example, tick-1h
• day|d – Days; for example, tick-1d
• hz – Hertz; the highest supported tick frequency is 5000 HZ (tick-5000hz).
Probes are enabled with the dtrace command by specifying them without the list (-l)
option. DTrace performs the associated action when the probe fires. The default action only
indicates that a probe fired. No other data is recorded. You can enable (and list) probes by
provider (-P), by name (-n), by function (-f), and by module (-m).
To enable all probes provided by the syscall provider:
# dtrace -P syscall
dtrace: description ‘syscall’ matched 592 probes
CPU ID FUNCTION:NAME
0 36 ioctl:entry
0 37 ioctl:return
0 36 ioctl:entry
0 37 ioctl:return
0 30 rt_sigaction:entry
^C
From the output, you can see that the default action displays the CPU where the probe fired,
the unique probe ID, the function where the probe fired, and the probe name.
DTrace actions are taken when a probe fires. The default action displays the CPU where the
probe fired, the unique probe ID, the function where the probe fired, and the probe name.
Use the D programming language to specify probes of interest and bind actions to those
probes. The D programming language is similar to the C programming language and includes
a set of functions and variables to help make tracing easier. The actions are listed as a series
of statements enclosed in braces {}, following the probe name.
The following example uses the trace() function as the action to trace the time of entry to
each system call:
# dtrace -n ‘syscall:::entry {trace(timestamp)}’
dtrace: description ‘syscall:::entry’ matched 296 probes
CPU ID FUNCTION:NAME
0 6 write:entry 173635407378228
0 6 write:entry 173635407399180
0 36 ioctl:entry 173635407416780
^C
Surround the probe name and associated action with single quotation marks (‘ ’).
There are many built-in functions that perform actions. Previous examples use the trace()
function. This function takes a D expression as its argument and traces the result to the
directed buffer. All the examples in the slide are valid trace() actions.
The last example, trace(`max_pfn), is an example of tracing the value of a system
variable named max_pfn. DTrace instrumentation executes inside the Oracle Linux operating
system kernel; so in addition to accessing special DTrace variables and probe arguments,
you can also access kernel data structures, symbols, and types. These capabilities enable
advanced DTrace users, administrators, service personnel, and driver developers to examine
the low-level behavior of the operating system kernel and device drivers.
D uses the backquote character (`) as a special scoping operator for accessing symbols that
are defined in the operating system and not in your D program. For example, the Oracle Linux
kernel contains a C declaration of a system variable named max_pfn. This variable is
declared in C in the kernel source code as follows:
unsigned long max_pfn
Use the following statement in a D program script to trace the value of this variable:
trace(`max_pfn)
Several examples are presented in subsequent slides. The following example displays
commands that are executing on your system:
# dtrace -n ‘proc:::exec {trace(stringof(arg0));}’
dtrace: description ‘proc:::exec ‘ matched 1 probe
CPU ID FUNCTION:NAME
0 617 do_execve_common:exec /bin/ls
0 617 do_execve_common:exec /bin/bash
^C
The following example displays new processes with arguments:
# dtrace -n ‘proc:::exec-success {trace(curpsinfo->pr_psargs);}’
dtrace: description ‘proc:::exec-success ‘ matched 1 probe
CPU ID FUNCTION:NAME
0 619 do_execve_common:exec-success /usr/sbin/sshd –R
0 619 do_execve_common:exec-success /sbin/unix_chkpwd root...
^C
The following example counts the number of write() system calls and the number of
read() system calls invoked by processes until you press Ctrl + C:
# dtrace -n ‘syscall::write:entry,syscall::read:entry
{@[strjoin(probefunc,“() calls”)] = count();}’
dtrace: description ‘syscall::write:entry,syscall::read:entry ‘
matched 2 probes
^C
write() calls 238
read() calls 431
The following example displays disk size by process:
# dtrace -n ‘io:::start {printf("%d %s %d",pid,execname,args[0]-
>b_bcount);}’
CPU ID FUNCTION:NAME
0 602 submit_bh:start 342 jdb2/dm-0-8 4096
0 602 submit_bh:start 1184 flush-252:0 4096
^C
Enabling multiple probes and multiple actions becomes difficult to manage on the command
line. For complex tracing, DTrace supports D script files. Each D script file ends with a .d
suffix and consists of a series of clauses. Each clause describes one or more probes to
enable and an optional set of actions to perform when the probe fires. The actions are listed
as a series of statements enclosed in braces {}, following the probe name. Each statement
ends with a semicolon (;). The example D script in the slide provides the same functionality
as the following command-line entry:
# dtrace -n ‘dtrace:::BEGIN {trace(“hello, world”);exit(0);}’
dtrace: description ‘dtrace:::BEGIN’ matched 1 probe
CPU ID FUNCTION:NAME
0 1 :BEGIN hello, world
Use the dtrace command with the –s <script> option to execute the script:
# dtrace –s hello.d
dtrace: script ‘hello.d’ matched 1 probe
CPU ID FUNCTION:NAME
0 1 :BEGIN hello, world
The quiet mode option, dtrace -q, instructs DTrace to record only the actions that are
explicitly stated. This option suppresses the default output that is normally produced by the
dtrace command. The following example shows the use of the –q option in the hello.d
script:
# dtrace -q -s hello.d
hello, world
Alternatively, you can create executable DTrace interpreter files. Interpreter files have execute
permission and always begin with the following line:
#!/usr/sbin/dtrace –s
Add the preceding line as the first line in a script to invoke the interpreter. Give the script
execute permission and you can run the script by entering its name on the command line as
follows:
# vi hello.d
#!/usr/sbin/dtrace –qs
...
# chmod +x hello.d
# ./hello.d
hello, world
Oracle Linux Advanced System Administration 9 - 28
Using Predicates in a D Script
The example in the slide implements a predicate in a D script. When executed, the script
counts down from 5 and then prints a message and exits. The script uses the
dtrace:::BEGIN probe to initialize an integer i to 10. The script then uses the
profile:::tick-1sec probe to implement a timer that fires once per second. The first
predicate, /i > 0/, checks if the value of i is greater than 0. If the value is greater, the
action is to use the trace() function to decrement the value of i by 1.
The script uses the profile:::tick-1sec probe a second time and the predicate
evaluates if i is equal to zero, /i == 0/. If this predicate is true, the action is to use the
trace() function to display the string “blastoff!”. A second action is to exit dtrace and
return to the shell prompt.
The example script in the slide uses an array to trace the elapsed time for each system call.
The script instruments both the entry to and return from read() and write(), and samples
the time at each point. Then, on return from a given system call, the script computes the
difference between the first and second timestamp. You could use separate variables for each
system call, but it is easier to use an associative array indexed by the probe function name.
The first clause defines an array named ts and assigns the appropriate member the value of
the DTrace variable timestamp. This variable returns the value of an always-incrementing
nanosecond counter. When the entry timestamp is saved, the corresponding return probe
samples timestamp again and reports the difference between the current time and the
saved value.
The predicate on the return probe requires that DTrace is tracing the appropriate process and
that the corresponding entry probe has already fired and assigned ts[probefunc] a non-
zero value. This trick eliminates invalid output when DTrace first starts. If your shell is already
waiting in a read() system call for input when you execute dtrace, the read:return
probe fires without a preceding read:entry for this first read() and ts[probefunc]
evaluates to zero because it has not yet been assigned.
The D script on the following page shows ongoing activity in terms of what program was
executing, what its parent is, and how long it ran. Use the following command to run this
script:
# dtrace -C –s activity.d
The -C option invokes the C preprocessor (see man cpp). You can use the C preprocessor in
conjunction with your D programs by specifying the dtrace -C option.
This script introduces special D compiler directives called pragmas. DTrace is tuned by
setting or enabling options and this is accomplished by using D pragmas. You can include D
pragmas anywhere in a D script, including outside probe clauses. Begin the line with # as
follows. This example sets the DTrace runtime quiet option (same as dtrace –q):
#pragma D option quiet
This script also introduces clause-local variables. These are variables whose storage is
reused for each D program clause that relates to a probe. Clause-local variables are similar to
automatic variables in a C, C++, or Java language program that are active during each
invocation of a function. Like all D program variables, clause-local variables are created on
their first assignment. These variables are referenced and assigned by applying the ->
operator to the special identifier this.
proc::do_fork:create
{
this->pid = ((struct task_struct *)arg0)->pid;
time[this->pid] = timestamp;
p_pid[this->pid] = pid;
p_name[this->pid] = execname;
p_exec[this->pid] = "";
}
proc::do_exit:exit
/p_pid[pid] != 0 && p_exec[pid] != ""/
{
printf("%s (%d) executed %s (%d) for %d msecs\n",
p_name[pid], p_pid[pid], p_exec[pid], pid,
(timestamp - time[pid]) / 1000);
}
proc::do_exit:exit
/p_pid[pid] !=0 && p_exec[pid] == ""/
{
printf("%s (%d) forked itself (as %d) for %d msecs\n",
p_name[pid], p_pid[pid], pid,
(timestamp - time[pid]) / 1000);
}
# ls /cgroup/cpuset/cpuset*
cpuset.cpu_exclusive
cpuset.cpus
cpuset.mem_exclusive
cpuset.mem_hardwall
cpuset.memory_migrate
cpuset.memory_pressure
cpuset.memory_pressure_enabled
cpuset.cpus
Is a comma-separated list of CPU cores to which a cgroup has access
cpuset.mems
Is a comma-separated list of memory nodes to which a cgroup has access
cpuset.cpu_exclusive
Specifies whether the CPUs listed in cpuset.cpus are exclusively allocated to this CPU set.
The value must be either of the following:
• 0 – CPUs are not exclusively allocated. This is the default.
• 1 – This enables exclusive use of the CPUs by a CPU set.
cpuset.mem_exclusive
Specifies whether the memory nodes listed in cpuset.mems are exclusively allocated to this
CPU set. The value must be either of the following:
• 0 – Memory nodes are not exclusively allocated. This is the default.
• 1 – This enables exclusive use of the memory nodes by a CPU set.
# ls /cgroup/cpu/cpu*
cpu.rt_period_us
cpu.rt_runtime_us
cpu.shares
cpu.shares
Specifies the relative share of CPU time available to the tasks in a cgroup. Two cgroups with
the same value receive equal CPU time. One cgroup with a value of 2 receives twice the CPU
time of another cgroup with a value of 1.
cpu.rt_period_us
Specifies a period of time in microseconds that a cgroup’s CPU access is rescheduled. The
default is 1000000 (1 second).
cpu.rt_runtime_us
Specifies the longest period of time in microseconds that tasks in a cgroup have CPU access.
The default is 950000 (0.95 seconds).
The cpu.rt_period_us and cpu.rt_runtime_us settings work together. By default,
tasks in a cgroup access CPU resources 95% of each second. To give tasks less access, for
example 60%, set cpu.rt_runtime_us to 600000 and leave cpu.rt_period_us set to
1000000.
# ls /cgroup/cpu/cpu*
cpuacct.stat
cpuacct.usage
cpuacct.usage_percpu
cpuacct.usage
Reports the total CPU time in nanoseconds for all the tasks in the cgroup. Setting this
parameter to 0 resets its value and the value of the cpuacct.usage_percpu parameter.
cpuacct.stat
Reports the total CPU time in nanoseconds that is spent in user and system (kernel) mode by
all the tasks in the cgroup
cpuacct.usage_percpu
Reports the CPU time in nanoseconds that is spent for each CPU by all the tasks in the
cgroup
# ls /cgroup/memory/memory*
memory.failcnt
memory.force_empty
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.memsw.failcnt
memory.memsw.limit_in_bytes
memory.memsw.max_usage_in_bytes
memory.memsw.usage_in_bytes
memory.limit_in_bytes
Specifies the hard, upper limit permitted for user memory, including the file cache, in bytes.
You cannot limit the root cgroup. You can only limit groups lower in the hierarchy.
memory.soft_limit_in_bytes
Specifies a soft, upper limit for user memory, including the file cache. Set the soft limit lower
than the hard limit value because the hard limit always takes precedence.
memory.memsw.limit_in_bytes
Specifies the upper limit for user memory, plus the swap space in bytes. To avoid an out-of-
memory error, set this value less than the amount of swap space, and set the value of the
memory.limit_in_bytes parameter less than memory.memsw.limit_in_bytes. You
must also set the memory.limit_in_bytes parameter before setting the
memory.memsw.limit_in_bytes parameter.
Each of these limit parameters defaults to bytes. You can also specify the limits in k or K, m or
M, and g or G. Set the parameters to -1 to remove the limits.
# ls /cgroup/devices/devices*
devices.allow
devices.deny
devices.list
devices.allow
Specifies a device that a cgroup is allowed to access. Devices are defined by the following:
• Type (a for any, b for block, or c for character)
• Major and minor numbers separated by a colon
• Access modes (m for create permission, r for read access, and w for write access)
You can use an asterisk, *, as a wildcard to represent any major or minor number. For
example, specify b 8:* rw to allow read and write access to any SCSI disk drive.
devices.deny
Specifies a device that a cgroup is not allowed to access. Use the same syntax as
devices.allow.
devices.list
Reports the devices with access controls. Output of a *.* rwm indicates that all devices are
available in all access modes.
# ls /cgroup/freezer/freezer*
freezer.state
freezer.state
The value of the parameter is one of the following:
• FROZEN – Tasks in the cgroup are suspended.
• FREEZING – Tasks in the cgroup are in the process of being suspended.
• THAWED – Tasks in the cgroup have resumed.
You perform the following steps to suspend a specific process:
1. Move the process to a cgroup in a hierarchy that has the freezer subsystem attached.
2. Freeze the particular cgroup to suspend the process contained in it.
You cannot move a process into a FROZEN cgroup. Only the FROZEN and THAWED values can
be written to freezer.state. The FREEZING value can only be read, not written.
# ls /cgroup/net_cls/net_cls*
net_cls.classid
net_cls.classid
Specifies a single hexadecimal value that indicates a traffic control handle. The value is
presented in decimal format to the Linux traffic controller (tc). You can configure the traffic
controller to use the handles that the net_cls subsystem adds to network packets.
# ls /cgroup/blkio/blkio*
blkio.io_merged
blkio.io_queued
blkio.io_service_bytes
blkio.io_serviced
blkio.io_service_time
blkio.io_wait_time
blkio.reset_stats
blkio.sectors
blkio.weight
Specifies a cgroup’s share of access to block I/O. The range is from 100 to 1000, with a
default value of 1000.
blkio.weight_device
Specifies a cgroup’s share of access to block I/O on a specific device. The range is from 100
to 1000 and the device is specified by its major and minor numbers, separated by a colon. For
example, 8:16 100 specifies a value of 100 for /dev/sdb. The value of this parameter
overrides the default value set by the blkio.weight parameter.
blkio.time
Reports the time in milliseconds that I/O access was available to a device specified by its
major and minor numbers
blkio.sectors
Reports the number of disk sectors written to or read from the devices specified by their major
and minor numbers