Dynamic Performance Tuning and Troubleshooting With DTrace (SA-327-S10) - New

Dynamic Performance Tuning and Troubleshooting With DTrace SA-327-S10
Student Guide
Sun Microsystems, Inc. UBRM05-104 500 Eldorado Blvd. Broomeld, CO 80021 U.S.A. Revision A
March 18, 2005 11:30 am
Copyright 2005 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Sun, Sun Microsystems, the Sun logo, Solaris, and OpenBoot are trademarks or registered trademarks of Sun Microsystems, Inc., in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc., in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. Federal Acquisitions: Commercial Software Government Users Subject to Standard License Terms and Conditions Export Laws. Products, Services, and technical data delivered by Sun may be subject to U.S. export controls or the trade laws of other countries. You will comply with all such laws and obtain all licenses to export, re-export, or import as may be required after delivery to You. You will not export or re-export to entities on the most current U.S. export exclusions lists or to any country subject to U.S. embargo or terrorist controls as specified in the U.S. export laws. You will not use or provide Products, Services, or technical data for nuclear, missile, or chemical biological weaponry end uses. DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS, AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. THIS MANUAL IS DESIGNED TO SUPPORT AN INSTRUCTOR-LED TRAINING (ILT) COURSE AND IS INTENDED TO BE USED FOR REFERENCE PURPOSES IN CONJUNCTION WITH THE ILT COURSE. THE MANUAL IS NOT A STANDALONE TRAINING TOOL. USE OF THE MANUAL FOR SELF-STUDY WITHOUT CLASS ATTENDANCE IS NOT RECOMMENDED. Export Control Classification Number EAR99 assigned: 10 September 2004
Please Recycle
Copyright 2005 Sun Microsystems Inc., 4150 Network Circle, Santa Clara, California 95054, Etats-Unis. Tous droits rservs. Ce produit ou document est protg par un copyright et distribu avec des licences qui en restreignent lutilisation, la copie, la distribution, et la dcompilation. Aucune partie de ce produit ou document ne peut tre reproduite sous aucune forme, par quelque moyen que ce soit, sans lautorisation pralable et crite de Sun et de ses bailleurs de licence, sil y en a. Le logiciel dtenu par des tiers, et qui comprend la technologie relative aux polices de caractres, est protg par un copyright et licenci par des fournisseurs de Sun. Sun, Sun Microsystems, le logo Sun, Solaris, et OpenBoot sont des marques de fabrique ou des marques dposes de Sun Microsystems, Inc., aux Etats-Unis et dans dautres pays. Toutes les marques SPARC sont utilises sous licence sont des marques de fabrique ou des marques dposes de SPARC International, Inc. aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont bass sur une architecture dveloppe par Sun Microsystems, Inc. UNIX est une marques dpose aux Etats-Unis et dans dautres pays et licencie exclusivement par X/Open Company, Ltd. Lgislation en matire dexportations. Les Produits, Services et donnes techniques livrs par Sun peuvent tre soumis aux contrles amricains sur les exportations, ou la lgislation commerciale dautres pays. Nous nous conformerons lensemble de ces textes et nous obtiendrons toutes licences dexportation, de r-exportation ou dimportation susceptibles dtre requises aprs livraison Vous. Vous nexporterez, ni ne r-exporterez en aucun cas des entits figurant sur les listes amricaines dinterdiction dexportation les plus courantes, ni vers un quelconque pays soumis embargo par les Etats-Unis, ou des contrles anti-terroristes, comme prvu par la lgislation amricaine en matire dexportations. Vous nutiliserez, ni ne fournirez les Produits, Services ou donnes techniques pour aucune utilisation finale lie aux armes nuclaires, chimiques ou biologiques ou aux missiles. LA DOCUMENTATION EST FOURNIE EN LETAT ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A LAPTITUDE A UNE UTILISATION PARTICULIERE OU A LABSENCE DE CONTREFAON. CE MANUEL DE RFRENCE DOIT TRE UTILIS DANS LE CADRE DUN COURS DE FORMATION DIRIG PAR UN INSTRUCTEUR (ILT). IL NE SAGIT PAS DUN OUTIL DE FORMATION INDPENDANT. NOUS VOUS DCONSEILLONS DE LUTILISER DANS LE CADRE DUNE AUTO-FORMATION.
Please Recycle
Table of Contents
About This Course ...............................................................Preface-xi Course Goals.......................................................................... Preface-xi Topics Not Covered.............................................................Preface-xiii How Prepared Are You?.....................................................Preface-xiv Introductions ......................................................................... Preface-xv How to Use Course Materials ............................................Preface-xvi Conventions .........................................................................Preface-xvii Typographical Conventions ................................... Preface-xviii DTrace Fundamentals ......................................................................1-1 Objectives ........................................................................................... 1-1 Relevance............................................................................................. 1-2 Additional Resources ........................................................................ 1-3 DTrace Features.................................................................................. 1-4 Transient Failures...................................................................... 1-4 Debugging Transient Failures................................................. 1-5 DTrace Capabilities................................................................... 1-6 DTrace Architecture........................................................................... 1-7 Probes and Probe Providers .................................................... 1-7 DTrace Components ................................................................. 1-8 DTrace Tour ...................................................................................... 1-12 Listing Probes .......................................................................... 1-12 Writing D Scripts..................................................................... 1-21 Using DTrace ....................................................................................2-1 Objectives ........................................................................................... 2-1 Relevance............................................................................................. 2-2 Additional Resources ........................................................................ 2-3 DTrace Performance Monitoring Capabilities............................... 2-4 Features of the DTrace Performance Monitoring Capabilities ............................................................................. 2-4 Aggregations.............................................................................. 2-4 Examining Performance Problems Using the vminfo Provider . 2-8 The vminfo Probes.................................................................... 2-9
v
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Finding the Source of Page Faults Using vminfo Probes.. 2-11 Examining Performance Problems Using the sysinfo Provider .......................................................................................... 2-17 The sysinfo Probes ............................................................... 2-18 Using the quantize Aggregation Function With the sysinfo Probes.............................................................. 2-21 Finding the Source of Cross-Calls ........................................ 2-22 Examining Performance Problems Using the io Provider ........ 2-26 The io Probes .......................................................................... 2-26 Information Available When io Probes Fire ...................... 2-27 Finding I/O Problems ........................................................... 2-32 Obtaining System Call Information .............................................. 2-36 The syscall Provider............................................................ 2-36 D Language Variables ............................................................ 2-43 Associative Arrays .................................................................. 2-44 Thread-Local Variables .......................................................... 2-45 Timing a System Call.............................................................. 2-46 Following a System Call........................................................ 2-48 Creating D Scripts That Use Arguments ...................................... 2-53 Built-in Macro Variables ....................................................... 2-54 PID Argument Example......................................................... 2-56 Executable Name Argument Example................................. 2-57 Custom Monitoring Tools..................................................... 2-60 Debugging Applications With DTrace............................................ 3-1 Objectives ........................................................................................... 3-1 Relevance............................................................................................. 3-2 Additional Resources ........................................................................ 3-3 Application Profiling ......................................................................... 3-4 The pid Provider....................................................................... 3-4 The profile Provider............................................................ 3-19 Application Variables...................................................................... 3-30 Displaying Process Global Variables ................................... 3-30 Displaying Library Global Variables ................................... 3-34 The plockstat Provider ................................................................ 3-36 Transient System Call Errors.......................................................... 3-38 User Stack Traces on System Call Failures.......................... 3-39 Processes Using a Lot of System Time................................ 3-41 Open Files.......................................................................................... 3-45 Accessing System Call Pointer Arguments......................... 3-45 Displaying Names of Files Being Opened........................... 3-48 Finding System Problems With DTrace......................................... 4-1 Objectives ........................................................................................... 4-1 Relevance............................................................................................. 4-2 Additional Resources ........................................................................ 4-3 Accessing Kernel Variables .............................................................. 4-4
vi
Dynamic Performance Tuning and Troubleshooting With DTrace

Using the D Language to Access Kernel Symbols ............... 4-4 Monitoring Kernel Variables................................................... 4-5 Accessing Kernel Data Structures........................................... 4-6 Accessing Lock Contention Information ............................. 4-12 The proc Provider and the system() Function.................. 4-18 Displaying Read Call Information................................................. 4-19 Tracing Read Calls System-Wide ......................................... 4-19 Tracing Read Calls Using the iosnoop.d D Script............ 4-22 Aggregating Read Data.......................................................... 4-22 Using the Anonymous Tracing Facility........................................ 4-25 Creating an Anonymous Enabling ....................................... 4-25 Performing Anonymous Tracing.......................................... 4-25 Using the Speculative Tracing Facility ......................................... 4-30 Speculative Tracing Functions ............................................. 4-31 Speculative Tracing Example ................................................ 4-32 Application Debugging With Speculative Tracing ............ 4-34 DTrace Privileges ............................................................................. 4-37 Using the Least Privilege Facility ......................................... 4-37 Kernel-Destructive Actions .................................................. 4-38 Setting DTrace User Privileges.............................................. 4-38 Setting DTrace Process Privileges......................................... 4-44 Summarizing the DTrace Privilege Levels......................... 4-47 Troubleshooting DTrace Problems.................................................5-1 Objectives ........................................................................................... 5-1 Relevance............................................................................................. 5-2 Additional Resources ........................................................................ 5-3 Minimizing DTrace Performance Impact ....................................... 5-4 Limiting Enabled Probes.......................................................... 5-4 Using Aggregations .................................................................. 5-5 Using Cacheable Predicates..................................................... 5-5 Using and Tuning DTrace Buffers................................................... 5-8 Principal Buffers........................................................................ 5-8 Principal Buffer Policies ........................................................... 5-8 DTrace Option Settings ............................................................ 5-9 The switch Buffer Policy....................................................... 5-10 The fill Buffer Policy ........................................................... 5-12 The ring Buffer Policy ........................................................... 5-13 Other Buffers............................................................................ 5-14 Buffer Resizing Policy ............................................................ 5-14 Debugging DTrace Scripts.............................................................. 5-15 Avoiding Syntax Errors in D Scripts .................................... 5-15 Avoiding Run-Time Errors in D Scripts ............................. 5-18 Actions and Subroutines ................................................................ A-1 Default Action ................................................................................... A-2 Data Recording Actions .................................................................. A-3
vii
The void trace(expression) Action................................ A-3 The void tracemem(address, size_t nbytes) Action . A-3 The void printf(string format, ...) Action............ A-3 The printa Action................................................................. A-10 The stack() Action ................................................................ A-12 The ustack() Action .............................................................. A-13 Destructive Actions......................................................................... A-16 Process Destructive Actions ................................................. A-16 Kernel Destructive Actions................................................... A-18 Special Actions ............................................................................... A-21 Actions Associated With Speculative Tracing ................... A-21 The void exit(int status) Action................................ A-22 Subroutines ..................................................................................... A-22 The void *alloca(size_t size) Subroutine ............... A-22 The string basename(char *str) Subroutine.............. A-23 The void bcopy(void *src, void *dest, size_t size) Subroutine............................................................................ A-23 The string cleanpath(char *str) Subroutine........... A-23 The void *copyin(uintptr_t addr, size_t size) Subroutine............................................................................ A-24 The string copyinstr(uintptr_t addr) Subroutine A-24 The string dirname(char *str) Subroutine ............... A-25 The size_t msgdsize(mblk_t *mp) Subroutine........... A-25 The size_t msgsize(mblk_t *mp) Subroutine ............. A-25 The int mutex_owned(kmutex_t *mutex) Subroutine A-25 The kthread_t *mutex_owner(kmutex_t *mutex) Subroutine............................................................................ A-25 The int mutex_type_adaptive(kmutex_t *mutex) Subroutine............................................................................ A-26 The int progenyof(pid_t pid) Subroutine................... A-26 The int rand(void) Subroutine ....................................... A-26 The int rw_iswriter(krwlock_t *rwlock) Subroutine....... A-26 The int rw_write_held(krwlock_t *rwlock) Subroutine .. A-27 The int speculation(void) Subroutine ........................ A-27 The string strjoin(char *str1, char *str2) Subroutine............................................................................ A-27 The size_t strlen(string str) Subroutine ............... A-27 D Built-in and Macro Variables .......................................................B-1 Built-in Variables................................................................................B-2 Macro Variables..................................................................................B-4 D Operators ......................................................................................C-1 Arithmetic Operators........................................................................ C-2 Relational Operators......................................................................... C-3
viii

Logical Operators.............................................................................. C-4 Bitwise Operators.............................................................................. C-5 Assignment Operators ..................................................................... C-6 Increment and Decrement Operators............................................. C-8 Conditional Expressions .................................................................. C-9
ix
Preface
About This Course

Course Goals
Upon completion of this course, you should be able to:
q
Describe the features and architecture of the Solaris Dynamic Tracing (DTrace) facility Use the DTrace facility to nd the source of intermittent problems Use DTrace to help debug applications Use DTrace to look at the cause of performance problems Troubleshoot DTrace script problems
q q q q
Preface-xi
Course Goals
Course Map
The following course map enables you to see what you have accomplished and where you are going in reference to the course goals.
Understanding and Using the DTrace Facility

DTrace Fundamentals
Using DTrace
Using DTrace to Debug Applications and Find System Problems

Debugging Applications With DTrace Finding System Problems with DTrace
Troubleshooting DTrace
Troubleshooting DTrace Problems
Preface-xii

Topics Not Covered
Topics Not Covered

This course does not cover the following topic. Many topics are covered in other courses offered by Sun Educational Services: Performance management Refer to the Sun Educational Services catalog for specic information and registration.
About This Course

Preface-xiii
How Prepared Are You?
How Prepared Are You?

To be sure you are prepared to take this course, can you answer yes to the following questions?
q q q q
Do you have some previous programming experience? Can you use the truss command to diagnose application problems? Do you know the basics of the kernel structure? Are you familiar with basic troubleshooting concepts?
Preface-xiv

Introductions
Introductions
Now that you have been introduced to the course, introduce yourself to the other students and the instructor, addressing the following items:
q q q q q q
Name Company afliation Title, function, and job responsibility Experience related to topics presented in this course Reasons for enrolling in this course Expectations for this course
About This Course

Preface-xv
How to Use Course Materials
How to Use Course Materials

To enable you to succeed in this course, these course materials contain a learning module that is composed of the following components:
q
Goals You should be able to accomplish the goals after nishing this course and meeting all of its objectives. Objectives You should be able to accomplish the objectives after completing a portion of instructional content. Objectives support goals and can support other higher-level objectives. Lecture The instructor presents information specic to the objective of the module. This information helps you learn the knowledge and skills necessary to succeed with the activities. Activities The activities take various forms, such as review questions, labs, discussion, and demonstration. Activities help facilitate the mastery of an objective. Visual aids The instructor might use several visual aids to convey a concept, such as a process, in a visual form. Visual aids commonly contain graphics, animation, and video.
Preface-xvi

Conventions
Conventions
The following conventions are used in this course to represent various training elements and alternative learning resources.
Icons
Additional resources Indicates other references that provide additional information on the topics described in the module.
!
?
Discussion Indicates a small-group or class discussion on the current topic is recommended at this time.
Note Indicates additional information that can help students but is not crucial to their understanding of the concept being described. Students should be able to understand the concept or complete the task without this information. Examples of notational information include keyword shortcuts and minor system adjustments. Caution Indicates that there is a risk of personal injury from a nonelectrical hazard, or risk of irreversible damage to data, software, or the operating system. A caution indicates that the possibility of a hazard (as opposed to certainty) might happen, depending on the action of the user. Caution Indicates that either personal injury or irreversible damage of data, software, or the operating system will occur if the user performs this action. A warning does not indicate potential events; if the action is performed, catastrophic events will occur.
About This Course

Preface-xvii
Conventions
Typographical Conventions
Courier is used for the names of commands, les, directories, programming code, and on-screen computer output; for example: Use ls -al to list all les. system% You have mail. Courier is also used to indicate programming constructs, such as class names, methods, and keywords; for example: The getServletInfo method is used to get author information. The java.awt.Dialog class contains Dialog constructor. Courier bold is used for characters and numbers that you type; for example: To list the les in this directory, type: # ls Courier bold is also used for each line of programming code that is referenced in a textual description; for example: 1 import java.io.*; 2 import javax.servlet.*; 3 import javax.servlet.http.*; Notice the javax.servlet interface is imported to allow access to its life cycle methods (Line 2).
Courier italics is used for variables and command-line placeholders that are replaced with a real name or value; for example:
To delete a le, use the rm filename command.
Courier italic bold is used to represent variables whose values are to be entered by the student as part of an activity; for example:
Type chmod a+rwx filename to grant read, write, and execute rights for filename to world, group, and users. Palatino italics is used for book titles, new words or terms, or words that you want to emphasize; for example: Read Chapter 6 in the Users Guide. These are called class options.
Preface-xviii

Module 1
DTrace Fundamentals
Objectives
Upon completion of this module, you should be able to:
q
Describe the features of the Solaris Dynamic Tracing (DTrace) facility Describe the DTrace architecture List and enable probes, and create action statements and D scripts
q q
1-1
Relevance
Relevance
Discussion The following questions are relevant to understanding DTrace:
q
!
?
Would the ability to turn on trace points for any one of the majority of functions in the kernel be benecial? Would it be useful to know who is issuing kill(2) system calls?
1-2

Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
q
Sun Microsystems, Inc. Solaris Dynamic Tracing Guide, part number 817-6223-10. The /usr/demo/dtrace directory contains all of the sample scripts from the Solaris Dynamic Tracing Guide. Cantrill Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. Paper presented at 2004 USENIX Conference. BigAdmin System Administration Portal [http://www.sun.com/bigadmin/content/dtrace]. The dtrace(1M) manual page.
DTrace Fundamentals
1-3
DTrace Features
DTrace Features
DTrace is a comprehensive dynamic tracing facility that is bundled into the Solaris 10 Operating System (Solaris 10 OS). It is intended for use by system administrators, service support personnel, kernel developers, application program developers, and users who are given explicit access permission to the DTrace facility DTrace has the following features:
q q q q q
Enables dynamic modication of the system to record arbitrary data Promotes tracing on live systems Is completely safeits use cannot induce fatal failure Allows tracing of both the kernel program and user-level programs Functions with low overhead when tracing is enabled and zero overhead when tracing is not being performed.
Transient Failures
DTrace provides answers to the causes of transient failures. A transient failure is any unacceptable behavior that does not result in fatal failure of the system. You might have a clear, specic failure, such as:
q
read(2) is returning EIO errno values on a device that is not reporting any errors. An application occasionally does not receive its expected timer signal. A thread is missing a condition variable wakeup.
The transient failure can be based on your own denition of unacceptable system operation:
q
We were expecting to accommodate 100 users per CPU, but we cannot support more than 60 users per CPU. Why does system time go way up when I run application X? Every morning between 9:30 a.m. and 10:00 a.m. the system performs poorly.
q q
1-4

DTrace Features In these situations, you must understand the problem and either eliminate the performance inhibitors or reset your expectations. Eliminating the performance inhibitors could involve:
q
Adding more resources, such as memory or central processing units (CPUs) Reconguring existing resources, for example, tuning parameters or rewriting software Lessening the load
Debugging Transient Failures

DTrace was developed to provide a more efcient and cost-effective method of diagnosing transient failures. Historically users have debugged transient failures using process-centric tools such as truss(1), pstack(1), or prstat(1M). These tools were not designed to debug systemic problems. The tools that were intended for debugging systemic problems, such as mdb(1) and Solaris Crash Analysis Tool (Solaris CAT), are designed for postmortem analysis.
Debugging Using Postmortem Analysis

You can use postmortem analysis to debug transient problems by inducing fatal failure during the period of transient failure. This technique has the following disadvantages:
q
It requires inducing fatal failure, which nearly always results in more downtime than the transient failure It requires solving a dynamic problem from a static snapshot of the systems state
Debugging Using Invasive Techniques

If existing tools cannot nd the root cause of a transient failure, then you must use more invasive techniques. Typically this means developing custom instrumentation for the failing user program, the kernel, or both. This can involve using the Trace Normal Form (TNF) facility. You then reproduce the problem using the instrumented binaries. This technique requires:
q q
Running the instrumented binaries in production Reproducing a transient problem in a development environment
DTrace Fundamentals
1-5
DTrace Features Such invasive techniques are undesirable because they are slow, errorprone, and often ineffective. Relying on the existing static TNF trace points found in the kernel, which you can enable with the prex(1) command, is also unsatisfactory. The number of TNF trace points in the kernel is limited and the overhead is substantial.
DTrace Capabilities
The DTrace framework allows you to enable tens of thousands of tracing points called probes. When these instrumentation points are hit, you can display arbitrary data in the kernel (or user process). An example of a probe provided by the DTrace framework is entry into any kernel function. Information that you can display when this probe res includes:
q q q q q q
Any argument to the function Any global variable in the kernel A nanosecond timestamp of when the function was called A stack trace to indicate what code called this function The process that was running when the function was called The thread that made the call to this function
Using DTrace, you can explore all aspects of the Solaris 10 OS to:
q q q
Understand how the software works Determine the root cause of performance problems Examine all layers of software sequentially from the user level to the kernel Track down the source of aberrant behavior
DTrace comes with powerful data management primitives to eliminate the need for postprocessing of gathered data. Unwanted data is pruned as close to the source as possible to avoid the overhead of generating and later ltering unwanted data. DTrace also provides a mechanism to trace during boot and to retrieve all traced data from a kernel crash dump.
1-6

DTrace Architecture
DTrace Architecture
DTrace helps you understand a software system by enabling you to dynamically modify the operating system kernel and user processes to record additional data that you specify at locations of interest called probes.
Probes and Probe Providers

A probe is a program location or activityfor example, every system clock tickto which DTrace can bind a request to perform a set of actions, such as recording a stack trace, a timestamp, or the argument to a function.
How Probes Work

Probes are like programmable sensors inserted at strategic points of your Solaris 10 OS. You use DTrace to program the appropriate sensors to record the information that you want. As each probe res, DTrace gathers the data from your probes and reports it back to you. If you do not specify any actions for a probe, DTrace simply records each time the probe res and on what CPU. DTrace provides tens of thousands of probes of various types. Probes are implemented by probe providers. A provider is a kernel module that enables a requested probe to re when it is hit. An example of a provider is the function boundary tracing or fbt provider. It provides entry and return probes for almost every function in every kernel module.
How Probes Are Enabled

You dene probes and actions using a programming language called D, which is based on the C programming language. Usually D programs are placed in script les ending in a .d sufx. The D programs are passed to a DTrace consumer. The primary, generic DTrace consumer is the dtrace(1M) command. The user-specied D program is compiled by the DTrace consumer into a form referred to as D Intermediate Format (DIF), which is then sent to the DTrace framework within the kernel for execution. There, the probes that are named within the D program are enabled, and the corresponding provider performs the instrumentation required to activate them.
DTrace Fundamentals
1-7
DTrace Architecture
DTrace Components
DTrace has the following components: probes, providers, consumers, and the D programming language. The entire DTrace framework resides in the kernel. Consumer programs access the DTrace framework through a welldened application programming interface (API).
Probes
A probe has the following attributes:
q q q
It is made available by a provider. It identies the module and function that it instruments. It has a name.
These four attributes dene a 4-tuple that uniquely identies each probe:
provider:module:function:name
In addition, DTrace assigns a unique integer identier to each probe.
Providers
A provider represents a methodology for instrumenting the system. Providers make probes available to the DTrace framework. A provider receives information from DTrace regarding when a probe is to be enabled and transfers control to DTrace when an enabled probe is hit. DTrace offers the following providers:
q
The function boundary tracing (fbt) provider can dynamically trace the entry and return of every function in the kernel. The syscall provider can dynamically trace the entry and return of every Solaris system call. The lockstat provider can dynamically trace the kernel synchronization primitives to observe lock contention and hold times. The plockstat provider makes probes available for user-level synchronization primitives including lock contention and hold times. The sched provider can dynamically trace key scheduling events. The prole provider enables you to add a congurable-rate timer interrupt to the system.
q q
1-8

DTrace Architecture
q
The dtrace provider enables pre-processing and post-processing (as well as D program error-processing) capabilities. The pid provider enables function boundary tracing within a process as well as tracing of any instruction in the virtual address space of the process. The statically dened tracing (sdt) provider creates probes at sites a programmer has explicitly designated in their own application. The vminfo provider makes available probes that correspond to the kernels virtual memory statistics. The sysinfo provider makes available probes that correspond to the kernels sys statistics. The proc provider makes available probes that pertain to process and thread creation and termination as well as signals. The mib provider makes available probes that correspond to counters in the Solaris management information bases (MIBs), which are used by the simple network management protocol (SNMP). The io provider makes available probes giving details related to disk input and output (I/O). The fpuinfo provider makes available probes that correspond to the simulation of oating point instructions on SPARC-based microprocessors.
Note You should check the Solaris Dynamic Tracing Guide, part number 817-6223, regularly for the addition of any new DTrace providers.
Consumers
A DTrace consumer is a process that interacts with DTrace. There is one main DTrace consumer called dtrace(1M). It acts as a generic front-end to the DTrace facility. Most other consumers are rewrites of previously existing utilities such as lockstat(1M). There is no limit on the number of concurrent consumers. That is, many users can simultaneously run the dtrace(1M) command. DTrace handles the multiplexing.
DTrace Fundamentals
1-9
DTrace Architecture
D Programming Language
The D programming language enables you to specify probes of interest and bind actions to those probes. To do this, you construct scripts called D scripts. The nature of D scripts is similar to awk(1)s pattern action pairs. The D programming language also borrows heavily from the C programming language. Even if you have no experience with the C programming language or with awk(1), D programs are fairly easy to write and understand. Features of the D language include the following:
q q q
Enables complete access to kernel C types, such as vnode_t Provides complete access to kernel static and global variables Provides complete support for American National Standards Institute (ANSI)-C operators Supports strings as a built-in type (unlike C, which uses the ambiguous char * or char[] types).
Architecture Summary
To summarize, the DTrace facility consists of user-level consumer programs such as dtrace(1M), providers packaged as kernel modules, and a library interface for the consumer programs to access the DTrace facility through the dtrace(7D) kernel driver.
1-10

DTrace Architecture Figure 1-1 shows the overall DTrace architecture.
a.d
b.d
D program source files

intrstat(1M) plockstat(1M)
dtrace(1M)
lockstat(1M)
DTrace consumers
libdtrace(3LIB) @JH=?A%,
userland kernel
,6H=?A
sysinfo syscall
Figure 1-1
vminfo fbt
io
...
profile
sched
DTrace providers
DTrace Architecture
DTrace Fundamentals
1-11
DTrace Tour
DTrace Tour
In this section you tour the DTrace facility and learn to perform the following tasks:
q
List the available probes using various criteria:

q q q q
Probes associated with a particular function Probes associated with a particular module Probes with a specic name All probes from a specic provider
q q q q
Explain how to enable probes Explain default probe output Describe action statements Create a simple D script
Listing Probes
You can list all DTrace probes with the -l option of the dtrace(1M) command:
# dtrace -l ID PROVIDER 1 dtrace 2 dtrace 3 dtrace 4 syscall 5 syscall 6 syscall 7 syscall 8 syscall 9 syscall 10 syscall 11 syscall 12 syscall 13 syscall 14 syscall 15 syscall ... MODULE FUNCTION NAME BEGIN END ERROR nosys entry nosys return rexit entry rexit return forkall entry forkall return read entry read return write entry write return open entry open return
1-12

DTrace Tour You can use an additional option to list specic probes, as follows:
q
In a specic function: -f function

MODULE genunix genunix FUNCTION NAME cv_wait entry cv_wait return
# dtrace -l -f cv_wait ID PROVIDER 12921 fbt 12922 fbt

q
In a specic module: -m module

MODULE sd sd sd sd sd sd FUNCTION NAME sdopen entry sdopen return sdclose entry sdclose return sdstrategy entry sdstrategy return
# dtrace -l -m sd ID PROVIDER 17147 fbt 17148 fbt 17149 fbt 17150 fbt 17151 fbt 17152 fbt ...
q
With a specic name: -n name

MODULE FUNCTION NAME BEGIN
# dtrace -l -n BEGIN ID PROVIDER 1 dtrace

q
From a specic provider: -P provider

MODULE genunix genunix genunix genunix genunix genunix FUNCTION NAME mutex_enter adaptive-acquire mutex_enter adaptive-block mutex_enter adaptive-spin mutex_exit adaptive-release mutex_destroy adaptive-release mutex_tryenter adaptive-acquire
# dtrace -l -P lockstat ID PROVIDER 469 lockstat 470 lockstat 471 lockstat 472 lockstat 473 lockstat 474 lockstat ...
q
Realize that a specic function or module can be supported by many providers:

MODULE FUNCTION NAME read entry read return read readch read sysread read entry read return
# dtrace -l -f read ID PROVIDER 10 syscall 11 syscall 4036 sysinfo 4040 sysinfo 7885 fbt 7886 fbt
genunix genunix genunix genunix
DTrace Fundamentals
1-13
DTrace Tour The previous output shows that for each probe, the following is displayed:
q
The probes uniquely assigned probe ID (The probe ID is only unique within a given release or patch level of Solaris). The provider name. The module name (if applicable). The function name (if applicable). The probe name.
q q q q
Specifying Probes in DTrace

Probes are fully specied by separating each component of the 4-tuple with a colon:
provider:module:function:name
Empty components match anything. For example, fbt::alloc:entry species a probe with the following attributes:
q q q q
From the fbt provider In any module In the alloc function Named entry
Elements of the 4-tuple can be left off from the left-hand side. For example, open:entry matches probes from all providers and kernel modules that have a function name of open and a probe name of entry:
# dtrace -l -n open:entry ID PROVIDER 14 syscall 7386 fbt MODULE genunix FUNCTION NAME open entry open entry
Probe descriptions also support a pattern matching syntax similar to the shell File Name Generation syntax described in sh(1). The special characters *, ?, and [ ] are all supported. For example, the syscall::open*:entry probe description matches both the open and open64 system calls. The ? character represents any single character in the name and [ ] characters lets you specify a choice of specic characters in the name.
1-14

DTrace Tour
Enabling Probes
Probes are enabled with the dtrace(1M) command by specifying them without the -l option. When enabled in this way, DTrace performs the default action when the probe res. The default action indicates only that the probe red. No other data is recorded. For example, the following code example enables every probe in the sd module: # dtrace -m sd CPU ID FUNCTION:NAME 0 17329 sd_media_watch_cb:entry 0 17330 sd_media_watch_cb:return 0 17167 sdinfo:entry 0 17168 sdinfo:return 0 17151 sdstrategy:entry 0 17152 sdstrategy:return 0 17661 ddi_xbuf_qstrategy:entry 0 17662 ddi_xbuf_qstrategy:return 0 17649 xbuf_iostart:entry 0 17341 sd_xbuf_strategy:entry 0 17385 sd_xbuf_init:entry 0 17386 sd_xbuf_init:return 0 17342 sd_xbuf_strategy:return 0 17177 sd_mapblockaddr_iostart:entry 0 17178 sd_mapblockaddr_iostart:return 0 17179 sd_pm_iostart:entry 0 17365 sd_pm_entry:entry 0 17366 sd_pm_entry:return 0 17180 sd_pm_iostart:return 0 17181 sd_core_iostart:entry 0 17407 sd_add_buf_to_waitq:entry ...
As you can see from the output, the default action displays the CPU where the probe red, the DTrace assigned probe ID, the function where the probe red, and the probe name.
DTrace Fundamentals
1-15
DTrace Tour To enable probes provided by the syscall provider: # dtrace -P syscall dtrace: description 'syscall' matched 452 probes CPU ID FUNCTION:NAME 0 99 ioctl:return 0 98 ioctl:entry 0 99 ioctl:return 0 98 ioctl:entry 0 99 ioctl:return 0 234 sysconfig:entry 0 235 sysconfig:return 0 234 sysconfig:entry 0 235 sysconfig:return 0 168 sigaction:entry 0 169 sigaction:return 0 168 sigaction:entry 0 169 sigaction:return 0 98 ioctl:entry 0 99 ioctl:return 0 234 sysconfig:entry 0 235 sysconfig:return 0 38 brk:entry 0 39 brk:return ... To enable probes named zfod: # dtrace -n zfod dtrace: description 'zfod' matched 3 probes CPU ID FUNCTION:NAME 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod ^C To enable probes provided by the syscall provider in the open function, use the -n option with the fully specied 4-tuple syntax: # dtrace -n syscall::open*: dtrace: description 'syscall::open:' matched 2 probes CPU ID FUNCTION:NAME 0 14 open:entry 0 15 open:return 0 14 open:entry 0 15 open:return 0 14 open:entry ^C
1-16

DTrace Tour To enable the entry probe in the clock function (which should re every 1/100th second): # dtrace -n clock:entry dtrace: description 'clock:entry' matched 1 probe CPU ID FUNCTION:NAME 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry ^C
DTrace Actions
Actions are user-programmable statements that are executed within the kernel by the DTrace virtual machine. The following are properties of actions:
q q q q
Actions are taken when a probe res. Actions are completely programmable (in the D language). Most actions record some specied state in the system. Some actions can change the state of the system in a well-dened way.
q q
These are called destructive actions. Destructive actions are not allowed by default.
Many actions use expressions in the D language.
For now, you will use D expressions that consist only of built-in D variables. The following are some of the most useful built-in D variables. See Appendix B for a complete list of the D built-in variables.
q q q q
pid The current process ID execname The current executable name timestamp The time since boot in nanoseconds curthread A pointer to the kthread_t structure that represents the current thread probemod The current probes module name probefunc The current probes function name
q q
DTrace Fundamentals
1-17
DTrace Tour
q
probename The current probes name
There are also many built-in functions that perform actions. Appendix A, Actions and Subroutines provides the complete list of D built-in functions. Start with the trace() function, which records the result of a D expression to the trace buffer. For example:
q q q
trace(pid) traces the current process ID. trace(execname) traces the name of the current executable. trace(curthread->t_pri) traces the t_pri eld of the current thread. trace(probefunc) traces the function name of the probe.
Actions are indicated by following a probe specication with { action }. For example: # dtrace -n 'readch {trace(pid)}' dtrace: description 'readch ' matched 4 probes CPU ID FUNCTION:NAME 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch ...
2040 2177 2177 2040 2181 2181 7
In the last example the process identication number (PID) appears in the last column of output.
1-18

DTrace Tour The following example traces the executable name: # dtrace -m 'ufs {trace(execname)}' dtrace: description 'ufs ' matched 889 probes CPU ID FUNCTION:NAME 0 14977 ufs_lookup:entry ls 0 15748 ufs_iaccess:entry ls 0 15749 ufs_iaccess:return ls 0 14978 ufs_lookup:return ls 0 14977 ufs_lookup:entry ls 0 15748 ufs_iaccess:entry ls 0 15749 ufs_iaccess:return ls 0 14978 ufs_lookup:return ls 0 14977 ufs_lookup:entry ls 0 15748 ufs_iaccess:entry ls 0 15749 ufs_iaccess:return ls 0 14978 ufs_lookup:return ls 0 14977 ufs_lookup:entry ls ... 0 15005 ufs_rwunlock:entry utmpd 0 15006 ufs_rwunlock:return utmpd 0 14963 ufs_close:entry utmpd 0 14964 ufs_close:return utmpd 0 15007 ufs_seek:entry utmpd 0 15008 ufs_seek:return utmpd 0 14963 ufs_close:entry utmpd ^C
DTrace Fundamentals
1-19
DTrace Tour The next action example traces the time of entry to each system call: # dtrace -n 'syscall:::entry {trace(timestamp)}' dtrace: description 'syscall:::entry ' matched 226 probes CPU ID FUNCTION:NAME 0 312 portfs:entry 157088479572713 0 98 ioctl:entry 157088479637542 0 98 ioctl:entry 157088479674339 0 234 sysconfig:entry 157088479767243 0 234 sysconfig:entry 157088479774432 0 168 sigaction:entry 157088479993155 0 168 sigaction:entry 157088480229390 0 98 ioctl:entry 157088480318855 0 234 sysconfig:entry 157088480398692 0 38 brk:entry 157088480422525 0 38 brk:entry 157088480438097 0 98 ioctl:entry 157088480794819 0 98 ioctl:entry 157088480959666 0 98 ioctl:entry 157088480986498 0 98 ioctl:entry 157088481033225 0 60 fstat:entry 157088481050686 0 60 fstat:entry 157088481074680 ... Multiple actions can be specied; they must be separated by semicolons: # dtrace -n 'zfod {trace(pid);trace(execname)}' dtrace: description 'zfod ' matched 3 probes CPU ID FUNCTION:NAME 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod ...
2195 2195 2195 2195 2195 2197 2207 2207
dtrace dtrace dtrace dtrace dtrace bash vi vi
1-20

DTrace Tour The following example traces the executable name in every entry to the pagefault function: # dtrace -n 'fbt::pagefault:entry {trace(execname)}' dtrace: description 'fbt::pagefault:entry ' matched 1 probe CPU ID FUNCTION:NAME 0 2407 pagefault:entry dtrace 0 2407 pagefault:entry dtrace 0 2407 pagefault:entry dtrace 0 2407 pagefault:entry sh 0 2407 pagefault:entry sh 0 2407 pagefault:entry sh 0 2407 pagefault:entry sh 0 2407 pagefault:entry sh ...
Writing D Scripts
Complicated DTrace enablings become difcult to manage on the command line. The dtrace(1M) command supports scripts, specied with the -s option. Alternatively, you can create executable DTrace interpreter les. Interpreter les always begin with: #!/usr/sbin/dtrace -s
Executable D Scripts
For example, you can write a script to trace the executable name upon entry to each system call as follows: # cat syscall.d syscall:::entry { trace(execname); }
DTrace Fundamentals
1-21
DTrace Tour By convention, D scripts end with a .d sufx. You can run this D script as follows: # dtrace -s syscall.d dtrace: script 'syscall.d' matched 226 probes CPU ID FUNCTION:NAME 0 312 pollsys:entry java 0 98 ioctl:entry dtrace 0 98 ioctl:entry dtrace 0 234 sysconfig:entry dtrace 0 234 sysconfig:entry dtrace 0 168 sigaction:entry dtrace 0 168 sigaction:entry dtrace 0 98 ioctl:entry dtrace 0 234 sysconfig:entry dtrace 0 38 brk:entry dtrace ^C If you give the syscall.d le execute permission and add a rst line to invoke the interpreter, you can run the script by entering its name on the command line as follows: # cat syscall.d #!/usr/sbin/dtrace -s syscall:::entry { trace(execname); } # chmod +x syscall.d # ls -l syscall.d -rwxr-xr-x 1 root other 62 May 12 11:30 syscall.d # ./syscall.d dtrace: script './syscall.d' matched 226 probes CPU ID FUNCTION:NAME 0 98 ioctl:entry java 0 98 ioctl:entry java 0 312 pollsys:entry java 0 312 pollsys:entry java 0 312 pollsys:entry java 0 98 ioctl:entry dtrace 0 98 ioctl:entry dtrace 0 234 sysconfig:entry dtrace 0 234 sysconfig:entry dtrace
1-22

DTrace Tour
D Literal Strings
The D language supports literal strings that you can use with the trace function as follows: # cat string.d #!/usr/sbin/dtrace -s fbt::bdev_strategy:entry { trace(execname); trace(" is initiating a disk I/O\n"); }
The \n at the end of the literal string produces a new line. To run this script, enter the following:
# dtrace -s string.d dtrace: script 'string.d' matched 1 probe CPU ID FUNCTION:NAME 0 9215 bdev_strategy:entry 0 0 0 0 9215 9215 9215 9215 bdev_strategy:entry bdev_strategy:entry bdev_strategy:entry bdev_strategy:entry
bash is initiating a disk I/O vi is initiating a disk I/O vi is initiating a disk I/O vi is initiating a disk I/O sched is initiating a disk I/O
The quiet mode option, -q, in dtrace(1M) tells DTrace to record only the actions explicitly stated. This option suppresses the default output normally produced by the dtrace command. The following example shows the use of the -q option on the string.d script: # dtrace -q -s string.d ls is initiating a disk I/O cat is initiating a disk I/O fsflush is initiating a disk I/O vi is initiating a disk I/O vi is initiating a disk I/O
DTrace Fundamentals
1-23
The BEGIN and END Probes

The simple dtrace provider has only three probes. They are BEGIN, END, and ERROR. The BEGIN probe res before all others and performs preprocessing steps. For example, it enables you to initialize variables, as well as to display headings for output that is displayed by other actions that occur later. The END probe res after all other probes have red and enables you to perform post-processing. The ERROR probe res when there are any runtime errors in your D programs. The following example shows a simple use of the BEGIN and END probes of the dtrace provider:
# cat beginEnd.d #!/usr/sbin/dtrace -s BEGIN { trace("This is a heading\n"); } END { trace("This should appear at the END\n"); } # ./beginEnd.d dtrace: script './beginEnd.d' matched 2 probes CPU ID FUNCTION:NAME 0 1 :BEGIN This is a heading ^C 0
:END
This should appear at the END
# dtrace -qs beginEnd.d This is a heading ^C This should appear at the END
Note The END probe does not re until you interrupt (^C) the dtrace command.
Module 2
Using DTrace
Objectives
q q q q q q
Describe the DTrace performance monitoring capabilities Examine performance problems using the vminfo provider Examine performance problems using the sysinfo provider Examine performance problems using the io provider Use DTrace to obtain information about system calls Create D scripts that use arguments
2-1
Relevance
Relevance
Discussion The following questions are relevant to understanding how to use DTrace:
q q
!
?
What performance monitoring tools exist in the Solaris 10 OS? Would it be useful to know which process is making which system calls? What advantage does the ability to pass arguments to a D script provide?
2-2

q
Cantrill, Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. paper presented at the 2004 USENIX Conference. BigAdmin System Administration Portal [http://www.sun.com/bigadmin/content/dtrace]. Sun Microsystems, Inc. Solaris Dynamic Tracing Guide (Beta), part number 817-6223-10. The /usr/demo/dtrace directory contains all of the sample scripts from the Solaris Dynamic Tracing Guide. dtrace(1M) manual page in the Solaris 10 OS manual pages, Solaris 10 Reference Manual Collection.
Using DTrace
2-3
DTrace Performance Monitoring Capabilities

A number of the DTrace providers implement probes that correspond to existing Solaris OS performance monitoring tools:
q
The vminfo provider Implements probes that correspond to the vmstat(1M) tool The sysinfo provider Implements probes that correspond to the mpstat(1M) tool The io provider Implements probes that correspond to the iostat(1M) tool
In addition, the syscall provider implements probes that correspond to the truss(1) command.
Features of the DTrace Performance Monitoring Capabilities

Using the DTrace facility, you can extract the same information that the bundled tools provide, with signicant added exibility. DTrace enables you to gather only the specic information you need to diagnose the aberrant behavior. It also provides additional related information such as process and thread identication, stack traces, and other arbitrary kernel information available at the time the probes re.
Aggregations
Aggregated data is more useful than individual data points in answering performance-related questions. For example, if you want to know the number of page faults by process, you do not necessarily care about each individual page fault. Rather, you want a table that lists the process names and the total number of page faults. DTrace provides several built-in aggregating functions. An aggregating function has this property: if it is applied to subsets of a collection of gathered data and then applied again to the results, it returns the same result as it does when applied to the whole collection. Examples of aggregating functions are count(), sum(), min(), and max(); A median function would not be considered an aggregating function because it lacks the above stated property.
2-4

DTrace Performance Monitoring Capabilities DTrace is not required to store the entire set of data items for aggregations; it keeps a running count, needing only the current intermediate result and the new element. Intermediate results are kept per central processing unit (CPU), enabling a scalable implementation (because of not requiring the use of locks).
DTrace Aggregation Syntax

The general form of a DTrace aggregation is:
@name[ keys ] = aggfunc( args );

These variables are dened as follows:
q
name The name of the aggregation that is preceded by the @ character keys A comma-separated list of D expressions aggfunc One of the DTrace aggregating functions args A comma-separated list of arguments appropriate to the aggregating function
q q q
DTrace Aggregating Functions

Table 2-1 lists the DTrace aggregating functions. Table 2-1 DTrace Aggregating Functions Function Name count sum avg min max Arguments none scalar expression scalar expression scalar expression scalar expression Result The number of times called. The total value of the specied expressions. The arithmetic average (mean) of the specied expressions. The smallest value of the specied expressions. The largest value of the specied expressions.
Using DTrace
2-5
DTrace Performance Monitoring Capabilities Table 2-1 DTrace Aggregating Functions (Continued) Function Name lquantize Arguments scalar expression, lower bound, upper bound, step value scalar expression Result A linear frequency distribution, sized by the specied range, of the values of the specied expression. Increments the value in the highest bucket that is less than or equal to the specied expression. A power-of-two frequency distribution of the values of the specied expression. Increments the value in the highest power-of-two bucket that is less than or equal to the specied expression.
quantize
Example Use of Aggregating Function

In the following example, the count aggregating function is used to count the number of write(2) system calls per process: # cat writes.d #!/usr/sbin/dtrace -s syscall::write:entry { @numWrites[execname] = count(); } # ./writes.d dtrace: script 'writes.d' matched 1 probe ^C dtrace date bash grep file ls
1 1 3 20 197 201
Note No data is output from the aggregation until dtrace(1M) is terminated. The output data is a summary up to that point.
2-6

Arguments Supplied by Providers

The syscall provider gives you access to a system calls arguments, using the syntax arg0, arg1, arg2, for the functions rst, second, third, and so on, arguments. These argument values are of type int64_t. You can also refer to the correctly typed arguments through the args[] array: args[0], args[1], and so on. The following example displays the average write size per process: # cat writes2.d #!/usr/sbin/dtrace -s syscall::write:entry { @avgSize[execname] = avg(arg2); } # ./writes2.d dtrace: script 'writes2.d' matched 1 probe ^C dtrace bash date file grep ls
1 27 29 37 60 68
Using DTrace
2-7
Examining Performance Problems Using the vminfo Provider

The vminfo provider makes available probes from the virtual memory (vm) kernel statistics (kstat) kept by the kernel kstat facility. You can examine any unexplainable behavior observed from the vm specic output of the vmstat(1M) command using this DTrace provider. A probe provided by the vminfo provider res immediately before the corresponding vm kstat value is incremented. To display both the names and the current values (counts) of the vm named kstat, you can use the kstat(1M) command as shown in the following command example.
# kstat -n vm module: cpu name: vm anonfree anonpgin anonpgout as_fault cow_fault crtime dfree execfree execpgin execpgout fsfree fspgin fspgout hat_fault kernel_asflt maj_fault pgfrec pgin pgout pgpgin pgpgout pgrec pgrrun pgswapin pgswapout prot_fault rev scan
instance: 0 class: misc 0 4 0 157771 34207 0.178610697 56 0 3646 0 56 16257 57 0 0 6743 34215 9188 36 19907 57 34216 4 0 0 39794 0 28668
2-8

Examining Performance Problems Using the vminfo Provider snaptime softlock swapin swapout zfod 349429.087071013 165 0 0 12835
The vminfo Probes

Table 2-2 describes the vminfo probes. Table 2-2 The vminfo Probes Probe Name anonfree Description Probe that res when an unmodied anonymous page is freed as part of paging activity. Anonymous pages are those that are not associated with a le; memory containing such pages include heap memory, stack memory, or memory obtained by explicitly mapping zero(7D). Probe that res when an anonymous page is paged in from a swap device. Probe that res when a modied anonymous page is paged out to a swap device. Probe that res when a fault is taken on a page and the fault is neither a protection fault nor a copy-on-write fault. Probe that res when a copy-on-write fault is taken on a page. The arg0 argument contains the number of pages that are created as a result of the copy-on-write. Probe that res when a page is freed as a result of paging activity. When dfree res, exactly one of the anonfree, execfree, or fsfree probes also subsequently res. Probe that res when an unmodied executable page is freed as a result of paging activity. Probe that res when an executable page is paged in from the backing store. Probe that res when a modied executable page is paged out to the backing store. Most paging of executable pages occurs in terms of the execfree probe; the execpgout probe can only re if an executable page is modied in memory, an uncommon occurrence in most systems. Probe that res when an unmodied le system data page is freed as part of paging activity.
anonpgin anonpgout as_fault cow_fault
dfree
execfree execpgin execpgout
fsfree
Using DTrace
2-9
Examining Performance Problems Using the vminfo Provider Table 2-2 The vminfo Probes (Continued) Probe Name fspgin fspgout Description Probe that res when a le system page is paged in from the backing store. Probe that res when a le system page is paged out to the backing store.
kernel_asflt Probe that res when a page fault is taken by the kernel on a page in its own address space. When the kernel_asflt probe res, it is immediately preceded by a ring of the as_fault probe. maj_fault Probe that res when a page fault is taken that results in input/output (I/O) from a backing store or swap device. Whenever maj_fault res, it is immediately preceded by a ring of the pgin probe. Probe that res when a page is reclaimed from the free page list. Probe that res when a page is paged in from the backing store or from a swap device. This differs from the maj_fault probe in that the maj_fault probe only res when a page is paged in as a result of a page fault; the pgin probe res when a page is paged in, regardless of the reason. Probe that res when a page is paged out to the backing store or to a swap device. Probe that res when a page is paged in from the backing store or from a swap device. The only difference between the pgpgin probe and the pgin probe is that the pgpgin probe contains the number of pages paged in as the arg0 argument. (The pgin probe always contains 1 in the arg0 argument.) Probe that res when a page is paged out to the backing store or to a swap device. The only difference between the pgpgout probe and the pgout probe is that the pgpgout probe contains the number of pages paged out as the arg0 argument. (The pgout probe always contains 1 in the arg0 argument.) Probe that res when a page is reclaimed. Probe that res when the pager is scheduled. Probe that res when a process is swapped in. Probe that res when a process is swapped out. Probe that res when a page fault is taken due to a protection violation. Probe that res when the page daemon begins a new revolution through all pages. Probe that res when the page daemon examines a page.
pgfrec pgin
pgout pgpgin
pgpgout
pgrec pgrrun pgswapin pgswapout prot_fault rev scan
2-10

Examining Performance Problems Using the vminfo Provider Table 2-2 The vminfo Probes (Continued) Probe Name softlock swapin swapout zfod Description Probe that res when a page is faulted as a part of placing a software lock on the page. Probe that res when a swapped-out process is swapped back in. Probe that res when a process is swapped out. Probe that res when a zero-lled page is created on demand.
Finding the Source of Page Faults Using vminfo Probes

Consider the following example output, obtained by running the vmstat command.
# vmstat 5 kthr memory r b w swap free re mf 0 0 0 648560 437016 3 11 0 0 0 598912 396136 0 11 0 0 0 598888 396112 0 1 0 0 0 598864 396088 0 1 0 0 0 598864 396088 0 0 0 1 0 598104 393456 4 45 0 0 0 595224 381544 0 2 0 0 0 592024 368832 0 1 0 0 0 588792 362640 1 3 0 0 0 587984 361848 0 3 0 0 0 587960 361800 0 4 0 0 0 587944 361768 0 1 0 0 0 587920 361744 0 1 0 0 0 587848 361672 0 1 0 0 0 587832 361656 0 1 0 0 0 587808 361632 0 5 0 0 0 587784 361608 40 193 0 0 0 588184 362576 0 1 page disk faults cpu pi po fr de sr s0 s2 s1 -in sy cs us sy id 13 0 0 0 8 0 1 0 0 406 42 50 0 0 100 27 0 0 0 0 0 0 0 0 615 113 67 0 0 100 0 0 0 0 0 0 0 0 0 604 69 47 0 0 100 0 0 0 0 0 0 0 0 0 616 69 72 0 0 100 0 0 0 0 0 0 0 0 0 619 73 89 0 0 100 3588 0 0 0 0 0 474 0 0 2014 5138 1013 3 17 79 5273 0 0 0 0 0 698 0 0 2593 7545 1448 3 31 66 5509 0 0 0 0 0 725 0 0 2674 7840 1503 3 26 71 3679 0 0 0 0 0 485 0 0 2009 5259 1027 3 20 77 4 0 0 0 0 0 0 0 0 605 80 70 0 0 100 20 0 0 0 0 0 2 0 0 624 74 91 0 0 100 0 0 0 0 0 0 0 0 0 614 76 78 0 0 100 0 0 0 0 0 0 0 0 0 616 69 80 0 0 100 0 0 0 0 0 0 18 0 0 689 69 69 0 0 100 0 0 0 0 0 0 0 0 0 611 74 67 0 0 100 0 0 0 0 0 0 0 0 0 611 71 66 0 0 100 844 0 0 0 0 0 107 0 0 953 905 260 3 5 92 0 0 0 0 0 0 0 0 0 611 69 71 0 0 100
Here the pi column denotes the number of kilobytes paged in per second.
Executable Causing Page Faults

The vminfo provider makes it easy to discover more about the source of these page-ins. The following example uses an anonymous aggregation:
Using DTrace
2-11
Examining Performance Problems Using the vminfo Provider # dtrace -n 'pgin {@[execname] = count()}' dtrace: description 'pgin ' matched 1 probe ^C utmpd in.routed init snmpd automountd vi vmstat sh grep dtrace bash file find
2 2 2 5 5 5 17 23 33 35 62 198 4551
This output shows that the find command is responsible for most of the page-ins. For a more complete picture of the find command in terms of vm behavior, you can enable all vminfo probes. Before doing this, however, you must introduce a ltering capability of DTrace called a predicate.
Predicates
A D program consists of a set of probe clauses. A probe clause has the following general form:
probe descriptions
/ predicate / {
action statements
} Predicates are D expressions enclosed in slashes / / that are evaluated at probe ring time to determine whether the associated actions should be executed. If the D expression evaluates to zero it is false; if it evaluates to non-zero it is true. Predicates are optional, but you must place them between the probe description and the action statements.
2-12

Details About the Executable Causing Page Faults

The following example examines the systems detailed vm behavior while the find command runs: # cat find.d #!/usr/sbin/dtrace -s vminfo::: /execname == "find"/ { @[probename] = count(); } Before running this D program, run a find command in the background while another utility uses up a substantial portion of the systems memory, as shown in the following example.
# (sleep 10 ; find / -name fubar & mkfile 300m /tmp/junk)& [1] 840 # ps PID TTY TIME CMD 615 pts/2 0:00 sh 841 pts/2 0:00 sleep 625 pts/2 0:00 bash 840 pts/2 0:00 bash 842 pts/2 0:00 ps # ps PID TTY TIME CMD 615 pts/2 0:00 sh 843 pts/2 0:02 find 625 pts/2 0:00 bash 840 pts/2 0:00 bash 845 pts/2 0:00 ps 844 pts/2 0:02 mkfile # ps PID TTY TIME CMD 615 pts/2 0:00 sh 843 pts/2 0:08 find 625 pts/2 0:00 bash 846 pts/2 0:00 ps [1]+ Done ( sleep 10 ; find / -name fubar & mkfile 300m /tmp/junk ) # ps PID TTY TIME CMD 615 pts/2 0:00 sh 847 pts/2 0:00 ps 625 pts/2 0:00 bash
Using DTrace
2-13
Examining Performance Problems Using the vminfo Provider The following dtrace command was started in another terminal window immediately after the above command group was started in the background. # dtrace -s find.d dtrace: script 'find.d' matched 44 probes ^C prot_fault cow_fault softlock execpgin kernel_asflt zfod as_fault pgrec pgfrec maj_fault fspgin pgpgin pgin
2 8 11 15 40 52 170 5417 5417 18068 18103 18118 18118
You might wonder why, with such a large memory load, scans do not show up in the output of the dtrace command. This is because the pageout daemon is running during scans, not the find user process. The following example shows this behavior. # cat mem.d #!/usr/sbin/dtrace -s vminfo::: { @vm[execname,probename] = count(); } END { printa("%16s\t%16s\t%@d\n", @vm); }
# dtrace -qs mem.d ^C sleep rm pageout dtrace bash in.routed
prot_fault prot_fault rev pgfrec kernel_asflt anonpgin
1 1 1 1 1 1
2-14

Examining Performance Problems Using the vminfo Provider mkfile find dtrace mkfile mkfile vmstat rm find sleep mkfile sendmail mkfile rm bash rm sendmail sleep find sendmail ... bash pageout pageout pageout pageout pageout pageout pageout pageout bash pageout sched sched sched sched sched sched sched sched pageout rm rm find find find find pgrec fspgout anonpgout pgpgout pgout execpgout pgrec anonfree execfree as_fault fsfree dfree pgrec pgout pgpgout anonpgout anonfree execpgout execfree dfree pgrec pgfrec maj_fault fspgin pgin pgpgin 205 293 293 293 293 293 293 360 510 519 519 523 523 523 523 523 523 523 523 803 1388 1388 5067 5085 5088 5088 prot_fault prot_fault pgrec execpgin kernel_asflt prot_fault zfod execpgin zfod zfod anonpgin cow_fault cow_fault anonpgin maj_fault pgfrec cow_fault cow_fault pgrec 1 1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 4
Using DTrace
2-15
Examining Performance Problems Using the vminfo Provider pageout scan 78852
The printa() built-in formatting function gives you increased control over the output of an aggregation. For example, consider the following code line: { printa("%16s\t%16s\t%@d\n", @vm); } It provides these formatting instructions:
q
%16s\t%16s prints the rst and second elements of the aggregation keys in a 16-character-wide column (right justied). \t outputs a <Tab>. %@d prints the aggregation value as a decimal number.
q q
Note Appendix A provides more details on the format letters available to the printa() function and the more general printf() function (which resembles the printf(3C) function from the Standard C Library).
2-16

Examining Performance Problems Using the sysinfo Provider

The sysinfo provider makes available probes that correspond to the sys kernel statistics. Because these statistics provide the input for system monitoring utilities such as mpstat(1M), the sysinfo provider enables quick exploration of observed aberrant behavior. The sysinfo provider probes re immediately before the sys named kstat is incremented. The following example displays the sys named kstat. # kstat -n sys module: cpu name: sys bawrite bread bwrite canch cpu_ticks_idle cpu_ticks_kernel cpu_ticks_user cpu_ticks_wait cpumigrate ... syscall sysexec sysfork sysread sysvfork syswrite trap ufsdirblk ufsiget ufsinopage ufsipage wait_ticks_io writech xcalls xmtint instance: 0 class: misc 112 6359 1401 374 2782331 46571 12187 30197 0 3991217 1088 1043 131334 47 676775 266286 1027383 1086164 873613 2 30197 5144172931 0 0
Using DTrace
2-17
The sysinfo Probes

Table 2-3 describes the sysinfo probes. Table 2-3 The sysinfo Probes Probe Name bawrite bread Description Probe that res when a buffer is about to be asynchronously written out to a device. Probe that res when a buffer is physically read from a device. The bread probe res after the buffer has been requested from the device, but before blocking pending its completion. Probe that res when a buffer is about to be written out to a device synchronously or asynchronously. Probe that res when the periodic system clock has determined that a CPU is idle. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed idle.
bwrite cpu_ticks_idle
cpu_ticks_kernel Probe that res when the periodic system clock has determined that a CPU is executing in the kernel. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed to be executing in the kernel. cpu_ticks_user Probe that res when the periodic system clock has determined that a CPU is executing in user mode. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed to be running in usermode. Probe that res when the periodic system clock has determined that a CPU is otherwise idle, but on which some threads are waiting for I/O. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed waiting on I/O. Probe that res when a CPU enters the idle loop. Probe that res when an interrupt thread blocks.
cpu_ticks_wait
idlethread intrblk
2-18

Examining Performance Problems Using the sysinfo Provider Table 2-3 The sysinfo Probes (Continued) Probe Name inv_swtch lread lwrite modload modunload msg mutex_adenters Description Probe that res when a running thread is forced to involuntarily give up the CPU. Probe that res when a buffer is logically read from a device. Probe that res when a buffer is logically written to a device. Probe that res when a kernel module is loaded. Probe that res when a kernel module is unloaded. Probe that res when a msgsnd(2) or msgrcv(2) system call is made, but before the message queue operations have been performed. Probe that res when an attempt is made to acquire an owned adaptive lock. If this probe res, one of the lockstat provider probes (adaptive-block or adaptive-spin) also res. Probe that res when a name lookup is attempted in the le system. Probe that res when a thread is created. Probe that res when a raw I/O read is about to be performed. Probe that res when a raw I/O write is about to be performed. Probe that res when a new process cannot be created because the system is out of process table entries. Probe that res when a CPU switches from executing one thread to executing another. Probe that res after each successful read, but before control is returned to the thread performing the read. A read can occur through the read(2), readv(2), or pread(2) system calls. The arg0 argument contains the number of bytes that were successfully read. Probe that res when an attempt is made to read-lock a readers/writer lock when the lock is either held by a writer, or desired by a writer. If this probe res, the lockstat provider's rwblock probe also res. Probe that res when an attempt is made to write-lock a readers/writer lock when the lock is held either by some number of readers or by another writer. If this probe res, the lockstat provider's rw-block probe also res. Probe that res when a semop(2) system call is made, but before any semaphore operations have been performed.
namei nthreads phread phwrite procovf pswitch readch
rw_rdfails
rw_wrfails
sema
Using DTrace
2-19
Examining Performance Problems Using the sysinfo Provider Table 2-3 The sysinfo Probes (Continued) Probe Name sysexec sysfork sysread sysvfork syswrite trap Description Probe that res when an exec(2) system call is made. Probe that res when a fork(2) system call is made. Probe that res when a read(2), readv(2) or pread(2) system call is made. Probe that res when a vfork(2) system call is made. Probe that res when a write(2), writev(2), or pwrite(2) system call is made. Probe that res when a processor trap occurs. Note that some processors (in particular, UltraSPARC variants) handle some lightweight traps through a mechanism that does not cause this probe to re. Probe that res when a directory block is read from the UFS le system. See the ufs(7FS) man page for details on UFS. Probe that res when an inode is retrieved. See the ufs(7FS) man page for details on UFS. Probe that res after an in-core inode without any associated data pages has been made available for reuse. See the ufs(7FS) man page for details on UFS. Probe that res after an in-core inode with associated data pages has been made available for reuse and therefore after the associated data pages have been ushed to disk. See the ufs(7FS) man page for details on UFS. Probe that res when the periodic system clock has determined that a CPU is otherwise idle, but on which some threads are waiting for I/O. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed waiting on I/O. Note that there is no semantic difference between wait_ticks_io and cpu_ticks_io; wait_ticks_io exists purely for historical reasons. Probe that res after each successful write, but before control is returned to the thread performing the write. A write can occur through the write(2), writev(2), or pwrite(2) system calls. The arg0 argument contains the number of bytes that were successfully written.
ufsdirblk ufsiget ufsinopage
ufsipage
wait_ticks_io
writech
2-20

Examining Performance Problems Using the sysinfo Provider Table 2-3 The sysinfo Probes (Continued) Probe Name xcalls Description Probe that res when a cross-call is about to be made. A cross-call is the operating system's mechanism for one CPU to request immediate work from another.
Using the quantize Aggregation Function With the sysinfo Probes

The quantize aggregation function displays a power-of-two frequency distribution bar graph of its argument. The following example shows how you can determine the size of reads being performed by all processes over a 10-second period. The arg0 argument for the sysinfo probes states the amount to increment the statistic; it is 1 for most sysinfo probes. Two exceptions are the readch and writech probes, for which the arg0 argument is set to the actual number of bytes read or written respectively. # cat -n read.d 1 #!/usr/sbin/dtrace -s 2 sysinfo:::readch 3 { 4 @[execname] = quantize(arg0); 5 } 6 7 tick-10sec 8 { 9 exit(0); 10 } # dtrace -s read.d dtrace: script 'read.d' matched 5 probes CPU ID FUNCTION:NAME 0 36754 :tick-10sec bash value ------------- Distribution ------------0 | 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2 | file value ------------- Distribution ------------- count -1 | 0 count 0 13 0
Using DTrace
2-21
Examining Performance Problems Using the sysinfo Provider 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 grep value -1 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@ 99 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |@@@@ 25 |@@@@ 23 |@@@@ 24 |@@@@ 22 | 4 | 3 | 0 | | | | | | | | |@@ |@@@@ |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | | | | | 2 0 0 6 0 0 6 6 16 30 199 0 0 1 1 0
Finding the Source of Cross-Calls

Consider the following output from the mpstat(1M) command:
CPU 0 1 2 minf mjf xcal 2189 0 1302 3385 0 1137 1918 0 1039 intr ithr 14 1 218 104 12 1 csw icsw migr smtx 215 12 54 28 195 13 58 33 226 15 49 22 srw 0 0 0 syscl 12995 14486 13251 usr sys 13 14 19 15 13 12 wt idl 0 73 0 66 0 75
2-22


3 2430 0 1284 220 113 201 10 50 26 0 13926 10 15 0 75
The xcal and syscl columns display relatively high numbers, which might be affecting the systems performance. Yet the system is relatively idle, and is not spending time waiting on input/output (I/O). The xcal numbers are per-second and are read from the xcalls eld of the sys kstat. To see which executables are responsible for the xcalls, enter the following dtrace(1M) command: # dtrace -n 'xcalls {@[execname] = count()}' dtrace: description 'xcalls ' matched 3 probes ^C find cut snmpd mpstat sendmail grep bash dtrace sched xargs file
#
2 2 2 22 101 123 175 435 784 22308 89889
This output indicates the source of the cross-calls: some number of file(1) and xargs(1) processes are inducing the majority of them. You can nd these processes using the pgrep(1) and ptree(1) commands: # pgrep xargs 15973 # ptree 15973 204 /usr/sbin/inetd -s 5650 in.telnetd 5653 -sh 5657 bash 15970 /bin/sh ./findtxt configuration 15971 cut -f1 -d: 15973 xargs file 16686 file /usr/bin/tbl /usr/bin/troff /usr/bin/ul /usr/bin/vgrind /usr/bin/catman
Using DTrace
2-23
Examining Performance Problems Using the sysinfo Provider The xargs and file commands appear to be part of a custom user shell script. You can locate this script as follows: # find / -name findtxt /users1/james/findtxt # cat /users1/james/findtxt #!/bin/sh find / -type f | xargs file | grep text | cut -f1 -d: >/tmp/findtxt$$ cat /tmp/findtxt$$ | xargs grep $1 rm /tmp/findtxt$$ # The script is running many processes concurrently with much interprocess communication occurring through pipes. This script appears to be quite resource intensive: it is trying to nd every text le in the system and is then searching each one for some specic text. You expect these processes to run concurrently on this systems four processors while they send data to each other.
Stack Trace xcall Details

You can gather more details on which kernel code is involved in all of the cross-calls while the file and xargs commands are running. The following example uses the stack() built-in DTrace function as the aggregation key to show which kernel code is requesting the cross-call. The number of unique kernel stack traces is being counted. # dtrace -n 'xcalls {@[stack()] = count()}' dtrace: description 'xcalls ' matched 3 probes ^C SUNW,UltraSPARC-IIIi`send_mondo_set+0x9c unix`xt_some+0xc4 unix`xt_sync+0x3c unix`hat_unload_callback+0x6ec unix`memscrub_scan+0x298 unix`memscrubber+0x308 unix`thread_start+0x4 2 SUNW,UltraSPARC-IIIi`send_mondo_set+0x9c unix`xt_some+0xc4 unix`sfmmu_tlb_demap+0x118 unix`sfmmu_hblk_unload+0x368 unix`hat_unload_callback+0x534 unix`memscrub_scan+0x298 unix`memscrubber+0x308
2-24

Examining Performance Problems Using the sysinfo Provider unix`thread_start+0x4 2 ... SUNW,UltraSPARC-IIIi`send_mondo_set+0x9c unix`xt_some+0xc4 unix`xt_sync+0x3c unix`hat_unload_callback+0x6ec genunixànon_private+0x204 genunix`segvn_faultpage+0x778 genunix`segvn_fault+0x920 genunixàs_fault+0x4a0 unix`pagefault+0xac unix`trap+0xc14 unixùtl0+0x4c 2303 SUNW,UltraSPARC-IIIi`send_mondo_set+0x9c unix`xt_some+0xc4 unix`sfmmu_tlb_range_demap+0x190 unix`sfmmu_chgattr+0x2e8 genunix`segvn_dup+0x3d0 genunixàs_dup+0xd0 genunix`cfork+0x120 unix`syscall_trap32+0xa8 7175 SUNW,UltraSPARC-IIIi`send_mondo_set+0x9c unix`xt_some+0xc4 unix`xt_sync+0x3c unix`sfmmu_chgattr+0x2f0 genunix`segvn_dup+0x3d0 genunixàs_dup+0xd0 genunix`cfork+0x120 unix`syscall_trap32+0xa8 11492 As this output shows, the majority of the cross-calls are the result of a signicant number of fork(2) system calls. (Shell scripts are notorious for abusing their fork(2) privileges.) Page faults of anonymous memory are also involved, which probably accounts for the large number of minor page faults seen in the mpstat output.
Using DTrace
2-25
Examining Performance Problems Using the io Provider

The io provider makes available probes related to disk input and output (I/O). The io provider is designed to enable quick exploration of behavior observed through I/O monitoring tools such as iostat(1M). The io provider describes the nature of the systems I/O by providing data such as the following:
q q q q q q
Device I/O type Process ID Application name File name File offset
The io Probes
Table 2-4 describes the io probes. Table 2-4 The io Probes Probe Name start Description Probe that res when an I/O request is about to be made to a disk device or to an NFS server. The buf(9S) structure corresponding to the I/O request is pointed to by the args[0] argument. The devinfo_t structure of the device to which the I/O is being issued is pointed to by the args[1] argument. The fileinfo_t structure of the le that corresponds to the I/O request is pointed to by the args[2] argument. Note that le information availability depends on the le system making the I/O request. Probe that res after an I/O request has been fullled. The buf(9S) structure corresponding to the I/O request is pointed to by the args[0] argument. The devinto_t structure of the device to which the I/O was issued is pointed to by the args[1] argument. The fileinfo_t structure of the le that corresponds to the I/O request is pointed to by the args[2] argument.
done
2-26

Examining Performance Problems Using the io Provider Table 2-4 The io Probes (Continued) Probe Name wait-start Description Probe that res immediately before a thread begins to wait pending completion of a given I/O request. The buf(9S) structure corresponding to the I/O request for which the thread will wait is pointed to by the args[0] argument. The devinfo_t structure of the device to which the I/O was issued is pointed to by the args[1] argument. The fileinto_t structure of the le that corresponds to the I/O request is pointed to by the args[2] argument. Some time after the wait-start probe res, the wait-done probe res in the same thread. Probe that res immediately after a thread wakes up from waiting for a pending completion of a given I/O request. The buf(9S) structure corresponding to the I/O request for which the thread was waiting is pointed to by the args[0] argument. The devinfo_t structure of the device to which the I/O was issued is pointed to by the args[1] argument. The fileinfo_t structure of the le that corresponds to the I/O request is pointed to by the args[2] argument. Some time after the wait-start probe res, the wait-done probe res in the same thread.
wait-done
Information Available When io Probes Fire

The io probes re for all I/O requests to disk devices, and for all le read and le write requests to an NFS server (except for metadata requests, such as readdir(3C)). The io provider uses three I/O structures: the buf(9S) structure, the devinfo_t structure, and the fileinfo_t structure. When the io probes re, the following arguments are made available:
q
args[0] Set to point to the buf(9S) structure corresponding to the I/O request. args[1] Set to point to the devinfo_t structure of the device to which the I/O was issued. args[2] Set to point to the fileinfo_t structure containing le system related information regarding the issued I/O request.
Using DTrace
2-27
The buf(9S) Structure

The buf(9S) structure is the abstraction that describes an I/O request. The address of this structure is made available to your D programs through the args[0] argument. Here is its denition: struct buf { int b_flags; size t b_bcount; caddr_t b_addr; uint64_t b_blkno; uint64_t b_lblkno; size_t b_resid; size t b_bufsize; caddr_t b_iodone; int b_error; dev_t b_edev; } /* /* /* /* /* /* /* /* /* /* flags */ number of bytes */ buffer address */ expanded block # on device */ block # on device */ # of bytes not transferred */ size of allocated buffer */ I/O completion routine */ expanded error field */ extended device */
The b_flags member indicates the state of the I/O buffer and consists of a bitwise OR operator of different state values. Table 2-5 shows the valid state values for the b_flags eld. Table 2-5 The b_flags Field Values Flag Value B_DONE B_ERROR B_PAGEIO Description Indicates the data transfer has completed. Indicates an I/O transfer error. It is set in conjunction with the b_error eld. Indicates the buffer is being used in a paged I/O request. See the description of the b_addr eld (Table 2-6) for more information. Indicates the buffer is being used for physical (direct) I/O to a user data area. Indicates that data is to be read from the peripheral device into main memory. Indicates that the data is to be transferred from main memory to the peripheral device.
B_PHYS B_READ B_WRITE
2-28

Examining Performance Problems Using the io Provider Table 2-6 shows the field descriptions for the buf(9S) structure. Table 2-6 The buf(9S) Structure Field Descriptions Field b_bcount b_addr Description Indicates the number of bytes to be transferred as part of the I/O request. Indicates the virtual address of the I/O request, unless B_PAGEIO is set. The address is a kernel virtual address unless B_PHYS is set, in which case it is a user virtual address. If B_PAGEIO is set, the b_addr eld contains kernel private data. Note that either B_PHYS or B_PAGEIO or neither can be set, but not both. Identies which logical block on the device is to be accessed. The mapping from a logical block to a physical block (cylinder, track, and so on) is dened by the device. Indicates the number of bytes not transferred because of an error. Contains the size of the allocated buffer. Identies a specic routine in the kernel that is called when the I/O is complete. Holds an error code returned from the driver in the event of an I/O error. b_error is set in conjunction with the B_ERROR bit set in the b_f1ags member. Contains the major and minor device numbers of the device accessed. Consumers can use the D built-in functions getmajor() and getminor() to extract the major and minor device numbers from the b_edev eld.
b_lblkno
b_resid b_bufsize b_iodone b_error
b_edev
Using DTrace
2-29
The devinfo_t Structure

The devinfo_t structure provides information about a device. A pointer to this structure is available to D programs through the args[1] argument. Its members are as follows: typedef struct devinfo { int dev_major; inc dev_minor; inc dev_instance; srring dev_name; string dev_statname; string dev_pathname; } devinfo_t; /* /* /* /* /* /* major number */ minor number */ instance number */ name of device */ name of device + instance/minor */ pathname of device */
Table 2-7 shows the field descriptions for the devinfo_t structure. Table 2-7 The devinfo_t Structure Field Descriptions Field dev_major dev_minor dev_instance Description Indicates the major number of the device; see getmajor(9F). Indicates the minor number of the device; see qetminor(9F). Indicates the instance number of the device. The instance of a device is different from the minor number: where the minor number is an abstraction managed by the device driver, the instance number is a property of the device node. Device node instance numbers can be displayed with the prtconf(lM) command. Indicates the name of the device driver that manages the device. (Device driver names can be viewed with the -D option to prtconf(1M).) Indicates the name of the device as reported by the iostat(1M) command. This name also corresponds to the name of the device as reported by the kstat(1M) command. This eld is provided to enable aberrant iostat or kstat output to be correlated to actual I/O activity. Indicates the complete path of the device.
dev_name
dev_statname
dev_pathname
2-30

The fileinfo_t Structure

The fileinfo_t structure provides information about a le. The le to which an I/O corresponds is pointed to by the args[2] argument in the start, done, wait-start, and wait-done probes. Note that le information is contingent upon the le system providing this information when dispatching I/O requests; some le systems, especially third-party le systems, do not provide the information. Moreover, I/O requests for which there is no le information can emanate from the le system. For example, I/O to le system metadata is not associated with a specic le. Following is the denition of the fileinfo_t structure: typedef struct fileinfo { strinq fi_name; string fi_dirname string fi_pathname; offset_t fi_offset; string fi_fs; string fi_mount } fileinfo_t; /* /* /* /* /* /* name (basename of fi_pathname) */ directory (dirname of fi_pathname) */ full pathname */ offset within file */ filesystem */ mount point of file system */
Table 2-8 shows the field descriptions for the fileinfo_t structure. Table 2-8 The fileinfo_t Structure Field Descriptions Field fi_name Description Contains the name of the le without any directory components. If there is no le information associated with an I/O, the fi_name eld is set to the string <none>. In rare cases, the pathname associated with a le is unknown; in this case, the fi_name eld is set to the string <unknown>. Contains only the directory component of the le name. As with fi_name, this can be set to <none> if there is no le information present, or to <unknown> if the pathname associated with the le is not known. Contains the complete pathname to the le. As with fi_name, this can be set to <none> if there is no le information present, or to <unknown> if the pathname associated with the le is not known. Contains the offset within the le, or -1 if le information is not present or if the offset is otherwise unspecied by the le system.
fi_dirname
fi_pathname
fi_offset
Using DTrace
2-31
Finding I/O Problems

Consider the following output from the iostat(1M) command. device fd0 sd0 sd2 sd15 nfs1 device fd0 sd0 sd2 sd15 nfs1 device fd0 sd0 sd2 sd15 nfs1 device fd0 sd0 sd2 sd15 nfs1 ^C r/s 0.0 2.5 106.6 0.0 0.0 r/s 0.0 0.5 80.9 0.0 0.0 r/s 0.0 1.0 43.8 0.0 0.0 r/s 0.0 1.0 129.5 0.0 0.0 extended device statistics w/s kr/s kw/s wait actv 0.0 0.0 0.0 0.0 0.0 168.7 20.0 10937.7 0.0 3.7 0.0 4319.9 0.0 0.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 extended device statistics w/s kr/s kw/s wait actv 0.0 0.0 0.0 0.0 0.0 168.7 4.0 16162.5 0.0 9.6 0.0 7570.5 0.0 0.0 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 extended device statistics w/s kr/s kw/s wait actv 0.0 0.0 0.0 0.0 0.0 166.3 8.1 18973.0 0.0 24.5 0.0 10949.6 0.0 0.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 extended device statistics w/s kr/s kw/s wait actv 0.0 0.0 0.0 0.0 0.0 189.5 8.0 11047.6 0.0 2.7 0.5 2836.3 14.5 0.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 svc_t %w %b 0.0 0 0 21.7 0 75 6.5 0 54 0.0 0 0 0.0 0 0 svc_t %w %b 0.0 0 0 56.9 0 72 13.2 0 68 0.0 0 0 0.0 0 0 svc_t %w %b 0.0 0 0 146.5 1 88 20.4 0 62 0.0 0 0 0.0 0 0 svc_t %w %b 0.0 0 0 14.4 0 67 5.6 0 59 0.0 0 0 0.0 0 0
This output indicates that a large amount of data is being read from disk drive sd2 and written to disk drive sd0. Someone appears to be transferring many megabytes of data between these two drives. Both disks are consistently over 50% busy. Is someone running a le transfer command such as tar(1), cpio(1), cp(1), or dd(1M)? The iosnoop.d D script enables you to determine who is performing this I/O.
2-32

The iosnoop.d D Script

The following D script displays data that enables you to determine which commands are running, what type of I/O those commands are performing, and which disk devices are involved.
# cat -n iosnoop.d 1 #!/usr/sbin/dtrace -qs 2 BEGIN 3 { 4 printf("%16s %5s %40s %10s %2s %7s\n", "COMMAND", "PID", "FILE", 5 "DEVICE", "RW", "MS"); 6 } 7 8 io:::start 9 { 10 start[args[0]->b_edev, args[0]->b_blkno] = timestamp; 11 command[args[0]->b_edev, args[0]->b_blkno] = execname; 12 mypid[args[0]->b_edev, args[0]->b_blkno] = pid; 13 } 14 15 io:::done 16 /start[args[0]->b_edev, args[0]->b_blkno]/ 17 { 18 elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno]; 19 printf("%16s %5d %40s %10s %2s %3d.%03d\n", command[args[0]->b_edev, 20 args[0]->b_blkno], mypid[args[0]->b_edev, args[0]->b_blkno], 21 args[2]->fi_pathname, args[1]->dev_statname, 22 args[0]->b_flags&B_READ? "R": "W", elapsed/1000000, 23 (elapsed/1000)%1000); 24 start[args[0]->b_edev, args[0]->b_blkno] = 0; /* free memory */ 25 command[args[0]->b_edev, args[0]->b_blkno] = 0; /* free memory */ 26 mypid[args[0]->b_edev, args[0]->b_blkno] = 0; /* free memory */ 27 }
You can decipher this D script as follows:

q q
You use the BEGIN probe to print out column headings. You use an associative array to store the nanosecond timestamp of when a particular I/O starts from a specic device. You must also store the executable name and PID of the command issuing the I/O request; this information is not available at I/O completion time because you are running in the context of an interrupt handler. When the I/O is done you determine the elapsed time and then print out the relevant information. You retrieve the le undergoing the I/O from the fileinfo_t structure; the args[2] argument is set up to point to the fileinfo_t structure when the done probe res.
Using DTrace
2-33

q
You retrieve the iostat-compatible device name from the devinfo_t structure, which is pointed to by the args[1] argument. You use a D conditional expression to display R or W based on testing the B_READ bit in the b_flags eld of the buf structure, which is pointed to by the args[0] argument. You use the D modulo operator (%) to determine the fractional portion of the time in milliseconds. Finally, you set the associative array elements to zero. Setting an associative array element to zero de-allocates the underlying dynamic memory that was being used. This avoids potential dynamic variable drops.
The following output results from running the previous iosnoop.d script. It clearly shows who is performing the I/O operations. Someone is copying the shared object les from /usr/lib on drive sd2 to a backup directory on drive sd0.
# ./iosnoop.d COMMAND bash bash bash bash bash bash bash cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp
PID 725 725 725 725 725 725 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768
FILE /usr/bin/bash /usr/lib /usr/lib /usr/lib /lib/libc.so.1 /lib/libnsl.so.1 /lib/libnsl.so.1 /lib/libc.so.1 /lib/libc.so.1 /lib/libc.so.1 /usr/lib/0@0.so.1 /usr/lib/0@0.so.1 /mnt/lib.backup/0@0.so.1 /usr/lib/ld.so /usr/lib/ld.so /usr/lib/ld.so /mnt/lib.backup/ld.so /mnt/lib.backup/ld.so /mnt/lib.backup/ld.so.1 <unknown> /mnt/lib.backup/ld.so.1 /mnt/lib.backup/ld.so.1 /usr/lib/lib300.so.1 /usr/lib/lib300.so.1 /usr/lib/lib300.so.1 /usr/lib/lib300.so.1 <unknown> /mnt/lib.backup/lib300.so
DEVICE RW sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd0 W sd2 R sd2 R sd2 R sd0 W sd0 W sd0 W sd2 R sd0 W sd0 W sd2 R sd2 R sd2 R sd2 R sd2 R sd0 W
MS 9.471 7.128 3.193 11.283 7.696 10.293 0.582 10.154 7.262 9.914 9.270 13.654 2.431 6.890 7.085 0.376 6.698 6.437 4.394 2.206 8.479 8.440 5.771 6.003 0.530 7.912 3.014 7.861
2-34


cp cp cp cp cp cp cp ... cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp ... cp cp cp cp cp cp cp cp cp cp ^C 768 768 768 768 768 768 768 768 768 768 /usr/lib/libgtk-x11-2.0.so.0.100.0 /mnt/lib.backup/libgthread-2.0.so.0 /mnt/lib.backup/libgthread-2.0.so.0.7.0 <none> /usr/lib/libgtk-x11-2.0.so.0.100.0 /usr/lib/libgtk-x11-2.0.so.0.100.0 /usr/lib/libgtk-x11-2.0.so.0.100.0 /usr/lib/libgtk-x11-2.0.so.0.100.0 /mnt/lib.backup/libgtk-x11-2.0.so /mnt/lib.backup/libgtk-x11-2.0.so sd2 sd0 sd0 sd2 sd2 sd2 sd2 sd2 sd0 sd0 R W W R R R R R W W 2.374 7.732 7.605 10.678 5.677 39.864 61.555 17.175 44.225 42.075 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 /usr/dt/lib/libXm.so.3 /usr/dt/lib/libXm.so.3 /usr/dt/lib/libXm.so.3 <none> /usr/dt/lib/libXm.so.3 /mnt/lib.backup/libXm.so.1.2 /usr/dt/lib/libXm.so.3 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 <unknown> /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.1.2 /mnt/lib.backup/libXm.so.3 /mnt/lib.backup/libXm.so.3 /mnt/lib.backup/libXm.so.3 /mnt/lib.backup/libXm.so.3 /mnt/lib.backup/libXm.so.3 sd2 sd2 sd2 sd0 sd2 sd0 sd2 sd0 sd0 sd0 sd0 sd0 sd0 sd0 sd0 sd0 sd0 sd0 sd0 sd2 sd0 sd0 sd0 sd0 sd0 sd0 sd0 R R R R R W R W W W W W W W W W W W W R W W W W W W W 32.020 6.471 14.494 10.184 22.211 9.777 28.813 26.279 24.141 22.075 19.989 21.710 39.809 37.459 32.631 30.378 28.308 29.701 28.327 24.986 28.021 26.601 5.353 4.603 13.232 11.242 12.412 768 768 768 768 768 768 768 /mnt/lib.backup/lib300.so.1 /usr/lib/lib300s.so.1 /usr/lib/lib300s.so.1 /usr/lib/lib300s.so.1 /usr/lib/lib300s.so.1 /mnt/lib.backup/lib300s.so /mnt/lib.backup/lib300s.so.1 sd0 sd2 sd2 sd2 sd2 sd0 sd0 W R R R R W W 6.794 3.326 3.525 0.553 7.397 2.996 1.970
Using DTrace
2-35
Obtaining System Call Information

System calls serve as the main interface between user-level applications and the kernel. You can learn much about the system by knowing the system calls that are being issued by the set of running applications. Note System calls are documented in Section 2 of the Solaris 10 OS manual pages. Traditionally, system calls of an application were determined using the truss(1) command. The DTrace syscall provider, however, enables you to quickly gather more detailed data with which to analyze aberrant behavior related to system calls. For example, not only can DTrace show you the system calls being issued by a given application, but it can also indicate which applications are issuing a given system call. In addition, you can time (in nanoseconds) how long a particular system call takes, such as a read(2). These operations cannot be performed with the truss(1) command.
The syscall Provider

The syscall provider makes available a probe at the entry and return of every system call in the system. An example of a fully-specied probe description for the entry probe of the read(2) system call is: syscall::read:entry The probe for return from the read(2) system call is: syscall::read:return Note that the module name is undened for the syscall provider probes.
2-36

System Call Names

The system call names are usually, but not always, the same as those documented in Section 2 of the Solaris 10 OS manual pages. The actual names are listed in the /etc/name_to_sysnum system le. Examples of system call names that do not match the manual pages are:
q q q q
rexit for exit(2) gtime for time(2) semsy for semctl(2), semget(2), semids(2), and semtimedop(2) signotify, which has no manual page, and is used for POSIX.4 message queues Large le system calls such as:
q q q q
creat64 for creat(2) lstat64 for lstat(2) open64 for open(2) mmap64 for mmap(2)
Arguments for entry and return Probes

For the entry probes, the arguments (arg0, arg1, ... argn) are the arguments to the system call. For return probes, both arg0 and arg1 contain the same value: return value from the system call. You can check system call failure in the return probe by referencing the errno D variable. The following example shows which system calls are failing for which applications and with what errno value. # cat errno.d #!/usr/sbin/dtrace -qs syscall:::return /arg0 == -1 && execname != "dtrace"/ { printf("%-20s %-10s %d\n", execname, probefunc, errno); } # ./errno.d sac ttymon ttymon nscd in.routed in.routed
read pause read lwp_kill ioctl ioctl
4 4 11 3 12 12
Using DTrace
2-37
Obtaining System Call Information tty tty bash bash bash snmpd ^C open stat setpgrp waitsys stat64 ioctl 2 2 13 10 2 12
The errno.d D program has a predicate that uses the AND operator: &&. The predicate states that the return from the system call must be -1, which is how all system calls indicate failure, and that the process executable name cannot be dtrace. The printf built-in D function uses the %-20s and %-10s format specications to left-justify the strings in the given minimum column width.
D Script Example Using the syscall Provider

The following simple D script counts the number of system calls being issued system wide. # cat syscall.d #!/usr/sbin/dtrace -qs syscall:::entry { @[probefunc] = count(); } # ./syscall.d ^C mmap64 mkdir umask getloadavg getdents64 ... stat ioctl close write mmap read sigaction brk
1 1 1 1 2 1754 1956 2708 2733 3006 3880 7886 12695
2-38

Obtaining System Call Information The output indicates that the majority of the system calls are setting up signal handling (sigaction(2)) or growing the heap (brk(2)). The following D script enables you to discover who is making the brk(2) system calls. # cat brk.d #!/usr/sbin/dtrace -qs syscall::brk:entry { @[execname] = count(); } # ./brk.d ^C dtrace 6 prstat 22 nroff 48 cat 48 tbl 142 eqn 144 rm 166 ln 166 col 222 expr 332 head 492 fgrep 492 dirname 581 grep 722 instant 738 sh 917 nawk 984 sgml2roff 1259 nsgmls 13296 # ps -ef | grep nsgmls root 591 590 2 07:56:32 pts/2 0:00 /usr/lib/sgml/nsgmls gl -m/usr/share/lib/sgml/locale/C/dtds/catalog -E0 /usr/s # man nsgmls No manual entry for nsgmls. # man -k sgml sgml sgml (5) - Standard Generalized Markup Language solbook sgml (5) - Standard Generalized Markup Language Apparently some process is working with the Standard Generalized Markup Language (SGML). Use the ptree command to see who is creating this process: # ptree 591 #
Using DTrace
2-39
Obtaining System Call Information The ptree command returns no results because the nsgmls process is too short-lived for the command to be run on it. You have learned, however, that the problem is not a long-lived process causing a memory leak. Now write a quick D script to print out the ancestry. You must keep trying the next previous parent iteratively, because many of the other processes involved are also short-lived. Note This particular D script fails if an ancestor does not exist. This is because the top ancestor, the sched process has no parent. You cannot harm the kernel even if a D script uses a bad pointer. The intent of this example is to show how you can quickly create custom D scripts to answer questions about system behavior. Many of your D scripts will be throw-away scripts that you will not re-use. You can x the script by testing each parent pointer with a predicate before printing. You will see this x later with the ancestors3.d D script.
# cat ancestors.d
# cat -n ancestors.d 1 #!/usr/sbin/dtrace -qs 2 syscall::brk:entry 3 /execname == "nsgmls"/ 4 { 5 printf("process: %s\n", 6 curthread->t_procp->p_user.u_psargs); 7 printf("parent: %s\n", 8 curthread->t_procp->p_parent->p_user.u_psargs); 9 printf("grandparent: %s\n", 10 curthread->t_procp->p_parent->p_parent->p_user.u_psargs); 11 printf("greatgrandparent: %s\n", 12 curthread->t_procp->p_parent->p_parent->p_parent->p_user.u_psargs); 13 printf("greatgreatgrandparent: %s\n", 14 curthread->t_procp->p_parent->p_parent->p_parent->p_parent->p_user.u_psargs); 15 printf("greatgreatgreatgrandparent: %s\n", 16 curthread->t_procp->p_parent->p_parent->p_parent->p_parent->p_parent->p_user.u_psargs); 17 }
# ./ancestors.d process: /usr/lib/sgml/nsgmls -gl -m/usr/share/lib/sgml/locale/C/dtds/catalog -E0 /usr/s parent: /usr/lib/sgml/instant -d -c/usr/share/lib/sgml/locale/C/transpec/roff.cmap -s/u grandparent: /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman4/rt_dptbl.4 greatgrandparent: sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share/man/sman4/rt_dptbl. greatgreatgrandparent: catman greatgreatgreatgrandparent: bash # ps -ef | grep catman root 2333 2332 1 08:26:05 pts/1 root 16984 2880 0 08:41:10 pts/2 # ptree 2333 299 /usr/sbin/inetd -s 2324 in.rlogind
0:03 catman 0:00 grep catman
2-40


2326 -sh 2332 bash 2333 catman 17232 sh -c cd /usr/share/man; rm -f /usr/share/man/cat4/variables.4; ln -s ../cat4/e 17235 sh -c cd /usr/share/man; rm -f /usr/share/man/cat4/variables.4; ln -s ../cat4/e
The previous output indicates that all of the brk(2) system calls resulted from the catman(1M) command, creating many short-lived children that issued this system call. The curthread built-in D variable gives access to the address of the running kernel thread. Like the C language, the D language accesses members of a structure with the -> symbol when you have a pointer to that structure. Through this pointer to the kernel kthread_t structure, you can access the process name and arguments (kept in the proc_t structures p_user structure) as well as any parent, grandparent, greatgrandparent, and so on. To do this you follow the parent pointers back. Refer to the <sys/thread.h>, <sys/proc.h> and <sys/user.h> header les for details of these elds.
Using DTrace
2-41
Obtaining System Call Information Figure 2-1 shows a diagram of the kernel data structures being accessed by this example.
kthread_t
t_state t_pri
curthread
proc_t
t_lwp t_procp p_exec p_as p_cred p_parent /usr/include/sys/thread.h p_tlist
proc_t
p_exec p_as p_cred p_parent p_tlist
proc_t . . .
p_parent
. . .
. . .
user_t p_user u_start u_ticks /usr/include/sys/user.h u_psargs[ ] u_cdir
. . .
p_user
. . .
p_user
. . .
u_psargs[ ]
. . .
u_psargs[ ]
. . .
/usr/include/sys/proc.h
. . .
. . .
Figure 2-1
Thread and Process Data Structures
New Approach to Analyzing Transient Failures

As the previous example demonstrates, each result obtained from using the DTrace facility can lead to further questions, which are answered with available commands or with new D programs that you can write quickly. In this way, the DTrace facility signicantly shortens the diagnostic loop: hypothesis->instrumentation->data gathering->analysis->hypothesis This tightened loop introduces a new paradigm for diagnosing transient failures. It enables the emphasis to shift from instrumentation to hypothesis, which is less labor intensive.
2-42

D Language Variables
The D language has ve basic variable types:
q
Scalar variables Have xed-size values such as integers, structures and pointers Associative arrays Store values indexed by one or more keys, similar to aggregations Thread-local variables Have one name, but storage is local to each separate kernel thread. These variables are prexed with the self-> keyword. Clause-local variables Appear when an action block is entered; storage is reclaimed after leaving the probe clause. These variables are prexed with the this-> keyword. Kernel external variables DTrace has access to all kernel global and static variables. These variables are prexed with a backquote ().
Associative arrays (start, command, and mypid) were used in the iosnoop.d script. Clause-local variables are similar to automatic or local variables in the C Language. The elapsed variable in the iosnoop.d script was a global scalar variable, but could have been made into a clause-local variable which is slightly more efcient. Clause-local variables come into existence when an action block (tied to a specic probe) is entered and their storage is reclaimed when the action block is left. They help save storage and are quicker to access than associative arrays. Note For more information on D variables, refer to the Solaris Dynamic Tracing Guide, part number 817-6223-10. You can access kernel global and static variables within your D programs. To access these external variables, you prex the global kernel variable with the (back quote or grave accent) character. For example, to reference the freemem kernel global variable use: freemem. If the variable is part of a kernel module that conicts with other module variable names, use the character between the module name and the variable name. For example, sdsd_state references the sd_state variable within the sd kernel module.
Using DTrace
2-43
Associative Arrays
Associative arrays enable the storing of scalar values in elements of an array (or table) that are identied by one or more sequences of commaseparated key elds (an n-tuple). The keys can be any combination of strings or integers. The following code example shows the use of an associative array to track how often any command issues more than a given number of any single system call: # cat -n assoc2.d 1 #!/usr/sbin/dtrace -qs 2 syscall:::entry 3 { 4 ++namesys[pid,probefunc]; 5 x = namesys[pid,probefunc] > 5000 ? 1 : 0; 6 } 7 syscall:::entry 8 /x && execname != "dtrace"/ 9 { 10 printf("Process: %d %s has just made more than 5000 %s calls\n", 11 pid, execname, probefunc); 12 namesys[pid,probefunc] = 0; /* reset the count */ 13 } # ./assoc2.d Process: 14837 Process: 14837 Process: 14854 Process: 14854 Process: 14854 ^C
find has just made more than 5000 lstat64 calls find has just made more than 5000 lstat64 calls ls has just made more than 5000 lstat64 calls ls has just made more than 5000 acl calls ls has just made more than 5000 lstat64 calls
2-44

Obtaining System Call Information The assoc2.d D program uses an associative array indexed by the unique combination of process ID (PID) and system call name. The ++ operator is incrementing the array element by one each time a process with that PID is making that system call. The array element, like all variables (except clause-local variables), is initialized to 0. The second statement in the action block uses a conditional expression that has three parts:
expression ? value1 : value2

A conditional expression has the value of value1 when the D expression is nonzero (true), and has the value of value2 when the expression is zero (false). Therefore, in the assoc2.d D program, the global scalar variable x is 1 when that element of the associative array is greater than 5000, and 0 when it is not greater than 5000. The next action block is only executed if x is not 0 and the executable name is not dtrace. After printing the command that made more than 5000 of a given system call, you reset the array element to 0 to begin counting again. Note that a comment is used in this D program. Like comments in the C language, a comment in the D language is text that is enclosed between /* and */.
Thread-Local Variables
Thread-local variables are useful when you wish to enable a probe and mark with a tag every thread that res the probe. Thread-local variables share a common name but refer to separate data storage associated with each thread. Thread-local variables are referenced with the special keyword self followed by the two characters ->, as shown in the following example: syscall::read:entry { self->read = 1; } syscall::read:return /self->read/ { printf("Same thread is returning from read\n"); }
Using DTrace
2-45
Timing a System Call

Thread-local variables enable you to determine the amount of time a thread spends in any particular system call. The following example times how long the grep(1) command takes in each read(2) system call. It also displays the number of bytes read (arg0 is the return value of read). # cat -n timegrep.d 1 #!/usr/sbin/dtrace -qs 2 BEGIN 3 { 4 printf("size\ttime\n"); 5 } 6 syscall::read:entry 7 /execname == "grep"/ 8 { 9 self->start = timestamp; 10 } 11 syscall::read:return 12 /self->start/ 13 { 14 printf("%d\t%d\n", arg0, timestamp - self->start); 15 self->start = 0; 16 } # ./timegrep.d size time 8192 7108972 319 1526616 0 12112 3293 5663329 0 18816 ^C The rst read took 7,108,972 nanoseconds or 7.1 milliseconds, which is reasonable for an 8-Kbyte disk read. As you might expect, the rst read of 0 bytes took only 12 microseconds. The next example uses an associative array to time every system call performed by the grep command. # cat -n timesys.d 1 #!/usr/sbin/dtrace -qs 2 BEGIN 3 { 4 printf("System Call Times for grep:\n\n");
2-46

Obtaining System Call Information 5 printf("%20s\t%10s\n", "Syscall", "Microseconds"); 6 } 7 syscall:::entry 8 /execname == "grep"/ 9 { 10 self->name[probefunc] = timestamp; 11 } 12 syscall:::return 13 /self->name[probefunc]/ 14 { 15 printf("%20s\t%10d\n", probefunc, 16 (timestamp - self->name[probefunc])/1000); 17 self->name[probefunc] = 0; /* free memory */ 18 } # ./timesys.d System Call Times for grep: Syscall mmap resolvepath resolvepath stat open stat open ... brk open64 read brk brk read close ^C Predictably, the system call that took the most time was read, because of the disk I/O wait time (the second read was of 0 bytes). 25 43 8126 20 28 24 26 Microseconds 50 47 67 37 46 34 32
Using DTrace
2-47
Following a System Call

You can follow a system call from entry into the kernel through all subsequent internal kernel function calls and returns back to the original point of entry of the system call function. You do this by using the syscall and fbt providers together with a thread-local variable. The following example traces all of the functions involved in the read(2) system call as issued by the grep(1) command: # cat -n follow.d 1 #!/usr/sbin/dtrace -s 2 syscall::read:entry 3 /execname == "grep"/ 4 { 5 self->start = 1; 6 } 7 8 syscall::read:return 9 /self->start/ 10 { 11 exit(0); 12 } 13 14 fbt::: 15 /self->start/ 16 { 17 } The fbt provider probe clause has an empty action. The default action for DTrace tracks every time you enter and return from all kernel functions involved in a read(2) system call until it terminates. Option -F of the dtrace(1M) command indents the output of each nested function call and shows this with the -> symbol; it un-indents the output when that function returns back up the call tree and shows this with the <- symbol. # dtrace -F -s follow.d dtrace: script './follow.d' matched 38108 probes CPU FUNCTION 0 -> read32 0 <- read32 0 -> read 0 -> getf 0 -> set_active_fd 0 <- set_active_fd 0 <- getf ...
2-48

Obtaining System Call Information 0 0 0 0 0 ... 0 0 0 0 0 0 0 ... 0 0 ... 0 0 0 0 0 ... 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 <- ufs_rwlock -> fop_read <- fop_read -> ufs_read -> ufs_lockfs_begin -> rdip -> rw_write_held <- rw_write_held -> segmap_getmapflt -> get_free_smp -> grab_smp -> segmap_hashout <- sfmmu_kpme_lookup -> sfmmu_kpme_sub <- page_unlock <- grab_smp -> segmap_pagefree -> page_lookup_nowait -> page_trylock <- segmap_hashin -> segkpm_create_va <- segkpm_create_va -> fop_getpage -> ufs_getpage -> ufs_lockfs_begin_getpage -> tsd_get <- page_exists -> page_lookup <- page_lookup -> page_lookup_create <- page_lookup_create -> ufs_getpage_miss -> bmap_read -> findextent <- findextent <- bmap_read -> pvn_read_kluster -> page_create_va -> lgrp_mem_hand <- page_add
Using DTrace
2-49
Obtaining System Call Information 0 0 0 0 0 0 0 0 0 ... 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 ... 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 <- page_create_va <- pvn_read_kluster -> pagezero -> ppmapin -> sfmmu_get_ppvcolor <- sfmmu_get_ppvcolor -> hat_memload -> sfmmu_memtte <- sfmmu_memtte -> xt_some <- xt_some <- xt_sync <- sema_init <- pageio_setup -> lufs_read_strategy -> logmap_list_get <- logmap_list_get -> bdev_strategy -> bdev_strategy_tnf_probe <- bdev_strategy_tnf_probe <- bdev_strategy -> sdstrategy -> getminor <-> <-> drv_usectohz timeout timeout timeout_common
<- getminor -> scsi_transport <- scsi_transport -> glm_scsi_start -> ddi_get_devstate <- ddi_get_soft_state -> pci_pbm_dma_sync <- pci_pbm_dma_sync <- pci_dma_sync <- glm_start_cmd <- glm_accept_pkt <- glm_scsi_start <- sd_start_cmds <- sd_core_iostart
2-50

Obtaining System Call Information 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ,,, 0 0 0 0 0 0 0 0 ... 0 0 0 ... 0 0 0 0 <- xbuf_iostart <- lufs_read_strategy -> biowait -> sema_p -> disp_lock_enter <- disp_lock_enter -> thread_lock_high <- thread_lock_high -> ts_sleep <- ts_sleep -> disp_lock_exit_high <- disp_lock_exit_high -> disp_lock_exit_nopreempt <- disp_lock_exit_nopreempt -> swtch -> disp -> disp_lock_enter <- disp_lock_enter -> disp_lock_exit <- disp_lock_exit -> disp_getwork <- disp_getwork <- disp <- swtch -> resume <- resume -> disp_lock_enter <- hat_page_getattr <- segmap_getmapflt -> uiomove -> xcopyout <- xcopyout <- uiomove -> segmap_release -> get_smap_kpm <- ufs_imark <- ufs_itimes_nolock <- rdip <- cv_broadcast <- releasef <- read -> read
Using DTrace
2-51
Obtaining System Call Information Although more than half of the functions were removed from the previous output, the example shows that a great many functions are required to perform a disk le read. Some of the key functions are described below:
q q q q
read read(2) system call entered ufs_read UFS le being read segmap_getmapflt Find segmap page for the I/O segmap_pagefree Free underlying previous physical page tied to this segmap virtual page onto the cachelist (this policy replaced the old priority paging) ufs_getpage Ask UFS to retrieve the page page_lookup First check to see if the page is in memory (it is not) page_create_va Get new physical page for the I/O hat_memload Map the virtual page to the physical page xt_some Issue cross-trap call to some CPUs sdstrategy Issue Small Computer System Interface (SCSI) command to read page from disk into segmap page timeout Prepare for SCSI timeout of disk read request glm_scsi_start In glm host bus adapter driver biowait Wait for block I/O sema_p Use semaphore to wait for I/O ts_sleep Put timesharing (TS) thread on sleep queue swtch Do a context switch (have thread give up the CPU while it waits for the I/O) disp_getwork Find another thread to run while this thread waits for its I/O resume I/O has completed and CPU is returned to resume running uimove Move data from kernel buffer (page) to user-land buffer segmap_release Release segmap page for use by another I/O later read Read operation ends
q q q q q q
q q q q q q
q q
2-52

Creating D Scripts That Use Arguments

As with shell and other interpretive programming language commands such as the perl(1) command, you can use the dtrace(1M) command to create executable interpreter les. The le must start with the following line and must have execute permission: #!/usr/sbin/dtrace -s You can specify other options to the dtrace(1M) command on this line; be sure, however, to use only one dash (-) followed by the options, with s being last: #!/usr/sbin/dtrace -qvs You can also specify all options to the dtrace(1M) command by using #pragma lines inside the D script: # cat -n mem2.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 #pragma D option verbose 5 6 vminfo::: 7 { 8 @[execname,probename] = count(); 9 } 10 11 END 12 { 13 printa("%-20s %-15s %@d\n", @); 14 }
Note For the list of option names used in #pragma lines, see the Solaris Dynamic Tracing Guide, part number 817-6223-10.
Using DTrace
2-53
Built-in Macro Variables

The D compiler denes a set of built-in macro variables that you can refer to inside a D script. These macro variables include:
q q q q q q
$pid Process ID of dtrace interpreter running script $ppid Parent process ID of dtrace interpreter running script $uid Real user ID of user running script $gid Real group ID of user running script $0 Name of script $1, $2, $3, and so on First, second, third command-line arguments passed to script $$1, $$2, $$3, and so on - First, second, third command-line arguments converted to double quoted (" ") strings
The complete list of D macro variables is given in Appendix B. The following D script uses some of these D macro variables: # cat -n params.d 1 #!/usr/sbin/dtrace -s 2 #pragma D option quiet 3 4 tick-2sec 5 /$1 == $11 && $$3 == "fubar"/ 6 { 7 printf("name of script: %s\n", $0); 8 printf("pid of script: %d\n", $pid); 9 printf("9th arg passed to script: %s\n", $$9); 10 exit(0); 11 }
# ./params.d 1 2 fubar 4 5 6 7 8 9 10 1 name of script: ./params.d pid of script: 5363 9th arg passed to script: 9 # ./params.d 1 2 3 4 5 6 7 8 9 10 11 ^C
2-54

Creating D Scripts That Use Arguments The last invocation of the script did not output anything because the value of the rst argument did not match the value of the eleventh argument. The following invocations show that the type and number of arguments must match those referenced inside the D script. This is an example of the error-checking capability of the DTrace facility: # ./params.d 1 2 3 4 5 6 7 8 9 dtrace: failed to compile script ./params.d: line 5: macro argument $11 is not defined # ./params.d 1 2 3 4 5 6 7 8 9 10 11 12 13 dtrace: failed to compile script ./params.d: line 12: extraneous argument '13' ($13 is not referenced) # ./params.d a b c d e f g h i j k dtrace: failed to compile script ./params.d: line 5: failed to resolve a: Unknown variable name The defaultargs option to the dtrace(1M) command allows you to default the values of $1, $2, and so on to zero if the user does not type any arguments when invoking the dtrace(1M) command. The $$1, $$2, and so on references become NULL strings when the user does not type any arguments. Options can be specied on the dtrace(1M) command line as an argument to the -x option. The following examples show these features: # cat -n args.d 1 #!/usr/sbin/dtrace -qs 2 BEGIN 3 { 4 x = 5; 5 } 6 7 tick-2sec 8 { 9 x = x + $1; 10 name = $$2 11 } 12 13 tick-11sec 14 { 15 printf("x: %d\n", x); 16 printf("name: %s\n", name); 17 exit(0); 18 } # ./args.d 2 foo x: 15 name: foo
Using DTrace
2-55
Creating D Scripts That Use Arguments # ./args.d dtrace: failed to compile script args.d: line 10: macro argument $1 is not defined # dtrace -x defaultargs -qs args.d x: 5 name: # dtrace -x defaultargs -qs args.d 2 3 4 dtrace: failed to compile script args.d: line 20: extraneous argument '4' ($3 is not referenced)
PID Argument Example

The following example passes the PID of a running vi process to the syscalls2.d D script. You use the pgrep command to determine the PID of the vi process. The D script terminates when the vi command exits. # cat -n syscalls2.d 1 #!/usr/sbin/dtrace -qs 2 3 syscall:::entry 4 /pid == $1/ 5 { 6 @[probefunc] = count(); 7 } 8 syscall::rexit:entry 9 { 10 exit(0); 11 } # pgrep vi 2208 # ./syscalls2.d 2208 rexit setpgrp creat getpid open lstat64 stat64 fdsync unlink close alarm 1 1 1 1 1 1 1 1 2 2 2
2-56

Creating D Scripts That Use Arguments lseek sigaction ioctl read write 3 5 45 143 178
Executable Name Argument Example

In the following example the ancestors.d D script is modied to make it more general. Remember that this script was created because the processes involved were too short-lived for a ptree command to be executed on them. The modied script can retrieve the ancestry back to the great-great-great-grandparent of any process you catch making any specied system call. The $$1 references the rst command line argument as a quoted string. # cat -n ancestors2.d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #!/usr/sbin/dtrace -qs syscall::$2:entry /execname == $$1/ { printf("process: %s\n", curthread->t_procp->p_user.u_psargs); printf("parent: %s\n", curthread->t_procp->p_parent->p_user.u_psargs); printf("grandparent: %s\n", curthread->t_procp->p_parent->p_parent->p_user.u_psargs); printf("greatgrandparent: %s\n", curthread->t_procp->p_parent->p_parent->p_parent->p_user.u_psargs); printf("greatgreatgrandparent: %s\n", curthread->t_procp->p_parent->p_parent->p_parent->p_parent->p_user.u_psargs); printf("greatgreatgreatgrandparent: %s\n", curthread->t_procp->p_parent->p_parent->p_parent->p_parent->p_parent->p_user.u_psargs); exit(0); }
# ./ancestors2.d nsgmls brk process: /usr/lib/sgml/nsgmls -gl m/usr/share/lib/sgml/locale/C/dtds/catalog -E0 /usr/s parent: /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman2/fork.2 grandparent: /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman2/fork.2 greatgrandparent: sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share/man/sman2/fork.2 greatgreatgrandparent: catman greatgreatgreatgrandparent: bash You can run the same script with a different process name and system call, which shows the power of being able to pass in arguments to a D script: # ./ancestors2.d vi sigaction process: vi /etc/system parent: bash
Using DTrace
2-57
Creating D Scripts That Use Arguments grandparent: -sh greatgrandparent: /usr/sbin/in.telnetd greatgreatgrandparent: /usr/lib/inet/inetd start greatgreatgreatgrandparent: /sbin/init The ancestors3.d D script xes the problem with trying to print nonexistent ancestry: # ./ancestors2.d dtrace: error on address (0x0) in dtrace: error on address (0x0) in dtrace: error on address (0x0) in cron read enabled probe ID 1 (ID 10: syscall::read:entry): invalid action #4 enabled probe ID 1 (ID 10: syscall::read:entry): invalid action #4 enabled probe ID 1 (ID 10: syscall::read:entry): invalid action #4
# cat -n ancestors3.d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 #!/usr/sbin/dtrace -qs syscall::$2:entry /execname == $$1/ { printf("process: %s\n", curthread->t_procp->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent; } syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("parent: %s\n", nextpaddr->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent->p_parent; } syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("grandparent: %s\n", nextpaddr->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent->p_parent->p_parent; } syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("greatgrandparent: %s\n", nextpaddr->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent->p_parent->p_parent->p_parent; } syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("greatgreatgrandparent: %s\n", nextpaddr->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent->p_parent->p_parent->p_parent->p_parent; }
2-58


38 39 40 41 42 43 syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("greatgreatgreatgrandparent: %s\n", nextpaddr->p_user.u_psargs); exit(0); }
# ./ancestors3.d cron read process: /usr/sbin/cron parent: /sbin/init grandparent: sched process: /usr/sbin/cron parent: /sbin/init grandparent: sched process: /usr/sbin/cron parent: /sbin/init grandparent: sched process: /usr/sbin/cron parent: /sbin/init grandparent: sched ^C
Using DTrace
2-59
Custom Monitoring Tools

The intended use of the vminfo, sysinfo, and io providers is to further investigate potential problems shown by the output of the existing Solaris monitoring tools such as vmstat(1M), sar(1), mpstat(1M), and iostat(1M). The following two examples show that you can also use these providers to create custom versions of the existing monitoring tools. It also shows the arithmetic capabilities of the D Language.
Example of a Custom Tool Resembling the sar -c Command

The following D script uses the sysinfo provider to implement a tool similar to the sar -c command.
# cat -n sar-c.d 1 #!/usr/sbin/dtrace -qs 2 /* 3 * Usage: ./sar-c.d interval count 4 */ 5 6 BEGIN 7 { 8 printf("%10s %10s %10s %10s %10s %10s %10s\n", "scall/s", 9 "sread/s", "swrit/s", "fork/s", "exec/s", "rchar/s", "wchar/s"); 10 rchar = 0; 11 wchar = 0; 12 } 13 14 syscall:::entry 15 { 16 ++scall; 17 } 18 19 sysinfo:::sysread 20 { 21 ++sread; 22 } 23 24 sysinfo:::syswrite 25 { 26 ++swrit; 27 } 28 29 sysinfo:::sysfork 30 { 31 ++fork; 32 } 33 34 sysinfo:::sysexec 35 {
2-60


36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 ++exec; } sysinfo:::readch { rchar = rchar + arg0; } sysinfo:::writech { wchar = wchar + arg0; } tick-1sec { ++i; } tick-1sec /i == $1/ { ++n; printf("%10d %10d %10d %10d %10d %10d %10d\n", scall/i, sread/i, swrit/i, fork/i, exec/i, rchar/i, wchar/i); i = 0; scall = 0; sread = 0; swrit = 0; fork = 0; exec = 0; rchar = 0; wchar = 0; } tick-1sec /n == $2/ { exit(0); }
# ./sar-c.d 5 6 scall/s sread/s 43 0 70 1 42 2 75 0 436 26 38 0
swrit/s 0 2 2 1 34 0
fork/s 0 0 0 0 3 0
exec/s 0 0 0 0 3 0
rchar/s 0 1 2 351 3329 0
wchar/s 15 32 17 39 317 15
Using DTrace
2-61
Example of a Custom Tool Resembling the vmstat(1M) Command

The following D script uses the vminfo provider to implement a tool similar to the vmstat(1M) command. It displays three elds from the vmstat(1M) command:
q
free eld Displays the systems average value of freemem in kilobytes re eld Displays the average page reclaims per second sr eld Displays the average page scans per second performed by the page daemon
q q
2-62

Creating D Scripts That Use Arguments # cat -n vm.d 1 #!/usr/sbin/dtrace -qs 2 /* 3 * Usage: vmd.d interval count 4 */ 5 6 BEGIN 7 { 8 printf("%8s %8s %8s\n", "free", "re", "sr"); 9 } 10 11 tick-1sec 12 { 13 ++i; 14 @free["freemem"] = sum(8*`freemem); 15 } 16 17 vminfo:::pgrec 18 { 19 ++re; 20 } 21 22 vminfo:::scan 23 { 24 ++sr; 25 } 26 27 tick-1sec 28 /i == $1/ 29 { 30 normalize(@free, $1); 31 printa("%8@d ", @free); 32 printf("%8d %8d\n", re/i, sr/i); 33 ++n; 34 i = 0; 35 re = 0; 36 sr = 0; 37 clear(@free); 38 } 39 40 tick-1sec 41 /n == $2/ 42 { 43 exit(0); 44 }
Using DTrace
2-63
Creating D Scripts That Use Arguments # ./vm.d 5 12 free 385296 385296 385296 385296 316180 22297 1976 1964 1971 1968 1964 1955
re 0 0 0 0 2 1 2 3 2 3 3 4
sr 0 0 0 0 0 19040 31727 31727 31727 31727 31727 31728
Like the vmstat(1M) command, the vm.d script expects two arguments: the interval value and a count value. The i, re, sr, and n variables are D global scalar variables used for counting. Note the special reference to the kernels freemem variable: freemem. The script multiplies freemem by 8 because it sums in units of kilobytes, not pages, and the assumption is that a page is 8 Kbytes in size. The script uses the sum() aggregation with the normalize() built-in function which divides the current sum by the interval value to get per second averages. The script also clears the running sum of freemem every interval with the clear() built-in function. The printa() built-in function, which is covered in detail in Appendix A, prints the value of the sum() aggregation. Because you are using integer-truncated arithmetic, you can lose some data. This is also true when using the vmstat(1M) command. For example, if there are only four page reclaims in the ve-second interval, then the average per second shows as 0. This output shows that the system is experiencing sustained scanning of memory by the page daemon, as indicated by the consistently high number of scans per second. It also shows that someone has used most of the free memory within a short period of time, which explains the high scan rates.
2-64

Module 3
Debugging Applications With DTrace

Objectives
q q q q
Use DTrace to prole an application Use DTrace to access application variables Use Dtrace to nd transient system call errors in an application Use DTrace to determine the names of les being opened
3-1
Relevance
Relevance
Discussion The following questions are relevant to understanding how to use DTrace for application debugging:
q
!
?
Would it be useful to follow the software stack sequentially from the application into the kernel? Would it be useful to display path names being passed to system calls while an application is running? Would it be useful to know where an application is spending the majority of its time?
3-2

q
Sun Microsystems, Inc. Solaris Dynamic Tracing Guide, part number 817-6223-10. Cantrill Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. Paper presented at 2004 USENIX Conference. BigAdmin System Administration Portal [http://www.sun.com/bigadmin/content/dtrace]. dtrace(1M) manual page in the Solaris 10 OS manual pages, Solaris 10 Reference Manual Collection. The /usr/demo/dtrace directory contains all of the sample scripts from the Solaris Dynamic Tracing Guide.

3-3
Application Profiling
Application Proling
DTrace provides tools for understanding the behavior of user processes. It can help you to:
q q q
Debug applications Analyze application performance problems Understand the behavior of a complex application
These tools can be used alone to determine the cause of problems with application program behavior, or as an adjunct to traditional debugging tools such as the mdb(1) debugger. This module describes the DTrace facilities used to trace user process activity. It also provides examples of how to use those facilities.
The pid Provider

The pid provider can trace the entry and return of any function in a user application. It can also trace any instruction of the running application as specied by its virtual address, which can be given numerically or as a function name plus offset. The pid provider has no probe effect overhead when probes are not enabled. The pid provider denes a class of providers; any process can have its own associated pid provider. You trace a process with process identication number (PID) 1234, for example, by using the pid1234 provider. Unlike most other providers, the pid provider creates probes on demand based on the probe descriptions found in your D programs. As a result, you do not see any pid probes listed in the output of the dtrace -l command until you have enabled them. This is shown in the following example:
# dtrace -l | awk '{print $2}' | sort -u PROVIDER dtrace fasttrap fbt fpuinfo io lockstat mib
3-4

proc profile sched sdt syscall sysinfo vminfo #
Enabling pid Probes

In the following example, you enable all of the function entry probes for the shell: # echo $$ 8586 # dtrace -n 'pid8586:::entry' dtrace: description 'pid8586:::entry' matched 6653 probes ^C # dtrace -l | awk '{print $2}' | sort -u PROVIDER dtrace fasttrap fbt fpuinfo io lockstat mib pid8586 proc profile sched sdt syscall sysinfo vminfo

3-5
Naming pid Probes

The module portion of the probe description refers to an object loaded in the address space of the corresponding process. You can list the objects using the mdb(1) debugger, as shown in the following example: # mdb -p 8586 Loading modules: [ ld.so.1 libc.so.1 ] > ::objects BASE LIMIT SIZE NAME 10000 a4000 94000 /usr/bin/bash ff3b0000 ff3da000 2a000 /lib/ld.so.1 ff350000 ff37a000 2a000 /lib/libcurses.so.1 ff320000 ff32c000 c000 /lib/libsocket.so.1 ff200000 ff290000 90000 /lib/libnsl.so.1 ff3a0000 ff3a2000 2000 /lib/libdl.so.1 ff100000 ff1d2000 d2000 /lib/libc.so.1 ff2d0000 ff2d4000 4000 /usr/lib/locale/en_US.ISO8859-1/en_US.ISO88591.so.3 > $q # You name the object using only the le name portion, not the complete path name. You can also omit the sufxes. The following names describe the same probe: pid8586:libc.so.1:strcmp:entry pid8586:libc.so:strcmp:entry pid8586:libc:strcmp:entry For the executable load object, use either the le name of the executable or a.out. The following two probe descriptions name the same probe: pid8586:bash:main:return pid8586:a.out:main:return
Tracing Library Functions

The following example shows that executing a simple date(1) command in the bash shell results in 14 strcmp function calls: # ps -ef root root root root | grep bash 8567 8561 8577 8571 8586 8580 8888 8577 0 0 0 0 07:36:26 07:37:03 07:37:31 14:14:25 pts/1 pts/2 pts/3 pts/2 0:00 0:00 0:01 0:00 bash bash bash grep bash
3-6

Application Profiling # echo $$ 8577 # dtrace -n 'pid8567:libc:strcmp:entry' dtrace: description 'pid8567:libc:strcmp:entry' matched 1 probe CPU ID FUNCTION:NAME 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry
Tracing User Functions

The simplest mode of operation for the pid provider is as the user-level analogue to the fbt provider. The following example traces all function entries and returns made from a given function. The tracecalls.d D script takes two command-line arguments: $1 for the PID of the process being traced, and $2 for the function name from which you want to trace all function calls. The simple C program that the script is going to trace is shown below. This C program calls one function after another, performing simple arithmetic operations: # cat -n calls.c 1 int f5(int a, int b) 2 { 3 return (a+b); 4 } 5 6 int f4(int a, int b) 7 { 8 int r; 9 10 r = f5(a,b)+13; 11 return(r); 12 } 13 14 int f3(int a)

3-7
Application Profiling 15 { 16 int r; 17 18 usleep(650); 19 r = f4(a-3, a+3); 20 return(r); 21 } 22 23 int f2(int a) 24 { 25 return(f3(5*a)); 26 } 27 28 int f1(int a, int b) 29 { 30 int r; 31 32 usleep(90); 33 r = f2(a-b); 34 return(r); 35 } 36 37 main() 38 { 39 int x; 40 41 x = f1(13,6); 42 printf("%d\n", x); 43 x = f1(17,5); 44 printf("%d\n", x); 45 } # gcc calls.c -o calls # calls 83 133 # cat -n tracecalls.d 1 #!/usr/sbin/dtrace -s 2 3 pid$1:calls:$2:entry 4 { 5 self->trace = 1; 6 } 7 8 pid$1:calls:$2:return 9 /self->trace/ 10 {
3-8

Application Profiling 11 12 13 14 15 16 17 18 self->trace = 0; } pid$1:calls::entry, pid$1:calls::return /self->trace/ { }
You start the calls application in a second window through the mdb(1) debugger. This enables you to stop it as soon as possible in the start-up function that calls the main() function. The _start:b command sets a breakpoint in the _start function where the application starts running. The :r command starts the process running; it immediately hits the breakpoint and stops. You then escape from the debugger by using the !ps command to nd the PID of the calls process: # mdb calls > _start:b > :r mdb: stop at _start mdb: target stopped at: _start: clr %fp > !ps PID TTY TIME CMD 8916 pts/3 0:00 ps 8914 pts/3 0:00 calls 8586 pts/3 0:01 bash 8915 pts/3 0:00 sh 8580 pts/3 0:00 sh 8913 pts/3 0:00 mdb You can now run the dtrace command in the rst terminal window to trace the function calls, starting with the f1 function. You must also continue the process with the :c mdb command after starting the dtrace command: # dtrace -F -s tracecalls.d 8914 f1 dtrace: script 'tracecalls.d' matched 16 probes In the second terminal window you continue the process: > :c 83 133 mdb: target has terminated > $q

3-9
Application Profiling The call sequence is shown in the rst, dtrace terminal window: CPU FUNCTION 0 -> f1 0 -> f2 0 -> f3 0 -> f4 0 -> f5 0 <- f5 0 <- f4 0 <- f3 0 <- f2 0 -> f1 0 -> f2 0 -> f3 0 -> f4 0 -> f5 0 <- f5 0 <- f4 0 <- f3 0 <- f2 ^C
Tracing Function Arguments

By adding a line to the tracecalls.d script, you can print the arguments to the functions as well as return value information. Arguments to functions are represented with arg0, arg1, arg2, and so on. The function return value is placed in the arg1 argument, with the arg0 argument containing the offset within the function where the return occurred. The following D script example prints the arguments to functions: # cat -n tracecalls2.d 1 #!/usr/sbin/dtrace -s 2 3 pid$1:calls:$2:entry 4 { 5 self->trace = 1; 6 } 7 8 pid$1:calls:$2:return 9 /self->trace/ 10 { 11 self->trace = 0; 12 } 13
3-10

Application Profiling 14 15 16 17 18 19 pid$1:calls::entry, pid$1:calls::return /self->trace/ { printf("%d %d", arg0, arg1); }
# dtrace -F -s tracecalls2.d 8944 f1 dtrace: script 'tracecalls2.d' matched 16 probes CPU FUNCTION 0 -> f1 13 6 0 -> f2 7 7 0 -> f3 35 35 0 -> f4 32 38 0 -> f5 32 38 0 <- f5 40 70 0 <- f4 56 83 0 <- f3 68 83 0 <- f2 52 83 0 -> f1 17 5 0 -> f2 12 12 0 -> f3 60 60 0 -> f4 57 63 0 -> f5 57 63 0 <- f5 40 120 0 <- f4 56 133 0 <- f3 68 133 0 <- f2 52 133 ^C The following commands are entered in the mdb(1) window which started the calls program. On return from a function, the arg0 argument is the offset within the function where the restore instruction executed to leave the function, and the arg1 argument is the return value, as follows: > f5+0t40/i f5+0x28: f5+0x28: > f5+0x24,2/i f5+0x24: f5+0x24: f5+0x28: > f2+0t48,2/i f2+0x30: f2+0x30: f2+0x34:
restore
ret restore
ret restore

3-11
Application Profiling > The f5+0t40 address represents 40 decimal bytes into the f5 function, which the trace output shows was placed in the arg0 argument when the f5 function returned. For arg1, the return value from the f5 function on the rst return was 70; on the second return it was 120. The f5+0x24,2/i command in the mdb(1) debugger displays two instructions starting at address f5+0x24. Functions typically return by using these two SPARC instructions. All SPARC instructions are four bytes in length. At address f2+0x34 is another restore instruction.
Tracing Calls Into the Kernel

In the following example you trace a simpler version of the calls program into the kernel: # cat -n calls2.c 1 int f5(int a, int b) 2 { 3 return (a+b); 4 } 5 6 int f4(int a, int b) 7 { 8 int r; 9 10 r = f5(a,b)+13; 11 return(r); 12 } 13 14 int f3(int a) 15 { 16 int r; 17 18 r = f4(a-3, a+3); 19 return(r); 20 } 21 22 int f2(int a) 23 { 24 return(f3(5*a)); 25 } 26 27 int f1(int a, int b) 28 { 29 int r;
3-12

Application Profiling 30 31 r = f2(a-b); 32 return(r); 33 } 34 35 main() 36 { 37 int x; 38 39 x = f1(13,6); 40 printf("%d\n", x); 41 } # cat -n traceall.d 1 #!/usr/sbin/dtrace -qs 2 #pragma D option flowindent 3 4 pid$1::$2:entry 5 { 6 self->trace = 1; 7 } 8 9 pid$1:::entry, pid$1:::return, fbt::: 10 /self->trace/ 11 { 12 printf("%s\n", curlwpsinfo->pr_syscall ? "K" : "U"); 13 } 14 15 pid$1::$2:return 16 /self->trace/ 17 { 18 self->trace = 0; 19 } The traceall.d D script uses a #pragma statement to set the equivalent -F option of the dtrace(1M) command to indent the function calls. The pr_syscall eld of the lwp information data structure to which the curlwpsinfo built-in variable points is 0 when not in the kernel otherwise it is the system call number when the thread is in the kernel. You use this to indicate whether you are tracing user code or kernel code. The traced calls follow. Many of the function calls are for setting up the dynamic binding to the library functions on rst call. The following example shows a portion of the output of this script: # traceall.d 12861 main

3-13
Application Profiling CPU FUNCTION 0 -> main 0 -> f1 0 -> f2 0 -> f3 0 -> f4 0 -> f5 0 <- f5 0 <- f4 0 <- f3 0 <- f2 0 <- f1 0 -> elf_rtbndr 0 -> elf_bndr 0 -> enter 0 -> rt_bind_guard 0 <- rt_bind_guard 0 -> _ti_bind_guard 0 <- _ti_bind_guard 0 -> rt_mutex_lock 0 <- rt_mutex_lock 0 -> _lwp_mutex_lock 0 <- _lwp_mutex_lock 0 <- enter 0 -> lookup_sym 0 -> elf_hash 0 <- elf_hash 0 -> callable 0 <- callable 0 -> elf_find_sym 0 -> strcmp ... 0 <- elf_bndr 0 <- elf_rtbndr 0 -> printf 0 -> _flockget 0 -> mutex_lock 0 <- mutex_lock 0 -> mutex_lock_impl 0 <- mutex_lock_impl 0 <- _flockget 0 -> _setorientation 0 <- _setorientation 0 -> _ndoprnt 0 -> elf_rtbndr 0 -> elf_bndr
U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U
3-14

Application Profiling 0 0 ... 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ^C -> enter -> rt_bind_guard -> _write -> pre_syscall -> syscall_mstate <- syscall_mstate <- pre_syscall -> write32 <- write32 -> write -> getf -> set_active_fd <- clear_active_fd -> cv_broadcast <- cv_broadcast <- releasef <- write -> post_syscall -> clear_stale_fd <- clear_stale_fd -> syscall_mstate <- syscall_mstate <- post_syscall <- _xflsbuf -> ferror_unlocked <- ferror_unlocked <- _ndoprnt -> ferror_unlocked <- ferror_unlocked -> mutex_unlock <- mutex_unlock <- printf <- main U U U K K K K K K K K K K K K K K K U U U U U U U U U U U U U U U

3-15
Tracing Arbitrary Instructions

You can use the pid provider to trace any instruction in any user function. Upon demand, the pid provider creates a probe for every instruction in a function. The name of each probe is the offset in hexadecimal of the corresponding instruction in the function. The following example traces the instruction 10 (hexadecimal) bytes into the strcmp function while the bash shell runs the date(1) command: # dtrace -n 'pid28845:libc:strcmp:10' dtrace: description 'pid28845:libc:strcmp:10' matched 1 probe CPU ID FUNCTION:NAME 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 0 39492 strcmp:10 ^C You see this instruction near the beginning of the strcmp C library function, where it is called 14 times when the bash shell runs the date(1) command. You can see which instructions within the strcmp C library function are executed by tracing all of the functions instructions, as follows: # dtrace -n 'pid28845:libc:strcmp:' dtrace: description 'pid28845:libc:strcmp:' matched 128 probes CPU ID FUNCTION:NAME 0 39494 strcmp:entry 0 39495 strcmp:0 0 39496 strcmp:4 0 39497 strcmp:8 0 39498 strcmp:c 0 39492 strcmp:10 0 39499 strcmp:14 0 39500 strcmp:18 0 39511 strcmp:44 0 39512 strcmp:48
3-16

Application Profiling 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39513 39582 39583 39584 39585 39586 39587 39588 39589 39597 39598 39599 39600 39601 39602 39603 39604 39605 39606 39607 39618 39619 39493 39494 39495 39496 strcmp:4c strcmp:160 strcmp:164 strcmp:168 strcmp:16c strcmp:170 strcmp:174 strcmp:178 strcmp:17c strcmp:19c strcmp:1a0 strcmp:1a4 strcmp:1a8 strcmp:1ac strcmp:1b0 strcmp:1b4 strcmp:1b8 strcmp:1bc strcmp:1c0 strcmp:1c4 strcmp:1f0 strcmp:1f4 strcmp:return strcmp:entry strcmp:0 strcmp:4 The previous output shows the strcmp function executing each instruction sequentially until the instruction at strcmp+0x18 branches to strcmp+0x44. You can display some of the assembly instructions using the mdb(1) debugger:
# mdb -p 8567 Loading modules: [ ld.so.1 libc.so.1 ] > libc`strcmp,14/ai libc.so.1`strcmp: libc.so.1`strcmp: subcc libc.so.1`strcmp+4: be libc.so.1`strcmp+8: sethi libc.so.1`strcmp+0xc: andcc libc.so.1`strcmp+0x10: or libc.so.1`strcmp+0x14: be libc.so.1`strcmp+0x18: sll libc.so.1`strcmp+0x1c: sub libc.so.1`strcmp+0x20: ldub libc.so.1`strcmp+0x24: ldub libc.so.1`strcmp+0x28: subcc libc.so.1`strcmp+0x2c: bne libc.so.1`strcmp+0x30: addcc
%o0, %o1, %o2 +0xac <libc.so.1`strcmp+0xb0> %hi(0x1010000), %o5 %o0, 3, %o3 %o5, 0x101, %o5 +0x30 <libc.so.1`strcmp+0x44> %o5, 7, %o4 %o3, 4, %o3 [%o1 + %o2], %o0 [%o1], %g1 %o0, %g1, %o0 +0x1c4 <libc.so.1`strcmp+0x1f0> %o0, %g1, %g0

3-17
libc.so.1`strcmp+0x34: libc.so.1`strcmp+0x38: libc.so.1`strcmp+0x3c: libc.so.1`strcmp+0x40: libc.so.1`strcmp+0x44: libc.so.1`strcmp+0x48: libc.so.1`strcmp+0x4c: be addcc bne add andcc be cmp +0x1bc %o3, 1, %o3 -0x1c %o1, 1, %o1 %o1, 3, %o3 +0x118 %o3, 2 <libc.so.1`strcmp+0x1f0> <libc.so.1`strcmp+0x20>
<libc.so.1`strcmp+0x160>
The instruction at the strcmp+0x18 address is a shift left logical (sll), which is in the delay slot after the conditional branch instruction: be. This instruction executes before the one at address: strcmp+0x44 even when the branch is taken, which in this execution it was. Another conditional branch was taken at address: strcmp+0x48. DTrace enables you to trace, instruction by instruction, the actual execution ow through the logic of a program. This is an improvement over the traditional debugging techniques of inserting print statements in your application or of running the application under a debugger and setting breakpoints where appropriate.
Determining Time Spent in Functions

Using an associative array and the quantize aggregation built-in function, you can determine the amount of time spent in every function of an application. The following D script displays a power-of-two distribution of how much time (in nanoseconds) is spent in every function of the calls application. A clause-local variable is used to calculate the elapsed time: # cat -n timespent.d 1 #!/usr/sbin/dtrace -qs 2 3 pid$1:::entry 4 { 5 self->t[probefunc] = timestamp; 6 } 7 8 pid$1:::return 9 /self->t[probefunc]/ 10 { 11 this->elapsed = timestamp - self->t[probefunc]; 12 @[probefunc] = quantize(this->elapsed); 13 self->t[probefunc] = 0; /* frees memory */ 14 } # ./timespent.d 8950 ^C
3-18

Application Profiling ... usleep value 1048576 2097152 4194304 8388608 16777216 ... f4 value 16384 32768 65536 131072 ... f1 value 4194304 8388608 16777216 33554432 ... main value ------------- Distribution ------------16777216 | 33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 67108864 | count 0 1 0 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@ 1 |@@@@@@@@@@@@@@@@@@@@ 1 | 0 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@ 1 |@@@@@@@@@@@@@@@@@@@@ 1 | 0 ------------- Distribution ------------- count | 0 |@@@@@@@@@@ 1 |@@@@@@@@@@ 1 |@@@@@@@@@@@@@@@@@@@@ 2 | 0
The profile Provider

The profile provider provides unanchored probes: probes that are not associated with any particular point of execution. When you specify these probes, you leave off both the module and the function portion of the probe description. Instead of being tied to a specic program location, the profile probes are associated with an asynchronous, time-based interrupt that res at a xed, specied time interval. You can use these probes to sample an aspect of system state at the specied interval. For example, you can sample the state of the current thread, the state of a central processing unit (CPU), or the current machine instruction. You can then use the samples to infer system behavior.

3-19
Using the profile-n Probes

A profile-n probe res every xed interval on every CPU at high interrupt level. These probes are used to prole the execution of an application because you do not know what CPU it may be running on at any instant in time. The profile-n probes re n times per second. You can add the following sufxes to change the time units: ns for nanoseconds, us for microseconds, ms for milliseconds, m for minutes, h for hours, or d for days. For example, the following probes re at the same rate:
q q q
profile-200 Fires 200 times per second on every CPU profile-5ms Fires every 5 milliseconds on every CPU profile-5000us Fires every 5000 microseconds on every CPU
The following probes re once per day:

q q
profile-1d profile-24h
The following script should output numbers that increase by approximately one million (nanoseconds): # dtrace -q -n 'profile-1ms {printf("%d\n", timestamp)}' 274817618640560 274817619628282 274817620626998 274817621624780 274817622624686 ^C Currently you cannot specify a time interval less than 200 microseconds with the profile provider, as the following example shows: # dtrace -q -n 'profile-199us {printf("%d\n", timestamp)}' dtrace: invalid probe specifier profile-199us {printf("%d\n", timestamp)}: probe description :::profile-199us does not match any probes # dtrace -q -n 'profile-200us {printf("%d\n", timestamp)}' 275328143837997 275328144030602 275328144229696 275328144431022 ^C
3-20

Sampling Process Activity

The following D script samples 109 times per second to see which processes are running. The count indicates which processes have run the most often during the interval that the script runs:
# cat -n running.d 1 #!/usr/sbin/dtrace -qs 2 3 profile-109 4 /pid != 0/ 5 { 6 @[pid, execname] = count(); 7 } 8 9 END 10 { 11 printf("%-8s %-40s %s\n", "PID", "CMD", "COUNT"); 12 printa("%-8d %-40s %@d\n", @); 13 } # ./running.d ^C PID CMD COUNT 9190 grep 1 9191 bash 1 9190 bash 1 9189 bash 1 9188 uptime 2 8586 bash 2 9191 vi 12 3 fsflush 24 9192 find 80 You can use the profile-n provider to sample information about a specic process. The following script samples, slightly quicker than every 5 milliseconds, the priority of the shell thread while it is running in an innite loop: # echo $$ 8586 # while : ; do : ; done

3-21
Application Profiling In another window, run the following D script: # cat -n profilepri.d 1 #!/usr/sbin/dtrace -qs 2 profile-211 3 /pid == $1/ 4 { 5 @[execname] = lquantize(curlwpsinfo->pr_pri, 0, 100, 10); 6 } # ./profilepri.d 8586 ^C bash value < 0 0 10 20 30 40 50 60 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@@@@@ 271 |@@@@@@ 63 |@@@@ 48 |@@@ 32 |@ 15 |@@ 24 | 0
In the previous example, the curlwpsinfo built-in variable points to a structure containing lwp information. This structure is described in the proc(4) manual page. It shows the Solaris timesharing schedulers bias towards zero for compute-bound threads. The high counts indicate that this thread is running more frequently than other threads on the system. In the following example, you see the results of running the next invocation of the script when the shell is running in its more normal mode of executing a few interactive commands: # ./profilepri.d 8586 ^C bash value 30 40 50 60 ------------- Distribution ------------| |@@@@@@@@@@@@@ |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | count 0 1 2 0
3-22

Application Profiling This shows that the shells priority is higher when run interactively, where it spends most of its time waiting on input; the small counts indicate that it was not running frequently.
Using the tick-n probes

Like profile-n probes, tick-n probes re every xed interval at high interrupt level. However, the tick-n probes re only on one CPU per interval, rather than on every CPU like the profile-n probes. These probes should not be used to prole an application because it many run on any CPU at any instant in time. You specify the n sufx just as you do for the profile-n probes. For example, tick-20ms res every 20 milliseconds, but only on one CPU. One use of the tick-n probes is to provide periodic output or to take periodic action. You saw this usage in Module 2 with the custom monitoring tools.
Using Arguments to the profile Provider

You can use the arguments to the profile probes to determine if the executing thread is currently in kernel mode and, if it is not, where within its process address space it is executing when the probe res. The program counter (PC) registers value is made available when the profile probes re. The arguments are set as follows:
q
The arg0 argument The PC register value in the kernel at the time the probe red, or 0 if the current thread was not executing in the kernel at the time that the probe red The arg1 argument The PC register value in the user-level process at the time the probe red, or 0 if the current thread was executing in the kernel at the time the probe red
Proling an Application Using the profile Provider

You can learn whether your application is executing within its own process address space or within the kernel space by using the arg0 and arg1 arguments, which are set when the prole probes re. The following D script samples the PC slightly faster than every millisecond. The script runs for 10 seconds on a compute-bound application. It also shows how many time intervals, out of the total that occurred in 10 seconds, the application used: # cat -n profile.d 1 #!/usr/sbin/dtrace -qs 2

3-23
Application Profiling 3 profile-1009 4 { 5 ++t; 6 } 7 8 profile-1009 9 /pid == $1/ 10 { 11 @pc[arg1] = count(); 12 @mode[arg0 ? "kernel" : "user"] = count(); 13 ++n; 14 } 15 16 tick-10sec 17 /n/ 18 { 19 printa("%-10x\t%@u\n", @pc); 20 printf("Total: %u out of %u\n", n, t); 21 exit(0); 22 } # ./profile.d 9240 ff3163ac 1 0 5 107f8 60 10810 60 10710 64 1084c 64 10754 65 10734 66 10824 69 1083c 69 10738 69 1081c 71 10820 73 106f4 75 10730 75 10728 76 10744 77 10814 77 1074c 79 1074c 79 106e4 79 106d8 79 1075c 80 10770 80 10828 80
3-24

Application Profiling 1072c 10760 106f0 10758 106dc 106d4 106d0 ff2a11e8 10764 20ac8 20acc ff2a11ec 10840 20ac4 10834 106cc 106e0 10714 107fc ff2a11e4 ff2a11e0 Total: 9887 out kernel user 82 83 86 86 87 88 92 132 134 137 141 142 144 147 172 306 562 611 623 716 3723 of 10002 5 9882 In the previous example, the high count in user mode versus kernel mode indicates that this process is compute-bound. By using the mdb(1) debugger as shown in the following example, you can tell where the process is spending most of its time: > ff2a11e0/i libc.so.1`.umul: libc.so.1`.umul:umul > ff2a11e4/i libc.so.1`.umul+4: > 107fc/i mod+0x34: cmp > 10714/i prod+0x1c: cmp > 106e0/i sum+0x14: add
%o0, %o1, %o0 rd %o0, %o1 %o0, %o1 %o0, %o1, %o0 %y, %o1

3-25
Application Profiling This output shows that this process spent most of its time in the C library multiply function: .umul. It spent most of the remaining time in its own mod, prod, and sum functions. The programmer should investigate compiler options to have the multiplication occur with hardware instructions instead of in software. This program was compiled with the gcc compiler with no optimizations.
3-26

Determining Time Spent in Functions

You can use the timespent2.d D script to obtain a graph of the time spent in each function of this process. A special macro, $target, is set to the process ID of the application that is started for you with the -c option to the dtrace(1M) command. The command after the -c must be quoted if it contains arguments: # cat -n timespent2.d 1 #!/usr/sbin/dtrace -qs 2 3 pid$target:::entry 4 { 5 self->t[probefunc] = timestamp; 6 } 7 8 pid$target:::return 9 /self->t[probefunc]/ 10 { 11 this->elapsed = timestamp - self->t[probefunc]; 12 @[probefunc] = quantize(this->elapsed); 13 self->t[probefunc] = 0; /* frees memory */ 14 } # dtrace -s timespent2.d -c ./pgm dtrace: script 'timespent2.d' matched 5836 probes ^C ... .rem value ------------- Distribution ------------4096 | 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16384 | memchr value 4096 8192 16384 32768 .div value 2048 4096 8192 16384 ------------- Distribution ------------| |@@@@@@@ |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |@ count 0 5 24 1 ------------- Distribution ------------| |@@@@@@@@@@@@@ |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | count 0 5 11 0
count 0 15 0

3-27
Application Profiling 32768 | mutex_lock value 8192 16384 32768 ... sum value 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 ... prod value 17179869184 34359738368 68719476736 137438953472 ... .umul value 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 0
------------- Distribution ------------| |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
count 0 16 0
------------- Distribution ------------| |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | | | | | | | |
count 0 13986319 15890 419 14174 426 282 59 57 24
------------- Distribution ------------| |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |@@@ |
count 0 14 1 0
------------- Distribution ------------| |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | | | | | | | | |
count 0 27699230 37144 943 30290 864 579 111 157 74 3
3-28

Application Profiling This output shows that the process is spending an average of only 816 microseconds in both the sum and the .umul functions, but they are being called signicantly more often than the other functions. The process spent between 3468 seconds in the prod function 14 times that it was called and between 68137 seconds the other time it was called. Finally, the following command builds a table of which functions of an application are called the most frequently: # dtrace -n 'pid$target:::entry {@[probefunc] = count()}' -c ./pgm dtrace: description 'pid$target:::entry ' matched 2931 probes ^C ... main 1 hdl_create 1 elf_entry_pt 1 unused 1 rtld_db_postinit 1 call_init 1 munmap 1 ... printf 3 .rem 3 mod 3 free 4 prod 4 defrag 4 strncpy 5 plt_full_range 5 strlen 5 ... strcmp 39 rt_bind_clear 42 sum 3549598 .umul 6249598

3-29
Application Variables
Application Variables
Accessing process address space information is more difcult than accessing kernel information because DTrace actions run in the kernel. Therefore, to access process data such as application variables or system call argument strings (for example, path names), you must copy the information from the process address space to the kernel. DTrace provides two built-in functions to accomplish this:
q
void *copyin(uintptr_t addr, size_t size) The copyin function copies the specied size in bytes from the specied user address into a DTrace scratch buffer and returns the address of this buffer. The user address is interpreted as being within the address space of the process associated with the currently running thread when the probe res.
string copyinstr(uintptr_t addr) The copyinstr function copies a null-terminated C string from the specied user address into a scratch buffer and returns its address.
Displaying Process Global Variables

The following example shows how to display global variables from an application when a probe res. Two global variables have been added to the calls.c C program you saw previously: # cat -n calls3.c 1 int y = 15; 2 int z = 8; 3 4 int f5(int a, int b) 5 { 6 ++z; 7 return (a+b); 8 } 9 10 int f4(int a, int b) 11 { 12 int r; 13 14 r = f5(a,b)+13; 15 y = z+r; 16 return(r); 17 }
3-30

Application Variables 18 19 int f3(int a) 20 { 21 int r; 22 23 usleep(650); 24 r = f4(a-3, a+3); 25 z = r*y; 26 return(r); 27 } 28 29 int f2(int a) 30 { 31 return(f3(5*a)); 32 } 33 34 int f1(int a, int b) 35 { 36 int r; 37 38 usleep(90); 39 r = f2(a-b); 40 y = z*r; 41 return(r); 42 } 43 44 main() 45 { 46 int x; 47 48 x = f1(13,6); 49 printf("x=%d y=%d z=%d\n", x, y, z); 50 x = f1(17,5); 51 printf("x=%d y=%d z=%d\n", x, y, z); 52 } # calls3 x=83 y=633788 z=7636 x=133 y=137443530 z=1033410 The following D script is passed three arguments:
q q q
$1 The virtual address of a global variable $2 The global variables size $$3 The name of the variable

3-31
Application Variables You have dtrace(1M) start the process by using the -c option. dtrace(1M) sets the $target macro to the process PID. The script displays the value of a global variable on entry and return to every function in the program that is called after the main function. # cat -n uservariables.d 1 #!/usr/sbin/dtrace -qs 2 3 pid$target:a.out:main:entry 4 { 5 started = 1; 6 } 7 8 pid$target:a.out::entry 9 /started/ 10 { 11 v = (int *)copyin($1, $2); 12 printf("On entry to %s: %s=%d\n", probefunc, $$3, *v); 13 } 14 15 pid$target:a.out::return 16 /started/ 17 { 18 v = (int *)copyin($1, $2); 19 printf("On return from %s: %s=%d\n", probefunc, $$3, *v); 20 } 21 22 pid$target:a.out:main:return 23 { 24 exit(0); 25 }
The (int *) in front of the copyin function is called a cast, which is a feature taken from the C language. A cast converts one data type into another data type. In this case, the data type is converted from void *, which is the type of the buffer address into which the variable is copied, to an integer pointer, because you are copying in an integer. You use a * in front of the v variable in the printf statements to dereference the pointer to that which it points, namely the integer. The nm(1) command is used to display the symbol table entry for the z variable in the calls3 executable le. # /usr/ccs/bin/nm calls3 | grep '|z$' [70] | 133952| 4|OBJT |GLOB |0 |16 |z
3-32

Application Variables # dtrace -qs uservariables.d -c calls3 133952 4 z x=83 y=633788 z=7636 x=133 y=137443530 z=1033410 On entry to main: z=8 On entry to f1: z=8 On entry to f2: z=8 On entry to f3: z=8 On entry to f4: z=8 On entry to f5: z=8 On return from f5: z=9 On return from f4: z=9 On return from f3: z=7636 On return from f2: z=7636 On return from f1: z=7636 On entry to f1: z=7636 On entry to f2: z=7636 On entry to f3: z=7636 On entry to f4: z=7636 On entry to f5: z=7636 On return from f5: z=7637 On return from f4: z=7637 On return from f3: z=1033410 On return from f2: z=1033410 On return from f1: z=1033410 On return from main: z=1033410 You can easily display the y variable, as follows: # /usr/ccs/bin/nm calls3 | grep '|y$' [67] | 133948| 4|OBJT |GLOB |0 |16 # dtrace -qs uservariables.d -c calls3 133948 4 y x=83 y=633788 z=7636 x=133 y=137443530 z=1033410 On entry to main: y=15 On entry to f1: y=15 On entry to f2: y=15 On entry to f3: y=15 On entry to f4: y=15 On entry to f5: y=15 On return from f5: y=15 On return from f4: y=92 On return from f3: y=92 On return from f2: y=92 On return from f1: y=633788 On entry to f1: y=633788 On entry to f2: y=633788 |y

3-33
Application Variables On On On On On On On On On entry to f3: y=633788 entry to f4: y=633788 entry to f5: y=633788 return from f5: y=633788 return from f4: y=7770 return from f3: y=7770 return from f2: y=7770 return from f1: y=137443530 return from main: y=137443530
Displaying Library Global Variables

The following example displays various errno variables from libraries linked with the bash shell every 211 milliseconds. Run an innite loop of cd commands that fail in the bash shell. The assumption is that errno should be set to 2 (No such le or directory) by the bash shell when the cd commands fail: # cat -n libvars.d 1 #!/usr/sbin/dtrace -qs 2 3 tick-211ms 4 /pid == $1/ 5 { 6 v = (int *)copyin($2, $3); 7 printf("The value of %s=%d\n", $$4, *v); 8 } # ps -ef | grep bash root 9593 9587 0 15:35:27 pts/2 0:00 bash root 9583 9577 0 15:35:04 pts/1 0:00 bash # echo $$ 9593 # mdb -p 9583 Loading modules: [ ld.so.1 libc.so.1 ] > ::objects BASE LIMIT SIZE NAME 10000 b2000 a2000 /usr/bin/bash ff3b0000 ff3dc000 2c000 /lib/ld.so.1 ff350000 ff37a000 2a000 /lib/libcurses.so.1 ff320000 ff32c000 c000 /lib/libsocket.so.1 ff200000 ff292000 92000 /lib/libnsl.so.1 ff3a0000 ff3a2000 2000 /lib/libdl.so.1 ff100000 ff1d4000 d4000 /lib/libc.so.1 ff2d0000 ff2d4000 4000 /usr/lib/locale/en_US.ISO8859-1/en_US.ISO88591.so.3
3-34

Application Variables > ::nm ! grep '|errno$' 0xff3ee670|0x00000004|OBJT |LOCL |0x2 0xff1ec03c|0x00000004|OBJT |GLOB |0x0 > $q # ./libvars.d 9583 0xff3ee670 4 errno The value of errno=2 The value of errno=2 The value of errno=2 The value of errno=2 The value of errno=2 ^C # ./libvars.d 9583 0xff1ec03c 4 errno The value of errno=0 The value of errno=0 The value of errno=0 The value of errno=0 The value of errno=0 ^C
|21 |21
|errno |errno
The libvars.d D script was run while the bash shell performed the following loop: # while :; do cd bash: cd: /fubar: bash: cd: /fubar: bash: cd: /fubar: bash: cd: /fubar: /fubar; done No such file or No such file or No such file or No such file or directory directory directory directory
This shows that the rst errno at address 0xff3ee670 is the one set as a result of the cd command failing in the bash shell. The No such file or directory error message corresponds to an errno value of 2.

3-35
The plockstat Provider
The plockstat Provider

The plockstat provider gives you details about user-level locking events. It is used similarly to the pid provider when identifying the process to be traced. For example plockstat1234 would trace user-level lock events for the process with PID 1234. The three types of lock events are hold events, contention events, and error events. Hold events occur when a lock is acquired or released; contention events occur when the application thread must wait for a lock; error events are any detected errors when using the locks. The following example shows how to monitor all lock events for a particular process: # pgrep sendmail 1196 # dtrace -n 'plockstat1196::: {trace(timestamp)}' dtrace: description 'plockstat1196::: ' matched 39 probes CPU ID FUNCTION:NAME 0 51440 lmutex_lock:mutex-acquire 1523449860253331 0 51460 lmutex_unlock:mutex-release 1523449860271845 0 51440 lmutex_lock:mutex-acquire 1523449860283483 0 51460 lmutex_unlock:mutex-release 1523449860290833 0 51440 lmutex_lock:mutex-acquire 1523449860325499 0 51460 lmutex_unlock:mutex-release 1523449860332171 0 51440 lmutex_lock:mutex-acquire 1523449860341438 0 51460 lmutex_unlock:mutex-release 1523449860347632 0 51440 lmutex_lock:mutex-acquire 1523449860378587 0 51460 lmutex_unlock:mutex-release 1523449860385554 0 51440 lmutex_lock:mutex-acquire 1523449860394887 0 51460 lmutex_unlock:mutex-release 1523449860401081 0 51440 lmutex_lock:mutex-acquire 1523449860447728 0 51460 lmutex_unlock:mutex-release 1523449860455464 0 51440 lmutex_lock:mutex-acquire 1523449860465297 0 51460 lmutex_unlock:mutex-release 1523449860471519 ^C The next example monitors readers/writer lock activity for the vold process. The -p option to dtrace(1M) attaches to a running process and sets the $target macro it its PID: # pgrep vold 1098 # dtrace -n 'plockstat$target:::rw* {trace(timestamp)}' -p 1098 dtrace: description 'plockstat$target:::rw* ' matched 11 probes CPU ID FUNCTION:NAME 0 51474 rwlock_lock:rw-block 1529287107214473 0 51494 rwlock_lock:rw-acquire 1529287107231728
3-36

The plockstat Provider 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51496 51474 51494 51496 51474 51494 51496 51474 51494 51496 51474 51494 51496 51474 51494 51496 51474 51494 51496 __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release 1529287107252733 1529287107403403 1529287107412819 1529287107423097 1529287107575211 1529287107583872 1529287107593238 1529287107816907 1529287107826079 1529287107836362 1529287107928393 1529287107936277 1529287107945832 1529287108042880 1529287108051591 1529287108060852 1529287108261326 1529287108270476 1529287108280748
The plockstat(1M) command is a DTrace consumer that uses the plockstat provider to show detailed application lock usage information. The plockstat(1M) command is comparable to the lockstat(1M) command which shows detailed lock contention details for kernel locks.

3-37
Transient System Call Errors

The following D program displays pertinent information any time any processs system call fails. System call failures return a value of -1 , which is placed in the arg0 argument when a syscall return probe res. You exclude looking at dtrace system call errors by comparing the PID of the process whose system call failed with that of the dtrace command. When a system call returns -1, the C library interface sets a global user variable named errno to a positive error code, as shown in the following example. These errno values are documented in the Intro(2) manual page and in the /usr/inlude/sys/errno.h header le. # cat -n errno.d 1 #!/usr/sbin/dtrace -qs 2 syscall:::return 3 /arg0 == -1 && pid != $pid/ 4 { 5 printf("%-20s %-10s %d\n",execname,probefunc,errno); 6 } # ./errno.d svc.startd nscd fmd svc.startd svc.startd bash bash bash bash nscd find find bash bash date date ls ls bash bash nscd ^C
portfs lwp_park lwp_park portfs portfs stat64 chdir chdir stat64 lwp_kill open stat setpgrp waitsys open stat open stat setpgrp waitsys lwp_kill
62 62 62 62 62 2 2 2 2 3 2 2 13 10 2 2 2 2 13 10 3
3-38

User Stack Traces on System Call Failures

By using the ustack() built-in DTrace function, you can also display a stack trace of the application code that issued the failed system call: # cat -n errno2.d 1 #!/usr/sbin/dtrace -qs 2 3 syscall:::return 4 /arg0 == -1 && pid != $pid/ 5 { 6 printf("\n%-20s %-10s %d", execname, probefunc, errno); 7 ustack(); 8 } # ./errno2.d bash
setpgrp 13 libc.so.1`_syscall6+0x1c 35c6c 34fa8 bashèxecute_command_internal+0x414 bashèxecute_command+0x50 bash`reader_loop+0x220 bash`main+0x90c bash`_start+0x108 portfs 62 libc.so.1`_portfs+0x4 svc.startd`wait_thread+0x30 libc.so.1`_lwp_start portfs 62 libc.so.1`_portfs+0x4 svc.startd`wait_thread+0x30 libc.so.1`_lwp_start waitsys 10 libc.so.1`_waitid+0x8 libc.so.1`waitpid+0x60 410a0 41004 libc.so.1`__sighndlr+0xc libc.so.1`call_user_handler+0x3b8 libc.so.1`__lwp_sigmask+0x30 libc.so.1`pthread_sigmask+0x1b4 libc.so.1`sigprocmask+0x20
svc.startd
svc.startd
bash

3-39
Transient System Call Errors bash`make_child+0x254 35c6c 34fa8 bashèxecute_command_internal+0x414 bashèxecute_command+0x50 bash`reader_loop+0x220 bash`main+0x90c bash`_start+0x108 bash stat64 2 libc.so.1`stat64+0x4 bash`sh_canonpath+0x258 63638 bash`cd_builtin+0x364 352a0 35a8c 34fc8 bashèxecute_command_internal+0x414 bashèxecute_command+0x50 bash`reader_loop+0x220 bash`main+0x90c bash`_start+0x108 open 2 ld.so.1`__open+0x4 ld.so.1èlf_config+0x120 ld.so.1`setup+0xc20 ld.so.1`_setup+0x37c ld.so.1`_rt_boot+0x88
find
Hexadecimal addresses are shown on the stack trace output when the dtrace command cannot resolve the PC value to a symbol. To nd what transient system call errors are occurring in a specic application and where, you simply change the errno2.d script to pass in the PID of the application.
3-40

Processes Using a Lot of System Time

Suppose you saw the following prstat(1M) command output: PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 12663 root 1104K 672K run 0 0 0:00:13 47% unknown/1 12662 root 4736K 4392K cpu0 59 0 0:00:00 0.2% prstat/1 278 root 2976K 1832K sleep 59 0 0:00:15 0.0% nscd/23 9593 root 2840K 2096K sleep 59 0 0:00:01 0.0% bash/1 12577 root 2808K 2056K sleep 59 0 0:00:00 0.0% bash/1 478 root 4696K 1312K sleep 59 0 0:00:21 0.0% sendmail/1 451 root 10M 5016K sleep 59 0 0:00:09 0.0% snmpd/1 517 root 2016K 472K sleep 59 0 0:00:00 0.0% ttymon/1 434 root 3624K 1464K sleep 59 0 0:00:00 0.0% snmpXdmid/2 9584 root 4520K 2200K sleep 59 0 0:00:00 0.0% in.telnetd/1 422 root 2280K 824K sleep 59 0 0:00:00 0.0% snmpdx/1 426 root 4920K 1168K sleep 59 0 0:00:00 0.0% dtlogin/1 439 root 2968K 1584K sleep 59 0 0:00:00 0.0% vold/3 476 root 2032K 720K sleep 59 0 0:00:00 0.0% ttymon/1 433 root 3048K 1032K sleep 59 0 0:00:00 0.0% dmispd/1 353 root 1872K 136K sleep 59 0 0:00:00 0.0% smcboot/1 351 root 1880K 168K sleep 59 0 0:00:00 0.0% smcboot/1 352 root 1872K 152K sleep 59 0 0:00:00 0.0% smcboot/1 339 root 1200K 472K sleep 59 0 0:00:01 0.0% utmpd/1 329 root 1560K 488K sleep 59 0 0:00:00 0.0% powerd/2 281 root 2616K 1200K sleep 59 0 0:00:00 0.0% inetd/1 265 root 2520K 792K sleep 59 0 0:00:00 0.0% cron/1 251 root 3800K 1432K sleep 59 0 0:00:01 0.0% automountd/3 260 root 3784K 1568K sleep 59 0 0:00:00 0.0% syslogd/13 171 root 2096K 1016K sleep 59 0 0:00:16 0.0% in.routed/1 185 daemon 2424K 584K sleep 59 0 0:00:00 0.0% rpcbind/1 189 root 2384K 352K sleep 59 0 0:00:00 0.0% keyserv/2 68 root 3128K 56K sleep 59 0 0:00:00 0.0% picld/4 65 daemon 3544K 1208K sleep 59 0 0:00:00 0.0% kcfd/3 59 root 2368K 152K sleep 59 0 0:00:00 0.0% syseventd/14 Total: 38 processes, 109 lwps, load averages: 0.14, 0.11, 0.09 You can obtain more details on the unknown process by using the following command:
# prstat -m -p 12663 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 12663 root 43 57 0.0 0.0 0.0 0.0 0.0 0.3 0 129 .12 0 unknown/1

3-41
Transient System Call Errors The unknown process is using a lot of system time. The following D program can determine what system calls are being made:
# dtrace -n 'syscall:::entry /pid == 12663/ { @syscalls[probefunc] = count();}' dtrace: description 'syscall:::entry ' matched 226 probes ^C read 940592
This process appears to be stuck in an endless loop of read(2) system calls. The following truss(1) command conrms this, and shows that the reads are failing: # truss -p 12663 read(3, 0xFFBFFD0B, read(3, 0xFFBFFD0B, read(3, 0xFFBFFD0B, read(3, 0xFFBFFD0B, ... 1) 1) 1) 1) Err#89 Err#89 Err#89 Err#89 ENOSYS ENOSYS ENOSYS ENOSYS
The errno2.d D script shows further evidence of a runaway loop of failing read(2) system calls: # ./errno2.d unknown read 89 libc.so.1`_read+0x8 unknown`main+0x134 unknown`_start+0x5c read 89 libc.so.1`_read+0x8 unknown`main+0x134 unknown`_start+0x5c read 89 libc.so.1`_read+0x8 unknown`main+0x134 unknown`_start+0x5c
unknown
unknown
^C # grep 89 /usr/include/sys/errno.h /* Copyright (c) 1984, 1986, 1987, 1988, 1989 AT&T */ * (c) 1983,1984,1985,1986,1987,1988,1989 AT&T. #define ENOSYS 89 /* Unsupported file system operation # pkill unknown
*/
3-42

Transient System Call Errors Suppose you saw the following similar prstat(1M) command output:
# prstat -m -p 12745 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 12745 root 17 81 0.0 0.0 0.0 0.0 0.0 1.5 0 132 .5M 0 readchar/1
You can again get details on what system calls are being made, as follows:
# dtrace -n 'syscall:::entry /pid == 12745/ { @syscalls[probefunc] = count();}' dtrace: description 'syscall:::entry ' matched 225 probes
^C
stat open write close read # truss -p 12745 read(3, "\b", 1) read(3, "92", 1) read(3, "10", 1) read(3, "\0", 1) read(3, "14", 1) read(3, " @", 1) read(3, "\0", 1) read(3, "82", 1) read(3, " #", 1) read(3, "90", 1) ^C = = = = = = = = = = 1 1 1 1 1 1 1 1 1 1 6 6 6 6 760747
As its name implies, this readchar process is reading a single character at a time. Now run the iosnoop.d D script from Module 2 to get details on the disk input/output (I/O):
# ./iosnoop.d COMMAND readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar PID 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 FILE /usr/lib/nss_ldap.so.1 /usr/lib/nss_ldap.so.1 /usr/lib/nss_ldap.so.1 /usr/lib/nss_ldap.so.1 /usr/lib/nss_ldap.so.1 <none> /usr/lib/nss_ldap.so.1 /usr/lib/passwdutil.so.1 /usr/lib/passwdutil.so.1 <none> /usr/lib/passwdutil.so.1 /usr/lib/watchmalloc.so.1 /usr/lib/watchmalloc.so.1 /usr/lib/watchmalloc.so.1 /usr/lib/watchmalloc.so.1 <none> /usr/lib/cpp DEVICE RW sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R MS 6.492 6.492 6.492 6.638 2.264 6.398 0.696 0.729 1.133 6.646 5.656 6.622 6.842 0.368 6.488 6.315 7.896

3-43

readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar ^C 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 /usr/lib/cpp /usr/lib/cpp /usr/lib/cpp <unknown> <unknown> /usr/lib/libz.so.1 /usr/lib/libz.so.1 /usr/lib/llib-lz /usr/lib/llib-lz.ln /usr/lib/llib-lz.ln /usr/lib/llib-lz.ln <unknown> /lib/libm.so.2 /lib/libm.so.2 /lib/libm.so.2 <none> /lib/libm.so.2 /lib/libm.so.2 /lib/libm.so.2 <unknown> /lib/libm.so.1 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 sd2 R R R R R R R R R R R R R R R R R R R R R 8.128 1.637 1.744 5.968 0.309 2.075 5.438 7.249 0.586 0.796 0.409 0.303 4.507 0.484 0.500 5.174 18.945 0.506 2.169 0.416 6.297
This application appears to be reading all of the les under the /usr/lib directory one byte at a time. This programmer must not realize that using the standard I/O library functions to buffer reads is more efcient than issuing system call reads of one character at a time. The OS is reading the disk in blocks, as the iosnoop.d D script output indicates, but the application is only extracting the information from the kernel buffers one byte at a time.
3-44

Open Files
Open Files
In this section you learn how to display the path names of les being opened. Note that in DTrace it is more difcult to display pointer arguments passed to system calls than those passed as integer arguments. Examples of system calls that take pointer arguments are open(2), stat(2), unlink(2), and chmod(2), which each take path name string arguments. There are also system calls that pass the address of structures, for example, sigaction(2). You must use the appropriate copinstr() and copyin() built-in functions to display the actual strings or structures being passed to the kernel.
Accessing System Call Pointer Arguments

Suppose you knew an application was writing out literal strings using the write(2) system call, as follows: # cat -n writemsg.c 1 main() 2 { 3 write(1, "This is some text being", 23); 4 write(1, " written to standard output", 29); 5 write(1, " to prove a point\n", 18); 6 } # gcc writemsg.c -o writemsg # writemsg This is some text being written to standard output to prove a point # You might try to display these strings using the following D script: # cat -n write.d 1 #!/usr/sbin/dtrace -s 2 3 syscall::write:entry 4 /pid == $target/ 5 { 6 printf("%s\n", stringof(arg1)); 7 } # dtrace -s write.d -c writemsg dtrace: script 'write.d' matched 1 probe This is some text being written to standard output to prove a point dtrace: pid 1532 exited with status 1

3-45
Open Files dtrace: invalid dtrace: invalid dtrace: invalid ^C error on enabled probe ID 1 address (0x10000) in action error on enabled probe ID 1 address (0x10000) in action error on enabled probe ID 1 address (0x10000) in action (ID 12: syscall::write:entry): #1 (ID 12: syscall::write:entry): #1 (ID 12: syscall::write:entry): #1
The arg1 argument used in the write.d D script is the second argument to the write(2) system call, which in this case is the address of the string you want to display. It is a process address, however, and DTrace is running the action statements in the kernels address space. The stringof() built-in function converts the write(2) system call argument to the proper string type. For the script to work, you must use the copyinstr() or copyin() built-in DTrace functions showed previously. The following example shows the correct way to access the processs string arguments: # cat -n write2.d 1 #!/usr/sbin/dtrace -s 2 3 syscall::write:entry 4 /pid == $target/ 5 { 6 printf("%s\n", copyinstr(arg1)); 7 } # dtrace -s write2.d -c writemsg dtrace: script 'write2.d' matched 1 probe This is some text being written to standard output to prove a point dtrace: pid 1537 exited with status 1 CPU ID FUNCTION:NAME 0 12 write:entry This is some text being 0 0 12 12 write:entry write:entry written to standard output to prove a point
The following changes to the D script enable it to work on all system-wide write(2) system calls (except those issued by the dtrace(1M) command): # cat -n write3.d 1 #!/usr/sbin/dtrace -s 2 3 syscall::write:entry 4 /pid != $pid/ 5 {
3-46

Open Files 6 printf("%s\n", copyinstr(arg1)); 7 } # ./write3.d dtrace: script './write3.d' matched 1 probe CPU ID FUNCTION:NAME ore--ion, name) iption specifiers (provider, module, funce describes how to use 4maction]]
0 914 write:entry sys61# ./write2.ddwrite2.dted token `newline' ctory _________________________________________________________________________ _________________________________________________________________________ _____________________________________________________ 0 914 write:entry pys61# ./write2.ddwrite2.dted token `newline' ctory _________________________________________________________________________ _________________________________________________________________________ _____________________________________________________ You received garbage output because the write(2) system call does not necessarily write out null terminated strings. The copyin() system call is the more appropriate function to use for specifying the size of the write: # cat -n write4.d 1 #!/usr/sbin/dtrace -s 2 3 syscall::write:entry 4 /pid != $pid/ 5 { 6 printf("%s\n", stringof(copyin(arg1, arg2))); 7 } # ./write4.d dtrace: script './write4.d' matched 1 probe CPU ID FUNCTION:NAME 0 914 write:entry p 0 914 write:entry w

3-47
Open Files 0 0 914 914 write:entry d write:entry
914
write:entry /var/dtrace/mod3
0 0 0 0 0 0
914 914 914 914 914 914
write:entry sys61# write:entry d write:entry a write:entry t write:entry e write:entry
914
write:entry Sun Jun 13 16:55:28 MDT 2004
0 ^C
914
write:entry sys61#
Displaying Names of Files Being Opened

The following example shows how to display the names of les being opened systemwide: # cat -n open.d 1 #!/usr/sbin/dtrace -qs 2 3 syscall::open*:entry 4 { 5 printf("%s opening %s\n", execname, copyinstr(arg0)); 6 } # ./open.d init opening /etc/inittab init opening /etc/svc/volatile/init-next.state init opening /etc/svc/volatile/init-next.state init opening /etc/inittab
3-48

Open Files man opening /var/ld/ld.config man opening /lib/libc.so.1 man opening /usr/share/man/man.cf man opening /usr/share/man/windex man opening /usr/share/man/sman1m/dtrace.1m sh opening /var/ld/ld.config sh opening /lib/libc.so.1 more opening /var/ld/ld.config more opening /lib/libcurses.so.1 more opening /lib/libc.so.1 more opening /usr/share/lib/terminfo//x/xterm utmpd opening /var/adm/utmpx utmpd opening /var/adm/utmpx utmpd opening /proc/12571/psinfo utmpd opening /proc/9587/psinfo date opening /var/ld/ld.config date opening /lib/libc.so.1 date opening /usr/share/lib/zoneinfo/US/Mountain vi opening /var/ld/ld.config vi opening /usr/lib/libmapmalloc.so.1 vi opening /lib/libcurses.so.1 vi opening /lib/libc.so.1 vi opening /lib/libgen.so.1 vi opening /usr/share/lib/terminfo//x/xterm vi opening //.exrc vi opening /var/tmp/ExTcaqBz vi opening /var/tmp/ExUcaqBz vi opening /etc/system ^C
Displaying Path Names When open System Calls Fail

The following example shows how to know when an open(2) system call fails and how to display the pertinent information to determine the problem: # cat -n failedopen.d 1 #!/usr/sbin/dtrace -qs 2 3 syscall::open*:entry 4 /pid == $1/ 5 { 6 self->path = copyinstr(arg0); 7 self->entry = 1; 8 } 9

3-49
Open Files 10 11 12 13 14 15 16 syscall::open*:return /self->entry && arg0 == -1/ { printf("open for '%s' failed, errno=%d", self->path, errno); ustack(); self->entry = 0; }
# failedopen.d 13026 open for '/usr/openwin/lib/X11/XtErrorDB' failed, errno=2 febbcf78 febb05a0 fec97b38 fec97a78 fedbbffc fedbbeac fedbbe40 fedc0220 fedc037c fed8fb6c fed8f2f8 fed8f290 cf3f8 3f648 d1c98 5c658 ^C
Displaying a Symbolic Stack Trace

The failedopen.d D script was run on the dtmail graphical user interface (GUI) utility as it was started up over a telnet session. The dtrace(1M) command could not determine the symbols at the place the functions were called. This may be due to the application exiting before the dtrace(1M) consumer has a chance to read its symbol table. You can use the mdb(1) debugger to display the PC locations symbolically: # mdb /usr/dt/bin/dtmail > _start:b > :r mdb: stop at _start mdb: target stopped at: _start: clr %fp > !ps PID TTY TIME CMD 13025 pts/1 0:00 mdb
3-50

Open Files 13027 pts/1 0:00 sh 12571 pts/1 0:00 sh 13026 pts/1 0:00 dtmail 13028 pts/1 0:00 ps 12577 pts/1 0:06 bash > :c libSDtMail: Error: Xt Error: Can't open display: 129.150.33.103:0.0 mdb: target has terminated > 5c658/i _start+0x108: _start+0x108: call +0x75618 <main> > d1c98/i main+0x28: main+0x28: jmpl %i1, %o7 > 3f648/i __0fHRoamAppKinitializePiPPc+0x310: __0fHRoamAppKinitializePiPPc+0x310: call +0x8fd24 <__0fLApplicationKinitializePiPPc> > cf3f8/i __0fLApplicationKinitializePiPPc+0x8c: __0fLApplicationKinitializePiPPc+0x8c: call +0x52718 <PLT:XtAppInitialize> > fed8f290/i libXt.so.4`XtAppInitialize+0x54: libXt.so.4`XtAppInitialize+0x54:call +0x56800 <PLT:XtOpenApplication> > fed8f2f8/i libXt.so.4`XtOpenApplication+0x48: call +0x56774 <PLT:_XtAppInit> > fed8fb6c/i libXt.so.4`_XtAppInit+0x138: call +0x553cc <PLT:XtErrorMsg> > febbcf78/i libc.so.1`__open+4: ta 8 > febbcf78:b > :c mdb: stop at libc.so.1`__open+4 mdb: target stopped at: libc.so.1`__open+4: ta 8 > $c libc.so.1`__open+4(ff2893ec, 2000, 1b6, 38e70, ff3b3508, febe2264) libc.so.1òpen+0x64(ff2893ec, 2000, 1b6, ff3ea0f8, ff3ec46c, 0) libnsl.so.1`__nsl_fopen+0x8c(ff2893ec, ff2893fc, ff24fbb4, ff3ea0f8, ff3ec46c, ff2893fc) libnsl.so.1`getnetlist+0x20(0, 69bcc, ff292690, 0, 0, ff290f30) libnsl.so.1`setnetconfig+0x38(0, ff294a58, ff292690, 0, 763a8, febea4c0) libnsl.so.1`__rpc_getconfip+0xd8(ff296ea8, 0, 0, 0, 4144c, 0)

3-51
Open Files libnsl.so.1`getipnodebyname+0x1c(ffbfef50, 1a, 3, ffbfef3c, 1010101, 57f74) libsocket.so.1`get_addr+0x158(0, fe8920a0, ffbff0c0, 17700000, 0, 0) libsocket.so.1`_getaddrinfo+0x710(fe8920a0, 1770, ffbff168, 15950c, 0, 2) libX11.so.4`_X11TransSocketINETConnect+0x178(1594d0, fe8920a0, ffbff188, ffbff32c, fed13100, 0) libX11.so.4`_X11TransConnect+0x58(1594d0, ffbff3e8, 7ffffc00, fe892090, fed13104, fe982078) libX11.so.4`_X11TransConnectDisplay+0x6e0(e, 1594d0, 1, ffbff3e8, 0, 0) libX11.so.4`XOpenDisplay+0xe8(0, fed20bc4, 158f88, ffbffdec, 9ebc4, 0) libXt.so.4`XtOpenDisplay+0xe4(158190, 0, ffbffdcc, fe982010, 0, 0) libXt.so.4`_XtAppInit+0xfc(ffbff71c, fe982010, 0, 0, ffbffdcc, ffbff778) libXt.so.4`XtOpenApplication+0x48(12bc78, fe982010, 0, 0, ffbffdcc, ffbffdec) libXt.so.4`XtAppInitialize+0x54(1346f4, fede7638, fede4000, 120008, 54da8, 14e634) __0fLApplicationKinitializePiPPc+0x8c(12bc68, ffbffdcc, ffbffdec, 0, 1346f4, 134400) __0fHRoamAppKinitializePiPPc+0x310(12bc68, ffbffdcc, ffbffdec, 14d400, 0, 136000) main+0x28(136000, 3f338, 12bc68, 12d0cc, 0, 136000) _start+0x108(0, 0, 0, 0, 0, 0) > A breakpoint was set on the C library open function and the dtmail utility was continued in the debugger to hit the breakpoint. The $c mdb command was used to display the stack trace symbolically after the breakpoint hit.
Examining Another Failed open Example

The next example shows the failedopen2.d D script run on the cat(1) command while it opens a non-existent le. This script assumes that dtrace(1M) will start the command. # cat -n failedopen2.d 1 #!/usr/sbin/dtrace -qs 2 3 syscall::open*:entry 4 /pid == $target/ 5 { 6 self->path = copyinstr(arg0); 7 self->entry = 1; 8 } 9 10 syscall::open*:return
3-52

Open Files 11 /self->entry && arg0 == -1/ 12 { 13 printf("open for '%s' failed, errno=%d", self->path, errno); 14 ustack(); 15 self->entry = 0; 16 } # dtrace -s failedopen2.d -c "cat /nothing" dtrace: script 'failedopen2.d' matched 4 probes cat: cannot open /nothing dtrace: pid 1612 exited with status 2 CPU ID FUNCTION:NAME 0 397 open64:return open for '/nothing' failed, errno=2 libc.so.1`__open64+0x4 libc.so.1`_endopen+0x88 libc.so.1`fopen64+0x1c cat`main+0x318 cat`_start+0x108

3-53
Open Files
Accessing structure members in the sigaction(2) system call.

The sigaction(2) system call passes the kernel an address of a sigaction structure. In order to access its members you must rst copy in the structure using the copyin() DTrace function. The following script shows how to do this. It uses a clause-local variable to point to the copied in sigaction structure. # cat -n sigaction.d 1 #!/usr/sbin/dtrace -qs 2 3 syscall::sigaction:entry 4 { 5 this->sa_struct = (struct sigaction *)copyin(arg1, sizeof(struct sigaction)); 6 printf("%s called sigaction on signal %d with flags: %x\n", 7 execname, arg0, this->sa_struct->sa_flags); 8 } # ./sigaction.d ... tcsh called sigaction on signal 2 with flags: 0 tcsh called sigaction on signal 15 with flags: 12 tcsh called sigaction on signal 15 with flags: 0 tcsh called sigaction on signal 3 with flags: 12 ... vi called sigaction on signal 8 with flags: 2 vi called sigaction on signal 10 with flags: 2 vi called sigaction on signal 11 with flags: 2 vi called sigaction on signal 13 with flags: 2 vi called sigaction on signal 24 with flags: 16
3-54

Module 4
Finding System Problems With DTrace

Objectives
q q q q q
Use DTrace to access kernel variables Use DTrace to obtain information about read calls Use DTrace to perform anonymous tracing Use DTrace to perform speculative tracing Explain the privileges necessary to run DTrace operations
4-1
Relevance
Relevance
Discussion The following questions are relevant to understanding how to use DTrace for finding system problems:
q
!
?
Would the ability to access any kernel variable when a probe res be benecial? Would it be useful to know who is issuing which type of read calls? Would it be advantageous to trace device driver code during system boot? Would it be benecial to give regular user accounts access to the DTrace facility that is limited to user-owned processes?
q q
4-2

q

4-3
Accessing Kernel Variables

The DTrace instrumentation executes inside the Solaris Operating System (Solaris OS) kernel. This means that, in addition to accessing DTrace variables and probe arguments such as pid and arg1, you can also access kernel data structures, symbols, and types. These capabilities allow advanced DTrace users, experienced system administrators, support service personnel, and driver developers to examine low-level behavior of the operating system kernel and the device drivers.
Using the D Language to Access Kernel Symbols

The D language uses the backquote character () as a scoping operator for accessing symbols that are dened in the operating system and not in your D programs. For example, the Solaris kernel contains a C language declaration of a system tunable named kmem_flags for enabling memory allocator debugging features. This tunable is declared in C in the kernel source code as follows: int kmem_flags; To display the value of this variable, you can write the D statement: printf(%x\n, kmem_flags);
Examining Naming Conicts

DTrace associates each kernel symbol with the type used for it in the operating system C code, providing source-based access to the native operating system data structures. Because kernel symbol names are kept in a separate namespace from D variables and function identiers, naming conicts are not an issue. When you prex a variable with a backquote, the D compiler searches the known kernel symbols in order, using the list of loaded modules to nd a matching variable denition. Because the Solaris OS kernel supports dynamically loaded modules with separate symbol namespaces, the same variable name or function name can be used more than once in the kernel. You resolve this conict by preceding the variable or function name with the kernel module name and the backquote character as a separator. For example, you refer to the _init(9E) function in the sd module as follows: sd_init
4-4

Accessing Kernel Variables You can apply any of the D operators to external kernel variables, except those that modify values. When you launch DTrace, the D compiler loads the set of variable names corresponding to active kernel modules, so declarations of these variables are not required.
Monitoring Kernel Variables

The following D script displays, every ve seconds, the value of three global kernel variables:
q
The nproc variable Holds the current number of Solaris OS processes The nthread variable Holds the current number of Solaris OS threads The freemem variable Holds the current amount of system free memory not owned by the memory allocator
You must precede each reference to these kernel variables with a backquote character (), as shown in the following example: # cat -n monitor.d 1 #!/usr/sbin/dtrace -qs 2 3 BEGIN 4 { 5 printf("%-14s %-10s %10s\n", "Processes", 6 "Threads", "Free Memory"); 7 } 8 9 tick-5sec 10 { 11 printf("%-14d %-10d %9dmb\n", `nproc, 12 `nthread, (`freemem*8)/1024); 13 } # ./monitor.d Processes 41 42 41 53 47 41 41 41 41
Threads 232 232 232 242 249 232 232 232 232
Free Memory 322mb 306mb 322mb 320mb 251mb 252mb 252mb 232mb 111mb
4-5

Accessing Kernel Variables 47 47 235 241 110mb 110mb
Accessing Kernel Data Structures

When a probe res, DTrace sets many useful built-in variables. Three of these variables and their associated data structures are:
q q
The curpsinfo variable Points to a process information structure The curlwpsinfo variable Points to a lightweight process (LWP) information structure The curcpu variable Points to a central processing unit (CPU) information structure
The rst two structures are part of the proc(4) interface and are used by commands like ps(1) and prstat(1M). These variables provide access to kernel state information at the time any probe res. The following examples dene the data structures.
The psinfo Data Structure

The following shows the psinfo data structure:
typedef struct psinfo { int pr_nlwp; /* number of active lwps in the process */ pid_t pr_pid; /* unique process id */ pid_t pr_ppid; /* process id of parent */ pid_t pr_pgid; /* pid of process group leader */ pid_t pr_sid; /* session id */ uid_t pr_uid; /* real user id */ uid_t pr_euid; /* effective user id */ gid_t pr_gid; /* real group id */ gid_t pr_egid; /* effective group id */ uintptr_t pr_addr; /* address of process */ dev_t pr_ttydev; /* controlling tty device (or PRNODEV) */ timestruc_t pr_start; /* process start time, from the epoch */ char pr_fname[PRFNSZ]; /* name of execed file */ char pr_psargs[PRARGSZ]; /* initial characters of arg list */ int pr_argc; /* initial argument count */ uintptr_t pr_argv; /* address of initial argument vector */ uintptr_t pr_envp; /* address of initial environment vector */ char pr_dmodel; /* data model of the process */ taskid_t pr_taskid; /* task id */ projid_t pr_projid; /* project id */ poolid_t pr_poolid; /* pool id */ zoneid_t pr_zoneid; /* zone id */ } psinfo_t;
4-6

The lwpsinfo Data Structure

The following shows the lwpsinfo data structure:
typedef struct lwpsinfo { int pr_flag; /* lwp flags */ id_t pr_lwpid; /* lwp id */ uintptr_t pr_addr; /* internal address of lwp */ uintptr_t pr_wchan; /* wait addr for sleeping lwp */ char pr_stype; /* synchronization event type */ char pr_state; /* numeric lwp state */ char pr_sname; /* printable character for pr_state */ char pr_nice; /* nice for cpu usage */ short pr_syscall; /* system call number (if in syscall) */ int pr_pri; /* priority, high value is high priority */ char pr_clname[PRCLSZ]; /* scheduling class name */ processorid_t pr_onpro; /* processor which last ran this lwp */ processorid_t pr_bindpro; /* processor to which lwp is bound */ psetid_t pr_bindpset; /* processor set to which lwp is bound */ } lwpsinfo_t;

4-7
The cpuinfo Data Structure

The following shows the cpuinfo data structure:
typedef struct cpuinfo { processorid_t cpu_id; psetid_t cpu_pset; chipid_t cpu_chip; lgrp_id_t cpu_lgrp; processor_info_t cpu_info; } cpuinfo_t; /* /* /* /* /* CPU identifier */ processor set identifier */ chip identifier */ locality group identifier */ CPU information */
The curthread Variable

Another built-in D variable that is set when a probe res is the curthread variable, which you used in the ancestry.d D script in Module 2. The curthread variable points to the kthread_t kernel structure of the currently running thread. Using the curthread pointer to access information in the kthread_t structure (or most other kernel data structures) provides a less stable interface than using the lwpsinfo_t and psinfo_t structures. The reason for this is that the psinfo_t and lwpsinfo_t structures are abstractions of process and thread information as advertised by the proc(4) interface. In contrast, curthread gets at the actual kernel implementation of this information which may change. For more details on the stability of DTrace interfaces, see the Solaris Dynamic Tracing Guide, part number 817-6223-10. The dtrace(1M) command has a -v option that will tell you the stability of a D program.
Example D Script Using Data Structures

The following D script uses the psinfo_t and lwpsinfo_t structures to display thread and process information for any thread that calls a specic kernel function: # cat -n ps.d 1 #!/usr/sbin/dtrace -qs 2 3 BEGIN 4 { 5 printf("TID\tPID\tPPID\tUID\tPRI\tCOMMAND\n"); 6 } 7 8 fbt::$1:entry 9 /pid != $pid && pid != 0/ 10 { 11 ++nlines; 12 printf("%d\t%d\t%d\t%d\t%d\t%s\n", curlwpsinfo->pr_lwpid,
4-8

Accessing Kernel Variables 13 14 15 16 17 18 19 20 21 22 curpsinfo->pr_pid, curpsinfo->pr_ppid, curpsinfo->pr_uid, curlwpsinfo->pr_pri, curpsinfo->pr_psargs); } fbt::$1:entry /nlines > 20/ { printf("TID\tPID\tPPID\tUID\tPRI\tCOMMAND\n"); nlines = 0; }
# ./ps.d bdev_strategy TID PID PPID UID PRI COMMAND 1 4640 4639 0 55 find / -type f 1 4640 4639 0 55 find / -type f 1 4698 4641 0 51 file /var/sadm/pkg/SUNWfontconfig-root/save/pspool/SUNWfontconfigroot/install/ 1 4640 4639 0 55 find / -type f 1 4698 4641 0 51 file /var/sadm/pkg/SUNWfontconfig-root/save/pspool/SUNWfontconfigroot/install/ ^C # ps.d nanosleep TID PID PPID UID PRI COMMAND 11 279 1 0 59 /usr/sbin/nscd 12 279 1 0 59 /usr/sbin/nscd 21 279 1 0 59 /usr/sbin/nscd 18 279 1 0 59 /usr/sbin/nscd 17 279 1 0 59 /usr/sbin/nscd 16 279 1 0 59 /usr/sbin/nscd 13 279 1 0 59 /usr/sbin/nscd 1 2120 2119 0 59 sleep 5 12 279 1 0 59 /usr/sbin/nscd 11 279 1 0 59 /usr/sbin/nscd 13 279 1 0 59 /usr/sbin/nscd 14 279 1 0 59 /usr/sbin/nscd 15 279 1 0 59 /usr/sbin/nscd 16 279 1 0 59 /usr/sbin/nscd 17 279 1 0 59 /usr/sbin/nscd 18 279 1 0 59 /usr/sbin/nscd 21 279 1 0 59 /usr/sbin/nscd 18 279 1 0 59 /usr/sbin/nscd TID PID PPID UID PRI COMMAND 17 279 1 0 59 /usr/sbin/nscd 16 279 1 0 59 /usr/sbin/nscd

4-9
^C
The sched Provider

The sched DTrace provider enables probes related to thread scheduling. For example, the on-cpu probe res when a CPU begins to execute a thread, and the off-cpu probe res when a thread is about to be taken off of a CPU. Note Refer to the Solaris Dynamic Tracing Guide for details on the probes provided by the sched provider. To list the sched probes, use the following command: # dtrace -l -P sched | awk '{print $NF}' | sort -u NAME change-pri dequeue enqueue off-cpu on-cpu preempt remain-cpu schedctl-nopreempt schedctl-preempt schedctl-yield sleep surrender tick wakeup The following D script uses the on-cpu sched probe to display the name of the executable process starting to run on a CPU and the priority of its thread:
# cat -n start2run.d 1 #!/usr/sbin/dtrace -qs 2 3 sched:::on-cpu 4 /pid != $pid && pid != 0/ 5 { 6 printf("Thread %d from: %s starting on CPU %d at priority %d\n", 7 curlwpsinfo->pr_lwpid, curpsinfo->pr_psargs, curcpu->cpu_id, 8 curlwpsinfo->pr_pri); 9 }
4-10


# ./start2run.d Thread 1 from: fsflush starting on CPU 0 at priority 60 Thread 1 from: bash starting on CPU 0 at priority 59 Thread 1 from: bash starting on CPU 2 at priority 49 Thread 1 from: pgm starting on CPU 1 at priority 49 Thread 1 from: pgm starting on CPU 1 at priority 29 Thread 1 from: pgm starting on CPU 1 at priority 29 Thread 1 from: pgm starting on CPU 1 at priority 19 Thread 1 from: pgm starting on CPU 1 at priority 9 Thread 1 from: pgm starting on CPU 1 at priority 9 Thread 1 from: pgm starting on CPU 1 at priority 0 Thread 6 from: /lib/svc/bin/svc.startd starting on CPU 0 at priority 59 Thread 1 from: fsflush starting on CPU 0 at priority 60 Thread 1 from: /usr/sfw/sbin/snmpd starting on CPU 1 at priority 59 Thread 1 from: /usr/sfw/sbin/snmpd starting on CPU 1 at priority 59 Thread 4 from: /usr/lib/picl/picld starting on CPU 2 at priority 59 Thread 1 from: fsflush starting on CPU 0 at priority 60 Thread 18 from: /usr/sbin/nscd starting on CPU 0 at priority 59 Thread 1 from: /usr/sfw/sbin/snmpd starting on CPU 1 at priority 59 Thread 4 from: /usr/lib/picl/picld starting on CPU 2 at priority 59 Thread 1 from: /usr/sfw/sbin/snmpd starting on CPU 2 at priority 59 Thread 2 from: /usr/lib/autofs/automountd starting on CPU 2 at priority 59 Thread 1 from: fsflush starting on CPU 0 at priority 60 Thread 18 from: /usr/sbin/nscd starting on CPU 0 at priority 59 Thread 1 from: /usr/lib/sendmail -bd -q15m starting on CPU 0 at priority 59 Thread 1 from: bash starting on CPU 0 at priority 59 Thread 1 from: /usr/sfw/sbin/snmpd starting on CPU 0 at priority 59 Thread 1 from: fsflush starting on CPU 0 at priority 60
The following D script uses the on-cpu sched probe with an aggregation to display a summary of who has recently been running on what CPU: # cat -n whorun.d 1 #!/usr/sbin/dtrace -qs 2 3 sched:::on-cpu 4 /pid != $pid && pid != 0/ 5 { 6 @[curpsinfo->pr_psargs, curcpu->cpu_id] = count(); 7 } 8 9 END 10 { 11 printf("%-30s %4s %6s\n", "Command", "CPU", "Count"); 12 printa("%-30s %4d %@6d\n", @); 13 } # ./whorun.d ^C Command
CPU
Count

4-11
Accessing Kernel Variables /usr/lib/fm/fmd/fmd uptime find / -name fubar /usr/lib/autofs/automountd -sh -sh /usr/lib/picl/picld /usr/lib/fm/fmd/fmd /usr/sbin/nscd /usr/lib/fm/fmd/fmd /usr/sbin/nscd /usr/sbin/nscd /usr/lib/sendmail -bd -q15m ls -lR / /usr/sfw/sbin/snmpd -sh /usr/lib/utmpd /usr/lib/sendmail -bd -q15m /usr/lib/picl/picld /usr/sfw/sbin/snmpd /usr/sfw/sbin/snmpd fsflush /usr/sbin/nscd find / -name fubar /usr/sbin/vold /usr/sfw/sbin/snmpd 1 2 3 2 2 1 1 3 3 2 2 0 0 1 0 3 0 2 0 3 2 0 1 1 2 1 1 1 2 3 3 3 4 6 8 11 14 15 16 18 18 20 20 20 32 44 55 72 77 152 152 237
Accessing Lock Contention Information

The lockstat provider makes available probes that give information regarding locking behavior on the system. For example, when the adaptive-block probe res, you know that a kernel thread had to wait for an adaptive mutex, and the arg1 argument tells you how long it slept waiting for the locks release. This gives you a sense of how much contention there is for the data (or code) that the mutex is protecting. Note See the Solaris Dynamic Tracing Guide for details on other lockstat provider probes.
4-12

The lockstat Provider Probes

To list the lockstat provider probes, use the following command:
# dtrace -l -P lockstat ID PROVIDER 467 lockstat 468 lockstat 469 lockstat 470 lockstat 471 lockstat 472 lockstat 473 lockstat 474 lockstat 475 lockstat 476 lockstat 477 lockstat 478 lockstat 479 lockstat 480 lockstat 481 lockstat 482 lockstat 483 lockstat 484 lockstat 485 lockstat 486 lockstat 487 lockstat 488 lockstat MODULE genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix genunix FUNCTION mutex_enter mutex_enter mutex_enter mutex_exit mutex_destroy mutex_tryenter lock_set lock_set lock_set_spl lock_set_spl lock_try lock_clear lock_clear_splx CLOCK_UNLOCK rw_enter rw_enter rw_exit rw_tryenter rw_tryupgrade rw_downgrade thread_lock thread_lock_high NAME adaptive-acquire adaptive-block adaptive-spin adaptive-release adaptive-release adaptive-acquire spin-acquire spin-spin spin-acquire spin-spin spin-acquire spin-release spin-release spin-release rw-acquire rw-block rw-release rw-acquire rw-upgrade rw-downgrade thread-spin thread-spin
The following D script displays CPU, thread, process, wait time, and stack trace information related to a thread blocking on an adaptive mutex:
# cat -n mutex.d 1 #!/usr/sbin/dtrace -qs 2 3 lockstat:::adaptive-block 4 { 5 printf("\nCPU\tTID\tPID\tUID\tWAIT TIME\tCOMMAND\n"); 6 printf("%d\t%d\t%d\t%d\t%d\t\t%s\n", curcpu->cpu_id, 7 curlwpsinfo->pr_lwpid, curpsinfo->pr_pid, 8 curpsinfo->pr_uid, arg1, curpsinfo->pr_psargs); 9 stack(); 10 }
Test the mutex.d D script by starting four instances of the readchar user application, which reads every le in the current directory one byte at a time using the read(2) system call:
# (cd /usr/lib; /var/dtrace/readchar)& (cd /usr/lib; /var/dtarce/readchar)& [1] 2323 [2] 2325 # (cd /usr/lib; /var/dtrace/readchar)& (cd /usr/lib; /var/dtrace/readchar)&

4-13

[3] 2327 [4] 2329 # ./mutex.d ^C # mpstat 2 CPU minf mjf xcal 0 2 0 0 0 3 0 0 0 0 0 0 0 1 0 0 ^C
intr ithr 409 307 401 301 406 305 402 302
csw icsw migr smtx 45 8 0 0 54 31 0 0 50 30 0 0 55 32 0 0
srw 0 0 0 0
syscl usr sys wt idl 65534 13 14 0 73 103605 21 79 0 0 100905 20 80 0 0 104497 21 79 0 0
Lock Contention on a Single Processor System

The mpstat(1M) command output indicates that you are on a single processor system which is CPU-bound primarily in system mode. The system call counts are high, which correlates with the high percentage of system time. You expect such numbers from running four instances of the readchar process. There is no mutex contention on a single processor system until you add more le system-intensive commands, as shown in the following example:
# find / -name fubar & ls -lR / >/ll& [1] 2357 [2] 2358 # find / -name fubar & ls -lR / >/ll& [3] 2359 [4] 2360 # ./mutex.d CPU 0 TID 0 PID 0 UID 0 WAIT TIME 56917 COMMAND sched
genunix`clock+0x3f0 genunix`cyclic_softint+0xa4 unix`cbe_level10+0x8 unixìntr_thread+0x144 CPU 0 TID 0 PID 0 UID 0 WAIT TIME 41076 COMMAND sched
sd`sdintr+0x14 glm`glm_doneq_empty+0x144
4-14


glm`glm_intr+0xf4 pcipsy`pci_intr_wrapper+0x9c unixìntr_thread+0x144 CPU 0 TID 0 PID 0 UID 0 WAIT TIME 45321 COMMAND sched
genunix`clock+0x3f0 genunix`cyclic_softint+0xa4 unix`cbe_level10+0x8 unixìntr_thread+0x144 ^C CPU 0 TID 0 PID 0 UID 0 WAIT TIME 43214 COMMAND sched
genunix`kmem_cache_free+0x4c uataàtapi_tran_destroy_pkt+0x58 scsi`scsi_destroy_pkt+0x14 sd`sd_return_command+0x16c sd`sdintr+0x224 uata`ghd_doneq_process+0x64 unixìntr_thread+0x144
Lock Contention on a Multiprocessor Server

The following output results from running four instances of the readchar process on a four-processor server. In this case you do not run the extra find and ls -lR commands, as you did on the uniprocessor system. There is signicantly more mutex contention, as indicated by the smtx column (you should always ignore the rst set of numbers output by the mpstat(1M) command). There is also signicantly more frequent output from the mutex.d D script:
# mpstat 2 CPU minf mjf xcal 0 1 0 3 1 1 0 3 2 1 0 3 3 1 0 3 CPU minf mjf xcal 0 2 0 5 1 0 0 5 2 0 0 1 intr ithr 4 1 7 4 4 1 214 111 intr ithr 21 1 29 4 53 1 csw icsw migr smtx srw syscl usr sys wt 65 0 1 8 0 27 0 1 0 30 0 1 8 0 29 0 0 0 28 0 1 8 0 28 0 0 0 15 0 0 9 0 28 0 0 0 csw icsw migr smtx srw syscl usr sys wt 56 17 8 74478 0 225870 14 81 67 22 8 76857 0 228291 11 83 150 49 5 83973 0 224372 16 74 idl 99 100 100 100 idl 0 4 0 6 0 9

4-15

3 3 0 90 CPU minf mjf xcal 0 0 0 4 1 0 0 2 2 0 0 1 3 0 0 95 ^C # ./mutex.d CPU 0 TID 1 PID 12523 UID 0 WAIT TIME 23500 COMMAND /var/dtrace/readchar 216 113 intr ithr 24 1 39 2 43 1 216 112 12 12 1 86392 0 227446 13 87 0 0 csw icsw migr smtx srw syscl usr sys wt idl 64 19 8 108269 0 189929 12 86 0 2 99 34 9 107282 0 189818 13 82 0 4 104 39 5 120189 0 173753 11 79 0 10 7 10 0 96010 0 229465 17 83 0 0
ufs`rdip+0x150 ufsùfs_read+0x208 genunix`read+0x274 genunix`read32+0x1c unix`syscall_trap32+0xa8 CPU 1 TID 1 PID 12527 UID 0 WAIT TIME 22200 COMMAND /var/dtrace/readchar
ufsùfs_lockfs_end+0x70 ufsùfs_read+0x25c genunix`read+0x274 genunix`read32+0x1c unix`syscall_trap32+0xa8
4-16


CPU 3 TID 1 PID 12556 UID 0 WAIT TIME 24800 COMMAND /var/dtrace/readchar
ufs`rdip+0x488 ufsùfs_read+0x208 genunix`read+0x274 genunix`read32+0x1c unix`syscall_trap32+0xa8
The previous output shows that the mutex contention is in the UNIX File System (UFS) code. The sleep times are only between 2129 microseconds.

4-17
The proc Provider and the system() Function

The proc provider makes available probes related to process creation and termination as well as signal delivery. The signal-send probe res when a signal is being sent to a process or thread. The args[2] argument is set to the signal number which can be compared with the symbolic names such as SIGINT used in the signal(3head) manual page. The args[1] argument is set to point to the psinfo_t structure of the recieving process. The system() built-in function allows you to run shell commands anytime a probe res. This general capability provides great power in that any probe event can trigger the execution of any command. You can use format specications similar to the printf() built-in function to parameterize the shell command you wish to invoke. The system() function requires destructive actions to be enabled with either the -w option to the dtrace(1M) command or with the #pragma statement used inside the script with the destructive option. The following script uses the signal-send probe as well as the built-in system() function to display what user account is sending the SIGKILL signal and to which process: # cat -n whosend.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option destructive 4 #pragma D option quiet 5 6 proc:::signal-send 7 /args[2] == SIGKILL/ 8 { 9 printf("SIGKILL was sent to %s by ", args[1]->pr_fname); 10 system("getent passwd %d | cut -d: -f5", uid); 11 } # ./whosend.d SIGKILL was sent to vi by Super-User SIGKILL was sent to bash by Mary Smith
4-18

Displaying Read Call Information

DTrace provides several ways to display read information:
q q
You can trace system-wide activity or application-specic activity. You can show information about each individual read call or summarize the data with an aggregation function. You can monitor read activity at the driver level with the io provider or at the application level with the pid provider, the syscall provider, or the sysinfo provider.
This section demonstrates some of these methods.
Tracing Read Calls System-Wide

The rst example traces, system-wide, each individual read(2) and pread(2) system call. There is a difference between the read size requested in the read(2) and pread(2) system calls and the number of bytes actually read, which is given in the return value from these system calls. A 0 return value indicates an end-of-le condition; a return of -1 indicates that the read(2) system call failed. # cat -n reads.d 1 #!/usr/sbin/dtrace -qs 2 3 BEGIN 4 { 5 printf("FD\tREQUEST\tACTUAL\tCOMMAND\n"); 6 } 7 8 syscall::read:entry, syscall::pread*:entry 9 /execname != "dtrace"/ 10 { 11 self->started = 1; 12 self->arg0 = arg0; 13 self->arg2 = arg2; 14 } 15 16 syscall::read:return, syscall::pread*:return 17 /self->started/ 18 { 19 printf("%d\t%d\t%d\t%s\n", self->arg0, self->arg2, arg0, execname);

4-19
Displaying Read Call Information 20 21 22 23 24 25 26 27 28 29 self->started = 0; ++nlines; } syscall::read:return, syscall::pread*:return /nlines > 20/ { printf("FD\tREQUEST\tACTUAL\tCOMMAND\n"); nlines = 0; }
# ./reads.d FD REQUEST 0 1 0 1 0 1 0 1 3 877 0 1 0 1 ... 0 1 3 152 4 8192 4 8192 3 877 0 1 0 1 ... 0 1 0 1 0 1 3 8192 3 8192 3 8192 1 8192 1 8192 1 8192 1 8192 ... FD REQUEST 5 1024 5 8192 6 336 6 336 6 336
ACTUAL 1 1 1 1 877 1 1 1 152 4092 0 877 1 1 1 1 1 8192 200 0 1006 0 1006 0 ACTUAL 61 4464 336 336 336
COMMAND bash bash bash bash date bash bash bash uptime uptime uptime uptime bash bash bash bash bash grep grep grep init init init init COMMAND nscd utmpd utmpd utmpd utmpd
4-20

Displaying Read Call Information 5 1 2 2 1 ... FD 0 0 0 0 0 ... 4 4 11 4 4 4 4 4 4 ^C 8192 24 8 8 24 REQUEST 1 1 128 128 128 416 416 336 416 416 416 416 416 416 0 -1 8 -1 24 ACTUAL 1 1 4 3 4 416 416 336 416 416 416 416 416 416 utmpd sac ttymon ttymon sac COMMAND bash bash sh sh sh ps ps svc.startd ps ps ps ps ps ps
Using the previous output (and help from the truss(1) command), you can determine the following:
q
The date(1) command reads a time zone (US/Mountain) conguration le of size 877 bytes when it starts. The ps(1) command reads the psinfo_t structure of size 416 bytes many times. The init(1M) command re-reads the /etc/inittab le periodically. The grep(1) command reads its le one page (8192 bytes) at a time. The sh(1) command reads a whole line from standard input into a 128-byte buffer. The bash(1) command reads standard input one byte at a time (probably to implement command line editing). The uptime(1) command reads the same time zone conguration le as the date(1) command. The sac(1M) and ttymon(1M) commands issued reads that failed.
q q q

4-21
Tracing Read Calls Using the iosnoop.d D Script

The following output results from running the iosnoop.d D script at the same time as the previous reads.d D script. It shows that only the grep(1) command performed actual disk reads. The other reads found the data cached in memory.
# ./iosnoop.d COMMAND sched sched sched sched sched sched sched sched sched sched sched grep fsflush ^C PID 0 0 0 0 0 0 0 0 0 0 0 2691 3 FILE <none> <none> <none> <none> <none> <none> <none> <none> <none> <none> <none> /usr/include/sys/zone.h <none> DEVICE RW sd2 W sd2 W sd2 W sd2 W sd2 W sd2 W sd2 W sd2 W sd2 W sd2 W sd2 W sd2 R sd2 W MS 3.733 4.796 4.003 10.259 12.698 15.843 21.331 28.134 33.668 39.575 4.004 4.817 13.120
Aggregating Read Data

The following D script uses the avg() aggregation function to display the average number of bytes read by le descriptor and process name: # cat -n readsummary.d 1 #!/usr/sbin/dtrace -qs 2 3 syscall::read:entry, syscall::pread*:entry 4 { 5 self->started = 1; 6 self->fd = arg0; 7 } 8 9 syscall::read:return, syscall::pread*:return 10 /self->started && execname != "dtrace" && arg0 != -1/ 11 { 12 @[self->fd, execname] = avg(arg0); 13 self->started = 0; 14 } 15 16 END 17 {
4-22

Displaying Read Call Information 18 19 printa("%d\t%-24s\t%@d\n", @); }
# ./readsummary.d ^C 4 instant 2 more 0 vi 4 readchar 0 bash 2 ttymon 4 rup 1 sac 5 nscd 19 sgml2roff 3 rup 3 rpc.rstatd 4 ps 3 uptime 1 init 3 man 3 ps 5 rup 4 vi 3 grep 3 date 3 vi 4 nroff 4 uptime 0 nroff 0 tbl 3 cat 6 nsgmls 0 eqn 0 col 0 instant 3 instant 3 nsgmls 6 rpc.rstatd 3 more 5 man 4 nsgmls 0 grep
0 1 1 1 1 8 23 24 59 119 413 413 416 514 540 550 687 787 803 845 877 1492 2221 2232 3479 3861 3861 3894 3914 3979 4072 4442 4459 4464 4842 5802 6606 6815

4-23
Displaying Read Call Information By changing the aggregation function from avg() to sum(), you can obtain the total number of bytes read by le descriptor and process name: # ./totalread.d ^C 4 instant 0 vi 2 more 5 nscd 0 bash 11 svc.startd 10 svc.startd 3 man 6 readchar 3 date 4 ls 3 vi 19 sgml2roff 1 init 23 readchar 19 readchar 6 nsgmls 5 man 4 nroff 3 more 7 readchar 20 readchar 3 cat 0 tbl 4 vi 10 readchar 14 readchar 0 eqn 0 nroff 0 col 8 readchar 0 grep 3 nsgmls 3 instant 11 readchar 0 instant 4 nsgmls 21 readchar
0 6 8 61 121 336 336 550 671 877 877 2984 3214 4324 4572 10276 11684 17408 17771 17876 18064 18116 18435 18435 18880 19500 19500 20356 20356 22496 28252 30095 33799 53314 56616 160360 171763 192636
4-24

Using the Anonymous Tracing Facility
Using the Anonymous Tracing Facility

Probes are usually enabled through a DTrace consumer process such as dtrace(1M). A DTrace consumer process cannot run, however, until you boot the system. Anonymous tracing allows you to enable tracing during boot. Anonymous tracing is not associated with any DTrace consumer. Any tracing that you can do interactively with the dtrace(1M) process you can also do anonymously. Only the super-user can create an anonymous enabling, and there can only be one anonymous enabling at any time. Most DTrace users do not need this feature, but because boot problems are particularly difcult to debug, anonymous tracing can prove valuable for kernel and device driver developers.
Creating an Anonymous Enabling

To create an anonymous enabling, use the -A option to a dtrace(1M) invocation that species the desired probes, predicates, actions, and options. The dtrace(1M) process modies your /etc/system le to force the loading of the kernel modules that implement the needed DTrace providers. The dtrace process then adds a series of driver properties representing your request to the dtrace(7D) drivers conguration le: /kernel/drv/dtrace.conf. These properties are read by the dtrace(7D) driver when it is loaded. The driver then enables the specied probes with the specied actions, creating an anonymous state to associate with the new enabling. Reboot your system. While the system is booting, messages appear on the console describing the anonymous enabling. After the machine boots, claim the anonymous state by specifying the -a option to the dtrace(1M) command. By default the -a option claims the anonymous state, processes the existing data, and continues to run. To process the anonymous state data and exit, add the -e option to the dtrace(1M) command.
Performing Anonymous Tracing

The following dtrace(1M) command performs anonymous tracing on the conskbd module, the console keyboard multiplexer driver:

4-25
Using the Anonymous Tracing Facility # dtrace -A -m conskbd dtrace: cleaned up old anonymous enabling in /kernel/drv/dtrace.conf dtrace: cleaned up forceload directives in /etc/system dtrace: saved anonymous enabling in /kernel/drv/dtrace.conf dtrace: added forceload directives to /etc/system dtrace: run update_drv(1M) or reboot to enable changes # tail /etc/system * chapter of the Solaris Dynamic Tracing Guide for details. * forceload: drv/systrace forceload: drv/sdt forceload: drv/profile forceload: drv/lockstat forceload: drv/fbt forceload: drv/fasttrap forceload: drv/dtrace * ^^^^ Added by DTrace # reboot ... # grep enabling /var/adm/messages Feb 27 07:34:22 sys63 dtrace: [ID 566105 kern.notice] NOTICE: enabling probe 0 (:kmdb::) Feb 27 07:34:22 sys63 dtrace: [ID 566105 kern.notice] NOTICE: enabling probe 1 (dtrace:::ERROR) Feb 27 07:45:27 sys63 dtrace: [ID 566105 kern.notice] NOTICE: enabling probe 0 (:conskbd::) Feb 27 07:45:27 sys63 dtrace: [ID 566105 kern.notice] NOTICE: enabling probe 1 (dtrace:::ERROR) # dtrace -ae CPU ID FUNCTION:NAME 0 25339 conskbd_attach:entry 0 25340 conskbd_attach:return 0 25327 conskbdopen:entry 0 25328 conskbdopen:return 0 25331 conskbduwput:entry 0 25332 conskbduwput:return 0 25345 conskbdioctl:entry 0 25346 conskbdioctl:return 0 25327 conskbdopen:entry 0 25328 conskbdopen:return 0 25331 conskbduwput:entry 0 25332 conskbduwput:return 0 25345 conskbdioctl:entry 0 25346 conskbdioctl:return 0 25329 conskbdclose:entry 0 25330 conskbdclose:return
4-26

Using the Anonymous Tracing Facility 0 0 0 0 25327 25328 25329 25330 conskbdopen:entry conskbdopen:return conskbdclose:entry conskbdclose:return The forceload entries in the /etc/system are not automatically removed after the reboot. Run the dtrace(1M) command with just the -A option to clean up these forceload entries: # tail -18 /etc/system * vvvv Added by DTrace * * The following forceload directives were added by dtrace(1M) to allow for * tracing during boot. If these directives are removed, the system will * continue to function, but tracing will not occur during boot as desired. * To remove these directives (and this block comment) automatically, run * "dtrace -A" without additional arguments. See the "Anonymous Tracing" * chapter of the Solaris Dynamic Tracing Guide for details. * forceload: drv/systrace forceload: drv/sdt forceload: drv/profile forceload: drv/lockstat forceload: drv/fbt forceload: drv/fasttrap forceload: drv/dtrace * ^^^^ Added by DTrace # dtrace -A dtrace: cleaned up old anonymous enabling in /kernel/drv/dtrace.conf dtrace: cleaned up forceload directives in /etc/system # tail /etc/system * * To set variables in 'unix': * * set nautopush=32 * set maxusers=40 * * To set a variable named 'debug' in the module named 'test_module' * * set test_module:debug = 0x13

4-27
Using the Anonymous Tracing Facility The next example focuses only on those functions called from the conskbd_attach() function in the conskbd module: # cat -n cons.d 1 #!/usr/sbin/dtrace -s 2 3 fbt::conskbd_attach:entry 4 { 5 self->trace = 1; 6 } 7 8 fbt::: 9 /self->trace/ 10 { 11 } 12 13 fbt::conskbd_attach:return 14 { 15 self->trace = 0; 16 } # dtrace -AFs cons.d dtrace: saved anonymous enabling in /kernel/drv/dtrace.conf dtrace: added forceload directives to /etc/system dtrace: run update_drv(1M) or reboot to enable changes # reboot ... # grep enabling /var/adm/messages Feb 27 07:45:27 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 0 (:conskbd::) Feb 27 07:45:27 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 1 (dtrace:::ERROR) Feb 27 08:07:05 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 0 (fbt::conskbd_attach:entry) Feb 27 08:07:05 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 1 (fbt:::) Feb 27 08:07:05 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 2 (fbt::conskbd_attach:return) Feb 27 08:07:05 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 3 (dtrace:::ERROR) # dtrace -ae CPU FUNCTION 0 -> conskbd_attach 0 -> ddi_create_minor_node 0 -> ddi_create_minor_common 0 -> ddi_driver_major
enabling enabling enabling enabling enabling enabling
4-28

Using the Anonymous Tracing Facility 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 <- ddi_driver_major -> strcmp <- strcmp -> derive_devi_class -> i_ddi_devi_class <- i_ddi_devi_class -> strncmp <- strncmp <- kstat_compare_bykid -> kstat_zone_compare <- kstat_zone_compare <- avl_find <- kstat_hold <- kstat_hold_bykid <- kstat_install -> kstat_rele -> cv_broadcast <- cv_broadcast <- kstat_rele <- conskbd_attach

4-29
Using the Speculative Tracing Facility

Because of the comprehensive tracing coverage that DTrace provides, the challenge for the user can be deciding what not to trace. The primary mechanism for ltering out uninteresting events is the predicate mechanism. Predicates are useful when you know at the time a probe res whether the probe event is interesting. For example, you might want to know when the read(2) system call is entered only if a particular process issued the call. There are situations, however, in which you can determine that a given probe event is interesting only some time after the probe has red. For example, if a read(2) system call is failing sporadically with an EIO errno code value, you might want to see the total code path leading to the error (not just the current stack trace.) Tracing every code path is possible with the fbt provider, but doing this while waiting for the failure to reappear results in too much recorded data. This causes one of two problems:
q q
Unwanted data that must be ltered afterwards Data loss caused by running out of buffer space in DTrace
To address this problem, DTrace provides a facility called speculative tracing. Speculative tracing allows you to tentatively trace data. Later, you can decide that the traced data is interesting and commit it, or you can decide that the traced data is uninteresting and discard it.
4-30

Speculative Tracing Functions

The D functions shown in Table 4-1 compose the DTrace speculative tracing facility: Table 4-1 DTrace Speculative Tracing Functions Function Name speculation speculate Args None ID Description Returns an identier (ID) for a new speculative buffer Denotes that the remainder of the probe clause should be traced to the speculative buffer specied by the ID Commits the speculative buffer associated with the ID Discards the speculative buffer associated with the ID
commit discard
ID ID
The speculation() function allocates a speculative buffer and returns a speculation identier (ID). You use this ID in subsequent calls to the speculate() function. You must place the speculate() call before any data recording action statement in the same clause. All such data recording action statements are then speculatively traced. Probe clauses can contain speculative tracing or regular tracing, but not both. Aggregating actions, destructive actions, and exit actions can never be speculative. By default (without tuning), there is only one speculative buffer. Therefore you must be careful not to start a new speculation before committing or discarding an existing one. You use the commit() function to commit a speculation. When you commit a speculative buffer, its data is copied into the one (per CPU) principal buffer of DTrace. You cannot have any data recording actions in a clause containing a commit() function. You use the discard() function to discard a speculation. When a speculative buffer is discarded, its contents are thrown away.

4-31
Speculative Tracing Example

You can use speculations to highlight a particular code path. The following example displays the entire code path for the open(2) system call only when it fails. The script explicitly ignores failed opens of the /var/ld/ld.config le, which are common on this system: # cat -n spec.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option flowindent 4 5 syscall::open*:entry 6 /stringof(copyinstr(arg0)) != "/var/ld/ld.config"/ 7 { 8 self->spec = speculation(); 9 speculate(self->spec); 10 11 /* The following will only appear if later committed */ 12 printf("%s was opening: %s\n", execname, copyinstr(arg0)); 13 } 14 15 fbt::: 16 /self->spec/ 17 { 18 speculate(self->spec); /* default action */ 19 } 20 21 syscall::open*:return 22 /self->spec && arg0 == -1/ 23 { 24 printf("Open failed with errno: %d\n", errno); 25 } 26 27 syscall::open*:return 28 /self->spec && arg0 == -1/ 29 { 30 /* 31 * Move data recorded in speculative buffer 32 * to principal buffer, freeing speculative buffer 33 * for a new specualtion() 34 */ 35 commit(self->spec); 36 self->spec = 0; 37 } 38
4-32

Using the Speculative Tracing Facility 39 syscall::open*:return 40 /self->spec && arg0 != -1/ 41 { 42 /* Throw away data recorded in speculative buffer */ 43 discard(self->spec); 44 self->spec = 0; 45 } # ./spec.d dtrace: script './spec.d' matched 40768 probes CPU FUNCTION 0 <= open64 Open failed with errno: 2 0 => open64 /etc/sytem 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 ^C -> open64 <- open64 -> copen -> falloc -> ufalloc <- ufalloc -> ufalloc_file -> fd_find <- cv_broadcast <- setf -> unfalloc -> crfree <- crfree <- unfalloc -> kmem_cache_free <- kmem_cache_free -> set_errno <- set_errno <- copen grep was opening:
It appears that the spec.d D script never starts a new open speculation until the current open returns and the current speculation is either committed or discarded. This is not the case, however, if an open blocks and does not return before another open is started. You learn in a lab exercise how to tune the number of speculative buffers.

4-33
Application Debugging With Speculative Tracing

The next example shows how to use speculative tracing for application debugging. Infrequent errors can be difcult to debug because they can be difcult to reproduce. Often you can identify a problem after a failure occurs, but at that point it is too late to reconstruct the code path that led to the failure condition. You can use the pid provider with speculative tracing to solve this common problem. The following script shows how to trace every instruction in a function only when it fails.
# cat -n appspec.d 1 #!/usr/sbin/dtrace -s 2 3 pid$target::malloc:entry 4 { 5 self->spec = speculation(); 6 speculate(self->spec); 7 printf("( %d )", arg0); 8 } 9 10 pid$target::malloc: /* trace all instructions */ 11 /self->spec/ 12 { 13 speculate(self->spec); 14 } 15 16 pid$target::malloc:return 17 /self->spec && arg1 == 0/ 18 { 19 commit(self->spec); 20 self->spec = 0; 21 } 22 23 pid$target::malloc:return 24 /self->spec && arg1 != 0/ 25 { 26 discard(self->spec); 27 self->spec = 0; 28 } # dtrace -s appspec.d -c myapp dtrace: script 'appspec.d' matched 106 probes ... CPU ID FUNCTION:NAME 0 42239 malloc:entry ( 1000000000 ) 0 42239 malloc:entry 0 42311 malloc:4 0 42312 malloc:8 0 42313 malloc:c 0 42314 malloc:10 0 42315 malloc:14 0 42316 malloc:18 0 42317 malloc:1c
4-34


0 42318 malloc:20 0 42319 malloc:24 0 42320 malloc:28 0 42321 malloc:2c 0 42327 malloc:44 0 42328 malloc:48 0 42329 malloc:4c 0 42330 malloc:50 0 42331 malloc:54 0 42332 malloc:58 0 42333 malloc:5c 0 42334 malloc:60 0 42335 malloc:64 0 42336 malloc:68 0 42337 malloc:6c 0 42309 malloc:return ... # mdb myapp > _start:b > :r mdb: stop at _start mdb: target stopped at: _start: clr %fp > malloc::nm Value Size Type Bind Other Shndx Name 0xff2d1cf0|0x00000070|FUNC |GLOB |0x0 |9 |libc.so.1`malloc > 70%4=x 1c > malloc,1c/ai libc.so.1`malloc: libc.so.1`malloc: save %sp, -0x60, %sp libc.so.1`malloc+4: mov %o7, %i3 libc.so.1`malloc+8: call +8 <libc.so.1`malloc+0x10> libc.so.1`malloc+0xc: sethi %hi(0x92400), %i2 libc.so.1`malloc+0x10: add %i2, 0x180, %i2 libc.so.1`malloc+0x14: add %i2, %o7, %i4 libc.so.1`malloc+0x18: mov %i3, %o7 libc.so.1`malloc+0x1c: ld [%i4 + 0xec8], %i5 libc.so.1`malloc+0x20: ld [%i5], %i1 libc.so.1`malloc+0x24: cmp %i1, 0 libc.so.1`malloc+0x28: bne +0x1c <libc.so.1`malloc+0x44> libc.so.1`malloc+0x2c: nop libc.so.1`malloc+0x30: call +0x93624 <PLT:___errno> libc.so.1`malloc+0x34: mov 0x30, %l7 libc.so.1`malloc+0x38: st %l7, [%o0] libc.so.1`malloc+0x3c: ret libc.so.1`malloc+0x40: restore %g0, 0, %o0 libc.so.1`malloc+0x44: call +0x657d4 <libc.so.1àssert_no_libc_locks_held> libc.so.1`malloc+0x48: nop libc.so.1`malloc+0x4c: call +0x6437c <libc.so.1`lmutex_lock> libc.so.1`malloc+0x50: ld [%i4 + 0xec0], %o0 libc.so.1`malloc+0x54: call +0x1c <libc.so.1`_malloc_unlocked> libc.so.1`malloc+0x58: mov %i0, %o0

4-35

libc.so.1`malloc+0x5c: libc.so.1`malloc+0x60: libc.so.1`malloc+0x64: libc.so.1`malloc+0x68: libc.so.1`malloc+0x6c: > mov call ld ret restore %o0, %i0 +0x6446c <libc.so.1`lmutex_unlock> [%i4 + 0xec0], %o0
4-36

DTrace Privileges
DTrace Privileges
By default, only the super-user can use DTrace. This is because DTrace enables visibility into all aspects of the system, including:
q q q q
User-level functions System calls Kernel functions Kernel data
In addition, some DTrace actions can modify a programs state by stopping a process or even inducing a breakpoint in the kernel. Just as it is inappropriate to allow one user to stop another users process or access another users les, so it is inappropriate to grant a user full access to all of the DTrace facilities. The traditional UNIX all or none approach to user privileges is not suitable for managing the use of the DTrace capabilities.
Using the Least Privilege Facility

The Least Privilege facility in the Solaris operating system enables a Solaris system administrator to grant particular users or processes specic privileges that permit access to individual DTrace capabilities. Three specic privileges control access by a user or process to the DTrace features:
q
The dtrace_proc privilege Permits use of only the pid and plockstat providers for process-level tracing of processes owned by the user. The dtrace_user privilege Permits use of only the profile and syscall providers on processes owned by the user. The dtrace_kernel privilege Permits the use of every provider except the pid and plockstat providers, unless dtrace_proc privilege is also granted. Does not allow kernel-destructive actions.
In addition to the above DTrace specic privileges, if a user has both dtrace_proc and proc_owner privileges then he is allowed to trace other users processes.

4-37
DTrace Privileges
Kernel-Destructive Actions
Only the super-user can perform kernel-destructive actions. You enable such actions by running the dtrace(1M) command with the -w option. Three built-in DTrace functions cause kernel-destructive actions:
q
The breakpoint() function Action that induces a kernel breakpoint, causing the system to stop, with control passing to OpenBoot PROM or kmdb(1), depending on how the system was booted. The panic() function Action that induces a kernel panic with crash les normally being created for postmortem analysis. The chill() function Action that causes DTrace to spin for the specied number of nanoseconds. Intended for dealing with race condition situations.
Setting DTrace User Privileges

The Solaris Least Privilege facility enables system administrators to grant specic privileges to specic Solaris users. To give a user a privilege at login, insert a line into the /etc/user_attr le, as follows:
username::::defaultpriv=basic,privilege,...
The following examples show the effect of setting the three DTrace specic privileges.
No Specied DTrace Privileges

The following example shows a user with no DTrace privileges specied:
$ cat /etc/user_attr # # Copyright (c) 2003 by Sun Microsystems, Inc. All rights reserved. # # /etc/user_attr # # user attributes. see user_attr(4) # #pragma ident "@(#)user_attr 1.1 03/07/09 SMI" # adm::::profiles=Log Management lp::::profiles=Printer Management root::::auths=solaris.*,solaris.grant;profiles=Web Console Management,All;lock_after_retries=no
4-38

DTrace Privileges
user2::::defaultpriv=basic,dtrace_proc user3::::defaultpriv=basic,dtrace_user user4::::defaultpriv=basic,dtrace_kernel user5::::defaultpriv=basic,dtrace_kernel,dtrace_proc user6::::defaultpriv=basic,dtrace_proc,proc_owner $ id uid=1001(user1) gid=101(users) $ /usr/sbin/dtrace -l dtrace: failed to initialize dtrace: DTrace requires additional privileges $ echo $$ 919 $ /usr/sbin/dtrace -n pid919::: dtrace: failed to initialize dtrace: DTrace requires additional privileges $
The dtrace_proc Privilege

This example shows the DTrace features available to a user with the dtrace_proc privilege:
$ id uid=1002(user2) gid=101(users) $ dtrace -l ID PROVIDER MODULE 1 dtrace 2 dtrace 3 dtrace $ echo $$ 9447 $ dtrace -n pid9447:::entry dtrace: description 'pid9447:::entry' matched 3179 probes ^C
FUNCTION NAME BEGIN END ERROR
$ dtrace -qn 'pid$target:libc:memcpy:entry {printf("size: %d\n",arg2)}' -c date Sun Feb 27 10:02:01 MST 2005 size: 16 size: 15 size: 1 size: 15 size: 5 size: 521 size: 44 size: 28 size: 28 size: 48 size: 48 size: 308 size: 56 size: 36 size: 29 $

4-39
DTrace Privileges
$ ps -ef | grep vi user2 1534 1528 0 09:48:20 pts/1 0:00 grep vi user5 1531 1452 0 09:47:55 pts/2 0:00 vi resume $ dtrace -n pid1531::: dtrace: invalid probe specifier pid1531:::: failed to grab pid 1531: permission denied $ dtrace -n syscall::read: dtrace: invalid probe specifier syscall::read:: probe description syscall::read: does not match any probes $
The dtrace_proc and proc_owner Privileges

$ id uid=1006(user6) gid=101(users) $ grep user6 /etc/user_attr user6::::defaultpriv=basic,dtrace_proc,proc_owner $ ps -ef | grep vi user6 650 637 0 09:41:30 pts/1 0:00 grep vi user5 649 630 0 09:41:16 pts/2 0:00 vi resume $ /usr/sbin/dtrace -n pid649:::entry dtrace: description 'pid649:::entry' matched 3951 probes CPU ID FUNCTION:NAME 0 42548 peekkey:entry 0 42544 getkey:entry 0 42546 getbr:entry 0 42548 peekkey:entry 0 42544 getkey:entry 0 42546 getbr:entry ...
The dtrace_user Privilege

This example shows the DTrace features available to a user with the dtrace_user privilege:
$ id uid=1003(user3) gid=101(users) $ grep user3 /etc/user_attr user3::::defaultpriv=basic,dtrace_user $ echo $$ 1171 $ dtrace -n pid1171:::entry dtrace: invalid probe specifier pid1171::: probe description pid1171::: does not match any probes $ pgm f: 13 p: 0 q: -1952257862 m: -10 f: 640001883 p: -2056615 q: -929109794 m: -7 f: -1660723204 p: -1529159 q: 94444073 m: 25 f: 2041630813 p: 749994 q: -42775360 m: -23
4-40

DTrace Privileges
f: -1255556994 p: 1065403 q: 309691762 m: 14 ^C $ dtrace -qn 'syscall::write:entry /arg0 == 1/ {printf("T: %d\n",timestamp)}' -c pgm f: 13 p: 0 q: -1952257862 m: -10 f: 640001883 p: -2056615 q: -929109794 m: -7 f: -1660723204 p: -1529159 q: 94444073 m: 25 f: 2041630813 p: 749994 q: -42775360 m: -23 f: -1255556994 p: 1065403 q: 309691762 m: 14 f: -1207459745 p: 1769677 q: -8640714 m: -35 T: 150116053418082 T: 150116222152140 T: 150116388881669 T: 150116558431666 T: 150116728255203 ... $ dtrace -n 'pid$target:::entry' -c pgm dtrace: invalid probe specifier pid$target:::entry: probe description pid1208:::entry does not match any probes $ dtrace -qn 'profile-109 {@[arg1] = count()}' -c pgm f: 13 p: 0 q: -1952257862 m: -10 f: 640001883 p: -2056615 q: -929109794 m: -7 f: -1660723204 p: -1529159 q: 94444073 m: 25 ... ^C 133476 49 4280947012 226 4280947008 1094 $ mdb pgm > _start:b > :r mdb: stop at pgm`_start mdb: target stopped at: mypgm`_start: clr %fp > 0t4280947008/ai libc.so.1`.umul: libc.so.1`.umul:umul %o0, %o1, %o0 > $q $ (sleep 33; pwd)& 1680 $ dtrace -n 'syscall:::entry /pid != $pid/ {}' dtrace: description 'syscall:::entry ' matched 225 probes /export/home/user3 CPU ID FUNCTION:NAME 0 18832 rexit:entry 0 18922 ioctl:entry 0 18908 setpgrp:entry 0 18922 ioctl:entry 0 19004 waitsys:entry 0 19214 getcwd:entry 0 18838 write:entry 0 18832 rexit:entry ^C $

4-41
DTrace Privileges The dtrace_user privilege only allows the use of the syscall and profile providers on processes owned by the user. Even though there are many system calls occuring in the system, the above output shows only the sh, sleep, and pwd commands system calls.
The dtrace_kernel Privilege

This example shows the DTrace features available to a user with the dtrace_kernel privilege:
$ id uid=1004(user4) gid=101(users) $ grep user4 /etc/user_attr user4::::defaultpriv=basic,dtrace_kernel $ dtrace -qn 'sched:::on-cpu {printf("Starting to run: %s\n", execname)}' Starting to run: sched Starting to run: sched Starting to run: fsflush Starting to run: svc.configd Starting to run: inetd Starting to run: svc.startd Starting to run: fmd Starting to run: dtrace Starting to run: sched Starting to run: sched ^C $ dtrace -qn 'io:::start {printf("Starting an I/O: %s\n", execname)}' Starting an I/O: bash Starting an I/O: bash Starting an I/O: bash Starting an I/O: fsflush Starting an I/O: find Starting an I/O: find Starting an I/O: find Starting an I/O: find ^C $ echo $$ 6711 $ dtrace -n pid6711:a.out::entry dtrace: invalid probe specifier pid6711:a.out::entry: probe description pid6711:bash::entry does not match any probes
The preceding example demonstrates that you must have the dtrace_proc privilege to trace your own processes. The dtrace_kernel privilege by itself is not sufcient.
$ id uid=1005(user5) gid=101(users) $ grep user5 /etc/user_attr user5::::defaultpriv=basic,dtrace_kernel,dtrace_proc $ echo $$
4-42

DTrace Privileges
6736 $ dtrace -n 'pid6736:a.out::entry' dtrace: description 'pid6736:a.out::entry' matched 211 probes ^C $ dtrace -l | awk '{print $2}' | sort -u PROVIDER dtrace fasttrap fbt fpuinfo io lockstat mib pid6736 proc profile sched sdt syscall sysinfo vminfo $

4-43
DTrace Privileges
Privilege Needed for Kernel-Destructive Actions

Only super-user can invoke kernel-destructive actions:
$ dtrace -wn 'syscall::fork1:entry {chill(2000); printf("OK, lets start: %s\n", execname);}' dtrace: description 'syscall::fork1:entry ' matched 1 probe dtrace: allowing destructive actions dtrace: error on enabled probe ID 2 (ID 18246: syscall::fork1:entry): invalid kernel access in action #1 dtrace: error on enabled probe ID 2 (ID 18246: syscall::fork1:entry): invalid kernel access in action #1 ^C $ su Password: # dtrace -wn 'syscall::fork1:entry {chill(2000); printf("OK, lets start: %s\n", execname);}' dtrace: description 'syscall::fork1:entry ' matched 1 probe dtrace: allowing destructive actions CPU ID FUNCTION:NAME 0 18246 fork1:entry OK, lets start: bash 0 ^C 18246 fork1:entry OK, lets start: bash
Setting DTrace Process Privileges

The Least Privilege facility also enables a Solaris system administrator to grant privileges to specic processes. To give a running process an additional privilege, use the ppriv(1) command: # ppriv -s A+privilege process-ID The following interactive session shows the use of the ppriv(1) command to give a shell specic DTrace privileges. Look at privileges(5) for details:
$ id uid=1001(user1) gid=101(users) $ /usr/sbin/dtrace -l dtrace: failed to initialize dtrace: DTrace requires additional privileges $ echo $$ 1774 $ ppriv -s A+dtrace_proc 1774 1774: ppriv: Not owner $ su Password: # ppriv -s A+dtrace_proc 1774 # exit
4-44

DTrace Privileges
$ /usr/sbin/dtrace -l ID PROVIDER MODULE FUNCTION 1 dtrace 2 dtrace 3 dtrace $ /usr/sbin/dtrace -n 'pid$target:calls::entry' -c calls dtrace: description 'pid$target:calls::entry' matched 7 probes 83 133 dtrace: pid 1787 exited with status 1 CPU ID FUNCTION:NAME 0 28355 _start:entry 0 28362 _init:entry 0 28361 main:entry 0 28360 f1:entry 0 28359 f2:entry 0 28358 f3:entry 0 28357 f4:entry 0 28356 f5:entry 0 28360 f1:entry 0 28359 f2:entry 0 28358 f3:entry 0 28357 f4:entry 0 28356 f5:entry 0 28363 _fini:entry $ ppriv $$ 1774: -sh flags = <none> E: basic,dtrace_proc I: basic,dtrace_proc P: basic,dtrace_proc L: all $ bash bash-2.05b$ ppriv $$ 1789: bash flags = <none> E: basic,dtrace_proc I: basic,dtrace_proc P: basic,dtrace_proc L: all bash-2.05b$ /usr/sbin/dtrace -n 'pid$target:calls::entry' -c calls dtrace: description 'pid$target:calls::entry' matched 7 probes 83 133 dtrace: pid 1850 exited with status 1 CPU ID FUNCTION:NAME 0 28355 _start:entry 0 28362 _init:entry 0 28361 main:entry 0 28360 f1:entry ... bash-2.05b$ echo $$ 1789 bash-2.05b$ su
NAME BEGIN END ERROR

4-45
Password: # ppriv -s A+dtrace_kernel 1789 # ppriv $$ 1854: sh flags = <none> E: all I: basic P: all L: all # exit bash-2.05b$ ppriv $$ 1789: bash flags = <none> E: basic,dtrace_kernel,dtrace_proc I: basic,dtrace_kernel,dtrace_proc P: basic,dtrace_kernel,dtrace_proc L: all bash-2.05b$ /usr/sbin/dtrace -qn 'fbt::cv_wait_sig:entry > {trace(execname);ustack();stack();exit(0);}' more ff2bcb58 15684 149a4 13ad8 12780 1201c 115cc genunix`str_cv_wait+0x28 genunix`strwaitq+0x238 genunix`strread+0x174 genunix`read+0x274 unix`syscall_trap32+0xcc
DTrace Privileges
Summarizing the DTrace Privilege Levels

Table 4-2 describes the DTrace privilege levels. Table 4-2 DTrace Privilege Levels Privilege Level Any DTrace Privilege Providers dtrace Actions exit printf tracemem discard speculate printa trace copyin copyout stop copyinstr raise ustack copyin copyout stop copyinstr raise ustack All but destructive actions Variables args probemod this epid probename timestamp id probeprov vtimestamp probefunc self Address Spaces None
dtrace_proc Privilege
pid plockstat
execname pid uregs User
dtrace_user Privilege
profile syscall
execname pid uregs User
dtrace_kernel Privilege
All except the pid and plockstat providers
All
User Kernel

4-47
Module 5

Objectives
q q q
Describe how to lessen the performance impact of DTrace Describe how to use and tune DTrace buffers Debug DTrace scripts
5-1
Relevance
Relevance
Discussion The following questions are relevant to understanding how to troubleshoot DTrace problems:
q
!
?
Would the ability to write your D scripts with minimal performance impact be benecial? Would it be useful to have control over buffer management policies when DTrace buffer space is exhausted? Would it be useful to detect common mistakes made in D scripts?
5-2

q

5-3
Minimizing DTrace Performance Impact

Enabling DTrace in any manner affects system performance in some way. Often, this effect is negligible, but it can be substantial if many probes are enabled with costly enablings. You can minimize the performance impact of DTrace by:
q q q
Limiting enabled probes Using aggregations Using cacheable predicates
Limiting Enabled Probes

DTrace provides comprehensive tracing coverage of both kernel and user processes. This coverage allows for a major probe effect if tens of thousands of probes are enabled. In general, you should only enable as many probes as needed to solve your problem. Do not, for example, enable all fbt probes if a more concise enabling can answer your question. When possible, limit enabled probes to a specic module or function of interest. The more concisely you can formulate the problem statement, the better you will be at limiting your probe effect. You should also be careful when using the pid provider, because it can instrument every instruction of an application. This can result in millions of probes being enabled in the application, slowing the target process to a crawl. Nevertheless, there are many conditions in which you must enable a large number of probes to answer a question. DTrace has been designed with this in mind. Enabling a large number of probes can slow down the system substantially, but it can never induce fatal failure of the machine. You should therefore not hesitate to enable many probes if necessary.
5-4

Using Aggregations
DTrace aggregations provide a scalable method of aggregating data. Although associative arrays appear to offer similar functionality, they are global, general-purpose variables that cannot provide the linear scalability of aggregations. Aggregating functions allow for intermediate results to be kept per-CPU instead of in a shared global data structure. When a system-wide result is required, the aggregating function may then be applied to the set consisting of the per-CPU intermediate results. You should therefore use aggregations rather than associative arrays whenever possible. For example, you should avoid performing the action shown in the following script: syscall:::entry { ++totals[execname]; } syscall::rexit:entry { printf(%40s %d\n, execname, totals[execname]); totals[execname] = 0; } You should instead perform the following: syscall:::entry { @totals[execname] = count(); } END { printa(%40s %@d\n, @totals); }
Using Cacheable Predicates

A tracing framework that offers comprehensive coverage must provide a mechanism that enables you not to trace events, otherwise you are ooded with unwanted data. DTrace does this with predicates, which enable you to trace data only when a specied condition is found to be true.

5-5
Minimizing DTrace Performance Impact When enabling many probes, you tend to use predicates of a form that identies a specic thread or threads of interest, such as /self>traceme/ or /pid == 12345/. Many of these predicates evaluate to the same (false) value for most threads in most probes, but the evaluation itself can become costly when done for every function entry and return point in the kernel. To reduce this cost, DTrace caches the evaluation of a predicate if it includes only thread-local variables (as in the rst example), only immutable variables (as in the second), or both. The cost of evaluating a cached predicate is much smaller than the cost of evaluating a non-cached predicate, especially if the predicate involves thread-local variables, string comparisons, or other relatively costly operations.
Examining Cacheable and Uncacheable Predicates

Predicate caching is transparent to the user (cache coherency is maintained by DTrace). It does, however, require you to follow some guidelines to construct optimal predicates. Table 5-1 shows some examples of cacheable as opposed to uncacheable predicate expressions. Table 5-1 Cacheable and Uncacheable Predicates Cacheable self->mumble Uncacheable mumble[curthread] or mumble[pid, tid]
execname == pgm curpsinfo->pr_fname or curthread->t_procp>p_user.u_comm pid == 1234 tid == 17 curpsinfo->pr_pid or curthread->t_procp>p_pidp->pid_id curlwpsinfo->pr_lwpid or curthread->t_tid
Constructing Optimal Predicates

You should avoid constructing uncacheable predicates, such as that shown in the following example: syscall::read:entry { follow[pid, tid] = 1; } fbt::: /follow[pid, tid]/
5-6

Minimizing DTrace Performance Impact {} syscall::read:return /follow[pid, tid]/ {follow[pid, tid] = 0;}
You should instead use thread-local variables, as in the following example: syscall::read:entry { self->follow = 1; } fbt::: /self->follow/ {} syscall::read:return /self->follow/ { self->follow = 0; } To be cacheable, a predicate must consist exclusively of cacheable expressions. The following predicates are all cacheable: /execname == myprogram / /execname == $$1/ /pid == 12345/ /pid == $1/ /self->traceme == 1/ Because of the use of global variables, these predicates are all not cacheable: /execname == one to_watch/ /traceme[execname]/ /pid == pid_i_care_about/ /se1f->traceme == my_global/

5-7
Using and Tuning DTrace Buffers

Data buffering and management is an essential service provided by the DTrace framework for its clients. In previous modules you used DTrace without examining how traced data is transported from the DTrace framework to clients such as the dtrace(1M) utility. In this section, you explore data buffering in detail and learn about options you can tune to change the DTrace buffer management policies.
Principal Buffers
The buffer most fundamental to DTrace operation is the principal buffer. The principal buffer is present in every DTrace invocation, and is the buffer to which tracing actions record their data by default. These actions include:
q q q q q q
exit() printf() trace() ustack() printa() stack()
The principal buffers are always allocated on a per-CPU basis, although tracing (and thus buffer allocation) can be restricted to a single CPU by using the cpu option.
Principal Buffer Policies

DTrace enables tracing in highly constrained contexts in the kernel. In particular, DTrace enables tracing in contexts in which you cannot reliably allocate memory. The consequence of this exibility of context is that there always exists a possibility that you want to trace data when there is no space available. DTrace must have policies to deal with such situations when they arise. Which policy you choose is dictated by the specics of how you are using DTrace: sometimes it is best to discard the new data, while at other times it is desirable to reuse the space containing the oldest recorded data to trace the new data. Usually, however, the best policy is the one that minimizes the likelihood of running out of available space in the rst place.
5-8

Using and Tuning DTrace Buffers To accommodate these varying demands, DTrace supports the following buffer policies:
q q q
The switch policy The fill policy The ring policy
This support is implemented with the bufpolicy option, and can be set on a per-consumer basis.
DTrace Option Settings

You can set options in a D script by using the #pragma D option statement and the option name. If the option takes a value, the option name should be followed by an equals sign (=) and the option value. For example, all of the following are valid option settings: #pragma #pragma #pragma #pragma #pragma #pragma D D D D D D option option option option option option nspec=4 grabanon bufsize=2g switchrate=64 aggrate=l00 bufresize=manual
The dtrace(1M) command also accepts option settings on the command line as an argument to the -x option. For example: # dtrace -x nspec=4 -x bufsize=2g -x switchrate=60 \ -x aggrate=l0ms -x bufpolicy=switch -n zfod You can also specify the bufsize option with the -b ag to the dtrace(1M) command: # dtrace -b 2g -n zfod
Note This section describes only those options relevant to buffer management. For details on the other DTrace options, see the Solaris Dynamic Tracing Guide.

5-9
The switch Buffer Policy

By default, the principal buffer has a switch buffer policy. Under this policy, per-CPU buffers are allocated in pairs: one buffer is active, the other is inactive. When a DTrace consumer asks to read its buffer out of the kernel, the kernel rst switches the inactive and active buffers. Buffer switching is done in such a manner that there is no window in which tracing data can be lost. When the buffers are switched, the newly inactive buffer is copied out to the DTrace consumer. This policy ensures that the consumer always sees a self-consistent buffer (that is, a buffer is never simultaneously traced to and copied out), and that no window is introduced in which tracing is paused or otherwise prevented. The consumer controls the rate at which the buffer is read out (and thus switched) by using the switchrate option. As with any rate option, switchrate can be specied with any time sufx, but defaults to rate-persecond.
Dropped Data
Under the switch policy, if a given enabled probe would trace more data than there is space available in the active principal buffer, the data is dropped and a per-CPU drop count is incremented. In the event of one or more drops, the dtrace(1M) command displays this message or a similar one: dtrace: 11 drops on CPU 0 You can reduce or eliminate drops by:
q
increasing the size of the principal buffer with the bufsize option, or increasing the switching rate with the switchrate option
The switch policy allocates scratch space for the copyin(), copyinstr(), and alloca() commands out of the active buffer.
Example of Tuning Buffers to Alleviate Drops

The following D script causes signicant drops: # cat -n stress.d 1 #!/usr/sbin/dtrace -s 2
5-10

Using and Tuning DTrace Buffers 3 4 5 6 7 8 9 10 11 fbt::: { trace(timestamp); } tick-5sec { exit(0); }
# ./stress.d >/var/tmp/stress.d.out dtrace: script './stress.d' matched 38665 probes dtrace: 451660 drops on CPU 0 dtrace: 1100596 drops on CPU 0 dtrace: 1028767 drops on CPU 0 dtrace: 1103521 drops on CPU 0 # ls -l /var/tmp/stress.d.out -rw-r--r-1 root root /var/tmp/stress.d.out
86004878 Mar 13 14:58
The drops result from the limited buffer space, the low switchrate value, or both. The default buffer size for the principal buffer is 4 Mbytes and the default switchrate is one second. In the next invocation of the script you increase the buffer size signicantly: # dtrace -x bufsize=300m -s stress.d >/var/tmp/stress.d.out dtrace: script 'stress.d' matched 38665 probes dtrace: buffer size lowered to 150m # ls -l /var/tmp/stress.d.out -rw-r--r-1 root root /var/tmp/stress.d.out
18177752 Mar 13 15:03
Note that DTrace lowers the setting for buffer size because there is not enough memory. By increasing the buffer size, you eliminated all drops and created 18 Mbytes of trace data. In the next example you use a smaller buffer size, but with an increased switchrate value: # dtrace -x bufsize=64m -x switchrate=16 -s stress.d > >/var/tmp/stress.d.out dtrace: script 'stress.d' matched 38665 probes ^C # ls -l /var/tmp/stress.d.out -rw-r--r-1 root root 33052791 Mar 13 15:06 /var/tmp/stress.d.out

5-11
The fill Buffer Policy

For some problems it is useful to have a single in-kernel buffer. In such situations you might want to have a single, large in-kernel buffer, and continue tracing until one or more of the per-CPU buffers has lled. You can implement this solution using the fill buffer policy. The fill buffer policy is benecial in helping to avoid drops that result in the loss of trace data. Kernel buffer space is also saved since there is only one buffer per CPU. Under the fill buffer policy, tracing continues until an enabled probe is about to trace more data than there is space in the principal buffer. At this time, the buffer is marked as lled and the consumer is notied that at least one of its per-CPU buffers has lled. When the dtrace(1M) utility detects a single lled buffer, tracing is stopped, all buffers are processed, and dtrace exits. Note that no further data is traced to a lled buffer, even if the data would t in the buffer. To use the fill policy, set the bufpolicy option to fill. For example, the following invocation of DTrace traces every system call entry into a per-CPU 2-Kbyte buffer with the buffer policy set to fill: # dtrace -n syscall:::entry -b 2k -x bufpolicy=fill To allow for END tracing in fill buffers, DTrace calculates beforehand the amount of space potentially consumed by END probes and subtracts this from the size of the principal buffer. If the net size is negative, DTrace refuses to start, and the dtrace(1M) utility outputs a corresponding error message: dtrace: END enablings exceed size of principal buffer Reserving space beforehand ensures that a full buffer always has sufcient space for any and all END probes.
5-12

The ring Buffer Policy

When using DTrace to help diagnose failure (as opposed to understanding non-failing behavior), you often want to track the events leading to failure. Moreover, in cases where reproducing failure can take hours or days, you might want to keep only the most recent data. To support such situations, DTrace provides the ring buffer policy. Under this policy, when a principal buffer has lled, tracing wraps around to the rst entry, thereby overwriting older tracing data. You establish a ring buffer by setting the bufpolicy option to ring: # dtrace -s stress.d -x bufpolicy=ring -b 16k dtrace: script 'stress.d' matched 38665 probes CPU 0 0 0 0 0 0 ... ID 9808 9809 2288 668 669 14298 FUNCTION:NAME disp_lock_enter_high:entry disp_lock_enter_high:return setfrontdq:return generic_enq_thread:entry generic_enq_thread:return ts_preempt:return
810424080584641 810424080586093 810424080588595 810424080590727 810424080592504 810424080594241
With the ring buffer policy, the dtrace(1M) utility does not display any output until the process terminates; at that time the ring buffer is consumed and processed. Note that if a given record cannot t in the buffer (that is, if the record is larger than the buffer size), the record is dropped regardless of buffer policy. By adding the following two lines to a D script, you can enable ring buffering with a specic buffer size: #praqma D option bufpolicy=ring #pragma D option bufsize=16k

5-13
Other Buffers
Principal buffers exist in every DTrace enabling. In addition to principal buffers, some DTrace consumers have additional in-kernel data buffers: an aggregation buffer, a number of speculative buffers, or both. You tune the aggregation buffer size with the aggsize option, and you tune the speculative buffer size with the specsize option. You can tune the size of each buffer on a per-consumer basis. Note that setting the buffer sizes denotes the sizes of the buffers on each CPU. Moreover, for the switch buffer policy, bufsize denotes the individual sizes of the active and inactive buffers on each CPU.
Buffer Resizing Policy

In some cases there is not adequate free kernel memory to allocate a buffer of the desired size. There might be insufcient memory available, or the DTrace consumer might have exceeded a tunable limit. DTrace provides a congurable policy when a buffer cannot be allocated. The policy is set with the bufresize option, and defaults to auto. Under the auto buffer resize policy, the size of a buffer is halved until a successful allocation occurs. The dtrace(1M) utility emits a message if a buffer as allocated is smaller than the requested size: # dtrace -P syscall -b 4g dtrace: description 'syscall' matched 450 probes dtrace: buffer size lowered to 128m # dtrace -n 'fbt:::entry {@a[probefunc] = count()}' -x aggsize=1g dtrace: description 'fbt:::entry ' matched 16250 probes dtrace: aggregation size lowered to 128m Alternatively, you can set the buffer resize policy to be manual by setting bufresize to manual. Under this policy, a failure to allocate causes DTrace to fail to start: # dtrace -P syscall -x bufsize=500m -x bufresize=manual dtrace: description 'syscall' matched 450 probes dtrace: could not enable tracing: Not enough space The bufresize option dictates the buffer resizing policy of all buffers principal, speculative and aggregation.
5-14

Debugging DTrace Scripts

As with any programming language, you can experience a multitude of errors in the D language. As you write more D scripts, you nd it easier to diagnose errors, whether they be syntax errors or run-time errors. This section provides requirements and recommendations for writing correct D scripts.
Avoiding Syntax Errors in D Scripts

This section describes requirements that help you to avoid common D script syntax errors. Start your scripts with the following rst line: #!/usr/sbin/dtrace -s # ./badstart.d ./badstart.d: line ./badstart.d: line ./badstart.d: line `0' ./badstart.d: line 1: BEGIN: command not found 8: tick-1sec: command not found 10: syntax error near unexpected token 10: ` exit(0);'
# cat comments.d /* This D script counts the number of read system calls */ #!/usr/sbin/dtrace -s syscall::read:entry { @["Number of reads:"] = count(); } # ./comments.d ./comments.d: line 1: /bin: is a directory ./comments.d: line 3: syscall::read:entry: command not found ./comments.d: line 5: syntax error near unexpected token `(' ./comments.d: line 5: ` @["Number of reads:"] = count();' You must match up /* with an ending */ for comments in D scripts: # cat comments2.d #!/usr/sbin/dtrace -s /* This D script counts the number of read system calls syscall::read:entry {

5-15
Debugging DTrace Scripts @["Number of reads:"] = count(); } # ./comments2.d dtrace: failed to compile script ./comments2.d: line 7: end-of-file encountered before matching */ If you have more than one statement in a probe clause, make sure you end each one with a semicolon: ... BEGIN { a=$1 b=$2 c=$3 } ... # ./badstart2.d 1 2 3 dtrace: failed to compile script ./badstart2.d: line 6: syntax error near "b" When comparing values, make sure that you use the == relational operator and not =: # cat test5.d #!/usr/sbin/dtrace -s fbt::sema_init:entry /arg1 = 1/ { trace(timestamp); } # ./test5.d dtrace: failed to compile script ./test5.d: line 4: operator = can only be applied to a writable variable The rst assignment to a variable determines its type. As in the C language, you cannot mix types in the D language: # cat test8.d #!/usr/sbin/dtrace -s BEGIN { vp = `rootdir; i = 5;
5-16

Debugging DTrace Scripts } tick-1sec { i = *vp; } # ./test8.d dtrace: failed to compile script ./test8.d: line 11: operands have incompatible types: "int" = "vnode_t" Remember that even with the -w dtrace(1M) option, which enables destructive actions, you cannot modify kernel variables: # cat test6.d #!/usr/sbin/dtrace -ws tick-5sec /`freemem < `lotsfree/ { `lotsfree = `lotsfree*2; } # ./test6.d dtrace: failed to compile script ./test6.d: line 6: operator = can only be applied to a writable variable

5-17
Avoiding Run-Time Errors in D Scripts

This section describes requirements that help you to avoid common D script run-time errors. Make sure your D script le has execute permission: # ./badstart.d ./badstart.d: : Permission denied # chmod +x badstart.d If you specify other options on the rst line of a D script, be sure the s option is last: # head badstart3.d #!/usr/sbin/dtrace -sq BEGIN { a=$1 b=$2 c=$3 } tick-1sec # ./badstart3.d dtrace: failed to open q: No such file or directory Make sure that you pass the correct number of arguments expected by the script (unless you explicitly set the defaultargs option). For example, the badstart4.d script expects three command-line arguments: # ./badstart4.d dtrace: failed to compile script ./badstart4.d: line 5: macro argument $1 is not defined # dtrace -x defaultargs -s badstart4.d dtrace: script 'badstart4.d' matched 2 probes CPU ID FUNCTION:NAME 0 36401 :tick-1sec If an argument is a string, make sure that you either reference the argument in the script with $$3 (if it is the third argument) or type it on the command line as string: # head badstart5.d #!/usr/sbin/dtrace -qs
5-18

BEGIN { a=$1; b=$2; } tick-1sec /execname == $3/ # ./badstart5.d 1 dtrace: failed to failed to resolve # ./badstart5.d 1 ^C 2 init compile script ./badstart5.d: line 10: init: Unknown variable name 2 '"init"'
Avoid misspelled words, which are a common problem in writing D scripts: # ./test1.d dtrace: failed to compile script ./test1.d: line 3: probe description syscall::opn:entry does not match any probes The following script uses an improper probe description: # cat test2.d #!/usr/sbin/dtrace -s syscall { trace(timestamp); } # ./test2.d dtrace: failed to compile script ./test2.d: line 3: probe description :::syscall does not match any probes When using the printf() and printa() built-in functions, make sure that the arguments match the format speciers in type and number: # cat -n test3.d 1 #!/usr/sbin/dtrace -qs 2 3 sched:::on-cpu 4 /pid != $pid && pid != 0/ 5 { 6 @[curpsinfo->pr_psargs, curcpu->cpu_id] = count();

5-19
Debugging DTrace Scripts 7 } 8 9 END 10 { 11 printf("%-30s %4s %6s\n", "Command", "CPU"); 12 printa("%-30s %4d %@6d\n", @); 13 } # ./test3.d dtrace: failed to compile script ./test3.d: line 11: printf( ) prototype mismatch: conversion #3 (%s) is missing a corresponding value argument # cat -n test3a.d 1 #!/usr/sbin/dtrace -qs 2 3 sched:::on-cpu 4 /pid != $pid && pid != 0/ 5 { 6 @[curpsinfo->pr_psargs, curcpu->cpu_id] = count(); 7 } 8 9 END 10 { 11 printf("%-30s %4s %6s\n", "Command", "CPU", "Count"); 12 printa("%-30s %4s %@6d\n", @); 13 } # ./test3a.d dtrace: failed to compile script ./test3a.d: line 12: printa( ) argument #3 is incompatible with conversion #2 prototype: conversion: %s prototype: char [] or string (or use stringof) argument: processorid_t # cat test4.d #!/usr/sbin/dtrace -s syscall::open:entry { printf("%s was opening: %s\n", execname, arg0); } # ./test4.d dtrace: failed to compile script ./test4.d: line 5: printf( ) argument #3 is incompatible with conversion #2 prototype: conversion: %s
5-20

Debugging DTrace Scripts prototype: char [] or string (or use stringof) argument: int64_t
Remember that pointer arguments to system calls are user addresses, not kernel addresses. You must use the copyinstr() built-in function to retrieve the strings:
# cat test4a.d #!/usr/sbin/dtrace -s syscall::open:entry { printf("%s was opening: %s\n", execname, stringof(arg0)); } # ./test4a.d dtrace: script './test4a.d' matched 1 probe dtrace: error on enabled probe ID 1 (ID 37: syscall::open:entry): invalid address (0xff3d79d3) in action #2 dtrace: error on enabled probe ID 1 (ID 37: syscall::open:entry): invalid address (0xff3ed570) in action #2 dtrace: error on enabled probe ID 1 (ID 37: syscall::open:entry): invalid address (0xff3ef6d0) in action #2 ^C # cat test4b.d #!/usr/sbin/dtrace -s syscall::open:entry { printf("%s was opening: %s\n", execname, copyinstr(arg0)); } # ./test4b.d dtrace: script './test4b.d' matched 1 probe CPU ID FUNCTION:NAME 0 37 open:entry ls was opening: /var/ld/ld.config 0 37 open:entry ls was opening: /lib/libc.so.1
0 37 open:entry ls was opening: /usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3 0 0 37 37 open:entry cat was opening: /var/ld/ld.config open:entry cat was opening: /lib/libc.so.1
0 37 open:entry cat was opening: /usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3 ^C

5-21
Numbering of Action Statements

The run-time error shown for test4a.d references action #2 although there is only one action statement. Action statements are numbered as follows: there is one action for each non-printf() expression, and one for each data argument to printf. Therefore the stringof(arg0)data argument to printf is action #2. Avoid enabling probes that generate too much data, causing drops: # cat drop.d #!/usr/sbin/dtrace -s entry { printf("%s %s %s\n", probeprov, probemod, probefunc); } # ./drop.d > /tmp/drop.out dtrace: script './drop.d' matched 19579 probes dtrace: 29569 drops on CPU 0 dtrace: 903839 drops on CPU 0 ^Cdtrace: 448991 drops on CPU 0 If you cause any run-time exceptions in your D scripts, such as divide-byzero, DTrace gives you run-time errors, but continues to run: # cat -n test9.d 1 #!/usr/sbin/dtrace -s 2 3 BEGIN 4 { 5 x = 5*1024*1024; 6 } 7 8 tick-3sec 9 { 10 x = x/(`pagesize-8192); 11 } # ./test9.d dtrace: script './test9.d' matched 2 probes dtrace: error on enabled probe ID 2 (ID 36402: profile:::tick-3sec): divide-by-zero in action #1 at DIF offset 20
5-22

Debugging DTrace Scripts dtrace: error on enabled probe ID 2 (ID 36402: profile:::tick-3sec): divide-by-zero in action #1 at DIF offset 20 ^C

5-23
Appendix A
Actions and Subroutines

You have seen function calls used in D program examples. D function calls allow you to invoke two kinds of services provided by DTrace:
q q
Actions that trace data or modify state external to DTrace Subroutines that only affect internal DTrace state
This appendix formally denes the set of actions and subroutines available in DTrace, along with their syntax and semantics. This appendix enables you to:
q q q q q
Describe the default action Describe and use data recording actions Describe and use destructive actions Describe and use special actions Describe and use subroutines
A-1
Default Action
Default Action
A clause need not contain an action; it may instead consist simply of manipulation of variable state, or of any combination of actions and manipulations of variable state. If a clause contains no actions and no D manipulation (that is, if a clause is empty), the default action is taken. The default action is to trace the enabled probe identier (EPID) to the principal buffer. The EPID identies a particular enabling of a particular probe with a particular predicate and actions. From the EPID, DTrace consumers can determine which probe induced the action. Indeed, whenever data is traced, it must be accompanied by the EPID to allow the consumer to make sense of the data; hence the default action is to trace the EPID and nothing else. Using the default action allows for simple use of the dtrace(1M) command. For example, you can enable all probes in the TS module with the default action by using: # dtrace -m TS (The TS module implements the timesharing scheduling class; see dispadmin(1M) for more information.) The above command results in output similar to the following: # dtrace -m TS dtrace: description 'TS' matched 93 probes CPU ID FUNCTION:NAME 0 14297 ts_preempt:entry 0 14298 ts_preempt:return 0 14301 ts_sleep:entry 0 14302 ts_sleep:return 0 14301 ts_sleep:entry 0 14302 ts_sleep:return 0 14301 ts_sleep:entry 0 14302 ts_sleep:return 0 14329 ts_update:entry 0 14331 ts_update_list:entry 0 14327 ts_change_priority:entry 0 14328 ts_change_priority:return 0 14332 ts_update_list:return 0 14331 ts_update_list:entry 0 14332 ts_update_list:return 0 14331 ts_update_list:entry ...
A-2

Data Recording Actions

Data recording actions compose the core DTrace actions. Each of these actions records data to the principal buffer by default, but each can also record data to speculative buffers. The descriptions below refer to the buffer where actions are being recorded as the directed buffer.
The void trace(expression) Action

The most basic action is the trace() action, which takes a D expression as its argument and traces the result to the directed buffer. All of the following are valid trace() actions: trace(execname); trace(curlwpsinfo->pr_pri); trace(timestamp / 1000); trace(lbolt); trace(somehow managed to get here);
The void tracemem(address, size_t nbytes) Action

A cousin to trace() is the tracemem() action, which takes a D expression as its rst argument, address, and a constant as its second argument, nbytes. The tracemem() action copies the memory from the address specied by address into the directed buffer for the length specied by nbytes.
The void printf(string format, ...) Action

Like trace(), the printf() action traces D expressions, but printf() allows for elaborate printf(3C)-style formatting. Like printf(3C), the parameters consist of a format string followed by a variable number of arguments. The following action traces a string and an integer argument with appropriate labels: printf(execname is %s; priority is %d, execname, curlwpsinfo->pr_pri);

Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun ServicesRevision A
A-3
Data Recording Actions The printf() action tells DTrace to trace the data associated with each argument after the rst argument, and then to format the results using the rules described by the rst printf() argument, known as a format string. The format string is a regular string that contains any number of format conversions, each beginning with the % character, which describe how to format the corresponding argument. The rst conversion in the format string corresponds to the second printf() argument, the second conversion to the third argument, and so on. All of the text between conversions is printed verbatim. The character following the conversion character describes the format to use for the corresponding argument. Unlike the printf(3C) action, DTrace printf() is implemented as a builtin function that is recognized by the D compiler. The D compiler provides several useful services for the DTrace printf() action that are not found in the C library printf():
q
The D compiler compares the arguments to the conversions in the format string. If an arguments type is incompatible with the format conversion, the D compiler produces an error message explaining the problem. The D compiler does not require the use of size prexes with printf() format conversions. The C printf() routine requires that you indicate the size of arguments by adding prexes, such as %ld for long or %lld for long long. The D compiler knows the size and type of your arguments, so these prexes are not required in your D printf() statements. DTrace provides additional format characters that are useful for debugging and observability; for example, the %a format conversion can be used to print a pointer as a symbol name and offset.
In order to implement these features, the format string in the DTrace printf() function must be specied as a string constant in your D program; format strings cannot be dynamic variables of type string.
Conversion Specications
Each conversion specication in the format string is introduced by the % character, after which the following appear in sequence:
q
Zero or more ags (in any order), which modify the meaning of the conversion specication as described in the following subsection. An optional minimum eld width. If the converted value has fewer bytes than the eld width, it is padded with spaces on the left by default, or on the right if the left-adjustment ag (-) is specied. The
A-4

Data Recording Actions eld width can also be specied as an asterisk (*), in which case the eld width is set dynamically based on the value of an additional argument of type int.
q
An optional precision that provides one of the following:

q
The minimum number of digits to appear for the d, i, o, u, x, and X conversions (the eld is padded with leading zeroes) The number of digits to appear after the radix character for the e, E, and f conversions The maximum number of signicant digits for the g and G conversions The maximum number of bytes to be printed from a string by the a conversion
The precision takes the form of a period (.) followed by either an asterisk (*), as described in the Width and Precision Speciers subsection, or by a decimal digit string.
q
An optional sequence of size prexes that indicate the size of the corresponding argument (described in the Size Prexes subsection). The size prexes are not necessary in D and are provided solely for compatibility with the C printf() function. A conversion specier (described in the following subsection) that indicates the type of conversion to be applied to the argument.
The printf(3C) function also supports conversion specications of the form %n$ where n is a decimal integer; DTrace printf() does not support this type of conversion specication.
Flag Speciers
You enable the printf() conversion ags by specifying one or more of the following characters, which can appear in any order:
q
() The integer portion of the result of a decimal conversion (%i, %d, %u, %f, %g, or %G) is formatted with thousands grouping characters using the non-monetary grouping character. Not all locales, including the POSIX C locale, provide non-monetary grouping characters for use with this ag. (-) The result of the conversion is left-justied within the eld. The conversion will be right-justied if this ag is not specied. (+) The result of signed conversion always begins with a sign (+ or -). If this ag is not specied, the conversion begins with a sign only when a negative value is converted.

A-5

q
( space) If the rst character of a signed conversion is not a sign or if a signed conversion results in no characters, a space is placed before the result. If the space and + ags both appear, the space ag is ignored. (#) The value is converted to an alternate form if one is dened for the selected conversion. The alternate formats for conversions are described below in the text corresponding to each conversion. (0) For d, i, c, u, x, X, e, E, f, g, and G conversions, leading zeroes (following any indication of sign or base) are used to pad to the eld width; no space padding is performed. If the 0 and - ags both appear, the 0 ag is ignored. For d, i, o, u, x, and X conversions, if a precision is specied, the 0 ag is ignored. If the 0 and ags both appear, the grouping characters are inserted before the zero padding.
Width and Precision Speciers

You can specify the minimum eld width as a decimal digit string following any ag specier, as described previously, in which case the eld width is set to the specied number of columns. You can also specify the eld width as asterisk (*), in which case an additional argument of type int is accessed to determine the eld width. For example, to print an integer x in a eld width determined by the value of the int variable w, you write this D statement: printf(%*d, w, x); Additionally, you can specify the eld width using a ? character to indicate that the eld width should be set based on the number of characters required to format an address in hexadecimal in the data model of the operating system kernel. The width is set to 8 if the kernel is using the 32-bit data model, or to 16 if the kernel is using the 64-bit data model. The precision for the conversion can be specied as a decimal digit string following a period (.) or by an asterisk (*) following a period. If an asterisk is used to specify the precision, an additional argument of type int prior to the conversion argument is accessed to determine the precision. If both width and precision are specied as asterisks, the order of arguments to printf() for the conversion should appear in the order: width, precision, value.
A-6

Size Prexes
Size prexes are required in ANSI-C programs that use printf(3C) in order to indicate the size and type of the conversion argument. The D compiler performs this processing for your printf() calls automatically, so size prexes are not required. Although size prexes are provided for C compatibility, their use is explicitly discouraged in D programs because they also tend to bind your code to a particular data model when using derived types. For example, if a typedef is redened to different integer base types depending on the data model, it is not possible to use a single C conversion that works in both data models without explicitly knowing the two underlying types and including a cast expression, or dening multiple format strings. The D compiler solves this problem by allowing you to omit size prexes and automatically determining the argument size. The size prexes can be placed just before the format conversion name and after any ags, widths, and precision speciers. The size prexes are:
q
Optional h species that a following a, i, o, u, x, or X conversion applies to a short or unsigned short Optional l species that a following d, i, o, u, x, or X conversion applies to a long or unsigned long Optional ll species that a following d, i, o, u, x, or X conversion applies to a long long or unsigned long long Optional L species that a following e, E, f, g, or G conversion applies to a long double Optional l species that a following c conversion applies to a wint_t argument; an optional l species that a following s conversion character applies to a pointer to a wchar_t argument
Conversion Formats
Each conversion character sequence results in fetching zero or more arguments. If you do not provide sufcient arguments for the format string, or if the format string is exhausted and arguments remain, the D compiler issues an error message. If you specify an undened conversion format, the D compiler issues an error message. The conversion character sequences and their meanings are:

A-7

q
a The pointer or uintptr_t argument is printed as a kernel symbol name in the form modulesymbol-name plus an optional hexadecimal byte offset. If the value does not fall within the range dened by a known kernel symbol, the value is printed as a hexadecimal integer. c The char, short, or int argument is printed as an ASCII character. d The char, short, int, long, or long long argument is printed as a decimal (base 10) integer. If the argument is signed, it is printed as a signed value. If the argument is unsigned, it is printed as an unsigned value. This conversion has the same meaning as i. e, E The float, double, or long double argument is converted to the style [-]d.dddedd, where there is one digit before the radix character (which is non-zero if the argument is non-zero) and the number of digits after it is equal to the precision. If you do not specify the precision, the default precision value is 6. If the precision is 0 and the # ag is not specied, no radix character appears. The E conversion format produces a number with E instead of e introducing the exponent. The exponent always contains at least two digits. The value is rounded up to the appropriate number of digits. f The float, double, or long double argument is converted to the style [-]ddd.ddd, where the number of digits after the radix character is equal to the precision specication. If you do not specify the precision, the default precision value is 6. If the precision is 0 and the # ag is not specied, no radix character appears. If a radix character appears, at least one digit appears before it. The value is rounded up to the appropriate number of digits. g, G The float, double, or long double argument is printed in the style f or e (or in style E in the case of a G conversion character), with the precision specifying the number of signicant digits. If an explicit precision is 0, it is taken as 1. The style used depends on the value converted: style e (or E) is used only if the exponent resulting from the conversion is less than -4 or greater than or equal to the precision. Trailing zeroes are removed from the fractional part of the result. A radix character appears only if it is followed by a digit. If the # ag is specied, trailing zeroes are not removed from the result as they normally are. i The char, short, int, long, or long long argument is printed as a decimal (base 10) integer. If the argument is signed, it is printed as a signed value. If the argument is unsigned, it is printed as an unsigned value. This conversion has the same meaning as d.
A-8


q
o The char, short, int, long, or long long argument is printed as an unsigned octal (base 8) integer. Arguments that are signed or unsigned can be used with this conversion. If the # ag is specied, the precision of the result is increased if necessary to force the rst digit of the result to be a zero. p The pointer or uintptr_t argument is printed as a hexadecimal (base 16) integer. D accepts pointer arguments of any type. If the # ag is specied, a non-zero result has 0x prepended to it. s The argument must be an array of char or a string. Bytes from the array or string are read up to a terminating null character or to the end of the data and are interpreted and printed as ASCII characters. If the precision is not specied, it is taken to be innite, so all characters up to the rst null character are printed. If the precision is specied, only that portion of the character array that displays in the corresponding number of screen columns is printed. If an argument of type char * is to be formatted, it should be cast to string or prexed with the D stringof operator to indicate that DTrace should trace the bytes of the string and format them. u The char, short, int, long, or long long argument is printed as an unsigned decimal (base 10) integer. Arguments that are signed or unsigned can be used with this conversion, and the result is always formatted as unsigned. wc The int argument is converted to a wide character (wchar_t) and the resulting wide character is printed. ws The argument must be an array of wchar_t. Bytes from the array are read up to a terminating null character or to the end of the data and are interpreted and printed as wide characters. If the precision is not specied, it is taken to be innite, so all wide characters up to the rst null character are printed. If the precision is specied, only that portion of the wide character array that displays in the corresponding number of screen columns is printed. x, X The char, short, int, long, or long long argument is printed as an unsigned hexadecimal (base 16) integer. Arguments that are signed or unsigned can be used with this conversion. If the X form of the conversion is used, the letter digits abcdef are used. If the X form of the conversion is used, the letter digits ABCDEF are used. If the # ag is specied, a non-zero result has 0x (for %x) or 0X (for %X) prepended to it. % Print a literal % character; no argument is converted. The entire conversion specication must be %%.

A-9
The printa Action

There are two forms of the printa action:
q q
void printa(aggregation) void printa(string format, aggregation)
The printa() action is used to format the results of aggregations in a D program. If the rst form of the action is used, the dtrace(1M) command takes a consistent snapshot of the aggregation data and produces output equivalent to the default output format used for aggregations. If the second form of the function is used, the dtrace(1M) command takes a consistent snapshot of the aggregation data and produces output based on the conversions specied in the format string, according to the rules described in the following subsection.
Rules for Specifying Conversions in the format String

The rules for specifying conversions in the format string are as follows:
q
The format conversions must match the tuple signature used to create the aggregation. Each tuple element can only appear once. For example, suppose you aggregate a count using the following D statements: @a[hello, 123] = count(); @a[goodbye, 456] = count(); If you then add the D statement printa(format-string, @a) to a probe clause, the dtrace utility snapshots the aggregation data and produces output as if you had entered the statements for each tuple dened in the aggregation, such as: printf(format-string, hello, 123); printf(format-string, goodbye, 456);
Unlike printf(), the format string you use for printa() need not include all elements of the tuple (that is, you can have a tuple of length 3 and only one format conversion). Therefore you can omit any tuple keys from your printa() output by changing your aggregation declaration to move the ones you want to omit to the end of the tuple and then omitting corresponding conversion speciers for them from the printa() format string. The aggregation result itself can be included in the output by using the additional @ format ag character, which is only valid when used with printa(). The @ ag can be combined with any appropriate format conversion specier, and can appear more than once in a
A-10

Data Recording Actions format string. This means that your tuple result can appear anywhere in the output and can appear more than once. The set of conversion speciers that can be used with each aggregating function are implied by the aggregating functions result type, listed below:
q q q q q q q
uint64_t avg() uint64_t count() int64_t lquantize() uint64_t max() uint64_t min() int64_t quantize() uint64_t sum()
For example, to format the results of avg(), you can apply the %d, %i, %o, %u, or %x format conversions. The quantize() and lquantize() functions format their results as an ASCII table rather than as a single value.
Example of the printa() Action

The following D program shows a complete example of the printa() action, using the profile provider to sample the value of caller and then formatting the results as a simple table: profile:::protile-997 { @a[caller] = count(); } END { printa(@8u %a\n, @a); } If you use the dtrace command to execute this program, then wait a few seconds and type Control-C, you see output similar to the following: # dtrace -s printa.d ^C CPU ID FUNCTION: NAME 1 2 :END 1 Oxl 1 ohciohci_handle root hub_status_change+0x148 1 specfsspec_write+OxeO

A-11
Data Recording Actions 1 1 1 1 1 1 1 ... Oxffl4f950 genunixcyclicsoftint+0x588 Oxfef228Oc genunixgetf+Oxdc ufsufs icheck+0x50 genunixinfpollinfo+0x80 genunixkmem_log_enter+tOxle8
The stack() Action

There are two forms of the stack() action:
q q
void stack(int nframes) void stack(void)
The stack() action records a kernel stack trace to the directed buffer. The kernel stack is nframes in depth. If you do not provide nframes, the number of stack frames recorded is the number specied by the stackframes option. For example: # dtrace -n uiomove:entry{stack()} CPU ID FUNCTION:NAME 0 12200 uiomove:entry ufs`rdip+0x338 ufsùfs_read+0x208 genunix`vn_rdwr+0x1c0 elfexec`getelfphdr+0xa4 elfexecèlf32exec+0x7a0 genunix`gexec+0x324 genunixèxec_common+0x278 genunixèxece+0xc unix`syscall_trap32+0xcc 0 12200 uiomove:entry ufsùfs_readlink+0x11c genunix`pn_getsymlink+0x40 genunix`lookuppnvp+0x414 genunix`lookuppnat+0x120 genunix`resolvepath+0x50 unix`syscall_trap32+0xcc
... The stack() action differs from other actions in that it can also be used as a key to an aggregation:
A-12

Data Recording Actions # dtrace -n kmem_alloc:entry {@[stack()] = count()} dtrace: description 'kmem_alloc:entry ' matched 1 probe ^C genunixìnstallctx+0xc genunix`schedctl+0x5c unix`syscall_trap+0xac 1 genunix`schedctl_shared_alloc+0xc0 genunix`schedctl+0x18 unix`syscall_trap+0xac 1 unix`lgrp_shm_policy_set+0x168 genunix`segvn_create+0x82c genunixàs_map+0xf0 genunix`schedctl_map+0x98 genunix`schedctl_shared_alloc+0x8c genunix`schedctl+0x18 unix`syscall_trap+0xac 1 ... sd`xbuf_iostart+0x7c ufs`log_roll_write_bufs+0x100 ufs`log_roll_write+0xe4 ufs`trans_roll+0x2f8 unix`thread_start+0x4 16
The ustack() Action

There are two forms of the ustack() action:
q q
void ustack(int nframes) void ustack(void)

A-13
Data Recording Actions The ustack() action records a user stack trace to the directed buffer. The user stack is nframes in depth. If you do not specify nframes, the number of stack frames recorded is the number specied by the ustackframes option. Although ustack() can determine the address of the calling frames when the probe res, the stack frames are not translated into symbols until the ustack() action is processed at user-level by the DTrace consumer. Note that some functions are static and therefore do not have entries in the symbol table; call sites in these functions are displayed with their hexadecimal address. Also, because ustack() symbol translation does not occur until after the data is recorded, there exists a possibility that the process in question has exited, making stack frame translation impossible. In this case, the dtrace utility emits a warning, followed by the hexadecimal stack frames. For example: dtrace: failed to grab process 100941: no such process c7b834d4 c7bca95d c7bcala4 c7bd4 374 c7bc2528 8047efc Finally, because the postmortem DTrace debugger commands cannot perform the frame translation, using ustack() with a ring buffer policy always results in raw ustack() data.
Example of the ustack() Action

The following D program shows an example of the ustack() action: syscall::brk:entry /execname == $1/ { @a[ustack(40)] = count(); } # dtrace -s brk.d '"vi"' dtrace: script 'brk.d' matched 1 probe ^C libc.so.1`_brk_unlocked+0x4 libc.so.1`sbrk+0x24 vi`morelines+0x4 viàppend+0xc4 vi`vdoappend+0x2c vi`fixzero+0x28
A-14

Data Recording Actions viòvbeg+0x30 vi`vop+0x158 vi`commands+0x13d0 vi`main+0xf24 vi`_start+0x108 1 ... libc.so.1`_brk_unlocked+0x4 libc.so.1`sbrk+0x24 vi`morelines+0x4 viàppend+0xc4 vi`put+0xe4 vi`vremote+0x64 vi`vmain+0x1670 vi`vop+0x25c vi`commands+0x13d0 vi`main+0xf24 vi`_start+0x108 35

A-15
Destructive Actions
Destructive Actions
Some actions are destructive in that they change the state of the system. Although they change the system in a well-dened way, they change it nonetheless. You cannot use destructive actions unless you have explicitly enabled them. In the dtrace(1M) command, you enable destructive actions with the -w option. If you attempt to use destructive actions in the dtrace(1M) command without explicitly enabling them, dtrace fails, returning an error message similar to: dtrace: could not enable tracing: Destructive actions not allowed
Process Destructive Actions

Some destructive actions are destructive only to a processthe system itself remains intact. These actions are available to those with the dtrace_proc or dtrace_user privileges.
The void stop(void) Action

The stop() action forces the process that hit the enabled probe to stop when it next leaves the kernel, as if stopped by a proc(4) action. You can use the prun(1) utility to resume a process that has been stopped by the stop() action. You can use the stop() action to stop a process at any DTrace probe point; this allows you to capture a program in a very particular state (which is difcult to achieve with a simple breakpoint). You can then attach a traditional debugger (such as mdb(1)) to examine the programs state, or use the gcore(1) utility to capture that state in a core le for later analysis.
The void raise(int signal) Action

The raise() action sends the specied signal to the currently running process. This is similar to using the kill(1) command to send a process a signal; however, you can use the raise() action to send a signal at a precise point in a processs execution.
A-16

Destructive Actions
The void copyout(void *buf, uintptr_t addr, size_t nbytes) Action

The copyout() action copies nbytes from the buffer specied by buf to the address specied by addr in the address space of the process associated with the current thread. If the user-space address does not correspond to a valid, faulted-in page in the current address space, an error is generated.
The void copyoutstr(string str, uintptr_t addr, size_t maxlen) Action

The copyoutstr() action copies the string specied by str to the address specied by addr in the address space of the process associated with the current thread. If the user-space address does not correspond to a valid, faulted-in page in the current address space, an error is generated. The string length is limited to the value set by the strsize option.
The void system(string program ...) Action

The system() action causes the program to be executed as if it were given to a shell as input.The program string can contain any of the printf() format conversions with corresponding arguments that follow.
Example of the system() Action

#pragma D option destructive #pragma D option quiet proc:::signal-send /args[2] == SIGINT/ { printf("SIGINT sent to %s by ", args[1]->pr_fname); system("getent passwd %d | cut -d: -f5", uid); } # ./whosend.d SIGINT sent to run-mozilla.sh by Mary Smith ^C

A-17
Destructive Actions
Kernel Destructive Actions

Some destructive actions are destructive to the entire system. These must be used with extreme care, as they can affect any process on the system (and any other systems dependent upon your network services).
The void breakpoint(void) Action

The breakpoint() action induces a kernel breakpoint, causing the system to stop and control to transfer to the kernel debugger. The kernel debugger then emits a string denoting the DTrace probe that triggered the action. For example, suppose you performed the following action: # dtrace -w -n clock:entry {breakpoint()}' dtrace: description 'clock:entry' matched 1 probe dtrace: allowing destructive actions On the Solaris Operating System running on SPARC, you might see the following on the console: dtrace: breakpoint action at probe fbt:genunix:clock:entry (ecb 30002765700) Type go to resume ok On Solaris running on x86, you might see the following on the console: dtrace: breakpoint action at probe fbt:genunix:clock:entry (ecb d2b97060) stopped at int2O+Oxb: ret kadb [0]: The address following the probe description is the address of the enabling control block (ECB) within DTrace. You can use it to learn more details about the probe enabling that induced the breakpoint action. Note that a mistake with the breakpoint() action can cause it to be called far more often than intended. This can in turn prevent you from even terminating the DTrace consumer that is inducing the breakpoint actions. If you nd yourself in this situation, set the kernel integer variable dtrace_destructive_disallow to 1. This disallows all destructive actions on the machine. This setting should be used only if you nd yourself in this particular situation.
A-18

Destructive Actions The exact method for setting dtrace_destructive_disallow depends on the kernel debugger that you are using. If you are using OpenBoot PROM on SPARC, follow these steps: 1. Use w! as follows: ok 1 dtrace_destructive_disallow w! ok 2. Conrm that this has been set using w?: ok dtrace_destructive_disallow w? 1 ok 3. Continue by using go: ok go If you are using the kadb(1M) debugger on x86, follow these steps: 1. Use the 4-byte write modier (W) with the / formatting dcmd: kadb[0]: dtrace_destructive_disallow/w 1 dtrace_destructive_disallow: 0x0 = 0xl kadb[0]: 2. Continue by entering :c: kadb[0]: :c If you wish to re-enable destructive actions after continuing, you must explicitly reset dtrace_destructive_disallow back to 0. You do this using the mdb(1) debugger: # echo dtrace_destructive_disallow/W 0 | mdb -kw dtrace_destructive_disallow: 0xl = 0x0 #
The void panic(void) Action

The panic() action induces a kernel panic when triggered. Use this action to force a system crash dump at a time of interest. The panic() action can be used together with ring buffering and postmortem analysis to understand a problem. When you use the panic() action, you see a panic message that denotes the probe inducing the panic. For example: panic[cpu0]/thread=300Ol83Ob80: dtrace: panic action at probe syscall::mmap:entry (ecb 300000acfc8)

A-19
Destructive Actions 000002al0050b840 dtrace:dtrace_probe+518 (fffe, 0, 1830f88, 1830f88, 30002fb8040, 300000acfc8) %l0-3: 0000000000000000 00000300030e4d80 0000030003418000 00000300018c0800 %l4-7: 000002a10050b980 0000000000000500 0000000000000000 0000000000000502 000002a10050ba30 genunix:dtrace_systrace_syscall32+44 (0, 2000, 5, 80000002, 3, 1898400) %l0-3: 00000300030de730 0000000002200008 00000000000000e0 000000000184d928 %l4-7: 00000300030de000 0000000000000730 0000000000000073 0000000000000010 syncing file systems... 2 done dumping to /dev/dsk/cOtOdOsl, offset 214827008, content: kernel 100% done: 11837 pages dumped, compression ratio 4.66, dump succeeded rebooting... In addition, the syslogd(1M) emits a message upon reboot: Jun 10 16:56:31 machinel savecore: [ID 570001 auth.error] reboot after panic: dtrace: panic action at probe syscall::mmap:entry (ecb 300000actc8) The message buffer of the crash dump will also contain the probe and ECB responsible for the panic() action.
The void chill(int nanoseconds) Action

The chill() action causes DTrace to spin for the specied number of nanoseconds. This action is primarily useful for exploring problems that might be timing related. For example, you can use it to open race condition windows, or to bring periodic events into or out of phase with one another.
A-20

Special Actions Because interrupts are disabled while in DTrace probe context, any use of the chill() action induces interrupt latency, scheduling latency, dispatch latency, and so on. The chill() action can, therefore, cause strange systemic effects, and should not be used indiscriminately. Moreover, because the liveness of the system relies on being able to periodically handle interrupts, DTrace refuses to implement the chill() action for longer than 500 milliseconds within any given one-second interval, and instead reports an illegal operation error: # dtrace -w -n 'syscall::open:entry {chill(500000001)}' dtrace: description 'syscall::open:entry ' matched 1 probe dtrace: allowing destructive actions dtrace: error on enabled probe ID 2 (ID 18022: syscall::open:entry): illegal operation in action #1 The cap is enforced even if the time is spread across multiple calls to chill(), or if the time is spread across multiple DTrace consumers for a single probe.
Special Actions
Some actions do not fall into either the data recording action or the destructive action category. These other special actions fall into one of two sets. The rst set contains those actions associated with speculative tracing. The second set contains the exit() action.
Actions Associated With Speculative Tracing

Three actions are associated with speculative tracing:
q
speculate(int id) The speculate() action denotes that the remainder of the probe clause should be traced to the speculative buffer specied by id.
commit(int id) The commit() action commits the speculative buffer associated with id.
discard(int id) The discard() action discards the speculative buffer associated with id.

A-21
Subroutines
The void exit(int status) Action

You use the exit() action to immediately stop tracing, and to inform the DTrace consumer that it should cease tracing, perform any nal processing, and call exit(3C) with the status specied. Because exit() does return a status to user-level, it is a data-storing action. Unlike other data-storing actions, however, it cannot be speculatively traced. The exit() action causes the DTrace consumer to exit regardless of buffer policy. Note that the data-storing nature of the exit() action means that it can be dropped. When the exit() action is called, only DTrace actions already underway on other CPUs are taken; no subsequent actions are taken on any CPU. The only exception to this is the END probe, which is called after the DTrace consumer has processed the exit() action and has indicated that tracing should stop.
Subroutines
Subroutines differ from actions in that they generally only affect internal DTrace state. There is therefore no such thing as a destructive subroutine, and subroutines never trace data into buffers. Many subroutines have analogs in Section 9F or Section 3C of the manual pages; see Intro(9F) and Intro(3), respectively.
The void *alloca(size_t size) Subroutine

The alloca() subroutine allocates size bytes out of scratch space, and returns a pointer to the allocated memory. The returned pointer is guaranteed to have 8-byte alignment. Scratch space is only valid for the duration of a clause; memory allocated with alloca() is deallocated when the clause completes. If insufcient scratch space is available, no memory is allocated and an error is generated.
A-22

Subroutines
The string basename(char *str) Subroutine

The basename() subroutine is a D analogue for basename(1); it creates a string that consists of a copy of the specied string, but without any prex that ends in /. The returned string is allocated out of scratch memory, and is therefore valid only for the duration of the clause. If insufcient scratch space is available, basename aborts and an error is generated.
The void bcopy(void *src, void *dest, size_t size) Subroutine

The bcopy() subroutine copies the bytes specied by the size variable from the memory pointed to by the src variable to the memory pointed to by the dest variable. All of the source memory must lie outside of scratch memory and all of the destination memory must lie within it; if this is not the case, no copying takes place and an error is generated.
The string cleanpath(char *str) Subroutine

The cleanpath() subroutine creates a string that consists of a copy of the path indicated by the str variable, but with certain redundant elements eliminated. In particular, /./ elements in the path are removed, and /../ elements are collapsed. Note that the collapsing of /../ elements is nave in that the parent component is collapsed without regard to symbolic links. As a result, the cleanpath() subroutine might take a valid path and return a shorter, invalid one. For example, if the path specied by str were /foo/../bar, and /foo were a symbolic link to /net/foo/export, then cleanpath() would return the string /bar even though bar might only be in /net/foo, not in /. This limitation is due to the fact that cleanpath() is called in the context of a ring probe, where full symbolic link resolution or arbitrary names are not possible. The returned string is allocated out of scratch memory, and is therefore valid only for the duration of the clause. If insufcient scratch space is available, cleanpath aborts and an error is generated.

A-23
Subroutines
The void *copyin(uintptr_t addr, size_t size) Subroutine

The copyin() subroutine copies the specied size in bytes from the specied user address into a DTrace scratch buffer, and returns the address of this buffer. The user address is interpreted as an address in the space of the process associated with the current thread. The resulting buffer pointer is guaranteed to have 8-byte alignment. The address in question must correspond to a faulted-in page in the current process. If the address does not correspond to a faulted-in page, or if insufcient scratch space is available, NULL is returned, and an error is generated.
The string copyinstr(uintptr_t addr) Subroutine

The copyinstr() subroutine copies a null-terminated C string from the specied user address into a DTrace scratch buffer, and returns the address of this buffer. The user address is interpreted as an address in the space of the process associated with the current thread. The string length is limited to the value set by the strsize option. As with the copyin subroutine, the specied address must correspond to a faulted-in page in the current process. If the address does not correspond to a faulted-in page, or if insufcient scratch space is available, NULL is returned, and an error is generated.
The void copyinto(uintptr_t addr, size_t size, void *dest) Subroutine

The copyinto() subroutine copies the specied size in bytes from the specied user address into the DTrace scratch buffer specied by the dest variable. The user address is interpreted as an address in the space of the process associated with the current thread. The address in question must correspond to a faulted-in page in the current process. If the address does not correspond to a faulted-in page, or if any of the destination memory lies outside scratch space, no copying takes place, and an error is generated.
A-24

Subroutines
The string dirname(char *str) Subroutine

The dirname() subroutine is a D analogue for dirname(1); it creates a string that consists of all but the last level of the path name specied by str. The returned string is allocated out of scratch memory, and is therefore valid only for the duration of the clause. If insufcient scratch space is available, dirname aborts and an error is generated.
The size_t msgdsize(mblk_t *mp) Subroutine

The msgdsize() subroutine returns the number of bytes in the data message pointed to by the mp variable. See msgdsize(9F) for details. Note that msgdsize() only includes data blocks of type M_DATA in the count.
The size_t msgsize(mblk_t *mp) Subroutine

The msgsize() subroutine returns the number of bytes in the message pointed to by the mp variable. Unlike the msgdsize() subroutine, which returns only the number of data bytes, msgsize() returns the total number of bytes in the message.
The int mutex_owned(kmutex_t *mutex) Subroutine

The mutex_owned() subroutine is an implementation of the mutex_owned(9F) command. The mutex_owned() subroutine returns nonzero if the calling thread currently holds the specied kernel mutex, or zero if the specied adaptive mutex is currently unowned.
The kthread_t *mutex_owner(kmutex_t *mutex) Subroutine

The mutex_owner() subroutine returns the thread pointer of the current owner of the specied adaptive kernel mutex. The mutex_owner() subroutine returns NULL if the specied adaptive mutex is currently unowned, or if the specied mutex is a spin mutex. See mutex_owned(9F).

A-25
Subroutines
The int mutex_type_adaptive(kmutex_t *mutex) Subroutine

The mutex_type_adaptive() subroutine returns non-zero if the specied kernel mutex is of type MUTEX_ADAPTIVE, or zero if it is not. Mutexes are adaptive if they are:
q q q
Declared statically Created with an interrupt block cookie of NULL, or Created with an interrupt block cookie that does not correspond to a high-level interrupt.
See mutex_init(9F) for more details on mutexes. The great majority of mutexes in the Solaris kernel are adaptive.
The int progenyof(pid_t pid) Subroutine

The progenyof() subroutine returns non-zero if the calling process (the process associated with the thread that is currently triggering the matched probe) is among the progeny of the specied process ID.
The int rand(void) Subroutine

The rand() subroutine returns a pseudo-random integer. The number returned is a weak pseudo-random number, and should not be used for any cryptographic application.
The int rw_iswriter(krwlock_t *rwlock) Subroutine

The rw_iswriter() subroutine returns non-zero if the specied readerwriter lock is either held or desired by a writer. If the lock is neither held nor desired by any writers (that is, it is held only by readers and no writer is blocked, or it is not held at all), rw_iswriter() returns zero. Refer to rw_init(9F).
A-26

Subroutines
The int rw_write_held(krwlock_t *rwlock) Subroutine

The rw_write_held() subroutine returns non-zero if the specied readerwriter lock is currently held by a writer. If the lock is held only by readers or not held at all, rw_write_held() returns zero. See rw_init(9F).
The int speculation(void) Subroutine

The speculation() subroutine reserves a speculative trace buffer for use with the speculate() action, and returns an identier for this buffer.
The string strjoin(char *str1, char *str2) Subroutine

The strjoin() subroutine creates a string that consists of the strl variable concatenated with the str2. variable. The returned string is allocated out of scratch memory, and is therefore valid only for the duration of the clause. If insufcient scratch space is available, strjoin aborts and an error is generated.
The size_t strlen(string str) Subroutine

The strlen() subroutine returns the length of the specied string in bytes, excluding the terminating null byte.

A-27
Appendix B
D Built-in and Macro Variables

This appendix describes and lists:
q q
Built-in variables provided by the D language Macro variables provided by the D language
B-1
Built-in Variables
Built-in Variables
You have seen a number of special built-in D variables in the example programs, including timestamp, pid, and others. All of these variables are scalar global variables; currently D does not dene thread-local variables, clause-local variables, or built-in associative arrays. Table B-1 shows the complete list of D built-in variables. Table B-1 DTrace Built-in Variables Type and Name int64_t arg0, ..., arg9 Description The rst ten input arguments to a probe represented as raw 64-bit integers. If fewer than ten arguments are passed to the current probe, the remaining variables return zero. The typed arguments to the current probe, if any. The args[] array is accessed using an integer index, but each element is dened to be the type corresponding to the given probe argument. For example, if args[] is referenced by a read(2) system call probe, args[0] is of type int, args[1] is of type void *, and args[2] is of type size_t. The program counter location of the current thread just before entering the current probe. The lightweight process (LWP) state of the LWP associated with the current thread. This structure is described in further detail in proc(4). The process state of the process associated with the current thread. This structure is described in further detail in proc(4). The address of the operating system kernels internal data structure for the current thread, the kthread_t structure. The kthread_t is dened in <sys/thread.h>. The name of the current working directory of the process associated with the current thread. The enabled probe ID (EPID) for the current probe. This integer uniquely identies a particular probe that is enabled with a specic predicate and set of actions. The error value returned by the last system call executed by this thread.
args[]
unintptr_t caller lwpsinfo_t *curlwpsinfo
psinfo_t *curpsinfo
kthread_t *curthread
string cwd epid
int errno
B-2

Built-in Variables Table B-1 DTrace Built-in Variables (Continued) Type and Name string execname uint_t id Description The name that was passed to exec(2) to execute the current process. The probe ID for the current probe. This is the system-wide unique identier for the probe as published by DTrace and listed in the output of dtrace -l. The interrupt priority level (IPL) on the current CPU at probe ring time. The process ID of the current process. The function name portion of the current probes description. The module name portion of the current probes description. The name portion of the current probes description. The provider name portion of the current probes description. The name of the root directory of the process associated with the current thread. The current threads stack frame depth at probe ring time. The thread ID of the current thread. For threads associated with user processes, this value is equal to the result of a call to pthread_self(3C). The current value of a nanosecond timestamp counter. This counter increments from an arbitrary point in the past and should only be used for relative computations. The current threads saved user-mode register values at probe ring time. The current value of a nanosecond timestamp counter that is virtualized to the amount of time that the current thread has been running on a CPU, minus the time spent in DTrace predicates and actions. This counter increments from an arbitrary point in the past and should only be used for relative time computations.
uint_t ipl pid_t pid string probefunc string probemod string probename string probeprov string root unit_t stackdepth id_t tid
unint64_t timestamp
unint64_t uregs[] unint64_t vtimestamp
D Built-in and Macro Variables

B-3
Macro Variables
Macro Variables
The D compiler denes a set of built-in macro variables that you can use when writing D programs or interpreter les. Macro variables are identiers that are prexed with a dollar sign ($) and are expanded once by the D compiler when processing your input le. Table B-2 shows the complete list of D macro variables. Table B-2 D Macro Variables Name $[0-9]+ $egid $euid $gid $pid $pgid $ppid $projid $sid $taskid $uid Description Macro arguments Effective group ID Effective user ID Real group ID Process ID Parent group ID Parent process ID Project ID Session ID Task ID Real user ID Reference See Module 2, Built-in Macro Variables getegid(2) geteuid(2) getgid(2) getpid(2) getpgid(2) getppid(2) getprojid(2) getsid(2) getatskid(2) getuid(2)
B-4

Appendix C
D Operators
This appendix denes and describes the following D operators:
q q q q q q
Arithmetic operators Relational operators Logical operators Bitwise operators Assignment operators Increment and decrement operators
This appendix also describes conditional expressions.
C-1
Arithmetic Operators
Arithmetic Operators
D provides the standard arithmetic operators for use in your programs. These operators all have the same meaning as they do in ANSI-C for integer operands. Table C-1 shows the D binary arithmetic operators. Table C-1 D Binary Arithmetic Operators Operator + * / % Meaning Integer addition Integer subtraction Integer multiplication Integer division Integer modulus
Arithmetic in D can only be performed on integer operands or on pointers. Arithmetic cannot be performed on oating-point operands in D programs. The DTrace execution environment does not take any action on integer overow or underow; you must check for these conditions yourself in situations where they are applicable. The DTrace execution environment does automatically check for and report division by zero errors resulting from improper use of the / and % operators. If a D program executes an invalid division operation, DTrace automatically disables the affected instrumentation and reports the error to you. Errors detected by DTrace have no effect on other DTrace users or on the operating system kernel, so you do not need to worry about causing any damage if your D program inadvertently contains one of these errors. In addition to these binary operators, the + and - operators can also be used as unary operators; these have higher precedence than any of the binary arithmetic operators. The order of precedence and associativity properties for all the D operators is summarized at the end of this Appendix. You can control precedence by grouping expressions in parentheses ( ).
C-2

Relational Operators
Relational Operators
D provides binary relational operators for use in your programs. These operators all have the same meaning as they do in ANSI-C. Table C-2 shows the D relational operators. Table C-2 D Relational Operators Operator < <= > >= == != Meaning Left-hand operand is less than right-hand operand Left-hand operand is less than or equal to right-hand operand Left-hand operand is greater than right-hand operand Left-hand operand is greater than or equal to right-hand operand Left-hand operand is equal to right-hand operand Left-hand operand is not equal to right-hand operand
Relational operators are most frequently used to write D predicates. Each operator evaluates to a value of type int, which is equal to 1 if the condition is true, and 0 if it is false. Relational operators can be applied to pairs of integers, pointers, or strings. If pointers are compared, the result is equivalent to an integer comparison of the two pointers interpreted as unsigned integers. If strings are compared, the result is determined as if by performing a strcmp(3C) on the two operands. Here are some example D string comparisons and their results: coffee < espresso coffee == coffee coffee >= mocha ... returns 1 (true) ... returns 1 (true) ... returns 0 (false)
Relational operators can also be used to compare a data object associated with an enumeration type with any of the enumerator tags dened by the enumeration. Enumerations are a facility for creating named integer constants.
D Operators
C-3
Logical Operators
Logical Operators
D provides binary logical operators for use in your programs. Table C-3 shows the D logical operators. The rst two are equivalent to the corresponding ANSI-C operators. Table C-3 D Relational Operators Operator && || ^^ Meaning Logical AND: true if both operands are true Logical OR: true if one or both operands are true Logical XOR: true if exactly one operand is true
Logical operators are most frequently used in writing D predicates. The logical AND operator performs short-circuit evaluation: if the left-hand operand is false, the right-hand expression is not evaluated. The logical OR operator also performs short-circuit evaluation: if the left-hand operand is true, the right-hand expression is not evaluated. The logical XOR operator does not short-circuit: both expression operands are always evaluated. In addition to the binary logical operators, the unary ! operator can be used to perform a logical negation of a single operand: it converts a zero operand into a 1 and a non-zero operand into a 0. By convention, D programmers use ! when working with integers that are meant to represent Boolean values and == 0 when working with non-Boolean integers, although both expressions are equivalent in meaning. The logical operators can be applied to operands of integer type or pointer type. The logical operators interpret pointer operands as unsigned integer values. As with all logical and relational operators in D, operands are true if they have a non-zero integer value and false if they have a zero integer value.
C-4

Bitwise Operators
Bitwise Operators
D provides binary operators for manipulating individual bits inside of integer operands. These operators all have the same meaning as they do in ANSI-C. Table C-4 shows the D bitwise operators. Table C-4 D Bitwise Operators Operator & | ^ << >> Meaning Bitwise AND Bitwise OR Bitwise XOR Shift the left-hand operand left by the number of bits specied by the right-hand operand Shift the left-hand operand right by the number of bits specied by the right-hand operand
You use the binary & operator to clear bits from an integer operand. You use the binary | operator to set bits in an integer operand. The binary ^ operator returns 1 in each bit position where exactly one of the corresponding operand bits is set. You use the shift operators to move bits left or right in a given integer operand. Shifting left lls empty bit positions on the right-hand side of the result with zeroes. Shifting right using an unsigned integer operand lls empty bit positions on the left-hand side of the result with zeroes. Shifting right using a signed integer operand (an action known as an arithmetic shift operation) lls empty bit positions on the left-hand side with the value of the sign bit. Shifting an integer value by a negative number of bits or by a number of bits larger than the number of bits in the left-hand operand itself produces an undened result. The D compiler produces an error message if it detects this condition when you compile your D program. In addition to the binary logical operators, you can use the unary ~ operator to perform a bitwise negation of a single operand: it converts each 0 bit in the operand into a 1 bit, and each 1 bit in the operand into a 0 bit.
D Operators
C-5
Assignment Operators
Assignment Operators
D provides the following binary assignment operators for modifying D variables. Remember that you can only modify D variables and arrays: kernel data objects and constants cannot be modied using the D assignment operators. The assignment operators have the same meaning as they do in ANSI-C. Table C-5 shows the D assignment operators. Table C-5 D Assignment Operators Operator = += -= *= /= %= |= &= ^= <<= >>= Meaning Set the left-hand operand equal to the right-hand expression value Increment the left-hand operand by the right-hand expression value Decrement the left-hand operand by the right-hand expression value Multiply the left-hand operand by the right-hand expression value Divide the left-hand operand by the right-hand expression value Modulo the left-hand operand by the right-hand expression value Bitwise OR the left-hand operand with the right-hand expression value Bitwise AND the left-hand operand with the right-hand expression value Bitwise XOR the left-hand operand with the right-hand expression value Shift the left-hand operand left by the number of bits specied by the right-hand expression value Shift the left-hand operand right by the number of bits specied by the right-hand expression value
C-6

Assignment Operators With the exception of the assignment operator =, the assignment operators are provided as short-hand for using the operator with one of the other operators described previously. For example, the expression x = x + 1 is equivalent to the expression x += 1, except that the expression x is evaluated once. These assignment operators obey the same rules for operand types as the binary forms described previously. The result of any assignment operator is an expression equal to the new value of the left-hand expression. You can use the assignment operators, or any of the operators described so far, in combination to form expressions of arbitrary complexity. You can use parentheses ( ) to group terms in complex expressions.
D Operators
C-7
Increment and Decrement Operators
Increment and Decrement Operators

D provides the special unary ++ and -- operators for incrementing and decrementing pointers and integers. These operators have the same meaning as they do in ANSI-C. They can only be applied to variables, and can be applied either before or after the variable name. If the operator appears before the variable name, the variable is rst modied and the resulting expression is equal to the new value of the variable. For example, the following two expressions produce identical results: x += 1; y = x; y = ++x;
If the operator appears after the variable name, the variable is modied after its current value is returned for use in the expression. For example, the following two expressions produce identical results: y = x; x -= 1; y = x--;
You can use the increment and decrement operators to create new variables without declaring them. If you omit a variable declaration and apply the increment or decrement operator to a variable, the variable is implicitly declared to be of type int64_t. You can apply the increment and decrement operators to integer or pointer variables. When applied to integer variables, the operators increment or decrement the corresponding value by one. When applied to pointer variables, the operators increment or decrement the pointer address by the size of the data type referenced by the pointer.
C-8

Conditional Expressions
Conditional Expressions
Although D does not provide support for if-then-else constructs, it does provide support for simple conditional expressions using the ? and : operators. These operators permit a triplet of expressions to be associated where the rst expression is used to conditionally evaluate one of the other two. For example, the following D statement can be used to set a variable x to one of two strings, depending on the value of i: x = i == 0 ? zero : non-zero; In this example, the expression i == 0 is rst evaluated to determine if it is true or false. If the rst expression is true, the second expression is evaluated and the ?: expression returns its value. If the rst expression is false, the third expression is evaluated and the ?: expression return its value. As with any D operator, you can use multiple ?: operators in a single expression to create more complex expressions. For example, the following expression takes a char variable c containing one of the characters 0-9, a-z, or A-Z and returns the value of this character when interpreted as a digit in a hexadecimal (base 16) integer: hexval = (c >= 0 && c <= 9) ? c - 0 : (c >= a && c <= z) ? c + 10 - a : c + 10 - A; The rst expression used with ?: must be a pointer or integer in order to be evaluated for its truth value. The second and third expressions can be of any compatible types. You cannot construct a conditional expression in which, for example, one path returns a string and another an integer. The second and third expressions also cannot invoke a tracing function, such as trace() or printf(). If you want to trace data conditionally, you should use a predicate instead.
D Operators
C-9

Dynamic Performance Tuning and Troubleshooting With DTrace (SA-327-S10) - New

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Dynamic Performance Tuning and Troubleshooting With DTrace (SA-327-S10) - New

Загружено:

Авторское право:

Доступные форматы

Dynamic Performance Tuning and Troubleshooting With DTrace SA-327-S10

March 18, 2005 11:30 am

Dynamic Performance Tuning and Troubleshooting With DTrace

Dynamic Performance Tuning and Troubleshooting With DTrace

About This Course

Understanding and Using the DTrace Facility

Using DTrace to Debug Applications and Find System Problems

Dynamic Performance Tuning and Troubleshooting With DTrace

Topics Not Covered

Topics Not Covered

About This Course

How Prepared Are You?

How Prepared Are You?

Dynamic Performance Tuning and Troubleshooting With DTrace

About This Course

How to Use Course Materials

How to Use Course Materials

Dynamic Performance Tuning and Troubleshooting With DTrace

About This Course

Dynamic Performance Tuning and Troubleshooting With DTrace

Dynamic Performance Tuning and Troubleshooting With DTrace

Dynamic Performance Tuning and Troubleshooting With DTrace

Debugging Transient Failures

Debugging Using Postmortem Analysis

Debugging Using Invasive Techniques

Dynamic Performance Tuning and Troubleshooting With DTrace

Probes and Probe Providers

How Probes Work

How Probes Are Enabled

Dynamic Performance Tuning and Troubleshooting With DTrace

Dynamic Performance Tuning and Troubleshooting With DTrace

DTrace Architecture Figure 1-1 shows the overall DTrace architecture.

D program source files

List the available probes using various criteria:

Dynamic Performance Tuning and Troubleshooting With DTrace

In a specic function: -f function

# dtrace -l -f cv_wait ID PROVIDER 12921 fbt 12922 fbt

In a specic module: -m module

With a specic name: -n name

# dtrace -l -n BEGIN ID PROVIDER 1 dtrace

From a specic provider: -P provider

Realize that a specic function or module can be supported by many providers:

genunix genunix genunix genunix

Specifying Probes in DTrace

Dynamic Performance Tuning and Troubleshooting With DTrace

Dynamic Performance Tuning and Troubleshooting With DTrace

Many actions use expressions in the D language.

probename The current probes name

2040 2177 2177 2040 2181 2181 7

Dynamic Performance Tuning and Troubleshooting With DTrace

2195 2195 2195 2195 2195 2197 2207 2207

dtrace dtrace dtrace dtrace dtrace bash vi vi

Dynamic Performance Tuning and Troubleshooting With DTrace

Dynamic Performance Tuning and Troubleshooting With DTrace

The BEGIN and END Probes

This should appear at the END

Dynamic Performance Tuning and Troubleshooting With DTrace

DTrace Performance Monitoring Capabilities

DTrace Performance Monitoring Capabilities

Features of the DTrace Performance Monitoring Capabilities

Dynamic Performance Tuning and Troubleshooting With DTrace

DTrace Aggregation Syntax