Академический Документы
Профессиональный Документы
Культура Документы
Student Guide
Sun Microsystems, Inc. UBRM05-104 500 Eldorado Blvd. Broomeld, CO 80021 U.S.A. Revision A
Copyright 2005 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Sun, Sun Microsystems, the Sun logo, Solaris, and OpenBoot are trademarks or registered trademarks of Sun Microsystems, Inc., in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc., in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. Federal Acquisitions: Commercial Software Government Users Subject to Standard License Terms and Conditions Export Laws. Products, Services, and technical data delivered by Sun may be subject to U.S. export controls or the trade laws of other countries. You will comply with all such laws and obtain all licenses to export, re-export, or import as may be required after delivery to You. You will not export or re-export to entities on the most current U.S. export exclusions lists or to any country subject to U.S. embargo or terrorist controls as specified in the U.S. export laws. You will not use or provide Products, Services, or technical data for nuclear, missile, or chemical biological weaponry end uses. DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS, AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. THIS MANUAL IS DESIGNED TO SUPPORT AN INSTRUCTOR-LED TRAINING (ILT) COURSE AND IS INTENDED TO BE USED FOR REFERENCE PURPOSES IN CONJUNCTION WITH THE ILT COURSE. THE MANUAL IS NOT A STANDALONE TRAINING TOOL. USE OF THE MANUAL FOR SELF-STUDY WITHOUT CLASS ATTENDANCE IS NOT RECOMMENDED. Export Control Classification Number EAR99 assigned: 10 September 2004
Please Recycle
Copyright 2005 Sun Microsystems Inc., 4150 Network Circle, Santa Clara, California 95054, Etats-Unis. Tous droits rservs. Ce produit ou document est protg par un copyright et distribu avec des licences qui en restreignent lutilisation, la copie, la distribution, et la dcompilation. Aucune partie de ce produit ou document ne peut tre reproduite sous aucune forme, par quelque moyen que ce soit, sans lautorisation pralable et crite de Sun et de ses bailleurs de licence, sil y en a. Le logiciel dtenu par des tiers, et qui comprend la technologie relative aux polices de caractres, est protg par un copyright et licenci par des fournisseurs de Sun. Sun, Sun Microsystems, le logo Sun, Solaris, et OpenBoot sont des marques de fabrique ou des marques dposes de Sun Microsystems, Inc., aux Etats-Unis et dans dautres pays. Toutes les marques SPARC sont utilises sous licence sont des marques de fabrique ou des marques dposes de SPARC International, Inc. aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont bass sur une architecture dveloppe par Sun Microsystems, Inc. UNIX est une marques dpose aux Etats-Unis et dans dautres pays et licencie exclusivement par X/Open Company, Ltd. Lgislation en matire dexportations. Les Produits, Services et donnes techniques livrs par Sun peuvent tre soumis aux contrles amricains sur les exportations, ou la lgislation commerciale dautres pays. Nous nous conformerons lensemble de ces textes et nous obtiendrons toutes licences dexportation, de r-exportation ou dimportation susceptibles dtre requises aprs livraison Vous. Vous nexporterez, ni ne r-exporterez en aucun cas des entits figurant sur les listes amricaines dinterdiction dexportation les plus courantes, ni vers un quelconque pays soumis embargo par les Etats-Unis, ou des contrles anti-terroristes, comme prvu par la lgislation amricaine en matire dexportations. Vous nutiliserez, ni ne fournirez les Produits, Services ou donnes techniques pour aucune utilisation finale lie aux armes nuclaires, chimiques ou biologiques ou aux missiles. LA DOCUMENTATION EST FOURNIE EN LETAT ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A LAPTITUDE A UNE UTILISATION PARTICULIERE OU A LABSENCE DE CONTREFAON. CE MANUEL DE RFRENCE DOIT TRE UTILIS DANS LE CADRE DUN COURS DE FORMATION DIRIG PAR UN INSTRUCTEUR (ILT). IL NE SAGIT PAS DUN OUTIL DE FORMATION INDPENDANT. NOUS VOUS DCONSEILLONS DE LUTILISER DANS LE CADRE DUNE AUTO-FORMATION.
Please Recycle
Table of Contents
About This Course ...............................................................Preface-xi Course Goals.......................................................................... Preface-xi Topics Not Covered.............................................................Preface-xiii How Prepared Are You?.....................................................Preface-xiv Introductions ......................................................................... Preface-xv How to Use Course Materials ............................................Preface-xvi Conventions .........................................................................Preface-xvii Typographical Conventions ................................... Preface-xviii DTrace Fundamentals ......................................................................1-1 Objectives ........................................................................................... 1-1 Relevance............................................................................................. 1-2 Additional Resources ........................................................................ 1-3 DTrace Features.................................................................................. 1-4 Transient Failures...................................................................... 1-4 Debugging Transient Failures................................................. 1-5 DTrace Capabilities................................................................... 1-6 DTrace Architecture........................................................................... 1-7 Probes and Probe Providers .................................................... 1-7 DTrace Components ................................................................. 1-8 DTrace Tour ...................................................................................... 1-12 Listing Probes .......................................................................... 1-12 Writing D Scripts..................................................................... 1-21 Using DTrace ....................................................................................2-1 Objectives ........................................................................................... 2-1 Relevance............................................................................................. 2-2 Additional Resources ........................................................................ 2-3 DTrace Performance Monitoring Capabilities............................... 2-4 Features of the DTrace Performance Monitoring Capabilities ............................................................................. 2-4 Aggregations.............................................................................. 2-4 Examining Performance Problems Using the vminfo Provider . 2-8 The vminfo Probes.................................................................... 2-9
v
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Finding the Source of Page Faults Using vminfo Probes.. 2-11 Examining Performance Problems Using the sysinfo Provider .......................................................................................... 2-17 The sysinfo Probes ............................................................... 2-18 Using the quantize Aggregation Function With the sysinfo Probes.............................................................. 2-21 Finding the Source of Cross-Calls ........................................ 2-22 Examining Performance Problems Using the io Provider ........ 2-26 The io Probes .......................................................................... 2-26 Information Available When io Probes Fire ...................... 2-27 Finding I/O Problems ........................................................... 2-32 Obtaining System Call Information .............................................. 2-36 The syscall Provider............................................................ 2-36 D Language Variables ............................................................ 2-43 Associative Arrays .................................................................. 2-44 Thread-Local Variables .......................................................... 2-45 Timing a System Call.............................................................. 2-46 Following a System Call........................................................ 2-48 Creating D Scripts That Use Arguments ...................................... 2-53 Built-in Macro Variables ....................................................... 2-54 PID Argument Example......................................................... 2-56 Executable Name Argument Example................................. 2-57 Custom Monitoring Tools..................................................... 2-60 Debugging Applications With DTrace............................................ 3-1 Objectives ........................................................................................... 3-1 Relevance............................................................................................. 3-2 Additional Resources ........................................................................ 3-3 Application Profiling ......................................................................... 3-4 The pid Provider....................................................................... 3-4 The profile Provider............................................................ 3-19 Application Variables...................................................................... 3-30 Displaying Process Global Variables ................................... 3-30 Displaying Library Global Variables ................................... 3-34 The plockstat Provider ................................................................ 3-36 Transient System Call Errors.......................................................... 3-38 User Stack Traces on System Call Failures.......................... 3-39 Processes Using a Lot of System Time................................ 3-41 Open Files.......................................................................................... 3-45 Accessing System Call Pointer Arguments......................... 3-45 Displaying Names of Files Being Opened........................... 3-48 Finding System Problems With DTrace......................................... 4-1 Objectives ........................................................................................... 4-1 Relevance............................................................................................. 4-2 Additional Resources ........................................................................ 4-3 Accessing Kernel Variables .............................................................. 4-4
vi
Using the D Language to Access Kernel Symbols ............... 4-4 Monitoring Kernel Variables................................................... 4-5 Accessing Kernel Data Structures........................................... 4-6 Accessing Lock Contention Information ............................. 4-12 The proc Provider and the system() Function.................. 4-18 Displaying Read Call Information................................................. 4-19 Tracing Read Calls System-Wide ......................................... 4-19 Tracing Read Calls Using the iosnoop.d D Script............ 4-22 Aggregating Read Data.......................................................... 4-22 Using the Anonymous Tracing Facility........................................ 4-25 Creating an Anonymous Enabling ....................................... 4-25 Performing Anonymous Tracing.......................................... 4-25 Using the Speculative Tracing Facility ......................................... 4-30 Speculative Tracing Functions ............................................. 4-31 Speculative Tracing Example ................................................ 4-32 Application Debugging With Speculative Tracing ............ 4-34 DTrace Privileges ............................................................................. 4-37 Using the Least Privilege Facility ......................................... 4-37 Kernel-Destructive Actions .................................................. 4-38 Setting DTrace User Privileges.............................................. 4-38 Setting DTrace Process Privileges......................................... 4-44 Summarizing the DTrace Privilege Levels......................... 4-47 Troubleshooting DTrace Problems.................................................5-1 Objectives ........................................................................................... 5-1 Relevance............................................................................................. 5-2 Additional Resources ........................................................................ 5-3 Minimizing DTrace Performance Impact ....................................... 5-4 Limiting Enabled Probes.......................................................... 5-4 Using Aggregations .................................................................. 5-5 Using Cacheable Predicates..................................................... 5-5 Using and Tuning DTrace Buffers................................................... 5-8 Principal Buffers........................................................................ 5-8 Principal Buffer Policies ........................................................... 5-8 DTrace Option Settings ............................................................ 5-9 The switch Buffer Policy....................................................... 5-10 The fill Buffer Policy ........................................................... 5-12 The ring Buffer Policy ........................................................... 5-13 Other Buffers............................................................................ 5-14 Buffer Resizing Policy ............................................................ 5-14 Debugging DTrace Scripts.............................................................. 5-15 Avoiding Syntax Errors in D Scripts .................................... 5-15 Avoiding Run-Time Errors in D Scripts ............................. 5-18 Actions and Subroutines ................................................................ A-1 Default Action ................................................................................... A-2 Data Recording Actions .................................................................. A-3
vii
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
The void trace(expression) Action................................ A-3 The void tracemem(address, size_t nbytes) Action . A-3 The void printf(string format, ...) Action............ A-3 The printa Action................................................................. A-10 The stack() Action ................................................................ A-12 The ustack() Action .............................................................. A-13 Destructive Actions......................................................................... A-16 Process Destructive Actions ................................................. A-16 Kernel Destructive Actions................................................... A-18 Special Actions ............................................................................... A-21 Actions Associated With Speculative Tracing ................... A-21 The void exit(int status) Action................................ A-22 Subroutines ..................................................................................... A-22 The void *alloca(size_t size) Subroutine ............... A-22 The string basename(char *str) Subroutine.............. A-23 The void bcopy(void *src, void *dest, size_t size) Subroutine............................................................................ A-23 The string cleanpath(char *str) Subroutine........... A-23 The void *copyin(uintptr_t addr, size_t size) Subroutine............................................................................ A-24 The string copyinstr(uintptr_t addr) Subroutine A-24 The string dirname(char *str) Subroutine ............... A-25 The size_t msgdsize(mblk_t *mp) Subroutine........... A-25 The size_t msgsize(mblk_t *mp) Subroutine ............. A-25 The int mutex_owned(kmutex_t *mutex) Subroutine A-25 The kthread_t *mutex_owner(kmutex_t *mutex) Subroutine............................................................................ A-25 The int mutex_type_adaptive(kmutex_t *mutex) Subroutine............................................................................ A-26 The int progenyof(pid_t pid) Subroutine................... A-26 The int rand(void) Subroutine ....................................... A-26 The int rw_iswriter(krwlock_t *rwlock) Subroutine....... A-26 The int rw_write_held(krwlock_t *rwlock) Subroutine .. A-27 The int speculation(void) Subroutine ........................ A-27 The string strjoin(char *str1, char *str2) Subroutine............................................................................ A-27 The size_t strlen(string str) Subroutine ............... A-27 D Built-in and Macro Variables .......................................................B-1 Built-in Variables................................................................................B-2 Macro Variables..................................................................................B-4 D Operators ......................................................................................C-1 Arithmetic Operators........................................................................ C-2 Relational Operators......................................................................... C-3
viii
Logical Operators.............................................................................. C-4 Bitwise Operators.............................................................................. C-5 Assignment Operators ..................................................................... C-6 Increment and Decrement Operators............................................. C-8 Conditional Expressions .................................................................. C-9
ix
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Preface
Describe the features and architecture of the Solaris Dynamic Tracing (DTrace) facility Use the DTrace facility to nd the source of intermittent problems Use DTrace to help debug applications Use DTrace to look at the cause of performance problems Troubleshoot DTrace script problems
q q q q
Preface-xi
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Course Goals
Course Map
The following course map enables you to see what you have accomplished and where you are going in reference to the course goals.
Using DTrace
Troubleshooting DTrace
Troubleshooting DTrace Problems
Preface-xii
Preface-xiii
Do you have some previous programming experience? Can you use the truss command to diagnose application problems? Do you know the basics of the kernel structure? Are you familiar with basic troubleshooting concepts?
Preface-xiv
Introductions
Introductions
Now that you have been introduced to the course, introduce yourself to the other students and the instructor, addressing the following items:
q q q q q q
Name Company afliation Title, function, and job responsibility Experience related to topics presented in this course Reasons for enrolling in this course Expectations for this course
Preface-xv
Goals You should be able to accomplish the goals after nishing this course and meeting all of its objectives. Objectives You should be able to accomplish the objectives after completing a portion of instructional content. Objectives support goals and can support other higher-level objectives. Lecture The instructor presents information specic to the objective of the module. This information helps you learn the knowledge and skills necessary to succeed with the activities. Activities The activities take various forms, such as review questions, labs, discussion, and demonstration. Activities help facilitate the mastery of an objective. Visual aids The instructor might use several visual aids to convey a concept, such as a process, in a visual form. Visual aids commonly contain graphics, animation, and video.
Preface-xvi
Conventions
Conventions
The following conventions are used in this course to represent various training elements and alternative learning resources.
Icons
Additional resources Indicates other references that provide additional information on the topics described in the module.
!
?
Discussion Indicates a small-group or class discussion on the current topic is recommended at this time.
Note Indicates additional information that can help students but is not crucial to their understanding of the concept being described. Students should be able to understand the concept or complete the task without this information. Examples of notational information include keyword shortcuts and minor system adjustments. Caution Indicates that there is a risk of personal injury from a nonelectrical hazard, or risk of irreversible damage to data, software, or the operating system. A caution indicates that the possibility of a hazard (as opposed to certainty) might happen, depending on the action of the user. Caution Indicates that either personal injury or irreversible damage of data, software, or the operating system will occur if the user performs this action. A warning does not indicate potential events; if the action is performed, catastrophic events will occur.
Preface-xvii
Conventions
Typographical Conventions
Courier is used for the names of commands, les, directories, programming code, and on-screen computer output; for example: Use ls -al to list all les. system% You have mail. Courier is also used to indicate programming constructs, such as class names, methods, and keywords; for example: The getServletInfo method is used to get author information. The java.awt.Dialog class contains Dialog constructor. Courier bold is used for characters and numbers that you type; for example: To list the les in this directory, type: # ls Courier bold is also used for each line of programming code that is referenced in a textual description; for example: 1 import java.io.*; 2 import javax.servlet.*; 3 import javax.servlet.http.*; Notice the javax.servlet interface is imported to allow access to its life cycle methods (Line 2).
Courier italics is used for variables and command-line placeholders that are replaced with a real name or value; for example:
To delete a le, use the rm filename command.
Courier italic bold is used to represent variables whose values are to be entered by the student as part of an activity; for example:
Type chmod a+rwx filename to grant read, write, and execute rights for filename to world, group, and users. Palatino italics is used for book titles, new words or terms, or words that you want to emphasize; for example: Read Chapter 6 in the Users Guide. These are called class options.
Preface-xviii
Module 1
DTrace Fundamentals
Objectives
Upon completion of this module, you should be able to:
q
Describe the features of the Solaris Dynamic Tracing (DTrace) facility Describe the DTrace architecture List and enable probes, and create action statements and D scripts
q q
1-1
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
Relevance
Discussion The following questions are relevant to understanding DTrace:
q
!
?
Would the ability to turn on trace points for any one of the majority of functions in the kernel be benecial? Would it be useful to know who is issuing kill(2) system calls?
1-2
Additional Resources
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
q
Sun Microsystems, Inc. Solaris Dynamic Tracing Guide, part number 817-6223-10. The /usr/demo/dtrace directory contains all of the sample scripts from the Solaris Dynamic Tracing Guide. Cantrill Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. Paper presented at 2004 USENIX Conference. BigAdmin System Administration Portal [http://www.sun.com/bigadmin/content/dtrace]. The dtrace(1M) manual page.
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-3
DTrace Features
DTrace Features
DTrace is a comprehensive dynamic tracing facility that is bundled into the Solaris 10 Operating System (Solaris 10 OS). It is intended for use by system administrators, service support personnel, kernel developers, application program developers, and users who are given explicit access permission to the DTrace facility DTrace has the following features:
q q q q q
Enables dynamic modication of the system to record arbitrary data Promotes tracing on live systems Is completely safeits use cannot induce fatal failure Allows tracing of both the kernel program and user-level programs Functions with low overhead when tracing is enabled and zero overhead when tracing is not being performed.
Transient Failures
DTrace provides answers to the causes of transient failures. A transient failure is any unacceptable behavior that does not result in fatal failure of the system. You might have a clear, specic failure, such as:
q
read(2) is returning EIO errno values on a device that is not reporting any errors. An application occasionally does not receive its expected timer signal. A thread is missing a condition variable wakeup.
The transient failure can be based on your own denition of unacceptable system operation:
q
We were expecting to accommodate 100 users per CPU, but we cannot support more than 60 users per CPU. Why does system time go way up when I run application X? Every morning between 9:30 a.m. and 10:00 a.m. the system performs poorly.
q q
1-4
DTrace Features In these situations, you must understand the problem and either eliminate the performance inhibitors or reset your expectations. Eliminating the performance inhibitors could involve:
q
Adding more resources, such as memory or central processing units (CPUs) Reconguring existing resources, for example, tuning parameters or rewriting software Lessening the load
It requires inducing fatal failure, which nearly always results in more downtime than the transient failure It requires solving a dynamic problem from a static snapshot of the systems state
Running the instrumented binaries in production Reproducing a transient problem in a development environment
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-5
DTrace Features Such invasive techniques are undesirable because they are slow, errorprone, and often ineffective. Relying on the existing static TNF trace points found in the kernel, which you can enable with the prex(1) command, is also unsatisfactory. The number of TNF trace points in the kernel is limited and the overhead is substantial.
DTrace Capabilities
The DTrace framework allows you to enable tens of thousands of tracing points called probes. When these instrumentation points are hit, you can display arbitrary data in the kernel (or user process). An example of a probe provided by the DTrace framework is entry into any kernel function. Information that you can display when this probe res includes:
q q q q q q
Any argument to the function Any global variable in the kernel A nanosecond timestamp of when the function was called A stack trace to indicate what code called this function The process that was running when the function was called The thread that made the call to this function
Using DTrace, you can explore all aspects of the Solaris 10 OS to:
q q q
Understand how the software works Determine the root cause of performance problems Examine all layers of software sequentially from the user level to the kernel Track down the source of aberrant behavior
DTrace comes with powerful data management primitives to eliminate the need for postprocessing of gathered data. Unwanted data is pruned as close to the source as possible to avoid the overhead of generating and later ltering unwanted data. DTrace also provides a mechanism to trace during boot and to retrieve all traced data from a kernel crash dump.
1-6
DTrace Architecture
DTrace Architecture
DTrace helps you understand a software system by enabling you to dynamically modify the operating system kernel and user processes to record additional data that you specify at locations of interest called probes.
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-7
DTrace Architecture
DTrace Components
DTrace has the following components: probes, providers, consumers, and the D programming language. The entire DTrace framework resides in the kernel. Consumer programs access the DTrace framework through a welldened application programming interface (API).
Probes
A probe has the following attributes:
q q q
It is made available by a provider. It identies the module and function that it instruments. It has a name.
These four attributes dene a 4-tuple that uniquely identies each probe:
provider:module:function:name
In addition, DTrace assigns a unique integer identier to each probe.
Providers
A provider represents a methodology for instrumenting the system. Providers make probes available to the DTrace framework. A provider receives information from DTrace regarding when a probe is to be enabled and transfers control to DTrace when an enabled probe is hit. DTrace offers the following providers:
q
The function boundary tracing (fbt) provider can dynamically trace the entry and return of every function in the kernel. The syscall provider can dynamically trace the entry and return of every Solaris system call. The lockstat provider can dynamically trace the kernel synchronization primitives to observe lock contention and hold times. The plockstat provider makes probes available for user-level synchronization primitives including lock contention and hold times. The sched provider can dynamically trace key scheduling events. The prole provider enables you to add a congurable-rate timer interrupt to the system.
q q
1-8
DTrace Architecture
q
The dtrace provider enables pre-processing and post-processing (as well as D program error-processing) capabilities. The pid provider enables function boundary tracing within a process as well as tracing of any instruction in the virtual address space of the process. The statically dened tracing (sdt) provider creates probes at sites a programmer has explicitly designated in their own application. The vminfo provider makes available probes that correspond to the kernels virtual memory statistics. The sysinfo provider makes available probes that correspond to the kernels sys statistics. The proc provider makes available probes that pertain to process and thread creation and termination as well as signals. The mib provider makes available probes that correspond to counters in the Solaris management information bases (MIBs), which are used by the simple network management protocol (SNMP). The io provider makes available probes giving details related to disk input and output (I/O). The fpuinfo provider makes available probes that correspond to the simulation of oating point instructions on SPARC-based microprocessors.
Note You should check the Solaris Dynamic Tracing Guide, part number 817-6223, regularly for the addition of any new DTrace providers.
Consumers
A DTrace consumer is a process that interacts with DTrace. There is one main DTrace consumer called dtrace(1M). It acts as a generic front-end to the DTrace facility. Most other consumers are rewrites of previously existing utilities such as lockstat(1M). There is no limit on the number of concurrent consumers. That is, many users can simultaneously run the dtrace(1M) command. DTrace handles the multiplexing.
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-9
DTrace Architecture
D Programming Language
The D programming language enables you to specify probes of interest and bind actions to those probes. To do this, you construct scripts called D scripts. The nature of D scripts is similar to awk(1)s pattern action pairs. The D programming language also borrows heavily from the C programming language. Even if you have no experience with the C programming language or with awk(1), D programs are fairly easy to write and understand. Features of the D language include the following:
q q q
Enables complete access to kernel C types, such as vnode_t Provides complete access to kernel static and global variables Provides complete support for American National Standards Institute (ANSI)-C operators Supports strings as a built-in type (unlike C, which uses the ambiguous char * or char[] types).
Architecture Summary
To summarize, the DTrace facility consists of user-level consumer programs such as dtrace(1M), providers packaged as kernel modules, and a library interface for the consumer programs to access the DTrace facility through the dtrace(7D) kernel driver.
1-10
a.d
b.d
dtrace(1M)
lockstat(1M)
DTrace consumers
libdtrace(3LIB) @JH=?A%,
userland kernel
,6H=?A
sysinfo syscall
Figure 1-1
vminfo fbt
io
...
profile
sched
DTrace providers
DTrace Architecture
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-11
DTrace Tour
DTrace Tour
In this section you tour the DTrace facility and learn to perform the following tasks:
q
Probes associated with a particular function Probes associated with a particular module Probes with a specic name All probes from a specic provider
q q q q
Explain how to enable probes Explain default probe output Describe action statements Create a simple D script
Listing Probes
You can list all DTrace probes with the -l option of the dtrace(1M) command:
# dtrace -l ID PROVIDER 1 dtrace 2 dtrace 3 dtrace 4 syscall 5 syscall 6 syscall 7 syscall 8 syscall 9 syscall 10 syscall 11 syscall 12 syscall 13 syscall 14 syscall 15 syscall ... MODULE FUNCTION NAME BEGIN END ERROR nosys entry nosys return rexit entry rexit return forkall entry forkall return read entry read return write entry write return open entry open return
1-12
DTrace Tour You can use an additional option to list specic probes, as follows:
q
# dtrace -l -m sd ID PROVIDER 17147 fbt 17148 fbt 17149 fbt 17150 fbt 17151 fbt 17152 fbt ...
q
# dtrace -l -P lockstat ID PROVIDER 469 lockstat 470 lockstat 471 lockstat 472 lockstat 473 lockstat 474 lockstat ...
q
# dtrace -l -f read ID PROVIDER 10 syscall 11 syscall 4036 sysinfo 4040 sysinfo 7885 fbt 7886 fbt
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-13
DTrace Tour The previous output shows that for each probe, the following is displayed:
q
The probes uniquely assigned probe ID (The probe ID is only unique within a given release or patch level of Solaris). The provider name. The module name (if applicable). The function name (if applicable). The probe name.
q q q q
provider:module:function:name
Empty components match anything. For example, fbt::alloc:entry species a probe with the following attributes:
q q q q
From the fbt provider In any module In the alloc function Named entry
Elements of the 4-tuple can be left off from the left-hand side. For example, open:entry matches probes from all providers and kernel modules that have a function name of open and a probe name of entry:
# dtrace -l -n open:entry ID PROVIDER 14 syscall 7386 fbt MODULE genunix FUNCTION NAME open entry open entry
Probe descriptions also support a pattern matching syntax similar to the shell File Name Generation syntax described in sh(1). The special characters *, ?, and [ ] are all supported. For example, the syscall::open*:entry probe description matches both the open and open64 system calls. The ? character represents any single character in the name and [ ] characters lets you specify a choice of specic characters in the name.
1-14
DTrace Tour
Enabling Probes
Probes are enabled with the dtrace(1M) command by specifying them without the -l option. When enabled in this way, DTrace performs the default action when the probe res. The default action indicates only that the probe red. No other data is recorded. For example, the following code example enables every probe in the sd module: # dtrace -m sd CPU ID FUNCTION:NAME 0 17329 sd_media_watch_cb:entry 0 17330 sd_media_watch_cb:return 0 17167 sdinfo:entry 0 17168 sdinfo:return 0 17151 sdstrategy:entry 0 17152 sdstrategy:return 0 17661 ddi_xbuf_qstrategy:entry 0 17662 ddi_xbuf_qstrategy:return 0 17649 xbuf_iostart:entry 0 17341 sd_xbuf_strategy:entry 0 17385 sd_xbuf_init:entry 0 17386 sd_xbuf_init:return 0 17342 sd_xbuf_strategy:return 0 17177 sd_mapblockaddr_iostart:entry 0 17178 sd_mapblockaddr_iostart:return 0 17179 sd_pm_iostart:entry 0 17365 sd_pm_entry:entry 0 17366 sd_pm_entry:return 0 17180 sd_pm_iostart:return 0 17181 sd_core_iostart:entry 0 17407 sd_add_buf_to_waitq:entry ...
As you can see from the output, the default action displays the CPU where the probe red, the DTrace assigned probe ID, the function where the probe red, and the probe name.
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-15
DTrace Tour To enable probes provided by the syscall provider: # dtrace -P syscall dtrace: description 'syscall' matched 452 probes CPU ID FUNCTION:NAME 0 99 ioctl:return 0 98 ioctl:entry 0 99 ioctl:return 0 98 ioctl:entry 0 99 ioctl:return 0 234 sysconfig:entry 0 235 sysconfig:return 0 234 sysconfig:entry 0 235 sysconfig:return 0 168 sigaction:entry 0 169 sigaction:return 0 168 sigaction:entry 0 169 sigaction:return 0 98 ioctl:entry 0 99 ioctl:return 0 234 sysconfig:entry 0 235 sysconfig:return 0 38 brk:entry 0 39 brk:return ... To enable probes named zfod: # dtrace -n zfod dtrace: description 'zfod' matched 3 probes CPU ID FUNCTION:NAME 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod ^C To enable probes provided by the syscall provider in the open function, use the -n option with the fully specied 4-tuple syntax: # dtrace -n syscall::open*: dtrace: description 'syscall::open:' matched 2 probes CPU ID FUNCTION:NAME 0 14 open:entry 0 15 open:return 0 14 open:entry 0 15 open:return 0 14 open:entry ^C
1-16
DTrace Tour To enable the entry probe in the clock function (which should re every 1/100th second): # dtrace -n clock:entry dtrace: description 'clock:entry' matched 1 probe CPU ID FUNCTION:NAME 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry 0 4198 clock:entry ^C
DTrace Actions
Actions are user-programmable statements that are executed within the kernel by the DTrace virtual machine. The following are properties of actions:
q q q q
Actions are taken when a probe res. Actions are completely programmable (in the D language). Most actions record some specied state in the system. Some actions can change the state of the system in a well-dened way.
q q
These are called destructive actions. Destructive actions are not allowed by default.
For now, you will use D expressions that consist only of built-in D variables. The following are some of the most useful built-in D variables. See Appendix B for a complete list of the D built-in variables.
q q q q
pid The current process ID execname The current executable name timestamp The time since boot in nanoseconds curthread A pointer to the kthread_t structure that represents the current thread probemod The current probes module name probefunc The current probes function name
q q
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-17
DTrace Tour
q
There are also many built-in functions that perform actions. Appendix A, Actions and Subroutines provides the complete list of D built-in functions. Start with the trace() function, which records the result of a D expression to the trace buffer. For example:
q q q
trace(pid) traces the current process ID. trace(execname) traces the name of the current executable. trace(curthread->t_pri) traces the t_pri eld of the current thread. trace(probefunc) traces the function name of the probe.
Actions are indicated by following a probe specication with { action }. For example: # dtrace -n 'readch {trace(pid)}' dtrace: description 'readch ' matched 4 probes CPU ID FUNCTION:NAME 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch 0 4036 read:readch ...
In the last example the process identication number (PID) appears in the last column of output.
1-18
DTrace Tour The following example traces the executable name: # dtrace -m 'ufs {trace(execname)}' dtrace: description 'ufs ' matched 889 probes CPU ID FUNCTION:NAME 0 14977 ufs_lookup:entry ls 0 15748 ufs_iaccess:entry ls 0 15749 ufs_iaccess:return ls 0 14978 ufs_lookup:return ls 0 14977 ufs_lookup:entry ls 0 15748 ufs_iaccess:entry ls 0 15749 ufs_iaccess:return ls 0 14978 ufs_lookup:return ls 0 14977 ufs_lookup:entry ls 0 15748 ufs_iaccess:entry ls 0 15749 ufs_iaccess:return ls 0 14978 ufs_lookup:return ls 0 14977 ufs_lookup:entry ls ... 0 15005 ufs_rwunlock:entry utmpd 0 15006 ufs_rwunlock:return utmpd 0 14963 ufs_close:entry utmpd 0 14964 ufs_close:return utmpd 0 15007 ufs_seek:entry utmpd 0 15008 ufs_seek:return utmpd 0 14963 ufs_close:entry utmpd ^C
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-19
DTrace Tour The next action example traces the time of entry to each system call: # dtrace -n 'syscall:::entry {trace(timestamp)}' dtrace: description 'syscall:::entry ' matched 226 probes CPU ID FUNCTION:NAME 0 312 portfs:entry 157088479572713 0 98 ioctl:entry 157088479637542 0 98 ioctl:entry 157088479674339 0 234 sysconfig:entry 157088479767243 0 234 sysconfig:entry 157088479774432 0 168 sigaction:entry 157088479993155 0 168 sigaction:entry 157088480229390 0 98 ioctl:entry 157088480318855 0 234 sysconfig:entry 157088480398692 0 38 brk:entry 157088480422525 0 38 brk:entry 157088480438097 0 98 ioctl:entry 157088480794819 0 98 ioctl:entry 157088480959666 0 98 ioctl:entry 157088480986498 0 98 ioctl:entry 157088481033225 0 60 fstat:entry 157088481050686 0 60 fstat:entry 157088481074680 ... Multiple actions can be specied; they must be separated by semicolons: # dtrace -n 'zfod {trace(pid);trace(execname)}' dtrace: description 'zfod ' matched 3 probes CPU ID FUNCTION:NAME 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod 0 4080 anon_zero:zfod ...
1-20
DTrace Tour The following example traces the executable name in every entry to the pagefault function: # dtrace -n 'fbt::pagefault:entry {trace(execname)}' dtrace: description 'fbt::pagefault:entry ' matched 1 probe CPU ID FUNCTION:NAME 0 2407 pagefault:entry dtrace 0 2407 pagefault:entry dtrace 0 2407 pagefault:entry dtrace 0 2407 pagefault:entry sh 0 2407 pagefault:entry sh 0 2407 pagefault:entry sh 0 2407 pagefault:entry sh 0 2407 pagefault:entry sh ...
Writing D Scripts
Complicated DTrace enablings become difcult to manage on the command line. The dtrace(1M) command supports scripts, specied with the -s option. Alternatively, you can create executable DTrace interpreter les. Interpreter les always begin with: #!/usr/sbin/dtrace -s
Executable D Scripts
For example, you can write a script to trace the executable name upon entry to each system call as follows: # cat syscall.d syscall:::entry { trace(execname); }
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-21
DTrace Tour By convention, D scripts end with a .d sufx. You can run this D script as follows: # dtrace -s syscall.d dtrace: script 'syscall.d' matched 226 probes CPU ID FUNCTION:NAME 0 312 pollsys:entry java 0 98 ioctl:entry dtrace 0 98 ioctl:entry dtrace 0 234 sysconfig:entry dtrace 0 234 sysconfig:entry dtrace 0 168 sigaction:entry dtrace 0 168 sigaction:entry dtrace 0 98 ioctl:entry dtrace 0 234 sysconfig:entry dtrace 0 38 brk:entry dtrace ^C If you give the syscall.d le execute permission and add a rst line to invoke the interpreter, you can run the script by entering its name on the command line as follows: # cat syscall.d #!/usr/sbin/dtrace -s syscall:::entry { trace(execname); } # chmod +x syscall.d # ls -l syscall.d -rwxr-xr-x 1 root other 62 May 12 11:30 syscall.d # ./syscall.d dtrace: script './syscall.d' matched 226 probes CPU ID FUNCTION:NAME 0 98 ioctl:entry java 0 98 ioctl:entry java 0 312 pollsys:entry java 0 312 pollsys:entry java 0 312 pollsys:entry java 0 98 ioctl:entry dtrace 0 98 ioctl:entry dtrace 0 234 sysconfig:entry dtrace 0 234 sysconfig:entry dtrace
1-22
DTrace Tour
D Literal Strings
The D language supports literal strings that you can use with the trace function as follows: # cat string.d #!/usr/sbin/dtrace -s fbt::bdev_strategy:entry { trace(execname); trace(" is initiating a disk I/O\n"); }
The \n at the end of the literal string produces a new line. To run this script, enter the following:
# dtrace -s string.d dtrace: script 'string.d' matched 1 probe CPU ID FUNCTION:NAME 0 9215 bdev_strategy:entry 0 0 0 0 9215 9215 9215 9215 bdev_strategy:entry bdev_strategy:entry bdev_strategy:entry bdev_strategy:entry
bash is initiating a disk I/O vi is initiating a disk I/O vi is initiating a disk I/O vi is initiating a disk I/O sched is initiating a disk I/O
The quiet mode option, -q, in dtrace(1M) tells DTrace to record only the actions explicitly stated. This option suppresses the default output normally produced by the dtrace command. The following example shows the use of the -q option on the string.d script: # dtrace -q -s string.d ls is initiating a disk I/O cat is initiating a disk I/O fsflush is initiating a disk I/O vi is initiating a disk I/O vi is initiating a disk I/O
DTrace Fundamentals
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
1-23
:END
# dtrace -qs beginEnd.d This is a heading ^C This should appear at the END
Note The END probe does not re until you interrupt (^C) the dtrace command.
Module 2
Using DTrace
Objectives
Upon completion of this module, you should be able to:
q q q q q q
Describe the DTrace performance monitoring capabilities Examine performance problems using the vminfo provider Examine performance problems using the sysinfo provider Examine performance problems using the io provider Use DTrace to obtain information about system calls Create D scripts that use arguments
2-1
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
Relevance
Discussion The following questions are relevant to understanding how to use DTrace:
q q
!
?
What performance monitoring tools exist in the Solaris 10 OS? Would it be useful to know which process is making which system calls? What advantage does the ability to pass arguments to a D script provide?
2-2
Additional Resources
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
q
Cantrill, Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. paper presented at the 2004 USENIX Conference. BigAdmin System Administration Portal [http://www.sun.com/bigadmin/content/dtrace]. Sun Microsystems, Inc. Solaris Dynamic Tracing Guide (Beta), part number 817-6223-10. The /usr/demo/dtrace directory contains all of the sample scripts from the Solaris Dynamic Tracing Guide. dtrace(1M) manual page in the Solaris 10 OS manual pages, Solaris 10 Reference Manual Collection.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-3
The vminfo provider Implements probes that correspond to the vmstat(1M) tool The sysinfo provider Implements probes that correspond to the mpstat(1M) tool The io provider Implements probes that correspond to the iostat(1M) tool
In addition, the syscall provider implements probes that correspond to the truss(1) command.
Aggregations
Aggregated data is more useful than individual data points in answering performance-related questions. For example, if you want to know the number of page faults by process, you do not necessarily care about each individual page fault. Rather, you want a table that lists the process names and the total number of page faults. DTrace provides several built-in aggregating functions. An aggregating function has this property: if it is applied to subsets of a collection of gathered data and then applied again to the results, it returns the same result as it does when applied to the whole collection. Examples of aggregating functions are count(), sum(), min(), and max(); A median function would not be considered an aggregating function because it lacks the above stated property.
2-4
DTrace Performance Monitoring Capabilities DTrace is not required to store the entire set of data items for aggregations; it keeps a running count, needing only the current intermediate result and the new element. Intermediate results are kept per central processing unit (CPU), enabling a scalable implementation (because of not requiring the use of locks).
name The name of the aggregation that is preceded by the @ character keys A comma-separated list of D expressions aggfunc One of the DTrace aggregating functions args A comma-separated list of arguments appropriate to the aggregating function
q q q
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-5
DTrace Performance Monitoring Capabilities Table 2-1 DTrace Aggregating Functions (Continued) Function Name lquantize Arguments scalar expression, lower bound, upper bound, step value scalar expression Result A linear frequency distribution, sized by the specied range, of the values of the specied expression. Increments the value in the highest bucket that is less than or equal to the specied expression. A power-of-two frequency distribution of the values of the specied expression. Increments the value in the highest power-of-two bucket that is less than or equal to the specied expression.
quantize
1 1 3 20 197 201
Note No data is output from the aggregation until dtrace(1M) is terminated. The output data is a summary up to that point.
2-6
1 27 29 37 60 68
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-7
# kstat -n vm module: cpu name: vm anonfree anonpgin anonpgout as_fault cow_fault crtime dfree execfree execpgin execpgout fsfree fspgin fspgout hat_fault kernel_asflt maj_fault pgfrec pgin pgout pgpgin pgpgout pgrec pgrrun pgswapin pgswapout prot_fault rev scan
instance: 0 class: misc 0 4 0 157771 34207 0.178610697 56 0 3646 0 56 16257 57 0 0 6743 34215 9188 36 19907 57 34216 4 0 0 39794 0 28668
2-8
Examining Performance Problems Using the vminfo Provider snaptime softlock swapin swapout zfod 349429.087071013 165 0 0 12835
dfree
fsfree
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-9
Examining Performance Problems Using the vminfo Provider Table 2-2 The vminfo Probes (Continued) Probe Name fspgin fspgout Description Probe that res when a le system page is paged in from the backing store. Probe that res when a le system page is paged out to the backing store.
kernel_asflt Probe that res when a page fault is taken by the kernel on a page in its own address space. When the kernel_asflt probe res, it is immediately preceded by a ring of the as_fault probe. maj_fault Probe that res when a page fault is taken that results in input/output (I/O) from a backing store or swap device. Whenever maj_fault res, it is immediately preceded by a ring of the pgin probe. Probe that res when a page is reclaimed from the free page list. Probe that res when a page is paged in from the backing store or from a swap device. This differs from the maj_fault probe in that the maj_fault probe only res when a page is paged in as a result of a page fault; the pgin probe res when a page is paged in, regardless of the reason. Probe that res when a page is paged out to the backing store or to a swap device. Probe that res when a page is paged in from the backing store or from a swap device. The only difference between the pgpgin probe and the pgin probe is that the pgpgin probe contains the number of pages paged in as the arg0 argument. (The pgin probe always contains 1 in the arg0 argument.) Probe that res when a page is paged out to the backing store or to a swap device. The only difference between the pgpgout probe and the pgout probe is that the pgpgout probe contains the number of pages paged out as the arg0 argument. (The pgout probe always contains 1 in the arg0 argument.) Probe that res when a page is reclaimed. Probe that res when the pager is scheduled. Probe that res when a process is swapped in. Probe that res when a process is swapped out. Probe that res when a page fault is taken due to a protection violation. Probe that res when the page daemon begins a new revolution through all pages. Probe that res when the page daemon examines a page.
pgfrec pgin
pgout pgpgin
pgpgout
2-10
Examining Performance Problems Using the vminfo Provider Table 2-2 The vminfo Probes (Continued) Probe Name softlock swapin swapout zfod Description Probe that res when a page is faulted as a part of placing a software lock on the page. Probe that res when a swapped-out process is swapped back in. Probe that res when a process is swapped out. Probe that res when a zero-lled page is created on demand.
Here the pi column denotes the number of kilobytes paged in per second.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-11
Examining Performance Problems Using the vminfo Provider # dtrace -n 'pgin {@[execname] = count()}' dtrace: description 'pgin ' matched 1 probe ^C utmpd in.routed init snmpd automountd vi vmstat sh grep dtrace bash file find
2 2 2 5 5 5 17 23 33 35 62 198 4551
This output shows that the find command is responsible for most of the page-ins. For a more complete picture of the find command in terms of vm behavior, you can enable all vminfo probes. Before doing this, however, you must introduce a ltering capability of DTrace called a predicate.
Predicates
A D program consists of a set of probe clauses. A probe clause has the following general form:
probe descriptions
/ predicate / {
action statements
} Predicates are D expressions enclosed in slashes / / that are evaluated at probe ring time to determine whether the associated actions should be executed. If the D expression evaluates to zero it is false; if it evaluates to non-zero it is true. Predicates are optional, but you must place them between the probe description and the action statements.
2-12
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-13
Examining Performance Problems Using the vminfo Provider The following dtrace command was started in another terminal window immediately after the above command group was started in the background. # dtrace -s find.d dtrace: script 'find.d' matched 44 probes ^C prot_fault cow_fault softlock execpgin kernel_asflt zfod as_fault pgrec pgfrec maj_fault fspgin pgpgin pgin
You might wonder why, with such a large memory load, scans do not show up in the output of the dtrace command. This is because the pageout daemon is running during scans, not the find user process. The following example shows this behavior. # cat mem.d #!/usr/sbin/dtrace -s vminfo::: { @vm[execname,probename] = count(); } END { printa("%16s\t%16s\t%@d\n", @vm); }
1 1 1 1 1 1
2-14
Examining Performance Problems Using the vminfo Provider mkfile find dtrace mkfile mkfile vmstat rm find sleep mkfile sendmail mkfile rm bash rm sendmail sleep find sendmail ... bash pageout pageout pageout pageout pageout pageout pageout pageout bash pageout sched sched sched sched sched sched sched sched pageout rm rm find find find find pgrec fspgout anonpgout pgpgout pgout execpgout pgrec anonfree execfree as_fault fsfree dfree pgrec pgout pgpgout anonpgout anonfree execpgout execfree dfree pgrec pgfrec maj_fault fspgin pgin pgpgin 205 293 293 293 293 293 293 360 510 519 519 523 523 523 523 523 523 523 523 803 1388 1388 5067 5085 5088 5088 prot_fault prot_fault pgrec execpgin kernel_asflt prot_fault zfod execpgin zfod zfod anonpgin cow_fault cow_fault anonpgin maj_fault pgfrec cow_fault cow_fault pgrec 1 1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 4
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-15
Examining Performance Problems Using the vminfo Provider pageout scan 78852
The printa() built-in formatting function gives you increased control over the output of an aggregation. For example, consider the following code line: { printa("%16s\t%16s\t%@d\n", @vm); } It provides these formatting instructions:
q
%16s\t%16s prints the rst and second elements of the aggregation keys in a 16-character-wide column (right justied). \t outputs a <Tab>. %@d prints the aggregation value as a decimal number.
q q
Note Appendix A provides more details on the format letters available to the printa() function and the more general printf() function (which resembles the printf(3C) function from the Standard C Library).
2-16
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-17
bwrite cpu_ticks_idle
cpu_ticks_kernel Probe that res when the periodic system clock has determined that a CPU is executing in the kernel. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed to be executing in the kernel. cpu_ticks_user Probe that res when the periodic system clock has determined that a CPU is executing in user mode. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed to be running in usermode. Probe that res when the periodic system clock has determined that a CPU is otherwise idle, but on which some threads are waiting for I/O. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed waiting on I/O. Probe that res when a CPU enters the idle loop. Probe that res when an interrupt thread blocks.
cpu_ticks_wait
idlethread intrblk
2-18
Examining Performance Problems Using the sysinfo Provider Table 2-3 The sysinfo Probes (Continued) Probe Name inv_swtch lread lwrite modload modunload msg mutex_adenters Description Probe that res when a running thread is forced to involuntarily give up the CPU. Probe that res when a buffer is logically read from a device. Probe that res when a buffer is logically written to a device. Probe that res when a kernel module is loaded. Probe that res when a kernel module is unloaded. Probe that res when a msgsnd(2) or msgrcv(2) system call is made, but before the message queue operations have been performed. Probe that res when an attempt is made to acquire an owned adaptive lock. If this probe res, one of the lockstat provider probes (adaptive-block or adaptive-spin) also res. Probe that res when a name lookup is attempted in the le system. Probe that res when a thread is created. Probe that res when a raw I/O read is about to be performed. Probe that res when a raw I/O write is about to be performed. Probe that res when a new process cannot be created because the system is out of process table entries. Probe that res when a CPU switches from executing one thread to executing another. Probe that res after each successful read, but before control is returned to the thread performing the read. A read can occur through the read(2), readv(2), or pread(2) system calls. The arg0 argument contains the number of bytes that were successfully read. Probe that res when an attempt is made to read-lock a readers/writer lock when the lock is either held by a writer, or desired by a writer. If this probe res, the lockstat provider's rwblock probe also res. Probe that res when an attempt is made to write-lock a readers/writer lock when the lock is held either by some number of readers or by another writer. If this probe res, the lockstat provider's rw-block probe also res. Probe that res when a semop(2) system call is made, but before any semaphore operations have been performed.
rw_rdfails
rw_wrfails
sema
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-19
Examining Performance Problems Using the sysinfo Provider Table 2-3 The sysinfo Probes (Continued) Probe Name sysexec sysfork sysread sysvfork syswrite trap Description Probe that res when an exec(2) system call is made. Probe that res when a fork(2) system call is made. Probe that res when a read(2), readv(2) or pread(2) system call is made. Probe that res when a vfork(2) system call is made. Probe that res when a write(2), writev(2), or pwrite(2) system call is made. Probe that res when a processor trap occurs. Note that some processors (in particular, UltraSPARC variants) handle some lightweight traps through a mechanism that does not cause this probe to re. Probe that res when a directory block is read from the UFS le system. See the ufs(7FS) man page for details on UFS. Probe that res when an inode is retrieved. See the ufs(7FS) man page for details on UFS. Probe that res after an in-core inode without any associated data pages has been made available for reuse. See the ufs(7FS) man page for details on UFS. Probe that res after an in-core inode with associated data pages has been made available for reuse and therefore after the associated data pages have been ushed to disk. See the ufs(7FS) man page for details on UFS. Probe that res when the periodic system clock has determined that a CPU is otherwise idle, but on which some threads are waiting for I/O. Note that this probe res in the context of the system clock and therefore res on the CPU running the system clock; one must examine the cpu_t argument (arg2) to determine the CPU that has been deemed waiting on I/O. Note that there is no semantic difference between wait_ticks_io and cpu_ticks_io; wait_ticks_io exists purely for historical reasons. Probe that res after each successful write, but before control is returned to the thread performing the write. A write can occur through the write(2), writev(2), or pwrite(2) system calls. The arg0 argument contains the number of bytes that were successfully written.
ufsipage
wait_ticks_io
writech
2-20
Examining Performance Problems Using the sysinfo Provider Table 2-3 The sysinfo Probes (Continued) Probe Name xcalls Description Probe that res when a cross-call is about to be made. A cross-call is the operating system's mechanism for one CPU to request immediate work from another.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-21
Examining Performance Problems Using the sysinfo Provider 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 grep value -1 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@ 99 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |@@@@ 25 |@@@@ 23 |@@@@ 24 |@@@@ 22 | 4 | 3 | 0 | | | | | | | | |@@ |@@@@ |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | | | | | 2 0 0 6 0 0 6 6 16 30 199 0 0 1 1 0
2-22
The xcal and syscl columns display relatively high numbers, which might be affecting the systems performance. Yet the system is relatively idle, and is not spending time waiting on input/output (I/O). The xcal numbers are per-second and are read from the xcalls eld of the sys kstat. To see which executables are responsible for the xcalls, enter the following dtrace(1M) command: # dtrace -n 'xcalls {@[execname] = count()}' dtrace: description 'xcalls ' matched 3 probes ^C find cut snmpd mpstat sendmail grep bash dtrace sched xargs file
#
This output indicates the source of the cross-calls: some number of file(1) and xargs(1) processes are inducing the majority of them. You can nd these processes using the pgrep(1) and ptree(1) commands: # pgrep xargs 15973 # ptree 15973 204 /usr/sbin/inetd -s 5650 in.telnetd 5653 -sh 5657 bash 15970 /bin/sh ./findtxt configuration 15971 cut -f1 -d: 15973 xargs file 16686 file /usr/bin/tbl /usr/bin/troff /usr/bin/ul /usr/bin/vgrind /usr/bin/catman
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-23
Examining Performance Problems Using the sysinfo Provider The xargs and file commands appear to be part of a custom user shell script. You can locate this script as follows: # find / -name findtxt /users1/james/findtxt # cat /users1/james/findtxt #!/bin/sh find / -type f | xargs file | grep text | cut -f1 -d: >/tmp/findtxt$$ cat /tmp/findtxt$$ | xargs grep $1 rm /tmp/findtxt$$ # The script is running many processes concurrently with much interprocess communication occurring through pipes. This script appears to be quite resource intensive: it is trying to nd every text le in the system and is then searching each one for some specic text. You expect these processes to run concurrently on this systems four processors while they send data to each other.
2-24
Examining Performance Problems Using the sysinfo Provider unix`thread_start+0x4 2 ... SUNW,UltraSPARC-IIIi`send_mondo_set+0x9c unix`xt_some+0xc4 unix`xt_sync+0x3c unix`hat_unload_callback+0x6ec genunix`anon_private+0x204 genunix`segvn_faultpage+0x778 genunix`segvn_fault+0x920 genunix`as_fault+0x4a0 unix`pagefault+0xac unix`trap+0xc14 unix`utl0+0x4c 2303 SUNW,UltraSPARC-IIIi`send_mondo_set+0x9c unix`xt_some+0xc4 unix`sfmmu_tlb_range_demap+0x190 unix`sfmmu_chgattr+0x2e8 genunix`segvn_dup+0x3d0 genunix`as_dup+0xd0 genunix`cfork+0x120 unix`syscall_trap32+0xa8 7175 SUNW,UltraSPARC-IIIi`send_mondo_set+0x9c unix`xt_some+0xc4 unix`xt_sync+0x3c unix`sfmmu_chgattr+0x2f0 genunix`segvn_dup+0x3d0 genunix`as_dup+0xd0 genunix`cfork+0x120 unix`syscall_trap32+0xa8 11492 As this output shows, the majority of the cross-calls are the result of a signicant number of fork(2) system calls. (Shell scripts are notorious for abusing their fork(2) privileges.) Page faults of anonymous memory are also involved, which probably accounts for the large number of minor page faults seen in the mpstat output.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-25
Device I/O type Process ID Application name File name File offset
The io Probes
Table 2-4 describes the io probes. Table 2-4 The io Probes Probe Name start Description Probe that res when an I/O request is about to be made to a disk device or to an NFS server. The buf(9S) structure corresponding to the I/O request is pointed to by the args[0] argument. The devinfo_t structure of the device to which the I/O is being issued is pointed to by the args[1] argument. The fileinfo_t structure of the le that corresponds to the I/O request is pointed to by the args[2] argument. Note that le information availability depends on the le system making the I/O request. Probe that res after an I/O request has been fullled. The buf(9S) structure corresponding to the I/O request is pointed to by the args[0] argument. The devinto_t structure of the device to which the I/O was issued is pointed to by the args[1] argument. The fileinfo_t structure of the le that corresponds to the I/O request is pointed to by the args[2] argument.
done
2-26
Examining Performance Problems Using the io Provider Table 2-4 The io Probes (Continued) Probe Name wait-start Description Probe that res immediately before a thread begins to wait pending completion of a given I/O request. The buf(9S) structure corresponding to the I/O request for which the thread will wait is pointed to by the args[0] argument. The devinfo_t structure of the device to which the I/O was issued is pointed to by the args[1] argument. The fileinto_t structure of the le that corresponds to the I/O request is pointed to by the args[2] argument. Some time after the wait-start probe res, the wait-done probe res in the same thread. Probe that res immediately after a thread wakes up from waiting for a pending completion of a given I/O request. The buf(9S) structure corresponding to the I/O request for which the thread was waiting is pointed to by the args[0] argument. The devinfo_t structure of the device to which the I/O was issued is pointed to by the args[1] argument. The fileinfo_t structure of the le that corresponds to the I/O request is pointed to by the args[2] argument. Some time after the wait-start probe res, the wait-done probe res in the same thread.
wait-done
args[0] Set to point to the buf(9S) structure corresponding to the I/O request. args[1] Set to point to the devinfo_t structure of the device to which the I/O was issued. args[2] Set to point to the fileinfo_t structure containing le system related information regarding the issued I/O request.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-27
The b_flags member indicates the state of the I/O buffer and consists of a bitwise OR operator of different state values. Table 2-5 shows the valid state values for the b_flags eld. Table 2-5 The b_flags Field Values Flag Value B_DONE B_ERROR B_PAGEIO Description Indicates the data transfer has completed. Indicates an I/O transfer error. It is set in conjunction with the b_error eld. Indicates the buffer is being used in a paged I/O request. See the description of the b_addr eld (Table 2-6) for more information. Indicates the buffer is being used for physical (direct) I/O to a user data area. Indicates that data is to be read from the peripheral device into main memory. Indicates that the data is to be transferred from main memory to the peripheral device.
2-28
Examining Performance Problems Using the io Provider Table 2-6 shows the field descriptions for the buf(9S) structure. Table 2-6 The buf(9S) Structure Field Descriptions Field b_bcount b_addr Description Indicates the number of bytes to be transferred as part of the I/O request. Indicates the virtual address of the I/O request, unless B_PAGEIO is set. The address is a kernel virtual address unless B_PHYS is set, in which case it is a user virtual address. If B_PAGEIO is set, the b_addr eld contains kernel private data. Note that either B_PHYS or B_PAGEIO or neither can be set, but not both. Identies which logical block on the device is to be accessed. The mapping from a logical block to a physical block (cylinder, track, and so on) is dened by the device. Indicates the number of bytes not transferred because of an error. Contains the size of the allocated buffer. Identies a specic routine in the kernel that is called when the I/O is complete. Holds an error code returned from the driver in the event of an I/O error. b_error is set in conjunction with the B_ERROR bit set in the b_f1ags member. Contains the major and minor device numbers of the device accessed. Consumers can use the D built-in functions getmajor() and getminor() to extract the major and minor device numbers from the b_edev eld.
b_lblkno
b_edev
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-29
Table 2-7 shows the field descriptions for the devinfo_t structure. Table 2-7 The devinfo_t Structure Field Descriptions Field dev_major dev_minor dev_instance Description Indicates the major number of the device; see getmajor(9F). Indicates the minor number of the device; see qetminor(9F). Indicates the instance number of the device. The instance of a device is different from the minor number: where the minor number is an abstraction managed by the device driver, the instance number is a property of the device node. Device node instance numbers can be displayed with the prtconf(lM) command. Indicates the name of the device driver that manages the device. (Device driver names can be viewed with the -D option to prtconf(1M).) Indicates the name of the device as reported by the iostat(1M) command. This name also corresponds to the name of the device as reported by the kstat(1M) command. This eld is provided to enable aberrant iostat or kstat output to be correlated to actual I/O activity. Indicates the complete path of the device.
dev_name
dev_statname
dev_pathname
2-30
Table 2-8 shows the field descriptions for the fileinfo_t structure. Table 2-8 The fileinfo_t Structure Field Descriptions Field fi_name Description Contains the name of the le without any directory components. If there is no le information associated with an I/O, the fi_name eld is set to the string <none>. In rare cases, the pathname associated with a le is unknown; in this case, the fi_name eld is set to the string <unknown>. Contains only the directory component of the le name. As with fi_name, this can be set to <none> if there is no le information present, or to <unknown> if the pathname associated with the le is not known. Contains the complete pathname to the le. As with fi_name, this can be set to <none> if there is no le information present, or to <unknown> if the pathname associated with the le is not known. Contains the offset within the le, or -1 if le information is not present or if the offset is otherwise unspecied by the le system.
fi_dirname
fi_pathname
fi_offset
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-31
This output indicates that a large amount of data is being read from disk drive sd2 and written to disk drive sd0. Someone appears to be transferring many megabytes of data between these two drives. Both disks are consistently over 50% busy. Is someone running a le transfer command such as tar(1), cpio(1), cp(1), or dd(1M)? The iosnoop.d D script enables you to determine who is performing this I/O.
2-32
You use the BEGIN probe to print out column headings. You use an associative array to store the nanosecond timestamp of when a particular I/O starts from a specic device. You must also store the executable name and PID of the command issuing the I/O request; this information is not available at I/O completion time because you are running in the context of an interrupt handler. When the I/O is done you determine the elapsed time and then print out the relevant information. You retrieve the le undergoing the I/O from the fileinfo_t structure; the args[2] argument is set up to point to the fileinfo_t structure when the done probe res.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-33
You retrieve the iostat-compatible device name from the devinfo_t structure, which is pointed to by the args[1] argument. You use a D conditional expression to display R or W based on testing the B_READ bit in the b_flags eld of the buf structure, which is pointed to by the args[0] argument. You use the D modulo operator (%) to determine the fractional portion of the time in milliseconds. Finally, you set the associative array elements to zero. Setting an associative array element to zero de-allocates the underlying dynamic memory that was being used. This avoids potential dynamic variable drops.
The following output results from running the previous iosnoop.d script. It clearly shows who is performing the I/O operations. Someone is copying the shared object les from /usr/lib on drive sd2 to a backup directory on drive sd0.
PID 725 725 725 725 725 725 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768
FILE /usr/bin/bash /usr/lib /usr/lib /usr/lib /lib/libc.so.1 /lib/libnsl.so.1 /lib/libnsl.so.1 /lib/libc.so.1 /lib/libc.so.1 /lib/libc.so.1 /usr/lib/0@0.so.1 /usr/lib/0@0.so.1 /mnt/lib.backup/0@0.so.1 /usr/lib/ld.so /usr/lib/ld.so /usr/lib/ld.so /mnt/lib.backup/ld.so /mnt/lib.backup/ld.so /mnt/lib.backup/ld.so.1 <unknown> /mnt/lib.backup/ld.so.1 /mnt/lib.backup/ld.so.1 /usr/lib/lib300.so.1 /usr/lib/lib300.so.1 /usr/lib/lib300.so.1 /usr/lib/lib300.so.1 <unknown> /mnt/lib.backup/lib300.so
DEVICE RW sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd0 W sd2 R sd2 R sd2 R sd0 W sd0 W sd0 W sd2 R sd0 W sd0 W sd2 R sd2 R sd2 R sd2 R sd2 R sd0 W
MS 9.471 7.128 3.193 11.283 7.696 10.293 0.582 10.154 7.262 9.914 9.270 13.654 2.431 6.890 7.085 0.376 6.698 6.437 4.394 2.206 8.479 8.440 5.771 6.003 0.530 7.912 3.014 7.861
2-34
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-35
2-36
rexit for exit(2) gtime for time(2) semsy for semctl(2), semget(2), semids(2), and semtimedop(2) signotify, which has no manual page, and is used for POSIX.4 message queues Large le system calls such as:
q q q q
creat64 for creat(2) lstat64 for lstat(2) open64 for open(2) mmap64 for mmap(2)
4 4 11 3 12 12
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-37
Obtaining System Call Information tty tty bash bash bash snmpd ^C open stat setpgrp waitsys stat64 ioctl 2 2 13 10 2 12
The errno.d D program has a predicate that uses the AND operator: &&. The predicate states that the return from the system call must be -1, which is how all system calls indicate failure, and that the process executable name cannot be dtrace. The printf built-in D function uses the %-20s and %-10s format specications to left-justify the strings in the given minimum column width.
2-38
Obtaining System Call Information The output indicates that the majority of the system calls are setting up signal handling (sigaction(2)) or growing the heap (brk(2)). The following D script enables you to discover who is making the brk(2) system calls. # cat brk.d #!/usr/sbin/dtrace -qs syscall::brk:entry { @[execname] = count(); } # ./brk.d ^C dtrace 6 prstat 22 nroff 48 cat 48 tbl 142 eqn 144 rm 166 ln 166 col 222 expr 332 head 492 fgrep 492 dirname 581 grep 722 instant 738 sh 917 nawk 984 sgml2roff 1259 nsgmls 13296 # ps -ef | grep nsgmls root 591 590 2 07:56:32 pts/2 0:00 /usr/lib/sgml/nsgmls gl -m/usr/share/lib/sgml/locale/C/dtds/catalog -E0 /usr/s # man nsgmls No manual entry for nsgmls. # man -k sgml sgml sgml (5) - Standard Generalized Markup Language solbook sgml (5) - Standard Generalized Markup Language Apparently some process is working with the Standard Generalized Markup Language (SGML). Use the ptree command to see who is creating this process: # ptree 591 #
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-39
Obtaining System Call Information The ptree command returns no results because the nsgmls process is too short-lived for the command to be run on it. You have learned, however, that the problem is not a long-lived process causing a memory leak. Now write a quick D script to print out the ancestry. You must keep trying the next previous parent iteratively, because many of the other processes involved are also short-lived. Note This particular D script fails if an ancestor does not exist. This is because the top ancestor, the sched process has no parent. You cannot harm the kernel even if a D script uses a bad pointer. The intent of this example is to show how you can quickly create custom D scripts to answer questions about system behavior. Many of your D scripts will be throw-away scripts that you will not re-use. You can x the script by testing each parent pointer with a predicate before printing. You will see this x later with the ancestors3.d D script.
# cat ancestors.d
# cat -n ancestors.d 1 #!/usr/sbin/dtrace -qs 2 syscall::brk:entry 3 /execname == "nsgmls"/ 4 { 5 printf("process: %s\n", 6 curthread->t_procp->p_user.u_psargs); 7 printf("parent: %s\n", 8 curthread->t_procp->p_parent->p_user.u_psargs); 9 printf("grandparent: %s\n", 10 curthread->t_procp->p_parent->p_parent->p_user.u_psargs); 11 printf("greatgrandparent: %s\n", 12 curthread->t_procp->p_parent->p_parent->p_parent->p_user.u_psargs); 13 printf("greatgreatgrandparent: %s\n", 14 curthread->t_procp->p_parent->p_parent->p_parent->p_parent->p_user.u_psargs); 15 printf("greatgreatgreatgrandparent: %s\n", 16 curthread->t_procp->p_parent->p_parent->p_parent->p_parent->p_parent->p_user.u_psargs); 17 }
# ./ancestors.d process: /usr/lib/sgml/nsgmls -gl -m/usr/share/lib/sgml/locale/C/dtds/catalog -E0 /usr/s parent: /usr/lib/sgml/instant -d -c/usr/share/lib/sgml/locale/C/transpec/roff.cmap -s/u grandparent: /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman4/rt_dptbl.4 greatgrandparent: sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share/man/sman4/rt_dptbl. greatgreatgrandparent: catman greatgreatgreatgrandparent: bash # ps -ef | grep catman root 2333 2332 1 08:26:05 pts/1 root 16984 2880 0 08:41:10 pts/2 # ptree 2333 299 /usr/sbin/inetd -s 2324 in.rlogind
2-40
The previous output indicates that all of the brk(2) system calls resulted from the catman(1M) command, creating many short-lived children that issued this system call. The curthread built-in D variable gives access to the address of the running kernel thread. Like the C language, the D language accesses members of a structure with the -> symbol when you have a pointer to that structure. Through this pointer to the kernel kthread_t structure, you can access the process name and arguments (kept in the proc_t structures p_user structure) as well as any parent, grandparent, greatgrandparent, and so on. To do this you follow the parent pointers back. Refer to the <sys/thread.h>, <sys/proc.h> and <sys/user.h> header les for details of these elds.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-41
Obtaining System Call Information Figure 2-1 shows a diagram of the kernel data structures being accessed by this example.
kthread_t
t_state t_pri
curthread
proc_t
t_lwp t_procp p_exec p_as p_cred p_parent /usr/include/sys/thread.h p_tlist
proc_t
p_exec p_as p_cred p_parent p_tlist
proc_t . . .
p_parent
. . .
. . .
user_t p_user u_start u_ticks /usr/include/sys/user.h u_psargs[ ] u_cdir
. . .
p_user
. . .
p_user
. . .
u_psargs[ ]
. . .
u_psargs[ ]
. . .
/usr/include/sys/proc.h
. . .
. . .
Figure 2-1
2-42
D Language Variables
The D language has ve basic variable types:
q
Scalar variables Have xed-size values such as integers, structures and pointers Associative arrays Store values indexed by one or more keys, similar to aggregations Thread-local variables Have one name, but storage is local to each separate kernel thread. These variables are prexed with the self-> keyword. Clause-local variables Appear when an action block is entered; storage is reclaimed after leaving the probe clause. These variables are prexed with the this-> keyword. Kernel external variables DTrace has access to all kernel global and static variables. These variables are prexed with a backquote ().
Associative arrays (start, command, and mypid) were used in the iosnoop.d script. Clause-local variables are similar to automatic or local variables in the C Language. The elapsed variable in the iosnoop.d script was a global scalar variable, but could have been made into a clause-local variable which is slightly more efcient. Clause-local variables come into existence when an action block (tied to a specic probe) is entered and their storage is reclaimed when the action block is left. They help save storage and are quicker to access than associative arrays. Note For more information on D variables, refer to the Solaris Dynamic Tracing Guide, part number 817-6223-10. You can access kernel global and static variables within your D programs. To access these external variables, you prex the global kernel variable with the (back quote or grave accent) character. For example, to reference the freemem kernel global variable use: freemem. If the variable is part of a kernel module that conicts with other module variable names, use the character between the module name and the variable name. For example, sdsd_state references the sd_state variable within the sd kernel module.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-43
Associative Arrays
Associative arrays enable the storing of scalar values in elements of an array (or table) that are identied by one or more sequences of commaseparated key elds (an n-tuple). The keys can be any combination of strings or integers. The following code example shows the use of an associative array to track how often any command issues more than a given number of any single system call: # cat -n assoc2.d 1 #!/usr/sbin/dtrace -qs 2 syscall:::entry 3 { 4 ++namesys[pid,probefunc]; 5 x = namesys[pid,probefunc] > 5000 ? 1 : 0; 6 } 7 syscall:::entry 8 /x && execname != "dtrace"/ 9 { 10 printf("Process: %d %s has just made more than 5000 %s calls\n", 11 pid, execname, probefunc); 12 namesys[pid,probefunc] = 0; /* reset the count */ 13 } # ./assoc2.d Process: 14837 Process: 14837 Process: 14854 Process: 14854 Process: 14854 ^C
find has just made more than 5000 lstat64 calls find has just made more than 5000 lstat64 calls ls has just made more than 5000 lstat64 calls ls has just made more than 5000 acl calls ls has just made more than 5000 lstat64 calls
2-44
Obtaining System Call Information The assoc2.d D program uses an associative array indexed by the unique combination of process ID (PID) and system call name. The ++ operator is incrementing the array element by one each time a process with that PID is making that system call. The array element, like all variables (except clause-local variables), is initialized to 0. The second statement in the action block uses a conditional expression that has three parts:
Thread-Local Variables
Thread-local variables are useful when you wish to enable a probe and mark with a tag every thread that res the probe. Thread-local variables share a common name but refer to separate data storage associated with each thread. Thread-local variables are referenced with the special keyword self followed by the two characters ->, as shown in the following example: syscall::read:entry { self->read = 1; } syscall::read:return /self->read/ { printf("Same thread is returning from read\n"); }
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-45
2-46
Obtaining System Call Information 5 printf("%20s\t%10s\n", "Syscall", "Microseconds"); 6 } 7 syscall:::entry 8 /execname == "grep"/ 9 { 10 self->name[probefunc] = timestamp; 11 } 12 syscall:::return 13 /self->name[probefunc]/ 14 { 15 printf("%20s\t%10d\n", probefunc, 16 (timestamp - self->name[probefunc])/1000); 17 self->name[probefunc] = 0; /* free memory */ 18 } # ./timesys.d System Call Times for grep: Syscall mmap resolvepath resolvepath stat open stat open ... brk open64 read brk brk read close ^C Predictably, the system call that took the most time was read, because of the disk I/O wait time (the second read was of 0 bytes). 25 43 8126 20 28 24 26 Microseconds 50 47 67 37 46 34 32
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-47
2-48
Obtaining System Call Information 0 0 0 0 0 ... 0 0 0 0 0 0 0 ... 0 0 ... 0 0 0 0 0 ... 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 <- ufs_rwlock -> fop_read <- fop_read -> ufs_read -> ufs_lockfs_begin -> rdip -> rw_write_held <- rw_write_held -> segmap_getmapflt -> get_free_smp -> grab_smp -> segmap_hashout <- sfmmu_kpme_lookup -> sfmmu_kpme_sub <- page_unlock <- grab_smp -> segmap_pagefree -> page_lookup_nowait -> page_trylock <- segmap_hashin -> segkpm_create_va <- segkpm_create_va -> fop_getpage -> ufs_getpage -> ufs_lockfs_begin_getpage -> tsd_get <- page_exists -> page_lookup <- page_lookup -> page_lookup_create <- page_lookup_create -> ufs_getpage_miss -> bmap_read -> findextent <- findextent <- bmap_read -> pvn_read_kluster -> page_create_va -> lgrp_mem_hand <- page_add
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-49
Obtaining System Call Information 0 0 0 0 0 0 0 0 0 ... 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 ... 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 <- page_create_va <- pvn_read_kluster -> pagezero -> ppmapin -> sfmmu_get_ppvcolor <- sfmmu_get_ppvcolor -> hat_memload -> sfmmu_memtte <- sfmmu_memtte -> xt_some <- xt_some <- xt_sync <- sema_init <- pageio_setup -> lufs_read_strategy -> logmap_list_get <- logmap_list_get -> bdev_strategy -> bdev_strategy_tnf_probe <- bdev_strategy_tnf_probe <- bdev_strategy -> sdstrategy -> getminor <-> <-> drv_usectohz timeout timeout timeout_common
<- getminor -> scsi_transport <- scsi_transport -> glm_scsi_start -> ddi_get_devstate <- ddi_get_soft_state -> pci_pbm_dma_sync <- pci_pbm_dma_sync <- pci_dma_sync <- glm_start_cmd <- glm_accept_pkt <- glm_scsi_start <- sd_start_cmds <- sd_core_iostart
2-50
Obtaining System Call Information 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ,,, 0 0 0 0 0 0 0 0 ... 0 0 0 ... 0 0 0 0 <- xbuf_iostart <- lufs_read_strategy -> biowait -> sema_p -> disp_lock_enter <- disp_lock_enter -> thread_lock_high <- thread_lock_high -> ts_sleep <- ts_sleep -> disp_lock_exit_high <- disp_lock_exit_high -> disp_lock_exit_nopreempt <- disp_lock_exit_nopreempt -> swtch -> disp -> disp_lock_enter <- disp_lock_enter -> disp_lock_exit <- disp_lock_exit -> disp_getwork <- disp_getwork <- disp <- swtch -> resume <- resume -> disp_lock_enter <- hat_page_getattr <- segmap_getmapflt -> uiomove -> xcopyout <- xcopyout <- uiomove -> segmap_release -> get_smap_kpm <- ufs_imark <- ufs_itimes_nolock <- rdip <- cv_broadcast <- releasef <- read -> read
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-51
Obtaining System Call Information Although more than half of the functions were removed from the previous output, the example shows that a great many functions are required to perform a disk le read. Some of the key functions are described below:
q q q q
read read(2) system call entered ufs_read UFS le being read segmap_getmapflt Find segmap page for the I/O segmap_pagefree Free underlying previous physical page tied to this segmap virtual page onto the cachelist (this policy replaced the old priority paging) ufs_getpage Ask UFS to retrieve the page page_lookup First check to see if the page is in memory (it is not) page_create_va Get new physical page for the I/O hat_memload Map the virtual page to the physical page xt_some Issue cross-trap call to some CPUs sdstrategy Issue Small Computer System Interface (SCSI) command to read page from disk into segmap page timeout Prepare for SCSI timeout of disk read request glm_scsi_start In glm host bus adapter driver biowait Wait for block I/O sema_p Use semaphore to wait for I/O ts_sleep Put timesharing (TS) thread on sleep queue swtch Do a context switch (have thread give up the CPU while it waits for the I/O) disp_getwork Find another thread to run while this thread waits for its I/O resume I/O has completed and CPU is returned to resume running uimove Move data from kernel buffer (page) to user-land buffer segmap_release Release segmap page for use by another I/O later read Read operation ends
q q q q q q
q q q q q q
q q
2-52
Note For the list of option names used in #pragma lines, see the Solaris Dynamic Tracing Guide, part number 817-6223-10.
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-53
$pid Process ID of dtrace interpreter running script $ppid Parent process ID of dtrace interpreter running script $uid Real user ID of user running script $gid Real group ID of user running script $0 Name of script $1, $2, $3, and so on First, second, third command-line arguments passed to script $$1, $$2, $$3, and so on - First, second, third command-line arguments converted to double quoted (" ") strings
The complete list of D macro variables is given in Appendix B. The following D script uses some of these D macro variables: # cat -n params.d 1 #!/usr/sbin/dtrace -s 2 #pragma D option quiet 3 4 tick-2sec 5 /$1 == $11 && $$3 == "fubar"/ 6 { 7 printf("name of script: %s\n", $0); 8 printf("pid of script: %d\n", $pid); 9 printf("9th arg passed to script: %s\n", $$9); 10 exit(0); 11 }
# ./params.d 1 2 fubar 4 5 6 7 8 9 10 1 name of script: ./params.d pid of script: 5363 9th arg passed to script: 9 # ./params.d 1 2 3 4 5 6 7 8 9 10 11 ^C
2-54
Creating D Scripts That Use Arguments The last invocation of the script did not output anything because the value of the rst argument did not match the value of the eleventh argument. The following invocations show that the type and number of arguments must match those referenced inside the D script. This is an example of the error-checking capability of the DTrace facility: # ./params.d 1 2 3 4 5 6 7 8 9 dtrace: failed to compile script ./params.d: line 5: macro argument $11 is not defined # ./params.d 1 2 3 4 5 6 7 8 9 10 11 12 13 dtrace: failed to compile script ./params.d: line 12: extraneous argument '13' ($13 is not referenced) # ./params.d a b c d e f g h i j k dtrace: failed to compile script ./params.d: line 5: failed to resolve a: Unknown variable name The defaultargs option to the dtrace(1M) command allows you to default the values of $1, $2, and so on to zero if the user does not type any arguments when invoking the dtrace(1M) command. The $$1, $$2, and so on references become NULL strings when the user does not type any arguments. Options can be specied on the dtrace(1M) command line as an argument to the -x option. The following examples show these features: # cat -n args.d 1 #!/usr/sbin/dtrace -qs 2 BEGIN 3 { 4 x = 5; 5 } 6 7 tick-2sec 8 { 9 x = x + $1; 10 name = $$2 11 } 12 13 tick-11sec 14 { 15 printf("x: %d\n", x); 16 printf("name: %s\n", name); 17 exit(0); 18 } # ./args.d 2 foo x: 15 name: foo
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-55
Creating D Scripts That Use Arguments # ./args.d dtrace: failed to compile script args.d: line 10: macro argument $1 is not defined # dtrace -x defaultargs -qs args.d x: 5 name: # dtrace -x defaultargs -qs args.d 2 3 4 dtrace: failed to compile script args.d: line 20: extraneous argument '4' ($3 is not referenced)
2-56
Creating D Scripts That Use Arguments lseek sigaction ioctl read write 3 5 45 143 178
# ./ancestors2.d nsgmls brk process: /usr/lib/sgml/nsgmls -gl m/usr/share/lib/sgml/locale/C/dtds/catalog -E0 /usr/s parent: /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman2/fork.2 grandparent: /bin/sh /usr/lib/sgml/sgml2roff /usr/share/man/sman2/fork.2 greatgrandparent: sh -c cd /usr/share/man; /usr/lib/sgml/sgml2roff /usr/share/man/sman2/fork.2 greatgreatgrandparent: catman greatgreatgreatgrandparent: bash You can run the same script with a different process name and system call, which shows the power of being able to pass in arguments to a D script: # ./ancestors2.d vi sigaction process: vi /etc/system parent: bash
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-57
Creating D Scripts That Use Arguments grandparent: -sh greatgrandparent: /usr/sbin/in.telnetd greatgreatgrandparent: /usr/lib/inet/inetd start greatgreatgreatgrandparent: /sbin/init The ancestors3.d D script xes the problem with trying to print nonexistent ancestry: # ./ancestors2.d dtrace: error on address (0x0) in dtrace: error on address (0x0) in dtrace: error on address (0x0) in cron read enabled probe ID 1 (ID 10: syscall::read:entry): invalid action #4 enabled probe ID 1 (ID 10: syscall::read:entry): invalid action #4 enabled probe ID 1 (ID 10: syscall::read:entry): invalid action #4
# cat -n ancestors3.d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 #!/usr/sbin/dtrace -qs syscall::$2:entry /execname == $$1/ { printf("process: %s\n", curthread->t_procp->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent; } syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("parent: %s\n", nextpaddr->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent->p_parent; } syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("grandparent: %s\n", nextpaddr->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent->p_parent->p_parent; } syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("greatgrandparent: %s\n", nextpaddr->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent->p_parent->p_parent->p_parent; } syscall::$2:entry /(execname == $$1) && nextpaddr/ { printf("greatgreatgrandparent: %s\n", nextpaddr->p_user.u_psargs); nextpaddr = curthread->t_procp->p_parent->p_parent->p_parent->p_parent->p_parent; }
2-58
# ./ancestors3.d cron read process: /usr/sbin/cron parent: /sbin/init grandparent: sched process: /usr/sbin/cron parent: /sbin/init grandparent: sched process: /usr/sbin/cron parent: /sbin/init grandparent: sched process: /usr/sbin/cron parent: /sbin/init grandparent: sched ^C
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-59
2-60
swrit/s 0 2 2 1 34 0
fork/s 0 0 0 0 3 0
exec/s 0 0 0 0 3 0
wchar/s 15 32 17 39 317 15
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-61
free eld Displays the systems average value of freemem in kilobytes re eld Displays the average page reclaims per second sr eld Displays the average page scans per second performed by the page daemon
q q
2-62
Creating D Scripts That Use Arguments # cat -n vm.d 1 #!/usr/sbin/dtrace -qs 2 /* 3 * Usage: vmd.d interval count 4 */ 5 6 BEGIN 7 { 8 printf("%8s %8s %8s\n", "free", "re", "sr"); 9 } 10 11 tick-1sec 12 { 13 ++i; 14 @free["freemem"] = sum(8*`freemem); 15 } 16 17 vminfo:::pgrec 18 { 19 ++re; 20 } 21 22 vminfo:::scan 23 { 24 ++sr; 25 } 26 27 tick-1sec 28 /i == $1/ 29 { 30 normalize(@free, $1); 31 printa("%8@d ", @free); 32 printf("%8d %8d\n", re/i, sr/i); 33 ++n; 34 i = 0; 35 re = 0; 36 sr = 0; 37 clear(@free); 38 } 39 40 tick-1sec 41 /n == $2/ 42 { 43 exit(0); 44 }
Using DTrace
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-63
Creating D Scripts That Use Arguments # ./vm.d 5 12 free 385296 385296 385296 385296 316180 22297 1976 1964 1971 1968 1964 1955
re 0 0 0 0 2 1 2 3 2 3 3 4
Like the vmstat(1M) command, the vm.d script expects two arguments: the interval value and a count value. The i, re, sr, and n variables are D global scalar variables used for counting. Note the special reference to the kernels freemem variable: freemem. The script multiplies freemem by 8 because it sums in units of kilobytes, not pages, and the assumption is that a page is 8 Kbytes in size. The script uses the sum() aggregation with the normalize() built-in function which divides the current sum by the interval value to get per second averages. The script also clears the running sum of freemem every interval with the clear() built-in function. The printa() built-in function, which is covered in detail in Appendix A, prints the value of the sum() aggregation. Because you are using integer-truncated arithmetic, you can lose some data. This is also true when using the vmstat(1M) command. For example, if there are only four page reclaims in the ve-second interval, then the average per second shows as 0. This output shows that the system is experiencing sustained scanning of memory by the page daemon, as indicated by the consistently high number of scans per second. It also shows that someone has used most of the free memory within a short period of time, which explains the high scan rates.
2-64
Module 3
Use DTrace to prole an application Use DTrace to access application variables Use Dtrace to nd transient system call errors in an application Use DTrace to determine the names of les being opened
3-1
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
Relevance
Discussion The following questions are relevant to understanding how to use DTrace for application debugging:
q
!
?
Would it be useful to follow the software stack sequentially from the application into the kernel? Would it be useful to display path names being passed to system calls while an application is running? Would it be useful to know where an application is spending the majority of its time?
3-2
Additional Resources
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
q
Sun Microsystems, Inc. Solaris Dynamic Tracing Guide, part number 817-6223-10. Cantrill Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. Paper presented at 2004 USENIX Conference. BigAdmin System Administration Portal [http://www.sun.com/bigadmin/content/dtrace]. dtrace(1M) manual page in the Solaris 10 OS manual pages, Solaris 10 Reference Manual Collection. The /usr/demo/dtrace directory contains all of the sample scripts from the Solaris Dynamic Tracing Guide.
3-3
Application Profiling
Application Proling
DTrace provides tools for understanding the behavior of user processes. It can help you to:
q q q
Debug applications Analyze application performance problems Understand the behavior of a complex application
These tools can be used alone to determine the cause of problems with application program behavior, or as an adjunct to traditional debugging tools such as the mdb(1) debugger. This module describes the DTrace facilities used to trace user process activity. It also provides examples of how to use those facilities.
3-4
Application Profiling
proc profile sched sdt syscall sysinfo vminfo #
3-5
Application Profiling
3-6
Application Profiling # echo $$ 8577 # dtrace -n 'pid8567:libc:strcmp:entry' dtrace: description 'pid8567:libc:strcmp:entry' matched 1 probe CPU ID FUNCTION:NAME 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry 0 45136 strcmp:entry
3-7
Application Profiling 15 { 16 int r; 17 18 usleep(650); 19 r = f4(a-3, a+3); 20 return(r); 21 } 22 23 int f2(int a) 24 { 25 return(f3(5*a)); 26 } 27 28 int f1(int a, int b) 29 { 30 int r; 31 32 usleep(90); 33 r = f2(a-b); 34 return(r); 35 } 36 37 main() 38 { 39 int x; 40 41 x = f1(13,6); 42 printf("%d\n", x); 43 x = f1(17,5); 44 printf("%d\n", x); 45 } # gcc calls.c -o calls # calls 83 133 # cat -n tracecalls.d 1 #!/usr/sbin/dtrace -s 2 3 pid$1:calls:$2:entry 4 { 5 self->trace = 1; 6 } 7 8 pid$1:calls:$2:return 9 /self->trace/ 10 {
3-8
You start the calls application in a second window through the mdb(1) debugger. This enables you to stop it as soon as possible in the start-up function that calls the main() function. The _start:b command sets a breakpoint in the _start function where the application starts running. The :r command starts the process running; it immediately hits the breakpoint and stops. You then escape from the debugger by using the !ps command to nd the PID of the calls process: # mdb calls > _start:b > :r mdb: stop at _start mdb: target stopped at: _start: clr %fp > !ps PID TTY TIME CMD 8916 pts/3 0:00 ps 8914 pts/3 0:00 calls 8586 pts/3 0:01 bash 8915 pts/3 0:00 sh 8580 pts/3 0:00 sh 8913 pts/3 0:00 mdb You can now run the dtrace command in the rst terminal window to trace the function calls, starting with the f1 function. You must also continue the process with the :c mdb command after starting the dtrace command: # dtrace -F -s tracecalls.d 8914 f1 dtrace: script 'tracecalls.d' matched 16 probes In the second terminal window you continue the process: > :c 83 133 mdb: target has terminated > $q
3-9
Application Profiling The call sequence is shown in the rst, dtrace terminal window: CPU FUNCTION 0 -> f1 0 -> f2 0 -> f3 0 -> f4 0 -> f5 0 <- f5 0 <- f4 0 <- f3 0 <- f2 0 -> f1 0 -> f2 0 -> f3 0 -> f4 0 -> f5 0 <- f5 0 <- f4 0 <- f3 0 <- f2 ^C
3-10
# dtrace -F -s tracecalls2.d 8944 f1 dtrace: script 'tracecalls2.d' matched 16 probes CPU FUNCTION 0 -> f1 13 6 0 -> f2 7 7 0 -> f3 35 35 0 -> f4 32 38 0 -> f5 32 38 0 <- f5 40 70 0 <- f4 56 83 0 <- f3 68 83 0 <- f2 52 83 0 -> f1 17 5 0 -> f2 12 12 0 -> f3 60 60 0 -> f4 57 63 0 -> f5 57 63 0 <- f5 40 120 0 <- f4 56 133 0 <- f3 68 133 0 <- f2 52 133 ^C The following commands are entered in the mdb(1) window which started the calls program. On return from a function, the arg0 argument is the offset within the function where the restore instruction executed to leave the function, and the arg1 argument is the return value, as follows: > f5+0t40/i f5+0x28: f5+0x28: > f5+0x24,2/i f5+0x24: f5+0x24: f5+0x28: > f2+0t48,2/i f2+0x30: f2+0x30: f2+0x34:
restore
ret restore
ret restore
3-11
Application Profiling > The f5+0t40 address represents 40 decimal bytes into the f5 function, which the trace output shows was placed in the arg0 argument when the f5 function returned. For arg1, the return value from the f5 function on the rst return was 70; on the second return it was 120. The f5+0x24,2/i command in the mdb(1) debugger displays two instructions starting at address f5+0x24. Functions typically return by using these two SPARC instructions. All SPARC instructions are four bytes in length. At address f2+0x34 is another restore instruction.
3-12
Application Profiling 30 31 r = f2(a-b); 32 return(r); 33 } 34 35 main() 36 { 37 int x; 38 39 x = f1(13,6); 40 printf("%d\n", x); 41 } # cat -n traceall.d 1 #!/usr/sbin/dtrace -qs 2 #pragma D option flowindent 3 4 pid$1::$2:entry 5 { 6 self->trace = 1; 7 } 8 9 pid$1:::entry, pid$1:::return, fbt::: 10 /self->trace/ 11 { 12 printf("%s\n", curlwpsinfo->pr_syscall ? "K" : "U"); 13 } 14 15 pid$1::$2:return 16 /self->trace/ 17 { 18 self->trace = 0; 19 } The traceall.d D script uses a #pragma statement to set the equivalent -F option of the dtrace(1M) command to indent the function calls. The pr_syscall eld of the lwp information data structure to which the curlwpsinfo built-in variable points is 0 when not in the kernel otherwise it is the system call number when the thread is in the kernel. You use this to indicate whether you are tracing user code or kernel code. The traced calls follow. Many of the function calls are for setting up the dynamic binding to the library functions on rst call. The following example shows a portion of the output of this script: # traceall.d 12861 main
3-13
Application Profiling CPU FUNCTION 0 -> main 0 -> f1 0 -> f2 0 -> f3 0 -> f4 0 -> f5 0 <- f5 0 <- f4 0 <- f3 0 <- f2 0 <- f1 0 -> elf_rtbndr 0 -> elf_bndr 0 -> enter 0 -> rt_bind_guard 0 <- rt_bind_guard 0 -> _ti_bind_guard 0 <- _ti_bind_guard 0 -> rt_mutex_lock 0 <- rt_mutex_lock 0 -> _lwp_mutex_lock 0 <- _lwp_mutex_lock 0 <- enter 0 -> lookup_sym 0 -> elf_hash 0 <- elf_hash 0 -> callable 0 <- callable 0 -> elf_find_sym 0 -> strcmp ... 0 <- elf_bndr 0 <- elf_rtbndr 0 -> printf 0 -> _flockget 0 -> mutex_lock 0 <- mutex_lock 0 -> mutex_lock_impl 0 <- mutex_lock_impl 0 <- _flockget 0 -> _setorientation 0 <- _setorientation 0 -> _ndoprnt 0 -> elf_rtbndr 0 -> elf_bndr
U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U
3-14
Application Profiling 0 0 ... 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ^C -> enter -> rt_bind_guard -> _write -> pre_syscall -> syscall_mstate <- syscall_mstate <- pre_syscall -> write32 <- write32 -> write -> getf -> set_active_fd <- clear_active_fd -> cv_broadcast <- cv_broadcast <- releasef <- write -> post_syscall -> clear_stale_fd <- clear_stale_fd -> syscall_mstate <- syscall_mstate <- post_syscall <- _xflsbuf -> ferror_unlocked <- ferror_unlocked <- _ndoprnt -> ferror_unlocked <- ferror_unlocked -> mutex_unlock <- mutex_unlock <- printf <- main U U U K K K K K K K K K K K K K K K U U U U U U U U U U U U U U U
3-15
Application Profiling
3-16
Application Profiling 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39513 39582 39583 39584 39585 39586 39587 39588 39589 39597 39598 39599 39600 39601 39602 39603 39604 39605 39606 39607 39618 39619 39493 39494 39495 39496 strcmp:4c strcmp:160 strcmp:164 strcmp:168 strcmp:16c strcmp:170 strcmp:174 strcmp:178 strcmp:17c strcmp:19c strcmp:1a0 strcmp:1a4 strcmp:1a8 strcmp:1ac strcmp:1b0 strcmp:1b4 strcmp:1b8 strcmp:1bc strcmp:1c0 strcmp:1c4 strcmp:1f0 strcmp:1f4 strcmp:return strcmp:entry strcmp:0 strcmp:4 The previous output shows the strcmp function executing each instruction sequentially until the instruction at strcmp+0x18 branches to strcmp+0x44. You can display some of the assembly instructions using the mdb(1) debugger:
# mdb -p 8567 Loading modules: [ ld.so.1 libc.so.1 ] > libc`strcmp,14/ai libc.so.1`strcmp: libc.so.1`strcmp: subcc libc.so.1`strcmp+4: be libc.so.1`strcmp+8: sethi libc.so.1`strcmp+0xc: andcc libc.so.1`strcmp+0x10: or libc.so.1`strcmp+0x14: be libc.so.1`strcmp+0x18: sll libc.so.1`strcmp+0x1c: sub libc.so.1`strcmp+0x20: ldub libc.so.1`strcmp+0x24: ldub libc.so.1`strcmp+0x28: subcc libc.so.1`strcmp+0x2c: bne libc.so.1`strcmp+0x30: addcc
%o0, %o1, %o2 +0xac <libc.so.1`strcmp+0xb0> %hi(0x1010000), %o5 %o0, 3, %o3 %o5, 0x101, %o5 +0x30 <libc.so.1`strcmp+0x44> %o5, 7, %o4 %o3, 4, %o3 [%o1 + %o2], %o0 [%o1], %g1 %o0, %g1, %o0 +0x1c4 <libc.so.1`strcmp+0x1f0> %o0, %g1, %g0
3-17
Application Profiling
libc.so.1`strcmp+0x34: libc.so.1`strcmp+0x38: libc.so.1`strcmp+0x3c: libc.so.1`strcmp+0x40: libc.so.1`strcmp+0x44: libc.so.1`strcmp+0x48: libc.so.1`strcmp+0x4c: be addcc bne add andcc be cmp +0x1bc %o3, 1, %o3 -0x1c %o1, 1, %o1 %o1, 3, %o3 +0x118 %o3, 2 <libc.so.1`strcmp+0x1f0> <libc.so.1`strcmp+0x20>
<libc.so.1`strcmp+0x160>
The instruction at the strcmp+0x18 address is a shift left logical (sll), which is in the delay slot after the conditional branch instruction: be. This instruction executes before the one at address: strcmp+0x44 even when the branch is taken, which in this execution it was. Another conditional branch was taken at address: strcmp+0x48. DTrace enables you to trace, instruction by instruction, the actual execution ow through the logic of a program. This is an improvement over the traditional debugging techniques of inserting print statements in your application or of running the application under a debugger and setting breakpoints where appropriate.
3-18
Application Profiling ... usleep value 1048576 2097152 4194304 8388608 16777216 ... f4 value 16384 32768 65536 131072 ... f1 value 4194304 8388608 16777216 33554432 ... main value ------------- Distribution ------------16777216 | 33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 67108864 | count 0 1 0 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@ 1 |@@@@@@@@@@@@@@@@@@@@ 1 | 0 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@ 1 |@@@@@@@@@@@@@@@@@@@@ 1 | 0 ------------- Distribution ------------- count | 0 |@@@@@@@@@@ 1 |@@@@@@@@@@ 1 |@@@@@@@@@@@@@@@@@@@@ 2 | 0
3-19
Application Profiling
profile-200 Fires 200 times per second on every CPU profile-5ms Fires every 5 milliseconds on every CPU profile-5000us Fires every 5000 microseconds on every CPU
profile-1d profile-24h
The following script should output numbers that increase by approximately one million (nanoseconds): # dtrace -q -n 'profile-1ms {printf("%d\n", timestamp)}' 274817618640560 274817619628282 274817620626998 274817621624780 274817622624686 ^C Currently you cannot specify a time interval less than 200 microseconds with the profile provider, as the following example shows: # dtrace -q -n 'profile-199us {printf("%d\n", timestamp)}' dtrace: invalid probe specifier profile-199us {printf("%d\n", timestamp)}: probe description :::profile-199us does not match any probes # dtrace -q -n 'profile-200us {printf("%d\n", timestamp)}' 275328143837997 275328144030602 275328144229696 275328144431022 ^C
3-20
Application Profiling
# cat -n running.d 1 #!/usr/sbin/dtrace -qs 2 3 profile-109 4 /pid != 0/ 5 { 6 @[pid, execname] = count(); 7 } 8 9 END 10 { 11 printf("%-8s %-40s %s\n", "PID", "CMD", "COUNT"); 12 printa("%-8d %-40s %@d\n", @); 13 } # ./running.d ^C PID CMD COUNT 9190 grep 1 9191 bash 1 9190 bash 1 9189 bash 1 9188 uptime 2 8586 bash 2 9191 vi 12 3 fsflush 24 9192 find 80 You can use the profile-n provider to sample information about a specic process. The following script samples, slightly quicker than every 5 milliseconds, the priority of the shell thread while it is running in an innite loop: # echo $$ 8586 # while : ; do : ; done
3-21
Application Profiling In another window, run the following D script: # cat -n profilepri.d 1 #!/usr/sbin/dtrace -qs 2 profile-211 3 /pid == $1/ 4 { 5 @[execname] = lquantize(curlwpsinfo->pr_pri, 0, 100, 10); 6 } # ./profilepri.d 8586 ^C bash value < 0 0 10 20 30 40 50 60 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@@@@@ 271 |@@@@@@ 63 |@@@@ 48 |@@@ 32 |@ 15 |@@ 24 | 0
In the previous example, the curlwpsinfo built-in variable points to a structure containing lwp information. This structure is described in the proc(4) manual page. It shows the Solaris timesharing schedulers bias towards zero for compute-bound threads. The high counts indicate that this thread is running more frequently than other threads on the system. In the following example, you see the results of running the next invocation of the script when the shell is running in its more normal mode of executing a few interactive commands: # ./profilepri.d 8586 ^C bash value 30 40 50 60 ------------- Distribution ------------| |@@@@@@@@@@@@@ |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | count 0 1 2 0
3-22
Application Profiling This shows that the shells priority is higher when run interactively, where it spends most of its time waiting on input; the small counts indicate that it was not running frequently.
The arg0 argument The PC register value in the kernel at the time the probe red, or 0 if the current thread was not executing in the kernel at the time that the probe red The arg1 argument The PC register value in the user-level process at the time the probe red, or 0 if the current thread was executing in the kernel at the time the probe red
3-23
Application Profiling 3 profile-1009 4 { 5 ++t; 6 } 7 8 profile-1009 9 /pid == $1/ 10 { 11 @pc[arg1] = count(); 12 @mode[arg0 ? "kernel" : "user"] = count(); 13 ++n; 14 } 15 16 tick-10sec 17 /n/ 18 { 19 printa("%-10x\t%@u\n", @pc); 20 printf("Total: %u out of %u\n", n, t); 21 exit(0); 22 } # ./profile.d 9240 ff3163ac 1 0 5 107f8 60 10810 60 10710 64 1084c 64 10754 65 10734 66 10824 69 1083c 69 10738 69 1081c 71 10820 73 106f4 75 10730 75 10728 76 10744 77 10814 77 1074c 79 1074c 79 106e4 79 106d8 79 1075c 80 10770 80 10828 80
3-24
Application Profiling 1072c 10760 106f0 10758 106dc 106d4 106d0 ff2a11e8 10764 20ac8 20acc ff2a11ec 10840 20ac4 10834 106cc 106e0 10714 107fc ff2a11e4 ff2a11e0 Total: 9887 out kernel user 82 83 86 86 87 88 92 132 134 137 141 142 144 147 172 306 562 611 623 716 3723 of 10002 5 9882 In the previous example, the high count in user mode versus kernel mode indicates that this process is compute-bound. By using the mdb(1) debugger as shown in the following example, you can tell where the process is spending most of its time: > ff2a11e0/i libc.so.1`.umul: libc.so.1`.umul:umul > ff2a11e4/i libc.so.1`.umul+4: > 107fc/i mod+0x34: cmp > 10714/i prod+0x1c: cmp > 106e0/i sum+0x14: add
%o0, %o1, %o0 rd %o0, %o1 %o0, %o1 %o0, %o1, %o0 %y, %o1
3-25
Application Profiling This output shows that this process spent most of its time in the C library multiply function: .umul. It spent most of the remaining time in its own mod, prod, and sum functions. The programmer should investigate compiler options to have the multiplication occur with hardware instructions instead of in software. This program was compiled with the gcc compiler with no optimizations.
3-26
Application Profiling
count 0 15 0
3-27
Application Profiling 32768 | mutex_lock value 8192 16384 32768 ... sum value 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 ... prod value 17179869184 34359738368 68719476736 137438953472 ... .umul value 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 0
count 0 16 0
count 0 14 1 0
3-28
Application Profiling This output shows that the process is spending an average of only 816 microseconds in both the sum and the .umul functions, but they are being called signicantly more often than the other functions. The process spent between 3468 seconds in the prod function 14 times that it was called and between 68137 seconds the other time it was called. Finally, the following command builds a table of which functions of an application are called the most frequently: # dtrace -n 'pid$target:::entry {@[probefunc] = count()}' -c ./pgm dtrace: description 'pid$target:::entry ' matched 2931 probes ^C ... main 1 hdl_create 1 elf_entry_pt 1 unused 1 rtld_db_postinit 1 call_init 1 munmap 1 ... printf 3 .rem 3 mod 3 free 4 prod 4 defrag 4 strncpy 5 plt_full_range 5 strlen 5 ... strcmp 39 rt_bind_clear 42 sum 3549598 .umul 6249598
3-29
Application Variables
Application Variables
Accessing process address space information is more difcult than accessing kernel information because DTrace actions run in the kernel. Therefore, to access process data such as application variables or system call argument strings (for example, path names), you must copy the information from the process address space to the kernel. DTrace provides two built-in functions to accomplish this:
q
void *copyin(uintptr_t addr, size_t size) The copyin function copies the specied size in bytes from the specied user address into a DTrace scratch buffer and returns the address of this buffer. The user address is interpreted as being within the address space of the process associated with the currently running thread when the probe res.
string copyinstr(uintptr_t addr) The copyinstr function copies a null-terminated C string from the specied user address into a scratch buffer and returns its address.
3-30
Application Variables 18 19 int f3(int a) 20 { 21 int r; 22 23 usleep(650); 24 r = f4(a-3, a+3); 25 z = r*y; 26 return(r); 27 } 28 29 int f2(int a) 30 { 31 return(f3(5*a)); 32 } 33 34 int f1(int a, int b) 35 { 36 int r; 37 38 usleep(90); 39 r = f2(a-b); 40 y = z*r; 41 return(r); 42 } 43 44 main() 45 { 46 int x; 47 48 x = f1(13,6); 49 printf("x=%d y=%d z=%d\n", x, y, z); 50 x = f1(17,5); 51 printf("x=%d y=%d z=%d\n", x, y, z); 52 } # calls3 x=83 y=633788 z=7636 x=133 y=137443530 z=1033410 The following D script is passed three arguments:
q q q
$1 The virtual address of a global variable $2 The global variables size $$3 The name of the variable
3-31
Application Variables You have dtrace(1M) start the process by using the -c option. dtrace(1M) sets the $target macro to the process PID. The script displays the value of a global variable on entry and return to every function in the program that is called after the main function. # cat -n uservariables.d 1 #!/usr/sbin/dtrace -qs 2 3 pid$target:a.out:main:entry 4 { 5 started = 1; 6 } 7 8 pid$target:a.out::entry 9 /started/ 10 { 11 v = (int *)copyin($1, $2); 12 printf("On entry to %s: %s=%d\n", probefunc, $$3, *v); 13 } 14 15 pid$target:a.out::return 16 /started/ 17 { 18 v = (int *)copyin($1, $2); 19 printf("On return from %s: %s=%d\n", probefunc, $$3, *v); 20 } 21 22 pid$target:a.out:main:return 23 { 24 exit(0); 25 }
The (int *) in front of the copyin function is called a cast, which is a feature taken from the C language. A cast converts one data type into another data type. In this case, the data type is converted from void *, which is the type of the buffer address into which the variable is copied, to an integer pointer, because you are copying in an integer. You use a * in front of the v variable in the printf statements to dereference the pointer to that which it points, namely the integer. The nm(1) command is used to display the symbol table entry for the z variable in the calls3 executable le. # /usr/ccs/bin/nm calls3 | grep '|z$' [70] | 133952| 4|OBJT |GLOB |0 |16 |z
3-32
Application Variables # dtrace -qs uservariables.d -c calls3 133952 4 z x=83 y=633788 z=7636 x=133 y=137443530 z=1033410 On entry to main: z=8 On entry to f1: z=8 On entry to f2: z=8 On entry to f3: z=8 On entry to f4: z=8 On entry to f5: z=8 On return from f5: z=9 On return from f4: z=9 On return from f3: z=7636 On return from f2: z=7636 On return from f1: z=7636 On entry to f1: z=7636 On entry to f2: z=7636 On entry to f3: z=7636 On entry to f4: z=7636 On entry to f5: z=7636 On return from f5: z=7637 On return from f4: z=7637 On return from f3: z=1033410 On return from f2: z=1033410 On return from f1: z=1033410 On return from main: z=1033410 You can easily display the y variable, as follows: # /usr/ccs/bin/nm calls3 | grep '|y$' [67] | 133948| 4|OBJT |GLOB |0 |16 # dtrace -qs uservariables.d -c calls3 133948 4 y x=83 y=633788 z=7636 x=133 y=137443530 z=1033410 On entry to main: y=15 On entry to f1: y=15 On entry to f2: y=15 On entry to f3: y=15 On entry to f4: y=15 On entry to f5: y=15 On return from f5: y=15 On return from f4: y=92 On return from f3: y=92 On return from f2: y=92 On return from f1: y=633788 On entry to f1: y=633788 On entry to f2: y=633788 |y
3-33
Application Variables On On On On On On On On On entry to f3: y=633788 entry to f4: y=633788 entry to f5: y=633788 return from f5: y=633788 return from f4: y=7770 return from f3: y=7770 return from f2: y=7770 return from f1: y=137443530 return from main: y=137443530
3-34
Application Variables > ::nm ! grep '|errno$' 0xff3ee670|0x00000004|OBJT |LOCL |0x2 0xff1ec03c|0x00000004|OBJT |GLOB |0x0 > $q # ./libvars.d 9583 0xff3ee670 4 errno The value of errno=2 The value of errno=2 The value of errno=2 The value of errno=2 The value of errno=2 ^C # ./libvars.d 9583 0xff1ec03c 4 errno The value of errno=0 The value of errno=0 The value of errno=0 The value of errno=0 The value of errno=0 ^C
|21 |21
|errno |errno
The libvars.d D script was run while the bash shell performed the following loop: # while :; do cd bash: cd: /fubar: bash: cd: /fubar: bash: cd: /fubar: bash: cd: /fubar: /fubar; done No such file or No such file or No such file or No such file or directory directory directory directory
This shows that the rst errno at address 0xff3ee670 is the one set as a result of the cd command failing in the bash shell. The No such file or directory error message corresponds to an errno value of 2.
3-35
3-36
The plockstat Provider 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51496 51474 51494 51496 51474 51494 51496 51474 51494 51496 51474 51494 51496 51474 51494 51496 51474 51494 51496 __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release rwlock_lock:rw-block rwlock_lock:rw-acquire __rw_unlock:rw-release 1529287107252733 1529287107403403 1529287107412819 1529287107423097 1529287107575211 1529287107583872 1529287107593238 1529287107816907 1529287107826079 1529287107836362 1529287107928393 1529287107936277 1529287107945832 1529287108042880 1529287108051591 1529287108060852 1529287108261326 1529287108270476 1529287108280748
The plockstat(1M) command is a DTrace consumer that uses the plockstat provider to show detailed application lock usage information. The plockstat(1M) command is comparable to the lockstat(1M) command which shows detailed lock contention details for kernel locks.
3-37
portfs lwp_park lwp_park portfs portfs stat64 chdir chdir stat64 lwp_kill open stat setpgrp waitsys open stat open stat setpgrp waitsys lwp_kill
62 62 62 62 62 2 2 2 2 3 2 2 13 10 2 2 2 2 13 10 3
3-38
setpgrp 13 libc.so.1`_syscall6+0x1c 35c6c 34fa8 bash`execute_command_internal+0x414 bash`execute_command+0x50 bash`reader_loop+0x220 bash`main+0x90c bash`_start+0x108 portfs 62 libc.so.1`_portfs+0x4 svc.startd`wait_thread+0x30 libc.so.1`_lwp_start portfs 62 libc.so.1`_portfs+0x4 svc.startd`wait_thread+0x30 libc.so.1`_lwp_start waitsys 10 libc.so.1`_waitid+0x8 libc.so.1`waitpid+0x60 410a0 41004 libc.so.1`__sighndlr+0xc libc.so.1`call_user_handler+0x3b8 libc.so.1`__lwp_sigmask+0x30 libc.so.1`pthread_sigmask+0x1b4 libc.so.1`sigprocmask+0x20
svc.startd
svc.startd
bash
3-39
Transient System Call Errors bash`make_child+0x254 35c6c 34fa8 bash`execute_command_internal+0x414 bash`execute_command+0x50 bash`reader_loop+0x220 bash`main+0x90c bash`_start+0x108 bash stat64 2 libc.so.1`stat64+0x4 bash`sh_canonpath+0x258 63638 bash`cd_builtin+0x364 352a0 35a8c 34fc8 bash`execute_command_internal+0x414 bash`execute_command+0x50 bash`reader_loop+0x220 bash`main+0x90c bash`_start+0x108 open 2 ld.so.1`__open+0x4 ld.so.1`elf_config+0x120 ld.so.1`setup+0xc20 ld.so.1`_setup+0x37c ld.so.1`_rt_boot+0x88
find
Hexadecimal addresses are shown on the stack trace output when the dtrace command cannot resolve the PC value to a symbol. To nd what transient system call errors are occurring in a specic application and where, you simply change the errno2.d script to pass in the PID of the application.
3-40
3-41
Transient System Call Errors The unknown process is using a lot of system time. The following D program can determine what system calls are being made:
# dtrace -n 'syscall:::entry /pid == 12663/ { @syscalls[probefunc] = count();}' dtrace: description 'syscall:::entry ' matched 226 probes ^C read 940592
This process appears to be stuck in an endless loop of read(2) system calls. The following truss(1) command conrms this, and shows that the reads are failing: # truss -p 12663 read(3, 0xFFBFFD0B, read(3, 0xFFBFFD0B, read(3, 0xFFBFFD0B, read(3, 0xFFBFFD0B, ... 1) 1) 1) 1) Err#89 Err#89 Err#89 Err#89 ENOSYS ENOSYS ENOSYS ENOSYS
The errno2.d D script shows further evidence of a runaway loop of failing read(2) system calls: # ./errno2.d unknown read 89 libc.so.1`_read+0x8 unknown`main+0x134 unknown`_start+0x5c read 89 libc.so.1`_read+0x8 unknown`main+0x134 unknown`_start+0x5c read 89 libc.so.1`_read+0x8 unknown`main+0x134 unknown`_start+0x5c
unknown
unknown
^C # grep 89 /usr/include/sys/errno.h /* Copyright (c) 1984, 1986, 1987, 1988, 1989 AT&T */ * (c) 1983,1984,1985,1986,1987,1988,1989 AT&T. #define ENOSYS 89 /* Unsupported file system operation # pkill unknown
*/
3-42
Transient System Call Errors Suppose you saw the following similar prstat(1M) command output:
# prstat -m -p 12745 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 12745 root 17 81 0.0 0.0 0.0 0.0 0.0 1.5 0 132 .5M 0 readchar/1
You can again get details on what system calls are being made, as follows:
# dtrace -n 'syscall:::entry /pid == 12745/ { @syscalls[probefunc] = count();}' dtrace: description 'syscall:::entry ' matched 225 probes
^C
stat open write close read # truss -p 12745 read(3, "\b", 1) read(3, "92", 1) read(3, "10", 1) read(3, "\0", 1) read(3, "14", 1) read(3, " @", 1) read(3, "\0", 1) read(3, "82", 1) read(3, " #", 1) read(3, "90", 1) ^C = = = = = = = = = = 1 1 1 1 1 1 1 1 1 1 6 6 6 6 760747
As its name implies, this readchar process is reading a single character at a time. Now run the iosnoop.d D script from Module 2 to get details on the disk input/output (I/O):
# ./iosnoop.d COMMAND readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar readchar PID 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 12745 FILE /usr/lib/nss_ldap.so.1 /usr/lib/nss_ldap.so.1 /usr/lib/nss_ldap.so.1 /usr/lib/nss_ldap.so.1 /usr/lib/nss_ldap.so.1 <none> /usr/lib/nss_ldap.so.1 /usr/lib/passwdutil.so.1 /usr/lib/passwdutil.so.1 <none> /usr/lib/passwdutil.so.1 /usr/lib/watchmalloc.so.1 /usr/lib/watchmalloc.so.1 /usr/lib/watchmalloc.so.1 /usr/lib/watchmalloc.so.1 <none> /usr/lib/cpp DEVICE RW sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R sd2 R MS 6.492 6.492 6.492 6.638 2.264 6.398 0.696 0.729 1.133 6.646 5.656 6.622 6.842 0.368 6.488 6.315 7.896
3-43
This application appears to be reading all of the les under the /usr/lib directory one byte at a time. This programmer must not realize that using the standard I/O library functions to buffer reads is more efcient than issuing system call reads of one character at a time. The OS is reading the disk in blocks, as the iosnoop.d D script output indicates, but the application is only extracting the information from the kernel buffers one byte at a time.
3-44
Open Files
Open Files
In this section you learn how to display the path names of les being opened. Note that in DTrace it is more difcult to display pointer arguments passed to system calls than those passed as integer arguments. Examples of system calls that take pointer arguments are open(2), stat(2), unlink(2), and chmod(2), which each take path name string arguments. There are also system calls that pass the address of structures, for example, sigaction(2). You must use the appropriate copinstr() and copyin() built-in functions to display the actual strings or structures being passed to the kernel.
3-45
Open Files dtrace: invalid dtrace: invalid dtrace: invalid ^C error on enabled probe ID 1 address (0x10000) in action error on enabled probe ID 1 address (0x10000) in action error on enabled probe ID 1 address (0x10000) in action (ID 12: syscall::write:entry): #1 (ID 12: syscall::write:entry): #1 (ID 12: syscall::write:entry): #1
The arg1 argument used in the write.d D script is the second argument to the write(2) system call, which in this case is the address of the string you want to display. It is a process address, however, and DTrace is running the action statements in the kernels address space. The stringof() built-in function converts the write(2) system call argument to the proper string type. For the script to work, you must use the copyinstr() or copyin() built-in DTrace functions showed previously. The following example shows the correct way to access the processs string arguments: # cat -n write2.d 1 #!/usr/sbin/dtrace -s 2 3 syscall::write:entry 4 /pid == $target/ 5 { 6 printf("%s\n", copyinstr(arg1)); 7 } # dtrace -s write2.d -c writemsg dtrace: script 'write2.d' matched 1 probe This is some text being written to standard output to prove a point dtrace: pid 1537 exited with status 1 CPU ID FUNCTION:NAME 0 12 write:entry This is some text being 0 0 12 12 write:entry write:entry written to standard output to prove a point
The following changes to the D script enable it to work on all system-wide write(2) system calls (except those issued by the dtrace(1M) command): # cat -n write3.d 1 #!/usr/sbin/dtrace -s 2 3 syscall::write:entry 4 /pid != $pid/ 5 {
3-46
Open Files 6 printf("%s\n", copyinstr(arg1)); 7 } # ./write3.d dtrace: script './write3.d' matched 1 probe CPU ID FUNCTION:NAME ore--ion, name) iption specifiers (provider, module, funce describes how to use 4maction]]
0 914 write:entry sys61# ./write2.ddwrite2.dted token `newline' ctory _________________________________________________________________________ _________________________________________________________________________ _____________________________________________________ 0 914 write:entry pys61# ./write2.ddwrite2.dted token `newline' ctory _________________________________________________________________________ _________________________________________________________________________ _____________________________________________________ You received garbage output because the write(2) system call does not necessarily write out null terminated strings. The copyin() system call is the more appropriate function to use for specifying the size of the write: # cat -n write4.d 1 #!/usr/sbin/dtrace -s 2 3 syscall::write:entry 4 /pid != $pid/ 5 { 6 printf("%s\n", stringof(copyin(arg1, arg2))); 7 } # ./write4.d dtrace: script './write4.d' matched 1 probe CPU ID FUNCTION:NAME 0 914 write:entry p 0 914 write:entry w
3-47
914
write:entry /var/dtrace/mod3
0 0 0 0 0 0
914
0 ^C
914
write:entry sys61#
3-48
Open Files man opening /var/ld/ld.config man opening /lib/libc.so.1 man opening /usr/share/man/man.cf man opening /usr/share/man/windex man opening /usr/share/man/sman1m/dtrace.1m sh opening /var/ld/ld.config sh opening /lib/libc.so.1 more opening /var/ld/ld.config more opening /lib/libcurses.so.1 more opening /lib/libc.so.1 more opening /usr/share/lib/terminfo//x/xterm utmpd opening /var/adm/utmpx utmpd opening /var/adm/utmpx utmpd opening /proc/12571/psinfo utmpd opening /proc/9587/psinfo date opening /var/ld/ld.config date opening /lib/libc.so.1 date opening /usr/share/lib/zoneinfo/US/Mountain vi opening /var/ld/ld.config vi opening /usr/lib/libmapmalloc.so.1 vi opening /lib/libcurses.so.1 vi opening /lib/libc.so.1 vi opening /lib/libgen.so.1 vi opening /usr/share/lib/terminfo//x/xterm vi opening //.exrc vi opening /var/tmp/ExTcaqBz vi opening /var/tmp/ExUcaqBz vi opening /etc/system ^C
3-49
Open Files 10 11 12 13 14 15 16 syscall::open*:return /self->entry && arg0 == -1/ { printf("open for '%s' failed, errno=%d", self->path, errno); ustack(); self->entry = 0; }
# failedopen.d 13026 open for '/usr/openwin/lib/X11/XtErrorDB' failed, errno=2 febbcf78 febb05a0 fec97b38 fec97a78 fedbbffc fedbbeac fedbbe40 fedc0220 fedc037c fed8fb6c fed8f2f8 fed8f290 cf3f8 3f648 d1c98 5c658 ^C
3-50
Open Files 13027 pts/1 0:00 sh 12571 pts/1 0:00 sh 13026 pts/1 0:00 dtmail 13028 pts/1 0:00 ps 12577 pts/1 0:06 bash > :c libSDtMail: Error: Xt Error: Can't open display: 129.150.33.103:0.0 mdb: target has terminated > 5c658/i _start+0x108: _start+0x108: call +0x75618 <main> > d1c98/i main+0x28: main+0x28: jmpl %i1, %o7 > 3f648/i __0fHRoamAppKinitializePiPPc+0x310: __0fHRoamAppKinitializePiPPc+0x310: call +0x8fd24 <__0fLApplicationKinitializePiPPc> > cf3f8/i __0fLApplicationKinitializePiPPc+0x8c: __0fLApplicationKinitializePiPPc+0x8c: call +0x52718 <PLT:XtAppInitialize> > fed8f290/i libXt.so.4`XtAppInitialize+0x54: libXt.so.4`XtAppInitialize+0x54:call +0x56800 <PLT:XtOpenApplication> > fed8f2f8/i libXt.so.4`XtOpenApplication+0x48: call +0x56774 <PLT:_XtAppInit> > fed8fb6c/i libXt.so.4`_XtAppInit+0x138: call +0x553cc <PLT:XtErrorMsg> > febbcf78/i libc.so.1`__open+4: ta 8 > febbcf78:b > :c mdb: stop at libc.so.1`__open+4 mdb: target stopped at: libc.so.1`__open+4: ta 8 > $c libc.so.1`__open+4(ff2893ec, 2000, 1b6, 38e70, ff3b3508, febe2264) libc.so.1`open+0x64(ff2893ec, 2000, 1b6, ff3ea0f8, ff3ec46c, 0) libnsl.so.1`__nsl_fopen+0x8c(ff2893ec, ff2893fc, ff24fbb4, ff3ea0f8, ff3ec46c, ff2893fc) libnsl.so.1`getnetlist+0x20(0, 69bcc, ff292690, 0, 0, ff290f30) libnsl.so.1`setnetconfig+0x38(0, ff294a58, ff292690, 0, 763a8, febea4c0) libnsl.so.1`__rpc_getconfip+0xd8(ff296ea8, 0, 0, 0, 4144c, 0)
3-51
Open Files libnsl.so.1`getipnodebyname+0x1c(ffbfef50, 1a, 3, ffbfef3c, 1010101, 57f74) libsocket.so.1`get_addr+0x158(0, fe8920a0, ffbff0c0, 17700000, 0, 0) libsocket.so.1`_getaddrinfo+0x710(fe8920a0, 1770, ffbff168, 15950c, 0, 2) libX11.so.4`_X11TransSocketINETConnect+0x178(1594d0, fe8920a0, ffbff188, ffbff32c, fed13100, 0) libX11.so.4`_X11TransConnect+0x58(1594d0, ffbff3e8, 7ffffc00, fe892090, fed13104, fe982078) libX11.so.4`_X11TransConnectDisplay+0x6e0(e, 1594d0, 1, ffbff3e8, 0, 0) libX11.so.4`XOpenDisplay+0xe8(0, fed20bc4, 158f88, ffbffdec, 9ebc4, 0) libXt.so.4`XtOpenDisplay+0xe4(158190, 0, ffbffdcc, fe982010, 0, 0) libXt.so.4`_XtAppInit+0xfc(ffbff71c, fe982010, 0, 0, ffbffdcc, ffbff778) libXt.so.4`XtOpenApplication+0x48(12bc78, fe982010, 0, 0, ffbffdcc, ffbffdec) libXt.so.4`XtAppInitialize+0x54(1346f4, fede7638, fede4000, 120008, 54da8, 14e634) __0fLApplicationKinitializePiPPc+0x8c(12bc68, ffbffdcc, ffbffdec, 0, 1346f4, 134400) __0fHRoamAppKinitializePiPPc+0x310(12bc68, ffbffdcc, ffbffdec, 14d400, 0, 136000) main+0x28(136000, 3f338, 12bc68, 12d0cc, 0, 136000) _start+0x108(0, 0, 0, 0, 0, 0) > A breakpoint was set on the C library open function and the dtmail utility was continued in the debugger to hit the breakpoint. The $c mdb command was used to display the stack trace symbolically after the breakpoint hit.
3-52
Open Files 11 /self->entry && arg0 == -1/ 12 { 13 printf("open for '%s' failed, errno=%d", self->path, errno); 14 ustack(); 15 self->entry = 0; 16 } # dtrace -s failedopen2.d -c "cat /nothing" dtrace: script 'failedopen2.d' matched 4 probes cat: cannot open /nothing dtrace: pid 1612 exited with status 2 CPU ID FUNCTION:NAME 0 397 open64:return open for '/nothing' failed, errno=2 libc.so.1`__open64+0x4 libc.so.1`_endopen+0x88 libc.so.1`fopen64+0x1c cat`main+0x318 cat`_start+0x108
3-53
Open Files
3-54
Module 4
Use DTrace to access kernel variables Use DTrace to obtain information about read calls Use DTrace to perform anonymous tracing Use DTrace to perform speculative tracing Explain the privileges necessary to run DTrace operations
4-1
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
Relevance
Discussion The following questions are relevant to understanding how to use DTrace for finding system problems:
q
!
?
Would the ability to access any kernel variable when a probe res be benecial? Would it be useful to know who is issuing which type of read calls? Would it be advantageous to trace device driver code during system boot? Would it be benecial to give regular user accounts access to the DTrace facility that is limited to user-owned processes?
q q
4-2
Additional Resources
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
q
Sun Microsystems, Inc. Solaris Dynamic Tracing Guide, part number 817-6223-10. Cantrill Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. Paper presented at 2004 USENIX Conference. BigAdmin System Administration Portal [http://www.sun.com/bigadmin/content/dtrace]. dtrace(1M) manual page in the Solaris 10 OS manual pages, Solaris 10 Reference Manual Collection. The /usr/demo/dtrace directory contains all of the sample scripts from the Solaris Dynamic Tracing Guide.
4-3
4-4
Accessing Kernel Variables You can apply any of the D operators to external kernel variables, except those that modify values. When you launch DTrace, the D compiler loads the set of variable names corresponding to active kernel modules, so declarations of these variables are not required.
The nproc variable Holds the current number of Solaris OS processes The nthread variable Holds the current number of Solaris OS threads The freemem variable Holds the current amount of system free memory not owned by the memory allocator
You must precede each reference to these kernel variables with a backquote character (), as shown in the following example: # cat -n monitor.d 1 #!/usr/sbin/dtrace -qs 2 3 BEGIN 4 { 5 printf("%-14s %-10s %10s\n", "Processes", 6 "Threads", "Free Memory"); 7 } 8 9 tick-5sec 10 { 11 printf("%-14d %-10d %9dmb\n", `nproc, 12 `nthread, (`freemem*8)/1024); 13 } # ./monitor.d Processes 41 42 41 53 47 41 41 41 41
Threads 232 232 232 242 249 232 232 232 232
Free Memory 322mb 306mb 322mb 320mb 251mb 252mb 252mb 232mb 111mb
4-5
The curpsinfo variable Points to a process information structure The curlwpsinfo variable Points to a lightweight process (LWP) information structure The curcpu variable Points to a central processing unit (CPU) information structure
The rst two structures are part of the proc(4) interface and are used by commands like ps(1) and prstat(1M). These variables provide access to kernel state information at the time any probe res. The following examples dene the data structures.
4-6
4-7
4-8
Accessing Kernel Variables 13 14 15 16 17 18 19 20 21 22 curpsinfo->pr_pid, curpsinfo->pr_ppid, curpsinfo->pr_uid, curlwpsinfo->pr_pri, curpsinfo->pr_psargs); } fbt::$1:entry /nlines > 20/ { printf("TID\tPID\tPPID\tUID\tPRI\tCOMMAND\n"); nlines = 0; }
# ./ps.d bdev_strategy TID PID PPID UID PRI COMMAND 1 4640 4639 0 55 find / -type f 1 4640 4639 0 55 find / -type f 1 4698 4641 0 51 file /var/sadm/pkg/SUNWfontconfig-root/save/pspool/SUNWfontconfigroot/install/ 1 4640 4639 0 55 find / -type f 1 4698 4641 0 51 file /var/sadm/pkg/SUNWfontconfig-root/save/pspool/SUNWfontconfigroot/install/ ^C # ps.d nanosleep TID PID PPID UID PRI COMMAND 11 279 1 0 59 /usr/sbin/nscd 12 279 1 0 59 /usr/sbin/nscd 21 279 1 0 59 /usr/sbin/nscd 18 279 1 0 59 /usr/sbin/nscd 17 279 1 0 59 /usr/sbin/nscd 16 279 1 0 59 /usr/sbin/nscd 13 279 1 0 59 /usr/sbin/nscd 1 2120 2119 0 59 sleep 5 12 279 1 0 59 /usr/sbin/nscd 11 279 1 0 59 /usr/sbin/nscd 13 279 1 0 59 /usr/sbin/nscd 14 279 1 0 59 /usr/sbin/nscd 15 279 1 0 59 /usr/sbin/nscd 16 279 1 0 59 /usr/sbin/nscd 17 279 1 0 59 /usr/sbin/nscd 18 279 1 0 59 /usr/sbin/nscd 21 279 1 0 59 /usr/sbin/nscd 18 279 1 0 59 /usr/sbin/nscd TID PID PPID UID PRI COMMAND 17 279 1 0 59 /usr/sbin/nscd 16 279 1 0 59 /usr/sbin/nscd
4-9
^C
4-10
The following D script uses the on-cpu sched probe with an aggregation to display a summary of who has recently been running on what CPU: # cat -n whorun.d 1 #!/usr/sbin/dtrace -qs 2 3 sched:::on-cpu 4 /pid != $pid && pid != 0/ 5 { 6 @[curpsinfo->pr_psargs, curcpu->cpu_id] = count(); 7 } 8 9 END 10 { 11 printf("%-30s %4s %6s\n", "Command", "CPU", "Count"); 12 printa("%-30s %4d %@6d\n", @); 13 } # ./whorun.d ^C Command
CPU
Count
4-11
Accessing Kernel Variables /usr/lib/fm/fmd/fmd uptime find / -name fubar /usr/lib/autofs/automountd -sh -sh /usr/lib/picl/picld /usr/lib/fm/fmd/fmd /usr/sbin/nscd /usr/lib/fm/fmd/fmd /usr/sbin/nscd /usr/sbin/nscd /usr/lib/sendmail -bd -q15m ls -lR / /usr/sfw/sbin/snmpd -sh /usr/lib/utmpd /usr/lib/sendmail -bd -q15m /usr/lib/picl/picld /usr/sfw/sbin/snmpd /usr/sfw/sbin/snmpd fsflush /usr/sbin/nscd find / -name fubar /usr/sbin/vold /usr/sfw/sbin/snmpd 1 2 3 2 2 1 1 3 3 2 2 0 0 1 0 3 0 2 0 3 2 0 1 1 2 1 1 1 2 3 3 3 4 6 8 11 14 15 16 18 18 20 20 20 32 44 55 72 77 152 152 237
4-12
The following D script displays CPU, thread, process, wait time, and stack trace information related to a thread blocking on an adaptive mutex:
# cat -n mutex.d 1 #!/usr/sbin/dtrace -qs 2 3 lockstat:::adaptive-block 4 { 5 printf("\nCPU\tTID\tPID\tUID\tWAIT TIME\tCOMMAND\n"); 6 printf("%d\t%d\t%d\t%d\t%d\t\t%s\n", curcpu->cpu_id, 7 curlwpsinfo->pr_lwpid, curpsinfo->pr_pid, 8 curpsinfo->pr_uid, arg1, curpsinfo->pr_psargs); 9 stack(); 10 }
Test the mutex.d D script by starting four instances of the readchar user application, which reads every le in the current directory one byte at a time using the read(2) system call:
# (cd /usr/lib; /var/dtrace/readchar)& (cd /usr/lib; /var/dtarce/readchar)& [1] 2323 [2] 2325 # (cd /usr/lib; /var/dtrace/readchar)& (cd /usr/lib; /var/dtrace/readchar)&
4-13
intr ithr 409 307 401 301 406 305 402 302
srw 0 0 0 0
genunix`clock+0x3f0 genunix`cyclic_softint+0xa4 unix`cbe_level10+0x8 unix`intr_thread+0x144 CPU 0 TID 0 PID 0 UID 0 WAIT TIME 41076 COMMAND sched
genunix`clock+0x3f0 genunix`cyclic_softint+0xa4 unix`cbe_level10+0x8 unix`intr_thread+0x144 CPU 0 TID 0 PID 0 UID 0 WAIT TIME 50424 COMMAND sched
sd`sdintr+0x14 glm`glm_doneq_empty+0x144
4-14
genunix`clock+0x3f0 genunix`cyclic_softint+0xa4 unix`cbe_level10+0x8 unix`intr_thread+0x144 CPU 0 TID 0 PID 0 UID 0 WAIT TIME 41184 COMMAND sched
genunix`clock+0x3f0 genunix`cyclic_softint+0xa4 unix`cbe_level10+0x8 unix`intr_thread+0x144 ^C CPU 0 TID 0 PID 0 UID 0 WAIT TIME 43214 COMMAND sched
4-15
ufs`rdip+0x150 ufs`ufs_read+0x208 genunix`read+0x274 genunix`read32+0x1c unix`syscall_trap32+0xa8 CPU 1 TID 1 PID 12527 UID 0 WAIT TIME 22200 COMMAND /var/dtrace/readchar
ufs`rdip+0x488 ufs`ufs_read+0x208 genunix`read+0x274 genunix`read32+0x1c unix`syscall_trap32+0xa8 CPU 1 TID 1 PID 12527 UID 0 WAIT TIME 20700 COMMAND /var/dtrace/readchar
ufs`rdip+0x150 ufs`ufs_read+0x208 genunix`read+0x274 genunix`read32+0x1c unix`syscall_trap32+0xa8 CPU 2 TID 1 PID 12528 UID 0 WAIT TIME 24400 COMMAND /var/dtrace/readchar
ufs`rdip+0x488 ufs`ufs_read+0x208 genunix`read+0x274 genunix`read32+0x1c unix`syscall_trap32+0xa8 CPU 2 TID 1 PID 12552 UID 0 WAIT TIME 28900 COMMAND /var/dtrace/readchar
4-16
The previous output shows that the mutex contention is in the UNIX File System (UFS) code. The sleep times are only between 2129 microseconds.
4-17
4-18
You can trace system-wide activity or application-specic activity. You can show information about each individual read call or summarize the data with an aggregation function. You can monitor read activity at the driver level with the io provider or at the application level with the pid provider, the syscall provider, or the sysinfo provider.
4-19
Displaying Read Call Information 20 21 22 23 24 25 26 27 28 29 self->started = 0; ++nlines; } syscall::read:return, syscall::pread*:return /nlines > 20/ { printf("FD\tREQUEST\tACTUAL\tCOMMAND\n"); nlines = 0; }
# ./reads.d FD REQUEST 0 1 0 1 0 1 0 1 3 877 0 1 0 1 ... 0 1 3 152 4 8192 4 8192 3 877 0 1 0 1 ... 0 1 0 1 0 1 3 8192 3 8192 3 8192 1 8192 1 8192 1 8192 1 8192 ... FD REQUEST 5 1024 5 8192 6 336 6 336 6 336
ACTUAL 1 1 1 1 877 1 1 1 152 4092 0 877 1 1 1 1 1 8192 200 0 1006 0 1006 0 ACTUAL 61 4464 336 336 336
COMMAND bash bash bash bash date bash bash bash uptime uptime uptime uptime bash bash bash bash bash grep grep grep init init init init COMMAND nscd utmpd utmpd utmpd utmpd
4-20
Displaying Read Call Information 5 1 2 2 1 ... FD 0 0 0 0 0 ... 4 4 11 4 4 4 4 4 4 ^C 8192 24 8 8 24 REQUEST 1 1 128 128 128 416 416 336 416 416 416 416 416 416 0 -1 8 -1 24 ACTUAL 1 1 4 3 4 416 416 336 416 416 416 416 416 416 utmpd sac ttymon ttymon sac COMMAND bash bash sh sh sh ps ps svc.startd ps ps ps ps ps ps
Using the previous output (and help from the truss(1) command), you can determine the following:
q
The date(1) command reads a time zone (US/Mountain) conguration le of size 877 bytes when it starts. The ps(1) command reads the psinfo_t structure of size 416 bytes many times. The init(1M) command re-reads the /etc/inittab le periodically. The grep(1) command reads its le one page (8192 bytes) at a time. The sh(1) command reads a whole line from standard input into a 128-byte buffer. The bash(1) command reads standard input one byte at a time (probably to implement command line editing). The uptime(1) command reads the same time zone conguration le as the date(1) command. The sac(1M) and ttymon(1M) commands issued reads that failed.
q q q
4-21
4-22
# ./readsummary.d ^C 4 instant 2 more 0 vi 4 readchar 0 bash 2 ttymon 4 rup 1 sac 5 nscd 19 sgml2roff 3 rup 3 rpc.rstatd 4 ps 3 uptime 1 init 3 man 3 ps 5 rup 4 vi 3 grep 3 date 3 vi 4 nroff 4 uptime 0 nroff 0 tbl 3 cat 6 nsgmls 0 eqn 0 col 0 instant 3 instant 3 nsgmls 6 rpc.rstatd 3 more 5 man 4 nsgmls 0 grep
0 1 1 1 1 8 23 24 59 119 413 413 416 514 540 550 687 787 803 845 877 1492 2221 2232 3479 3861 3861 3894 3914 3979 4072 4442 4459 4464 4842 5802 6606 6815
4-23
Displaying Read Call Information By changing the aggregation function from avg() to sum(), you can obtain the total number of bytes read by le descriptor and process name: # ./totalread.d ^C 4 instant 0 vi 2 more 5 nscd 0 bash 11 svc.startd 10 svc.startd 3 man 6 readchar 3 date 4 ls 3 vi 19 sgml2roff 1 init 23 readchar 19 readchar 6 nsgmls 5 man 4 nroff 3 more 7 readchar 20 readchar 3 cat 0 tbl 4 vi 10 readchar 14 readchar 0 eqn 0 nroff 0 col 8 readchar 0 grep 3 nsgmls 3 instant 11 readchar 0 instant 4 nsgmls 21 readchar
0 6 8 61 121 336 336 550 671 877 877 2984 3214 4324 4572 10276 11684 17408 17771 17876 18064 18116 18435 18435 18880 19500 19500 20356 20356 22496 28252 30095 33799 53314 56616 160360 171763 192636
4-24
4-25
Using the Anonymous Tracing Facility # dtrace -A -m conskbd dtrace: cleaned up old anonymous enabling in /kernel/drv/dtrace.conf dtrace: cleaned up forceload directives in /etc/system dtrace: saved anonymous enabling in /kernel/drv/dtrace.conf dtrace: added forceload directives to /etc/system dtrace: run update_drv(1M) or reboot to enable changes # tail /etc/system * chapter of the Solaris Dynamic Tracing Guide for details. * forceload: drv/systrace forceload: drv/sdt forceload: drv/profile forceload: drv/lockstat forceload: drv/fbt forceload: drv/fasttrap forceload: drv/dtrace * ^^^^ Added by DTrace # reboot ... # grep enabling /var/adm/messages Feb 27 07:34:22 sys63 dtrace: [ID 566105 kern.notice] NOTICE: enabling probe 0 (:kmdb::) Feb 27 07:34:22 sys63 dtrace: [ID 566105 kern.notice] NOTICE: enabling probe 1 (dtrace:::ERROR) Feb 27 07:45:27 sys63 dtrace: [ID 566105 kern.notice] NOTICE: enabling probe 0 (:conskbd::) Feb 27 07:45:27 sys63 dtrace: [ID 566105 kern.notice] NOTICE: enabling probe 1 (dtrace:::ERROR) # dtrace -ae CPU ID FUNCTION:NAME 0 25339 conskbd_attach:entry 0 25340 conskbd_attach:return 0 25327 conskbdopen:entry 0 25328 conskbdopen:return 0 25331 conskbduwput:entry 0 25332 conskbduwput:return 0 25345 conskbdioctl:entry 0 25346 conskbdioctl:return 0 25327 conskbdopen:entry 0 25328 conskbdopen:return 0 25331 conskbduwput:entry 0 25332 conskbduwput:return 0 25345 conskbdioctl:entry 0 25346 conskbdioctl:return 0 25329 conskbdclose:entry 0 25330 conskbdclose:return
4-26
Using the Anonymous Tracing Facility 0 0 0 0 25327 25328 25329 25330 conskbdopen:entry conskbdopen:return conskbdclose:entry conskbdclose:return The forceload entries in the /etc/system are not automatically removed after the reboot. Run the dtrace(1M) command with just the -A option to clean up these forceload entries: # tail -18 /etc/system * vvvv Added by DTrace * * The following forceload directives were added by dtrace(1M) to allow for * tracing during boot. If these directives are removed, the system will * continue to function, but tracing will not occur during boot as desired. * To remove these directives (and this block comment) automatically, run * "dtrace -A" without additional arguments. See the "Anonymous Tracing" * chapter of the Solaris Dynamic Tracing Guide for details. * forceload: drv/systrace forceload: drv/sdt forceload: drv/profile forceload: drv/lockstat forceload: drv/fbt forceload: drv/fasttrap forceload: drv/dtrace * ^^^^ Added by DTrace # dtrace -A dtrace: cleaned up old anonymous enabling in /kernel/drv/dtrace.conf dtrace: cleaned up forceload directives in /etc/system # tail /etc/system * * To set variables in 'unix': * * set nautopush=32 * set maxusers=40 * * To set a variable named 'debug' in the module named 'test_module' * * set test_module:debug = 0x13
4-27
Using the Anonymous Tracing Facility The next example focuses only on those functions called from the conskbd_attach() function in the conskbd module: # cat -n cons.d 1 #!/usr/sbin/dtrace -s 2 3 fbt::conskbd_attach:entry 4 { 5 self->trace = 1; 6 } 7 8 fbt::: 9 /self->trace/ 10 { 11 } 12 13 fbt::conskbd_attach:return 14 { 15 self->trace = 0; 16 } # dtrace -AFs cons.d dtrace: saved anonymous enabling in /kernel/drv/dtrace.conf dtrace: added forceload directives to /etc/system dtrace: run update_drv(1M) or reboot to enable changes # reboot ... # grep enabling /var/adm/messages Feb 27 07:45:27 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 0 (:conskbd::) Feb 27 07:45:27 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 1 (dtrace:::ERROR) Feb 27 08:07:05 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 0 (fbt::conskbd_attach:entry) Feb 27 08:07:05 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 1 (fbt:::) Feb 27 08:07:05 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 2 (fbt::conskbd_attach:return) Feb 27 08:07:05 sys63 dtrace: [ID 566105 kern.notice] NOTICE: probe 3 (dtrace:::ERROR) # dtrace -ae CPU FUNCTION 0 -> conskbd_attach 0 -> ddi_create_minor_node 0 -> ddi_create_minor_common 0 -> ddi_driver_major
4-28
Using the Anonymous Tracing Facility 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 <- ddi_driver_major -> strcmp <- strcmp -> derive_devi_class -> i_ddi_devi_class <- i_ddi_devi_class -> strncmp <- strncmp <- kstat_compare_bykid -> kstat_zone_compare <- kstat_zone_compare <- avl_find <- kstat_hold <- kstat_hold_bykid <- kstat_install -> kstat_rele -> cv_broadcast <- cv_broadcast <- kstat_rele <- conskbd_attach
4-29
Unwanted data that must be ltered afterwards Data loss caused by running out of buffer space in DTrace
To address this problem, DTrace provides a facility called speculative tracing. Speculative tracing allows you to tentatively trace data. Later, you can decide that the traced data is interesting and commit it, or you can decide that the traced data is uninteresting and discard it.
4-30
commit discard
ID ID
The speculation() function allocates a speculative buffer and returns a speculation identier (ID). You use this ID in subsequent calls to the speculate() function. You must place the speculate() call before any data recording action statement in the same clause. All such data recording action statements are then speculatively traced. Probe clauses can contain speculative tracing or regular tracing, but not both. Aggregating actions, destructive actions, and exit actions can never be speculative. By default (without tuning), there is only one speculative buffer. Therefore you must be careful not to start a new speculation before committing or discarding an existing one. You use the commit() function to commit a speculation. When you commit a speculative buffer, its data is copied into the one (per CPU) principal buffer of DTrace. You cannot have any data recording actions in a clause containing a commit() function. You use the discard() function to discard a speculation. When a speculative buffer is discarded, its contents are thrown away.
4-31
4-32
Using the Speculative Tracing Facility 39 syscall::open*:return 40 /self->spec && arg0 != -1/ 41 { 42 /* Throw away data recorded in speculative buffer */ 43 discard(self->spec); 44 self->spec = 0; 45 } # ./spec.d dtrace: script './spec.d' matched 40768 probes CPU FUNCTION 0 <= open64 Open failed with errno: 2 0 => open64 /etc/sytem 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 ^C -> open64 <- open64 -> copen -> falloc -> ufalloc <- ufalloc -> ufalloc_file -> fd_find <- cv_broadcast <- setf -> unfalloc -> crfree <- crfree <- unfalloc -> kmem_cache_free <- kmem_cache_free -> set_errno <- set_errno <- copen grep was opening:
It appears that the spec.d D script never starts a new open speculation until the current open returns and the current speculation is either committed or discarded. This is not the case, however, if an open blocks and does not return before another open is started. You learn in a lab exercise how to tune the number of speculative buffers.
4-33
4-34
4-35
4-36
DTrace Privileges
DTrace Privileges
By default, only the super-user can use DTrace. This is because DTrace enables visibility into all aspects of the system, including:
q q q q
In addition, some DTrace actions can modify a programs state by stopping a process or even inducing a breakpoint in the kernel. Just as it is inappropriate to allow one user to stop another users process or access another users les, so it is inappropriate to grant a user full access to all of the DTrace facilities. The traditional UNIX all or none approach to user privileges is not suitable for managing the use of the DTrace capabilities.
The dtrace_proc privilege Permits use of only the pid and plockstat providers for process-level tracing of processes owned by the user. The dtrace_user privilege Permits use of only the profile and syscall providers on processes owned by the user. The dtrace_kernel privilege Permits the use of every provider except the pid and plockstat providers, unless dtrace_proc privilege is also granted. Does not allow kernel-destructive actions.
In addition to the above DTrace specic privileges, if a user has both dtrace_proc and proc_owner privileges then he is allowed to trace other users processes.
4-37
DTrace Privileges
Kernel-Destructive Actions
Only the super-user can perform kernel-destructive actions. You enable such actions by running the dtrace(1M) command with the -w option. Three built-in DTrace functions cause kernel-destructive actions:
q
The breakpoint() function Action that induces a kernel breakpoint, causing the system to stop, with control passing to OpenBoot PROM or kmdb(1), depending on how the system was booted. The panic() function Action that induces a kernel panic with crash les normally being created for postmortem analysis. The chill() function Action that causes DTrace to spin for the specied number of nanoseconds. Intended for dealing with race condition situations.
username::::defaultpriv=basic,privilege,...
The following examples show the effect of setting the three DTrace specic privileges.
4-38
DTrace Privileges
user2::::defaultpriv=basic,dtrace_proc user3::::defaultpriv=basic,dtrace_user user4::::defaultpriv=basic,dtrace_kernel user5::::defaultpriv=basic,dtrace_kernel,dtrace_proc user6::::defaultpriv=basic,dtrace_proc,proc_owner $ id uid=1001(user1) gid=101(users) $ /usr/sbin/dtrace -l dtrace: failed to initialize dtrace: DTrace requires additional privileges $ echo $$ 919 $ /usr/sbin/dtrace -n pid919::: dtrace: failed to initialize dtrace: DTrace requires additional privileges $
$ dtrace -qn 'pid$target:libc:memcpy:entry {printf("size: %d\n",arg2)}' -c date Sun Feb 27 10:02:01 MST 2005 size: 16 size: 15 size: 1 size: 15 size: 5 size: 521 size: 44 size: 28 size: 28 size: 48 size: 48 size: 308 size: 56 size: 36 size: 29 $
4-39
DTrace Privileges
$ ps -ef | grep vi user2 1534 1528 0 09:48:20 pts/1 0:00 grep vi user5 1531 1452 0 09:47:55 pts/2 0:00 vi resume $ dtrace -n pid1531::: dtrace: invalid probe specifier pid1531:::: failed to grab pid 1531: permission denied $ dtrace -n syscall::read: dtrace: invalid probe specifier syscall::read:: probe description syscall::read: does not match any probes $
4-40
DTrace Privileges
f: -1255556994 p: 1065403 q: 309691762 m: 14 ^C $ dtrace -qn 'syscall::write:entry /arg0 == 1/ {printf("T: %d\n",timestamp)}' -c pgm f: 13 p: 0 q: -1952257862 m: -10 f: 640001883 p: -2056615 q: -929109794 m: -7 f: -1660723204 p: -1529159 q: 94444073 m: 25 f: 2041630813 p: 749994 q: -42775360 m: -23 f: -1255556994 p: 1065403 q: 309691762 m: 14 f: -1207459745 p: 1769677 q: -8640714 m: -35 T: 150116053418082 T: 150116222152140 T: 150116388881669 T: 150116558431666 T: 150116728255203 ... $ dtrace -n 'pid$target:::entry' -c pgm dtrace: invalid probe specifier pid$target:::entry: probe description pid1208:::entry does not match any probes $ dtrace -qn 'profile-109 {@[arg1] = count()}' -c pgm f: 13 p: 0 q: -1952257862 m: -10 f: 640001883 p: -2056615 q: -929109794 m: -7 f: -1660723204 p: -1529159 q: 94444073 m: 25 ... ^C 133476 49 4280947012 226 4280947008 1094 $ mdb pgm > _start:b > :r mdb: stop at pgm`_start mdb: target stopped at: mypgm`_start: clr %fp > 0t4280947008/ai libc.so.1`.umul: libc.so.1`.umul:umul %o0, %o1, %o0 > $q $ (sleep 33; pwd)& 1680 $ dtrace -n 'syscall:::entry /pid != $pid/ {}' dtrace: description 'syscall:::entry ' matched 225 probes /export/home/user3 CPU ID FUNCTION:NAME 0 18832 rexit:entry 0 18922 ioctl:entry 0 18908 setpgrp:entry 0 18922 ioctl:entry 0 19004 waitsys:entry 0 19214 getcwd:entry 0 18838 write:entry 0 18832 rexit:entry ^C $
4-41
DTrace Privileges The dtrace_user privilege only allows the use of the syscall and profile providers on processes owned by the user. Even though there are many system calls occuring in the system, the above output shows only the sh, sleep, and pwd commands system calls.
The preceding example demonstrates that you must have the dtrace_proc privilege to trace your own processes. The dtrace_kernel privilege by itself is not sufcient.
$ id uid=1005(user5) gid=101(users) $ grep user5 /etc/user_attr user5::::defaultpriv=basic,dtrace_kernel,dtrace_proc $ echo $$
4-42
DTrace Privileges
6736 $ dtrace -n 'pid6736:a.out::entry' dtrace: description 'pid6736:a.out::entry' matched 211 probes ^C $ dtrace -l | awk '{print $2}' | sort -u PROVIDER dtrace fasttrap fbt fpuinfo io lockstat mib pid6736 proc profile sched sdt syscall sysinfo vminfo $
4-43
DTrace Privileges
4-44
DTrace Privileges
$ /usr/sbin/dtrace -l ID PROVIDER MODULE FUNCTION 1 dtrace 2 dtrace 3 dtrace $ /usr/sbin/dtrace -n 'pid$target:calls::entry' -c calls dtrace: description 'pid$target:calls::entry' matched 7 probes 83 133 dtrace: pid 1787 exited with status 1 CPU ID FUNCTION:NAME 0 28355 _start:entry 0 28362 _init:entry 0 28361 main:entry 0 28360 f1:entry 0 28359 f2:entry 0 28358 f3:entry 0 28357 f4:entry 0 28356 f5:entry 0 28360 f1:entry 0 28359 f2:entry 0 28358 f3:entry 0 28357 f4:entry 0 28356 f5:entry 0 28363 _fini:entry $ ppriv $$ 1774: -sh flags = <none> E: basic,dtrace_proc I: basic,dtrace_proc P: basic,dtrace_proc L: all $ bash bash-2.05b$ ppriv $$ 1789: bash flags = <none> E: basic,dtrace_proc I: basic,dtrace_proc P: basic,dtrace_proc L: all bash-2.05b$ /usr/sbin/dtrace -n 'pid$target:calls::entry' -c calls dtrace: description 'pid$target:calls::entry' matched 7 probes 83 133 dtrace: pid 1850 exited with status 1 CPU ID FUNCTION:NAME 0 28355 _start:entry 0 28362 _init:entry 0 28361 main:entry 0 28360 f1:entry ... bash-2.05b$ echo $$ 1789 bash-2.05b$ su
4-45
Password: # ppriv -s A+dtrace_kernel 1789 # ppriv $$ 1854: sh flags = <none> E: all I: basic P: all L: all # exit bash-2.05b$ ppriv $$ 1789: bash flags = <none> E: basic,dtrace_kernel,dtrace_proc I: basic,dtrace_kernel,dtrace_proc P: basic,dtrace_kernel,dtrace_proc L: all bash-2.05b$ /usr/sbin/dtrace -qn 'fbt::cv_wait_sig:entry > {trace(execname);ustack();stack();exit(0);}' more ff2bcb58 15684 149a4 13ad8 12780 1201c 115cc genunix`str_cv_wait+0x28 genunix`strwaitq+0x238 genunix`strread+0x174 genunix`read+0x274 unix`syscall_trap32+0xcc
DTrace Privileges
dtrace_proc Privilege
pid plockstat
dtrace_user Privilege
profile syscall
dtrace_kernel Privilege
All
User Kernel
4-47
Module 5
Describe how to lessen the performance impact of DTrace Describe how to use and tune DTrace buffers Debug DTrace scripts
5-1
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
Relevance
Discussion The following questions are relevant to understanding how to troubleshoot DTrace problems:
q
!
?
Would the ability to write your D scripts with minimal performance impact be benecial? Would it be useful to have control over buffer management policies when DTrace buffer space is exhausted? Would it be useful to detect common mistakes made in D scripts?
5-2
Additional Resources
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
q
Sun Microsystems, Inc. Solaris Dynamic Tracing Guide, part number 817-6223-10. Cantrill Bryan M., Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. Paper presented at 2004 USENIX Conference. BigAdmin System Administration Portal [http://www.sun.com/bigadmin/content/dtrace]. dtrace(1M) manual page in the Solaris 10 OS manual pages, Solaris 10 Reference Manual Collection. The /usr/demo/dtrace directory contains all of the sample scripts from the Solaris Dynamic Tracing Guide.
5-3
5-4
Using Aggregations
DTrace aggregations provide a scalable method of aggregating data. Although associative arrays appear to offer similar functionality, they are global, general-purpose variables that cannot provide the linear scalability of aggregations. Aggregating functions allow for intermediate results to be kept per-CPU instead of in a shared global data structure. When a system-wide result is required, the aggregating function may then be applied to the set consisting of the per-CPU intermediate results. You should therefore use aggregations rather than associative arrays whenever possible. For example, you should avoid performing the action shown in the following script: syscall:::entry { ++totals[execname]; } syscall::rexit:entry { printf(%40s %d\n, execname, totals[execname]); totals[execname] = 0; } You should instead perform the following: syscall:::entry { @totals[execname] = count(); } END { printa(%40s %@d\n, @totals); }
5-5
Minimizing DTrace Performance Impact When enabling many probes, you tend to use predicates of a form that identies a specic thread or threads of interest, such as /self>traceme/ or /pid == 12345/. Many of these predicates evaluate to the same (false) value for most threads in most probes, but the evaluation itself can become costly when done for every function entry and return point in the kernel. To reduce this cost, DTrace caches the evaluation of a predicate if it includes only thread-local variables (as in the rst example), only immutable variables (as in the second), or both. The cost of evaluating a cached predicate is much smaller than the cost of evaluating a non-cached predicate, especially if the predicate involves thread-local variables, string comparisons, or other relatively costly operations.
execname == pgm curpsinfo->pr_fname or curthread->t_procp>p_user.u_comm pid == 1234 tid == 17 curpsinfo->pr_pid or curthread->t_procp>p_pidp->pid_id curlwpsinfo->pr_lwpid or curthread->t_tid
5-6
Minimizing DTrace Performance Impact {} syscall::read:return /follow[pid, tid]/ {follow[pid, tid] = 0;}
You should instead use thread-local variables, as in the following example: syscall::read:entry { self->follow = 1; } fbt::: /self->follow/ {} syscall::read:return /self->follow/ { self->follow = 0; } To be cacheable, a predicate must consist exclusively of cacheable expressions. The following predicates are all cacheable: /execname == myprogram / /execname == $$1/ /pid == 12345/ /pid == $1/ /self->traceme == 1/ Because of the use of global variables, these predicates are all not cacheable: /execname == one to_watch/ /traceme[execname]/ /pid == pid_i_care_about/ /se1f->traceme == my_global/
5-7
Principal Buffers
The buffer most fundamental to DTrace operation is the principal buffer. The principal buffer is present in every DTrace invocation, and is the buffer to which tracing actions record their data by default. These actions include:
q q q q q q
The principal buffers are always allocated on a per-CPU basis, although tracing (and thus buffer allocation) can be restricted to a single CPU by using the cpu option.
5-8
Using and Tuning DTrace Buffers To accommodate these varying demands, DTrace supports the following buffer policies:
q q q
This support is implemented with the bufpolicy option, and can be set on a per-consumer basis.
The dtrace(1M) command also accepts option settings on the command line as an argument to the -x option. For example: # dtrace -x nspec=4 -x bufsize=2g -x switchrate=60 \ -x aggrate=l0ms -x bufpolicy=switch -n zfod You can also specify the bufsize option with the -b ag to the dtrace(1M) command: # dtrace -b 2g -n zfod
Note This section describes only those options relevant to buffer management. For details on the other DTrace options, see the Solaris Dynamic Tracing Guide.
5-9
Dropped Data
Under the switch policy, if a given enabled probe would trace more data than there is space available in the active principal buffer, the data is dropped and a per-CPU drop count is incremented. In the event of one or more drops, the dtrace(1M) command displays this message or a similar one: dtrace: 11 drops on CPU 0 You can reduce or eliminate drops by:
q
increasing the size of the principal buffer with the bufsize option, or increasing the switching rate with the switchrate option
The switch policy allocates scratch space for the copyin(), copyinstr(), and alloca() commands out of the active buffer.
5-10
# ./stress.d >/var/tmp/stress.d.out dtrace: script './stress.d' matched 38665 probes dtrace: 451660 drops on CPU 0 dtrace: 1100596 drops on CPU 0 dtrace: 1028767 drops on CPU 0 dtrace: 1103521 drops on CPU 0 # ls -l /var/tmp/stress.d.out -rw-r--r-1 root root /var/tmp/stress.d.out
The drops result from the limited buffer space, the low switchrate value, or both. The default buffer size for the principal buffer is 4 Mbytes and the default switchrate is one second. In the next invocation of the script you increase the buffer size signicantly: # dtrace -x bufsize=300m -s stress.d >/var/tmp/stress.d.out dtrace: script 'stress.d' matched 38665 probes dtrace: buffer size lowered to 150m # ls -l /var/tmp/stress.d.out -rw-r--r-1 root root /var/tmp/stress.d.out
Note that DTrace lowers the setting for buffer size because there is not enough memory. By increasing the buffer size, you eliminated all drops and created 18 Mbytes of trace data. In the next example you use a smaller buffer size, but with an increased switchrate value: # dtrace -x bufsize=64m -x switchrate=16 -s stress.d > >/var/tmp/stress.d.out dtrace: script 'stress.d' matched 38665 probes ^C # ls -l /var/tmp/stress.d.out -rw-r--r-1 root root 33052791 Mar 13 15:06 /var/tmp/stress.d.out
5-11
5-12
With the ring buffer policy, the dtrace(1M) utility does not display any output until the process terminates; at that time the ring buffer is consumed and processed. Note that if a given record cannot t in the buffer (that is, if the record is larger than the buffer size), the record is dropped regardless of buffer policy. By adding the following two lines to a D script, you can enable ring buffering with a specic buffer size: #praqma D option bufpolicy=ring #pragma D option bufsize=16k
5-13
Other Buffers
Principal buffers exist in every DTrace enabling. In addition to principal buffers, some DTrace consumers have additional in-kernel data buffers: an aggregation buffer, a number of speculative buffers, or both. You tune the aggregation buffer size with the aggsize option, and you tune the speculative buffer size with the specsize option. You can tune the size of each buffer on a per-consumer basis. Note that setting the buffer sizes denotes the sizes of the buffers on each CPU. Moreover, for the switch buffer policy, bufsize denotes the individual sizes of the active and inactive buffers on each CPU.
5-14
# cat comments.d /* This D script counts the number of read system calls */ #!/usr/sbin/dtrace -s syscall::read:entry { @["Number of reads:"] = count(); } # ./comments.d ./comments.d: line 1: /bin: is a directory ./comments.d: line 3: syscall::read:entry: command not found ./comments.d: line 5: syntax error near unexpected token `(' ./comments.d: line 5: ` @["Number of reads:"] = count();' You must match up /* with an ending */ for comments in D scripts: # cat comments2.d #!/usr/sbin/dtrace -s /* This D script counts the number of read system calls syscall::read:entry {
5-15
Debugging DTrace Scripts @["Number of reads:"] = count(); } # ./comments2.d dtrace: failed to compile script ./comments2.d: line 7: end-of-file encountered before matching */ If you have more than one statement in a probe clause, make sure you end each one with a semicolon: ... BEGIN { a=$1 b=$2 c=$3 } ... # ./badstart2.d 1 2 3 dtrace: failed to compile script ./badstart2.d: line 6: syntax error near "b" When comparing values, make sure that you use the == relational operator and not =: # cat test5.d #!/usr/sbin/dtrace -s fbt::sema_init:entry /arg1 = 1/ { trace(timestamp); } # ./test5.d dtrace: failed to compile script ./test5.d: line 4: operator = can only be applied to a writable variable The rst assignment to a variable determines its type. As in the C language, you cannot mix types in the D language: # cat test8.d #!/usr/sbin/dtrace -s BEGIN { vp = `rootdir; i = 5;
5-16
Debugging DTrace Scripts } tick-1sec { i = *vp; } # ./test8.d dtrace: failed to compile script ./test8.d: line 11: operands have incompatible types: "int" = "vnode_t" Remember that even with the -w dtrace(1M) option, which enables destructive actions, you cannot modify kernel variables: # cat test6.d #!/usr/sbin/dtrace -ws tick-5sec /`freemem < `lotsfree/ { `lotsfree = `lotsfree*2; } # ./test6.d dtrace: failed to compile script ./test6.d: line 6: operator = can only be applied to a writable variable
5-17
5-18
BEGIN { a=$1; b=$2; } tick-1sec /execname == $3/ # ./badstart5.d 1 dtrace: failed to failed to resolve # ./badstart5.d 1 ^C 2 init compile script ./badstart5.d: line 10: init: Unknown variable name 2 '"init"'
Avoid misspelled words, which are a common problem in writing D scripts: # ./test1.d dtrace: failed to compile script ./test1.d: line 3: probe description syscall::opn:entry does not match any probes The following script uses an improper probe description: # cat test2.d #!/usr/sbin/dtrace -s syscall { trace(timestamp); } # ./test2.d dtrace: failed to compile script ./test2.d: line 3: probe description :::syscall does not match any probes When using the printf() and printa() built-in functions, make sure that the arguments match the format speciers in type and number: # cat -n test3.d 1 #!/usr/sbin/dtrace -qs 2 3 sched:::on-cpu 4 /pid != $pid && pid != 0/ 5 { 6 @[curpsinfo->pr_psargs, curcpu->cpu_id] = count();
5-19
Debugging DTrace Scripts 7 } 8 9 END 10 { 11 printf("%-30s %4s %6s\n", "Command", "CPU"); 12 printa("%-30s %4d %@6d\n", @); 13 } # ./test3.d dtrace: failed to compile script ./test3.d: line 11: printf( ) prototype mismatch: conversion #3 (%s) is missing a corresponding value argument # cat -n test3a.d 1 #!/usr/sbin/dtrace -qs 2 3 sched:::on-cpu 4 /pid != $pid && pid != 0/ 5 { 6 @[curpsinfo->pr_psargs, curcpu->cpu_id] = count(); 7 } 8 9 END 10 { 11 printf("%-30s %4s %6s\n", "Command", "CPU", "Count"); 12 printa("%-30s %4s %@6d\n", @); 13 } # ./test3a.d dtrace: failed to compile script ./test3a.d: line 12: printa( ) argument #3 is incompatible with conversion #2 prototype: conversion: %s prototype: char [] or string (or use stringof) argument: processorid_t # cat test4.d #!/usr/sbin/dtrace -s syscall::open:entry { printf("%s was opening: %s\n", execname, arg0); } # ./test4.d dtrace: failed to compile script ./test4.d: line 5: printf( ) argument #3 is incompatible with conversion #2 prototype: conversion: %s
5-20
Debugging DTrace Scripts prototype: char [] or string (or use stringof) argument: int64_t
Remember that pointer arguments to system calls are user addresses, not kernel addresses. You must use the copyinstr() built-in function to retrieve the strings:
# cat test4a.d #!/usr/sbin/dtrace -s syscall::open:entry { printf("%s was opening: %s\n", execname, stringof(arg0)); } # ./test4a.d dtrace: script './test4a.d' matched 1 probe dtrace: error on enabled probe ID 1 (ID 37: syscall::open:entry): invalid address (0xff3d79d3) in action #2 dtrace: error on enabled probe ID 1 (ID 37: syscall::open:entry): invalid address (0xff3ed570) in action #2 dtrace: error on enabled probe ID 1 (ID 37: syscall::open:entry): invalid address (0xff3ef6d0) in action #2 ^C # cat test4b.d #!/usr/sbin/dtrace -s syscall::open:entry { printf("%s was opening: %s\n", execname, copyinstr(arg0)); } # ./test4b.d dtrace: script './test4b.d' matched 1 probe CPU ID FUNCTION:NAME 0 37 open:entry ls was opening: /var/ld/ld.config 0 37 open:entry ls was opening: /lib/libc.so.1
0 37 open:entry ls was opening: /usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3 0 0 37 37 open:entry cat was opening: /var/ld/ld.config open:entry cat was opening: /lib/libc.so.1
5-21
5-22
Debugging DTrace Scripts dtrace: error on enabled probe ID 2 (ID 36402: profile:::tick-3sec): divide-by-zero in action #1 at DIF offset 20 ^C
5-23
Appendix A
Actions that trace data or modify state external to DTrace Subroutines that only affect internal DTrace state
This appendix formally denes the set of actions and subroutines available in DTrace, along with their syntax and semantics. This appendix enables you to:
q q q q q
Describe the default action Describe and use data recording actions Describe and use destructive actions Describe and use special actions Describe and use subroutines
A-1
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Default Action
Default Action
A clause need not contain an action; it may instead consist simply of manipulation of variable state, or of any combination of actions and manipulations of variable state. If a clause contains no actions and no D manipulation (that is, if a clause is empty), the default action is taken. The default action is to trace the enabled probe identier (EPID) to the principal buffer. The EPID identies a particular enabling of a particular probe with a particular predicate and actions. From the EPID, DTrace consumers can determine which probe induced the action. Indeed, whenever data is traced, it must be accompanied by the EPID to allow the consumer to make sense of the data; hence the default action is to trace the EPID and nothing else. Using the default action allows for simple use of the dtrace(1M) command. For example, you can enable all probes in the TS module with the default action by using: # dtrace -m TS (The TS module implements the timesharing scheduling class; see dispadmin(1M) for more information.) The above command results in output similar to the following: # dtrace -m TS dtrace: description 'TS' matched 93 probes CPU ID FUNCTION:NAME 0 14297 ts_preempt:entry 0 14298 ts_preempt:return 0 14301 ts_sleep:entry 0 14302 ts_sleep:return 0 14301 ts_sleep:entry 0 14302 ts_sleep:return 0 14301 ts_sleep:entry 0 14302 ts_sleep:return 0 14329 ts_update:entry 0 14331 ts_update_list:entry 0 14327 ts_change_priority:entry 0 14328 ts_change_priority:return 0 14332 ts_update_list:return 0 14331 ts_update_list:entry 0 14332 ts_update_list:return 0 14331 ts_update_list:entry ...
A-2
A-3
Data Recording Actions The printf() action tells DTrace to trace the data associated with each argument after the rst argument, and then to format the results using the rules described by the rst printf() argument, known as a format string. The format string is a regular string that contains any number of format conversions, each beginning with the % character, which describe how to format the corresponding argument. The rst conversion in the format string corresponds to the second printf() argument, the second conversion to the third argument, and so on. All of the text between conversions is printed verbatim. The character following the conversion character describes the format to use for the corresponding argument. Unlike the printf(3C) action, DTrace printf() is implemented as a builtin function that is recognized by the D compiler. The D compiler provides several useful services for the DTrace printf() action that are not found in the C library printf():
q
The D compiler compares the arguments to the conversions in the format string. If an arguments type is incompatible with the format conversion, the D compiler produces an error message explaining the problem. The D compiler does not require the use of size prexes with printf() format conversions. The C printf() routine requires that you indicate the size of arguments by adding prexes, such as %ld for long or %lld for long long. The D compiler knows the size and type of your arguments, so these prexes are not required in your D printf() statements. DTrace provides additional format characters that are useful for debugging and observability; for example, the %a format conversion can be used to print a pointer as a symbol name and offset.
In order to implement these features, the format string in the DTrace printf() function must be specied as a string constant in your D program; format strings cannot be dynamic variables of type string.
Conversion Specications
Each conversion specication in the format string is introduced by the % character, after which the following appear in sequence:
q
Zero or more ags (in any order), which modify the meaning of the conversion specication as described in the following subsection. An optional minimum eld width. If the converted value has fewer bytes than the eld width, it is padded with spaces on the left by default, or on the right if the left-adjustment ag (-) is specied. The
A-4
Data Recording Actions eld width can also be specied as an asterisk (*), in which case the eld width is set dynamically based on the value of an additional argument of type int.
q
The minimum number of digits to appear for the d, i, o, u, x, and X conversions (the eld is padded with leading zeroes) The number of digits to appear after the radix character for the e, E, and f conversions The maximum number of signicant digits for the g and G conversions The maximum number of bytes to be printed from a string by the a conversion
The precision takes the form of a period (.) followed by either an asterisk (*), as described in the Width and Precision Speciers subsection, or by a decimal digit string.
q
An optional sequence of size prexes that indicate the size of the corresponding argument (described in the Size Prexes subsection). The size prexes are not necessary in D and are provided solely for compatibility with the C printf() function. A conversion specier (described in the following subsection) that indicates the type of conversion to be applied to the argument.
The printf(3C) function also supports conversion specications of the form %n$ where n is a decimal integer; DTrace printf() does not support this type of conversion specication.
Flag Speciers
You enable the printf() conversion ags by specifying one or more of the following characters, which can appear in any order:
q
() The integer portion of the result of a decimal conversion (%i, %d, %u, %f, %g, or %G) is formatted with thousands grouping characters using the non-monetary grouping character. Not all locales, including the POSIX C locale, provide non-monetary grouping characters for use with this ag. (-) The result of the conversion is left-justied within the eld. The conversion will be right-justied if this ag is not specied. (+) The result of signed conversion always begins with a sign (+ or -). If this ag is not specied, the conversion begins with a sign only when a negative value is converted.
A-5
( space) If the rst character of a signed conversion is not a sign or if a signed conversion results in no characters, a space is placed before the result. If the space and + ags both appear, the space ag is ignored. (#) The value is converted to an alternate form if one is dened for the selected conversion. The alternate formats for conversions are described below in the text corresponding to each conversion. (0) For d, i, c, u, x, X, e, E, f, g, and G conversions, leading zeroes (following any indication of sign or base) are used to pad to the eld width; no space padding is performed. If the 0 and - ags both appear, the 0 ag is ignored. For d, i, o, u, x, and X conversions, if a precision is specied, the 0 ag is ignored. If the 0 and ags both appear, the grouping characters are inserted before the zero padding.
A-6
Size Prexes
Size prexes are required in ANSI-C programs that use printf(3C) in order to indicate the size and type of the conversion argument. The D compiler performs this processing for your printf() calls automatically, so size prexes are not required. Although size prexes are provided for C compatibility, their use is explicitly discouraged in D programs because they also tend to bind your code to a particular data model when using derived types. For example, if a typedef is redened to different integer base types depending on the data model, it is not possible to use a single C conversion that works in both data models without explicitly knowing the two underlying types and including a cast expression, or dening multiple format strings. The D compiler solves this problem by allowing you to omit size prexes and automatically determining the argument size. The size prexes can be placed just before the format conversion name and after any ags, widths, and precision speciers. The size prexes are:
q
Optional h species that a following a, i, o, u, x, or X conversion applies to a short or unsigned short Optional l species that a following d, i, o, u, x, or X conversion applies to a long or unsigned long Optional ll species that a following d, i, o, u, x, or X conversion applies to a long long or unsigned long long Optional L species that a following e, E, f, g, or G conversion applies to a long double Optional l species that a following c conversion applies to a wint_t argument; an optional l species that a following s conversion character applies to a pointer to a wchar_t argument
Conversion Formats
Each conversion character sequence results in fetching zero or more arguments. If you do not provide sufcient arguments for the format string, or if the format string is exhausted and arguments remain, the D compiler issues an error message. If you specify an undened conversion format, the D compiler issues an error message. The conversion character sequences and their meanings are:
A-7
a The pointer or uintptr_t argument is printed as a kernel symbol name in the form modulesymbol-name plus an optional hexadecimal byte offset. If the value does not fall within the range dened by a known kernel symbol, the value is printed as a hexadecimal integer. c The char, short, or int argument is printed as an ASCII character. d The char, short, int, long, or long long argument is printed as a decimal (base 10) integer. If the argument is signed, it is printed as a signed value. If the argument is unsigned, it is printed as an unsigned value. This conversion has the same meaning as i. e, E The float, double, or long double argument is converted to the style [-]d.dddedd, where there is one digit before the radix character (which is non-zero if the argument is non-zero) and the number of digits after it is equal to the precision. If you do not specify the precision, the default precision value is 6. If the precision is 0 and the # ag is not specied, no radix character appears. The E conversion format produces a number with E instead of e introducing the exponent. The exponent always contains at least two digits. The value is rounded up to the appropriate number of digits. f The float, double, or long double argument is converted to the style [-]ddd.ddd, where the number of digits after the radix character is equal to the precision specication. If you do not specify the precision, the default precision value is 6. If the precision is 0 and the # ag is not specied, no radix character appears. If a radix character appears, at least one digit appears before it. The value is rounded up to the appropriate number of digits. g, G The float, double, or long double argument is printed in the style f or e (or in style E in the case of a G conversion character), with the precision specifying the number of signicant digits. If an explicit precision is 0, it is taken as 1. The style used depends on the value converted: style e (or E) is used only if the exponent resulting from the conversion is less than -4 or greater than or equal to the precision. Trailing zeroes are removed from the fractional part of the result. A radix character appears only if it is followed by a digit. If the # ag is specied, trailing zeroes are not removed from the result as they normally are. i The char, short, int, long, or long long argument is printed as a decimal (base 10) integer. If the argument is signed, it is printed as a signed value. If the argument is unsigned, it is printed as an unsigned value. This conversion has the same meaning as d.
A-8
o The char, short, int, long, or long long argument is printed as an unsigned octal (base 8) integer. Arguments that are signed or unsigned can be used with this conversion. If the # ag is specied, the precision of the result is increased if necessary to force the rst digit of the result to be a zero. p The pointer or uintptr_t argument is printed as a hexadecimal (base 16) integer. D accepts pointer arguments of any type. If the # ag is specied, a non-zero result has 0x prepended to it. s The argument must be an array of char or a string. Bytes from the array or string are read up to a terminating null character or to the end of the data and are interpreted and printed as ASCII characters. If the precision is not specied, it is taken to be innite, so all characters up to the rst null character are printed. If the precision is specied, only that portion of the character array that displays in the corresponding number of screen columns is printed. If an argument of type char * is to be formatted, it should be cast to string or prexed with the D stringof operator to indicate that DTrace should trace the bytes of the string and format them. u The char, short, int, long, or long long argument is printed as an unsigned decimal (base 10) integer. Arguments that are signed or unsigned can be used with this conversion, and the result is always formatted as unsigned. wc The int argument is converted to a wide character (wchar_t) and the resulting wide character is printed. ws The argument must be an array of wchar_t. Bytes from the array are read up to a terminating null character or to the end of the data and are interpreted and printed as wide characters. If the precision is not specied, it is taken to be innite, so all wide characters up to the rst null character are printed. If the precision is specied, only that portion of the wide character array that displays in the corresponding number of screen columns is printed. x, X The char, short, int, long, or long long argument is printed as an unsigned hexadecimal (base 16) integer. Arguments that are signed or unsigned can be used with this conversion. If the X form of the conversion is used, the letter digits abcdef are used. If the X form of the conversion is used, the letter digits ABCDEF are used. If the # ag is specied, a non-zero result has 0x (for %x) or 0X (for %X) prepended to it. % Print a literal % character; no argument is converted. The entire conversion specication must be %%.
A-9
The printa() action is used to format the results of aggregations in a D program. If the rst form of the action is used, the dtrace(1M) command takes a consistent snapshot of the aggregation data and produces output equivalent to the default output format used for aggregations. If the second form of the function is used, the dtrace(1M) command takes a consistent snapshot of the aggregation data and produces output based on the conversions specied in the format string, according to the rules described in the following subsection.
The format conversions must match the tuple signature used to create the aggregation. Each tuple element can only appear once. For example, suppose you aggregate a count using the following D statements: @a[hello, 123] = count(); @a[goodbye, 456] = count(); If you then add the D statement printa(format-string, @a) to a probe clause, the dtrace utility snapshots the aggregation data and produces output as if you had entered the statements for each tuple dened in the aggregation, such as: printf(format-string, hello, 123); printf(format-string, goodbye, 456);
Unlike printf(), the format string you use for printa() need not include all elements of the tuple (that is, you can have a tuple of length 3 and only one format conversion). Therefore you can omit any tuple keys from your printa() output by changing your aggregation declaration to move the ones you want to omit to the end of the tuple and then omitting corresponding conversion speciers for them from the printa() format string. The aggregation result itself can be included in the output by using the additional @ format ag character, which is only valid when used with printa(). The @ ag can be combined with any appropriate format conversion specier, and can appear more than once in a
A-10
Data Recording Actions format string. This means that your tuple result can appear anywhere in the output and can appear more than once. The set of conversion speciers that can be used with each aggregating function are implied by the aggregating functions result type, listed below:
q q q q q q q
uint64_t avg() uint64_t count() int64_t lquantize() uint64_t max() uint64_t min() int64_t quantize() uint64_t sum()
For example, to format the results of avg(), you can apply the %d, %i, %o, %u, or %x format conversions. The quantize() and lquantize() functions format their results as an ASCII table rather than as a single value.
A-11
Data Recording Actions 1 1 1 1 1 1 1 ... Oxffl4f950 genunixcyclicsoftint+0x588 Oxfef228Oc genunixgetf+Oxdc ufsufs icheck+0x50 genunixinfpollinfo+0x80 genunixkmem_log_enter+tOxle8
The stack() action records a kernel stack trace to the directed buffer. The kernel stack is nframes in depth. If you do not provide nframes, the number of stack frames recorded is the number specied by the stackframes option. For example: # dtrace -n uiomove:entry{stack()} CPU ID FUNCTION:NAME 0 12200 uiomove:entry ufs`rdip+0x338 ufs`ufs_read+0x208 genunix`vn_rdwr+0x1c0 elfexec`getelfphdr+0xa4 elfexec`elf32exec+0x7a0 genunix`gexec+0x324 genunix`exec_common+0x278 genunix`exece+0xc unix`syscall_trap32+0xcc 0 12200 uiomove:entry ufs`ufs_readlink+0x11c genunix`pn_getsymlink+0x40 genunix`lookuppnvp+0x414 genunix`lookuppnat+0x120 genunix`resolvepath+0x50 unix`syscall_trap32+0xcc
... The stack() action differs from other actions in that it can also be used as a key to an aggregation:
A-12
Data Recording Actions # dtrace -n kmem_alloc:entry {@[stack()] = count()} dtrace: description 'kmem_alloc:entry ' matched 1 probe ^C genunix`installctx+0xc genunix`schedctl+0x5c unix`syscall_trap+0xac 1 genunix`schedctl_shared_alloc+0xc0 genunix`schedctl+0x18 unix`syscall_trap+0xac 1 unix`lgrp_shm_policy_set+0x168 genunix`segvn_create+0x82c genunix`as_map+0xf0 genunix`schedctl_map+0x98 genunix`schedctl_shared_alloc+0x8c genunix`schedctl+0x18 unix`syscall_trap+0xac 1 ... sd`xbuf_iostart+0x7c ufs`log_roll_write_bufs+0x100 ufs`log_roll_write+0xe4 ufs`trans_roll+0x2f8 unix`thread_start+0x4 16
A-13
Data Recording Actions The ustack() action records a user stack trace to the directed buffer. The user stack is nframes in depth. If you do not specify nframes, the number of stack frames recorded is the number specied by the ustackframes option. Although ustack() can determine the address of the calling frames when the probe res, the stack frames are not translated into symbols until the ustack() action is processed at user-level by the DTrace consumer. Note that some functions are static and therefore do not have entries in the symbol table; call sites in these functions are displayed with their hexadecimal address. Also, because ustack() symbol translation does not occur until after the data is recorded, there exists a possibility that the process in question has exited, making stack frame translation impossible. In this case, the dtrace utility emits a warning, followed by the hexadecimal stack frames. For example: dtrace: failed to grab process 100941: no such process c7b834d4 c7bca95d c7bcala4 c7bd4 374 c7bc2528 8047efc Finally, because the postmortem DTrace debugger commands cannot perform the frame translation, using ustack() with a ring buffer policy always results in raw ustack() data.
A-14
Data Recording Actions vi`ovbeg+0x30 vi`vop+0x158 vi`commands+0x13d0 vi`main+0xf24 vi`_start+0x108 1 ... libc.so.1`_brk_unlocked+0x4 libc.so.1`sbrk+0x24 vi`morelines+0x4 vi`append+0xc4 vi`put+0xe4 vi`vremote+0x64 vi`vmain+0x1670 vi`vop+0x25c vi`commands+0x13d0 vi`main+0xf24 vi`_start+0x108 35
A-15
Destructive Actions
Destructive Actions
Some actions are destructive in that they change the state of the system. Although they change the system in a well-dened way, they change it nonetheless. You cannot use destructive actions unless you have explicitly enabled them. In the dtrace(1M) command, you enable destructive actions with the -w option. If you attempt to use destructive actions in the dtrace(1M) command without explicitly enabling them, dtrace fails, returning an error message similar to: dtrace: could not enable tracing: Destructive actions not allowed
A-16
Destructive Actions
A-17
Destructive Actions
A-18
Destructive Actions The exact method for setting dtrace_destructive_disallow depends on the kernel debugger that you are using. If you are using OpenBoot PROM on SPARC, follow these steps: 1. Use w! as follows: ok 1 dtrace_destructive_disallow w! ok 2. Conrm that this has been set using w?: ok dtrace_destructive_disallow w? 1 ok 3. Continue by using go: ok go If you are using the kadb(1M) debugger on x86, follow these steps: 1. Use the 4-byte write modier (W) with the / formatting dcmd: kadb[0]: dtrace_destructive_disallow/w 1 dtrace_destructive_disallow: 0x0 = 0xl kadb[0]: 2. Continue by entering :c: kadb[0]: :c If you wish to re-enable destructive actions after continuing, you must explicitly reset dtrace_destructive_disallow back to 0. You do this using the mdb(1) debugger: # echo dtrace_destructive_disallow/W 0 | mdb -kw dtrace_destructive_disallow: 0xl = 0x0 #
A-19
Destructive Actions 000002al0050b840 dtrace:dtrace_probe+518 (fffe, 0, 1830f88, 1830f88, 30002fb8040, 300000acfc8) %l0-3: 0000000000000000 00000300030e4d80 0000030003418000 00000300018c0800 %l4-7: 000002a10050b980 0000000000000500 0000000000000000 0000000000000502 000002a10050ba30 genunix:dtrace_systrace_syscall32+44 (0, 2000, 5, 80000002, 3, 1898400) %l0-3: 00000300030de730 0000000002200008 00000000000000e0 000000000184d928 %l4-7: 00000300030de000 0000000000000730 0000000000000073 0000000000000010 syncing file systems... 2 done dumping to /dev/dsk/cOtOdOsl, offset 214827008, content: kernel 100% done: 11837 pages dumped, compression ratio 4.66, dump succeeded rebooting... In addition, the syslogd(1M) emits a message upon reboot: Jun 10 16:56:31 machinel savecore: [ID 570001 auth.error] reboot after panic: dtrace: panic action at probe syscall::mmap:entry (ecb 300000actc8) The message buffer of the crash dump will also contain the probe and ECB responsible for the panic() action.
A-20
Special Actions Because interrupts are disabled while in DTrace probe context, any use of the chill() action induces interrupt latency, scheduling latency, dispatch latency, and so on. The chill() action can, therefore, cause strange systemic effects, and should not be used indiscriminately. Moreover, because the liveness of the system relies on being able to periodically handle interrupts, DTrace refuses to implement the chill() action for longer than 500 milliseconds within any given one-second interval, and instead reports an illegal operation error: # dtrace -w -n 'syscall::open:entry {chill(500000001)}' dtrace: description 'syscall::open:entry ' matched 1 probe dtrace: allowing destructive actions dtrace: error on enabled probe ID 2 (ID 18022: syscall::open:entry): illegal operation in action #1 The cap is enforced even if the time is spread across multiple calls to chill(), or if the time is spread across multiple DTrace consumers for a single probe.
Special Actions
Some actions do not fall into either the data recording action or the destructive action category. These other special actions fall into one of two sets. The rst set contains those actions associated with speculative tracing. The second set contains the exit() action.
speculate(int id) The speculate() action denotes that the remainder of the probe clause should be traced to the speculative buffer specied by id.
commit(int id) The commit() action commits the speculative buffer associated with id.
discard(int id) The discard() action discards the speculative buffer associated with id.
A-21
Subroutines
Subroutines
Subroutines differ from actions in that they generally only affect internal DTrace state. There is therefore no such thing as a destructive subroutine, and subroutines never trace data into buffers. Many subroutines have analogs in Section 9F or Section 3C of the manual pages; see Intro(9F) and Intro(3), respectively.
A-22
Subroutines
A-23
Subroutines
A-24
Subroutines
A-25
Subroutines
Declared statically Created with an interrupt block cookie of NULL, or Created with an interrupt block cookie that does not correspond to a high-level interrupt.
See mutex_init(9F) for more details on mutexes. The great majority of mutexes in the Solaris kernel are adaptive.
A-26
Subroutines
A-27
Appendix B
Built-in variables provided by the D language Macro variables provided by the D language
B-1
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Built-in Variables
Built-in Variables
You have seen a number of special built-in D variables in the example programs, including timestamp, pid, and others. All of these variables are scalar global variables; currently D does not dene thread-local variables, clause-local variables, or built-in associative arrays. Table B-1 shows the complete list of D built-in variables. Table B-1 DTrace Built-in Variables Type and Name int64_t arg0, ..., arg9 Description The rst ten input arguments to a probe represented as raw 64-bit integers. If fewer than ten arguments are passed to the current probe, the remaining variables return zero. The typed arguments to the current probe, if any. The args[] array is accessed using an integer index, but each element is dened to be the type corresponding to the given probe argument. For example, if args[] is referenced by a read(2) system call probe, args[0] is of type int, args[1] is of type void *, and args[2] is of type size_t. The program counter location of the current thread just before entering the current probe. The lightweight process (LWP) state of the LWP associated with the current thread. This structure is described in further detail in proc(4). The process state of the process associated with the current thread. This structure is described in further detail in proc(4). The address of the operating system kernels internal data structure for the current thread, the kthread_t structure. The kthread_t is dened in <sys/thread.h>. The name of the current working directory of the process associated with the current thread. The enabled probe ID (EPID) for the current probe. This integer uniquely identies a particular probe that is enabled with a specic predicate and set of actions. The error value returned by the last system call executed by this thread.
args[]
psinfo_t *curpsinfo
kthread_t *curthread
int errno
B-2
Built-in Variables Table B-1 DTrace Built-in Variables (Continued) Type and Name string execname uint_t id Description The name that was passed to exec(2) to execute the current process. The probe ID for the current probe. This is the system-wide unique identier for the probe as published by DTrace and listed in the output of dtrace -l. The interrupt priority level (IPL) on the current CPU at probe ring time. The process ID of the current process. The function name portion of the current probes description. The module name portion of the current probes description. The name portion of the current probes description. The provider name portion of the current probes description. The name of the root directory of the process associated with the current thread. The current threads stack frame depth at probe ring time. The thread ID of the current thread. For threads associated with user processes, this value is equal to the result of a call to pthread_self(3C). The current value of a nanosecond timestamp counter. This counter increments from an arbitrary point in the past and should only be used for relative computations. The current threads saved user-mode register values at probe ring time. The current value of a nanosecond timestamp counter that is virtualized to the amount of time that the current thread has been running on a CPU, minus the time spent in DTrace predicates and actions. This counter increments from an arbitrary point in the past and should only be used for relative time computations.
uint_t ipl pid_t pid string probefunc string probemod string probename string probeprov string root unit_t stackdepth id_t tid
unint64_t timestamp
B-3
Macro Variables
Macro Variables
The D compiler denes a set of built-in macro variables that you can use when writing D programs or interpreter les. Macro variables are identiers that are prexed with a dollar sign ($) and are expanded once by the D compiler when processing your input le. Table B-2 shows the complete list of D macro variables. Table B-2 D Macro Variables Name $[0-9]+ $egid $euid $gid $pid $pgid $ppid $projid $sid $taskid $uid Description Macro arguments Effective group ID Effective user ID Real group ID Process ID Parent group ID Parent process ID Project ID Session ID Task ID Real user ID Reference See Module 2, Built-in Macro Variables getegid(2) geteuid(2) getgid(2) getpid(2) getpgid(2) getppid(2) getprojid(2) getsid(2) getatskid(2) getuid(2)
B-4
Appendix C
D Operators
This appendix denes and describes the following D operators:
q q q q q q
Arithmetic operators Relational operators Logical operators Bitwise operators Assignment operators Increment and decrement operators
C-1
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Arithmetic Operators
Arithmetic Operators
D provides the standard arithmetic operators for use in your programs. These operators all have the same meaning as they do in ANSI-C for integer operands. Table C-1 shows the D binary arithmetic operators. Table C-1 D Binary Arithmetic Operators Operator + * / % Meaning Integer addition Integer subtraction Integer multiplication Integer division Integer modulus
Arithmetic in D can only be performed on integer operands or on pointers. Arithmetic cannot be performed on oating-point operands in D programs. The DTrace execution environment does not take any action on integer overow or underow; you must check for these conditions yourself in situations where they are applicable. The DTrace execution environment does automatically check for and report division by zero errors resulting from improper use of the / and % operators. If a D program executes an invalid division operation, DTrace automatically disables the affected instrumentation and reports the error to you. Errors detected by DTrace have no effect on other DTrace users or on the operating system kernel, so you do not need to worry about causing any damage if your D program inadvertently contains one of these errors. In addition to these binary operators, the + and - operators can also be used as unary operators; these have higher precedence than any of the binary arithmetic operators. The order of precedence and associativity properties for all the D operators is summarized at the end of this Appendix. You can control precedence by grouping expressions in parentheses ( ).
C-2
Relational Operators
Relational Operators
D provides binary relational operators for use in your programs. These operators all have the same meaning as they do in ANSI-C. Table C-2 shows the D relational operators. Table C-2 D Relational Operators Operator < <= > >= == != Meaning Left-hand operand is less than right-hand operand Left-hand operand is less than or equal to right-hand operand Left-hand operand is greater than right-hand operand Left-hand operand is greater than or equal to right-hand operand Left-hand operand is equal to right-hand operand Left-hand operand is not equal to right-hand operand
Relational operators are most frequently used to write D predicates. Each operator evaluates to a value of type int, which is equal to 1 if the condition is true, and 0 if it is false. Relational operators can be applied to pairs of integers, pointers, or strings. If pointers are compared, the result is equivalent to an integer comparison of the two pointers interpreted as unsigned integers. If strings are compared, the result is determined as if by performing a strcmp(3C) on the two operands. Here are some example D string comparisons and their results: coffee < espresso coffee == coffee coffee >= mocha ... returns 1 (true) ... returns 1 (true) ... returns 0 (false)
Relational operators can also be used to compare a data object associated with an enumeration type with any of the enumerator tags dened by the enumeration. Enumerations are a facility for creating named integer constants.
D Operators
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
C-3
Logical Operators
Logical Operators
D provides binary logical operators for use in your programs. Table C-3 shows the D logical operators. The rst two are equivalent to the corresponding ANSI-C operators. Table C-3 D Relational Operators Operator && || ^^ Meaning Logical AND: true if both operands are true Logical OR: true if one or both operands are true Logical XOR: true if exactly one operand is true
Logical operators are most frequently used in writing D predicates. The logical AND operator performs short-circuit evaluation: if the left-hand operand is false, the right-hand expression is not evaluated. The logical OR operator also performs short-circuit evaluation: if the left-hand operand is true, the right-hand expression is not evaluated. The logical XOR operator does not short-circuit: both expression operands are always evaluated. In addition to the binary logical operators, the unary ! operator can be used to perform a logical negation of a single operand: it converts a zero operand into a 1 and a non-zero operand into a 0. By convention, D programmers use ! when working with integers that are meant to represent Boolean values and == 0 when working with non-Boolean integers, although both expressions are equivalent in meaning. The logical operators can be applied to operands of integer type or pointer type. The logical operators interpret pointer operands as unsigned integer values. As with all logical and relational operators in D, operands are true if they have a non-zero integer value and false if they have a zero integer value.
C-4
Bitwise Operators
Bitwise Operators
D provides binary operators for manipulating individual bits inside of integer operands. These operators all have the same meaning as they do in ANSI-C. Table C-4 shows the D bitwise operators. Table C-4 D Bitwise Operators Operator & | ^ << >> Meaning Bitwise AND Bitwise OR Bitwise XOR Shift the left-hand operand left by the number of bits specied by the right-hand operand Shift the left-hand operand right by the number of bits specied by the right-hand operand
You use the binary & operator to clear bits from an integer operand. You use the binary | operator to set bits in an integer operand. The binary ^ operator returns 1 in each bit position where exactly one of the corresponding operand bits is set. You use the shift operators to move bits left or right in a given integer operand. Shifting left lls empty bit positions on the right-hand side of the result with zeroes. Shifting right using an unsigned integer operand lls empty bit positions on the left-hand side of the result with zeroes. Shifting right using a signed integer operand (an action known as an arithmetic shift operation) lls empty bit positions on the left-hand side with the value of the sign bit. Shifting an integer value by a negative number of bits or by a number of bits larger than the number of bits in the left-hand operand itself produces an undened result. The D compiler produces an error message if it detects this condition when you compile your D program. In addition to the binary logical operators, you can use the unary ~ operator to perform a bitwise negation of a single operand: it converts each 0 bit in the operand into a 1 bit, and each 1 bit in the operand into a 0 bit.
D Operators
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
C-5
Assignment Operators
Assignment Operators
D provides the following binary assignment operators for modifying D variables. Remember that you can only modify D variables and arrays: kernel data objects and constants cannot be modied using the D assignment operators. The assignment operators have the same meaning as they do in ANSI-C. Table C-5 shows the D assignment operators. Table C-5 D Assignment Operators Operator = += -= *= /= %= |= &= ^= <<= >>= Meaning Set the left-hand operand equal to the right-hand expression value Increment the left-hand operand by the right-hand expression value Decrement the left-hand operand by the right-hand expression value Multiply the left-hand operand by the right-hand expression value Divide the left-hand operand by the right-hand expression value Modulo the left-hand operand by the right-hand expression value Bitwise OR the left-hand operand with the right-hand expression value Bitwise AND the left-hand operand with the right-hand expression value Bitwise XOR the left-hand operand with the right-hand expression value Shift the left-hand operand left by the number of bits specied by the right-hand expression value Shift the left-hand operand right by the number of bits specied by the right-hand expression value
C-6
Assignment Operators With the exception of the assignment operator =, the assignment operators are provided as short-hand for using the operator with one of the other operators described previously. For example, the expression x = x + 1 is equivalent to the expression x += 1, except that the expression x is evaluated once. These assignment operators obey the same rules for operand types as the binary forms described previously. The result of any assignment operator is an expression equal to the new value of the left-hand expression. You can use the assignment operators, or any of the operators described so far, in combination to form expressions of arbitrary complexity. You can use parentheses ( ) to group terms in complex expressions.
D Operators
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
C-7
If the operator appears after the variable name, the variable is modied after its current value is returned for use in the expression. For example, the following two expressions produce identical results: y = x; x -= 1; y = x--;
You can use the increment and decrement operators to create new variables without declaring them. If you omit a variable declaration and apply the increment or decrement operator to a variable, the variable is implicitly declared to be of type int64_t. You can apply the increment and decrement operators to integer or pointer variables. When applied to integer variables, the operators increment or decrement the corresponding value by one. When applied to pointer variables, the operators increment or decrement the pointer address by the size of the data type referenced by the pointer.
C-8
Conditional Expressions
Conditional Expressions
Although D does not provide support for if-then-else constructs, it does provide support for simple conditional expressions using the ? and : operators. These operators permit a triplet of expressions to be associated where the rst expression is used to conditionally evaluate one of the other two. For example, the following D statement can be used to set a variable x to one of two strings, depending on the value of i: x = i == 0 ? zero : non-zero; In this example, the expression i == 0 is rst evaluated to determine if it is true or false. If the rst expression is true, the second expression is evaluated and the ?: expression returns its value. If the rst expression is false, the third expression is evaluated and the ?: expression return its value. As with any D operator, you can use multiple ?: operators in a single expression to create more complex expressions. For example, the following expression takes a char variable c containing one of the characters 0-9, a-z, or A-Z and returns the value of this character when interpreted as a digit in a hexadecimal (base 16) integer: hexval = (c >= 0 && c <= 9) ? c - 0 : (c >= a && c <= z) ? c + 10 - a : c + 10 - A; The rst expression used with ?: must be a pointer or integer in order to be evaluated for its truth value. The second and third expressions can be of any compatible types. You cannot construct a conditional expression in which, for example, one path returns a string and another an integer. The second and third expressions also cannot invoke a tracing function, such as trace() or printf(). If you want to trace data conditionally, you should use a predicate instead.
D Operators
Copyright 2005 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
C-9