Вы находитесь на странице: 1из 29

Processor Performance Counter Monitoring

Dr. Roman Dementiev


roman.dementiev@intel.com Senior Application Engineer Software and Services Group
14 July 2010

Software & Services Group 1

Legal Disclaimer
Intel may make changes to specifications and product descriptions at any time, without notice. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel Virtualization Technology requires a computer system with a processor, chipset, BIOS, virtual machine monitor (VMM) and applications enabled for virtualization technology. Functionality, performance or other virtualization technology benefits will vary depending on hardware and software configurations. Virtualization technology-enabled BIOS and VMM applications are currently in development. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information. Lead-free: 45nm product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). Some EU RoHS exemptions for lead may apply to other components used in the product package. Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below 900 PPM bromine and 900 PPM chlorine. Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. 2009 Standard Performance Evaluation Corporation (SPEC) logo is reprinted with permission

Software & Services Group 2

Agenda
CPU Utilization Monitoring Performance Monitoring Units (PMU) in Processors Offline analysis with PMU: Intel VTune Performance Analyser Online Dynamic Processor Monitoring NEW!

Software & Services Group 3

Operating System CPU Utilization Meter


Most known meter, exists on almost any OS
Shows how long OS was in the idle/sleep loop Worked well with CPUs of 80s

But OS CPU Meters ignore


memory access stalls synchronisation/locking CPU I/O Simultaneous multithreading (SMT) Intel Hyper-Threading etc

How do I find out what keeps processor busy? Or is my software just wasting compute cycles?
Existing OS CPU meters can not predict capacity of modern hardware
Software & Services Group 4

CPU Utilization Meter in Hardware?


Modern CPU systems are very complex and consist of many units/resources that influence computation speed

SYSTEM

SOCKET (CPU)

CORE

Software & Services Group 5

Performance Monitoring Units (PMUs)


Intel processors have Performance Monitoring Units (PMUs) that can be programmed to count many performance-related events
One PMU per logical core (number of elapsed cycles, L1, L2 cache, TLB events, processed instructions, there are hundreds of events) One in PMU uncore (L3 cache, memory controller, Intel QPI events)

Software & Services Group 6

Programming PMUs
Programming by reading/writing Model Specific Registers Much of hardware and events are platform specific Core PMU is enumerate in CPUID Leaf A:
Number of fully programmable counters (4 per logical core), a counter is assigned to count a certain event Number of fixed function counters exist (3 per logical core): core clocks counter, reference clock counter, instruction counter

Some uncore and core programmable counters can be only programmed with certain types of events Other tricky restrictions apply, restructions are documented in the event list

Software & Services Group 7

Processor Performance Counters


Publicly documented on intel.com
David Levinthal Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors
http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

Intel 64 and IA-32 Architectures Software Developers Manual, Volume 3B: System Programming Guide, Part 2 http://www.intel.com/products/processor/manuals/ Intel Xeon Processor 7500 Series Uncore Programming Guide
http://www.intel.com/Assets/en_US/PDF/designguide/323535.pdf
http://software.intel.com/file/20476

Peggy Irelan and Shihjong Kuo Performance Monitoring Unit Sharing Guide

Intel Hyper-Threading Technology-specific:


Drysdale, Gillespie, Valles Performance Insights to Intel Hyper-Threading Technology
http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/

Gillespie, Drysdale Intel Hyper-Threading Technology: Analysis of the HT Effects on a Server Transactional Workload http://software.intel.com/en-us/articles/intel-hyper-threadingtechnology-analysis-of-the-ht-effects-on-a-server-transactional-workload/

Software & Services Group 8

PMU Sampling Mode: The Statistical Method of Finding Hotspots


A sampling collector (like VTune Performance Analyzer or Intel Performance Tuning Utility)
PMU periodically interrupts the processor
Triggered by the occurrence of a certain number of events

Collects the execution context


Execution address in memory (CS:IP) Operating system process and thread ID Executable module loaded at that address
If you have symbols for the module, post-processing can identify the function or method at the memory address. Line numbers from the symbol file can direct you to the relevant line of source code.

Software & Services Group

Introducing Intel VTune Performance Analyzer


Helps identify and characterize performance issues by:
Collecting performance data from the system running your application. Organizing and displaying the data in a variety of interactive views, from system-wide down to source code or processor instruction perspective. Identifying potential performance issues and suggesting improvements. Providing application profiling information Provides Tuning assistant and great help system

Besides sampling analysis with PMU can also produce call-graph (not covered here)

Software & Services Group

Just a few things you can do with processor performance events


Check if your software is NUMA-optimized (local/remote memory accesses) Cache-local or not Memory bandwidth bound or not Branchy or not (branch misspredictions) Has bad long latency instructions on critical path Has performance bugs in multithreaded programs ( false-sharing,) Exploits instruction parallelism well or not
See also the article Using Intel VTune Performance Analyzer to Optimize Software for the Intel Core i7 Processor Family http://software.intel.com/enus/articles/using-intel-vtune-performance-analyzer-to-optimize-software-for-the-intelr-coretm-i7-processorfamily/

Software & Services Group 11

DEMO
Intel VTune Performance Analyzer in action!

Software & Services Group 12

Offline Analysis: VTune Performance Analyzer Sampling Collector

Select Event: Clock ticks, L2/L3 cache misses, branch misspredictions,

etc.
Software & Services Group 13

Offline Analysis: Intel VTune Performance Analyzer Sampling Collector

Offline Analysis: Intel VTune Analyser

Hotspot view of one module for all OS processes and threads grouped by function (or method).

Software & Services Group 14

Sampling Source View Displays Source Code Annotated with Performance Data

Software & Services Group 15

PMU Counting Mode


No interrupts generated Application reads (periodically) the number of occured events from the PMU counters Very small overhead Advances online use-cases possible: next slides

Software & Services Group 16

Online Performance Counter Monitoring: Access Intel CPU Counters* in Your Program
Terminology: System consists of several sockets (=CPUs) Socket has a number (logical) cores

Usage pattern 1. Save counter state for {core,socket,system} into a state object 1 2. Run user code or experiment 3. Save counter state for {core,socket,system} into a state object 2 4. Using state object 1 and 2 compute performance/utilization metrics Caution: OS may schedule different user threads on the same core (context switches) NEW!

Access not only core counters (clock ticks, L2 cache misses, etc) but also uncore (Intel memory controllers, Intel QPI, etc) counters*
Software & Services Group

* Implemented for Intel Core i7, Xeon 5500, 5600 and 7500 Processor Series (based on microarchitecture codenamed Nehalem/Westmere) 17

Example C++ code


Monitor * m = Monitor::getInstance(); if(m->good()) m->program(); // program counters SystemCounterState before_sstate, after_sstate; before_sstate = getSystemCounterState(); [run your code here] after_sstate = getSystemCounterState(); cout<<IPC:<< getIPC(before_sstate,after_sstate)<< L3 cache hit ratio: << getL3CacheHitRatio(before_sstate,after_sstate) << Bytes read:<< getBytesReadFromMC(before_sstate,after_sstate) << [and so on]

Software & Services Group 18

Example 1
Compare traversal/searching in the STL list vs. STL vector (4 byte records) C++ code to measure:
std::find( ds.begin(), ds.end(), ds.size());

Get CPU performance insights in real time


Software & Services Group 19

Intel Performance Counter Monitor* (Linux*/Windows*)

Easily collect CPU performance data


Software & Services Group 20 *the name might be changed in future

Linux* KDE* plug-in

Visualize CPU performance in real time


Software & Services Group 21

Advanced Examples
NEW!

Software reads data from PMUs in online fashion

Self-tuning software !!

Software & Services Group 22

Example 2 CPU resource-aware scheduling


Problem (a simplified one):
schedule 1000 compute-intensive and 1000 memory bandwidth intensive jobs on a single core jobs are equal in size background unknown activity exists

Goal: minimize total completion time

Software & Services Group 23

CPU Monitoring Unaware Scheduler


time
Memory-band intensive background activity compute intensive jobs memorybandwidth Intensive jobs

11

11

Software & Services Group 24

CPU Monitoring Aware Scheduler


time
Memory-band intensive background activity compute intensive jobs memorybandwidth Intensive jobs

12

13

In an experiment with 2000 jobs we measured 16% faster completion time*


Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software & Services Group 25

Advanced Use-Cases I
Extend the problem (to be closer to reality):
Schedule to all Hyper-Threaded cores in the system The remaining capacities are not known a priori because the jobs are not predictable in exact resource utilization Do we have a room to put another job on this HT core?
Should it be compute intensive or rather memory intensive job?

CPU Performance Monitoring can provide more insights and help to answer these questions

Software & Services Group 26

Advanced Use-Cases II
Depending on remaining resource capacities choose the best algorithm to compute result
mem-intensive or compute-intensive

Choose between implementations


single-threaded or multithreaded (all cores) or with limited threading

and, so on

Software & Services Group 27

Conclusions and Takeaways


Current OS CPU utilization meters are not suited for modern hardware Modern processor PMUs provide metrics to get deep insight into processor performance and resource utilization

Processor performance counters are heavily used in established performance tools like Intel VTune Performance Analyser
New advanced use-cases for PMUs for dynamic online optimization possible
new kind of intelligent CPU-monitoring aware software

Software & Services Group 28

Software & Services Group 29

Вам также может понравиться