Class16 Profiling-Handout PDF

Application Performance Tuning
M. D. Jones, Ph.D.
Center for Computational Research University at Buffalo State University of New York
High Performance Computing I, 2012
M. D. Jones, Ph.D. (CCR/UB)
HPC-I Fall 2012
1 / 108
Introduction to Performance Engineering in HPC
Performance Fundamentals
Performance Foundations
Three pillars of performance optimization: Algorithmic - choose the most effective algorithm that you can for the problem of interest Serial Efciency - optimize the code to run efciently in a non-parallel environment Parallel Efciency - effectively use multiple processors to achieve a reduction in execution time, or equivalently, to solve a proportionately larger problem
HPC-I Fall 2012
3 / 108
Performance Fundamentals
Algorithmic Efciency
Choose the best algorithm before you start coding (recall that good planning is an essential part of writing good software): Running on large number of processors? Choose an algorithm that scales well with increasing processor count Running a large system (mesh points, particle count, etc.)? Choose an algorithm that scales well with system size If you are going to run on a massively parallel machine, plan from the beginning on how you intend to decompose the problem (it may save you a lot of time later)
HPC-I Fall 2012
4 / 108
Performance Baseline
Serial Efciency
Getting efcient code in parallel is made much more difcult if you have not optimized the sequential code, and in fact can lead to a misleading picture of parallel performance. Recall that our denition of parallel speedup, S , S(Np ) = p (Np ) involves the time, S , for an optimal sequential implementation (not just p (1)!)
HPC-I Fall 2012
5 / 108
Performance Baseline
Establish a Performance Baseline
Steps to establishing a baseline for your own performance expectations: Choose a representative problem (or better still a suite of problems) that can be run under identical circumstances on a multitude of platforms/compilers Requires portable code! How fast is fast? You can utilize hardware performance counters to measure actual code performance Prole, prole, and then prole some more ... to nd bottlenecks and spend your time more effectively in optimizing code
HPC-I Fall 2012
6 / 108
Pitfalls in Parallel Performance
Parallel Performance Trap
Pitfalls when measuring the performance of parallel codes: For many, speedup or linear scalability is the ultimate goal. This goal is incomplete - a terribly inefcient code can scale well, but actually deliver poor efciency. For example, consider a simple Monte Carlo based code that uses the most rudimentary uniform sampling (i.e. no importance sampling) - this can be made to scale perfectly in parallel, but the algorithmic efciency (measured perhaps by the delivered variance per cpu-hour) is quite low.
HPC-I Fall 2012
7 / 108
Simple Proling Tools
Timers
time Command
Note that this is not the time built-in function in many shells (bash and tcsh included), but instead the one located in /usr/bin. This command is quite useful for getting an overall picture of code performance. The default output format:
%Uuser %Ssystem %Eelapsed % PCPU (% X t e x t+%Ddata % Mmax) k %I i n p u t s+%Ooutputs (%Fmajor+%Rminor ) p a g e f a u l t s %Wswaps
and using the -p option:

r e a l %e user % U sys %S
HPC-I Fall 2012
9 / 108
Timers
time Example
[ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e . / l a p l a c e _ s : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 75.82 user 0.00 system 1 : 1 7 . 7 2 elapsed 97%CPU ( 0 a v g t e x t +0avgdata 0 maxresident ) k 0 i n p u t s +0 o u t p u t s ( 0 major +913 minor ) p a g e f a u l t s 0swaps [ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e p . / l a p l a c e _ s : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 r e a l 75.73 user 74.68 sys 0.00
HPC-I Fall 2012
10 / 108
Timers
time MPI Example

[ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e mpirun np 2 . / l a p l a c e _ m p i : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 Writing l o g f i l e . . . . Finished w r i t i n g l o g f i l e . 28.43 user 1.54 system 0 : 3 1 . 9 5 elapsed 93%CPU ( 0 a v g t e x t +0avgdata 0 maxresident ) k 0 i n p u t s +0 o u t p u t s ( 0 major +14920 minor ) p a g e f a u l t s 0swaps [ bono : ~ / d _ l a p l a c e ] $ qsub q debug lnodes =2: ppn =2 , w a l l t i m e =00:30:00 I qsub : w a i t i n g f o r j o b 577255. bono . c c r . b u f f a l o . edu t o s t a r t #############################PBS Prologue ############################## PBS p r o l o g u e s c r i p t run on h o s t c15n32 a t F r i Sep 28 1 5 : 0 3 : 0 5 EDT 2007 PBSTMPDIR i s / s c r a t c h / 5 7 7 2 5 5 . bono . c c r . b u f f a l o . edu / u s r / b i n / t i m e mpiexec . / l a p l a c e _ m p i : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 0.01 user 0.01 system 0 : 3 0 . 8 2 elapsed 0%CPU ( 0 a v g t e x t +0avgdata 0 maxresident ) k 0 i n p u t s +0 o u t p u t s ( 0 major +1416 minor ) p a g e f a u l t s 0swaps
OSCs mpiexec will report accurate timings in the (optional) email report, as it does not rely on rsh/ssh to launch tasks (but Intel MPI does, so in that case you will see the result of timing the mpiexec shell script, not the MPI code).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 11 / 108
Code Section Timing (Calipers)

Timing sections of code requires a bit more work on the part of the programmer, but there are reasonably portable means of doing so: Routine times gettimeofday clock_gettime system_clock (f90) cpu_time (f95) MPI_Wtime* OMP_GET_WTIME* Type user/sys wall wall wall cpu wall wall Resolution ms s ns system-dependent compiler-dependent system-dependent system-dependent
Generally I prefer the MPI and OpenMP timing calls whenever I can use them (*the MPI and OpenMP specications call for their intrinsic timers to be high precision).
More information on code section timers (and code for doing so): LLNL Performance Tools:
https://computing.llnl.gov/tutorials/performance_tools/#gettimeofday
Stopwatch (nice F90 module, but you need to supply a low-level function for accessing a timer): http://math.nist.gov/StopWatch
HPC-I Fall 2012
13 / 108
gprof
GNU Tools: gprof

Tool that we used briey before: Generic GNU proler Requires recompiling code with -pg option Running subsequent instrumented code produces gmon.out to be read by gprof Use the environment variable GMON_OUT_PREFIX to specify a new gmon.out prex to which the process ID will be appended (especially usefule for parallel runs) - this is a largely undocumented feature ... Line-level proling is possible, as we will see in the following example
HPC-I Fall 2012
14 / 108
gprof
gprof Shortcomings
Shortcomings of gprof (which apply also to any statistical proling tool): Need to recompile to instrument the code Instrumentation can affect the statistics in the prole Overhead can signicantly increase the running time Compiler optimization can be affected by instrumentation
HPC-I Fall 2012
15 / 108
gprof
Types of gprof Proles
gprof proles come in three types:

1
Flat Prole: shows how much time your program spent in each function, and how many times that function was called Call Graph: for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function Basic-block: Requires compilation with the -a ag (supported only by GNU?) - enables gprof to construct an annotated source code listing showing how many times each line of code was executed
HPC-I Fall 2012
16 / 108
gprof
gprof example
g77 I . O3 f f a s t math g pg o r p rp_read . o i n i t i a l . o en_gde . o \ adwfns . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o gen_etg . o g e n _ r t g . o \ gdewfn . o l i b / * . o [ jonesm@bono ~ / d_bench ] $ qsub qdebug lnodes =1: ppn =2 , w a l l t i m e =00:30:00 I [ jonesm@c16n30 ~ / d_bench ] $ . / r p E n t e r r u n i d :( <=9 chars ) short ... . . . s k i p copious amount o f s t a n d a r d o u t p u t . . . ... [ jonesm@bono ~ / d_bench ] $ g p r o f r p gmon . o u t > & o u t . g p r o f [ jonesm@bono ~ / d_bench ] $ l e s s o u t . g p r o f Flat profile : Each sample counts as 0.01 seconds . % cumulative self time seconds seconds calls 89.74 123.23 123.23 204008 6.96 132.79 9.56 1 1.18 134.41 1.62 200004 1.05 135.86 1.44 204002 0.71 136.83 0.97 14790551 0.27 137.20 0.37 204008 ...
self s/ call 0.00 9.56 0.00 0.00 0.00 0.00
total s/ call 0.00 137.22 0.00 0.00 0.00 0.00
name triwfns_ MAIN__ en_gde__ evol_ ranf_ gdewfn_
HPC-I Fall 2012
17 / 108
gprof
[ jonesm@bono ~ / d_bench ] $ g p r o f - l i n e r p gmon . o u t > & o u t . g p r o f . l i n e [ jonesm@bono ~ / d_bench ] $ l e s s o u t . g p r o f . l i n e Flat profile : Each sample counts as 0.01 seconds . % cumulative self self total time seconds seconds c a l l s ns / c a l l ns / c a l l name 17.45 23.96 23.96 t r i w f n s _ ( adwfns . 14.46 43.82 19.86 t r i w f n s _ ( adwfns . 12.87 61.50 17.68 t r i w f n s _ ( adwfns . 12.31 78.41 16.91 t r i w f n s _ ( adwfns . 0.67 79.33 0.92 t r i w f n s _ ( adwfns . 0.59 80.14 0.82 MAIN__ (cc4WTuQH . 0.51 80.84 0.70 MAIN__ (cc4WTuQH .
f :129 f :130 f :171 f :172 f :130 f :308 f :304
@ @ @ @ @ @ @
403c94 ) 403ce6 ) 404755) 4047e2 ) 403cd6 ) 4070c6 ) 4072da )
HPC-I Fall 2012
18 / 108
gprof
More gprof Information
More gprof documentation: gprof GNU Manual:

http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html
gprof man page: man gprof
HPC-I Fall 2012
19 / 108
pgprof
PGI Tools: pgprof
PGI tools also have proling capabilities (c.f. man pgf95) Graphical proler, pgprof
HPC-I Fall 2012
20 / 108
pgprof
pgprof example
pgf77
t p p764 f a s t s s e g 7 7 l i b s Mprof= l i n e s o r p rp_read . o i n i t i a l . o en_gde . o \ adwfns . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o gen_etg . o g e n _ r t g . o \ gdewfn . o l i b / * . o [ jonesm@bono ~ / d_bench ] $ . / r p E n t e r r u n i d :( <=9 chars ) short ... . . . s k i p copious amount o f s t a n d a r d o u t p u t . . . ... [ jonesm@bono ~ / d_bench ] $ p g p r o f exe . / r p p g p r o f . o u t
HPC-I Fall 2012
21 / 108
pgprof
pgprof example [screenshot]
HPC-I Fall 2012
22 / 108
pgprof
N.B. you can also use the -text option to pgprof to make it behave more like gprof See the PGI Tools Guide for more information (should be a PDF copy in $PGI/doc)
HPC-I Fall 2012
23 / 108
Parallel Proling Tools
Statistical MPI Proling With mpiP
mpiP: Statistical MPI Proling
http://mpip.sourceforge.net Not a tracing tool, but a lightweight interface to accumulate statistics using the MPI proling interface Quite useful in conjunction with a tracele analysis (e.g. using jumpshot) Installed on CCR systems - see module avail for mpiP availability and location
HPC-I Fall 2012
25 / 108
mpiP Compilation
To use mpiP you need to: Add a -g ag to add symbols (this will allow mpiP to access the source code symbols and line numbers) Link in the necessary mpiP proling library and the binary utility libraries for actually decoding symbols (There is a trick that you can use most of the time to avoid having to link with mpiP, though). Compilation examples (from U2) follow ...
HPC-I Fall 2012
26 / 108
mpiP Runtime Flags

You can set various mpiP runtime ags (e.g. export MPIP=-t 10.0 -k 2):
Option -c -e -f dir -g -k n -n -o -s n -t x -v Description Generate concise version of report, omitting callsite process-specic detail. Print report data using oating-point format Record output le in directory <dir> Enable mpiP debug mode Sets callsite stack traceback depth to <n> Do not truncate full pathname of lename in callsites Disable proling at initialization. Application must enable proling with MPI_Pcontrol() Set hash table size to <n> Set print threshold for report, where <x> is the MPI percentage of time for each callsite Generates both concise and verbose report output Default
. disabled 1
256 0.0
HPC-I Fall 2012
27 / 108
mpiP Example Output
From an older version of mpiP, but still almost entirely the same - this one links directly with mpiP; rst compile:
mpicc g o a t l a s a i j 2 _ b a s i s . o analyze . o a t l a s . o b a r r i e r . o b y t e f l i p . o \ chordgn2 . o c s t r i n g s . o i o 2 . o map . o m u t i l s . o numrec . o paramods . o p r o j . o \ p r o j A t l a s . o sym2 . o u t i l . o lm L / P r o j e c t s /CCR/ jonesm / mpiP 2.8.2/ gnu / ch_gm / l i b \ lmpiP l b f d l i b e r t y lm
then run (in this case using 16 processors) and examine the output le:
[ jonesm@joplin d_derenzo ] $ l s * . mpiP a t l a s . gcc3mpipapimpiP . 1 6 . 2 0 5 7 8 . 1 . mpiP [ jonesm@joplin d_derenzo ] $ l e s s a t l a s . gcc3mpipapimpiP . 1 6 . 2 0 5 7 8 . 1 . mpiP : :
HPC-I Fall 2012
28 / 108
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
mpiP Command : . / a t l a s . gcc3mpipapimpiP s t u d y . i n i 0 Version : 2.8.2 MPIP B u i l d date : Jun 29 2005 , 1 4: 5 3 : 4 1 S t a r t time : 2005 06 29 15 : 1 8 : 5 2 Stop t i m e : 2005 06 29 15 : 2 8 : 3 4 Timer Used : gettimeofday MPIP env v a r : [ null ] C o l l e c t o r Rank : 0 C o l l e c t o r PID : 20578 F i n a l Output D i r : . MPI Task Assignment : 0 bb18n17 . c c r . b u f f a l o . edu MPI Task Assignment : 1 bb18n17 . c c r . b u f f a l o . edu MPI Task Assignment : 2 bb18n16 . c c r . b u f f a l o . edu MPI Task Assignment : 3 bb18n16 . c c r . b u f f a l o . edu MPI Task Assignment : 4 bb18n15 . c c r . b u f f a l o . edu MPI Task Assignment : 5 bb18n15 . c c r . b u f f a l o . edu MPI Task Assignment : 6 bb18n14 . c c r . b u f f a l o . edu MPI Task Assignment : 7 bb18n14 . c c r . b u f f a l o . edu MPI Task Assignment : 8 bb18n13 . c c r . b u f f a l o . edu MPI Task Assignment : 9 bb18n13 . c c r . b u f f a l o . edu MPI Task Assignment : 10 bb18n12 . c c r . b u f f a l o . edu MPI Task Assignment : 11 bb18n12 . c c r . b u f f a l o . edu MPI Task Assignment : 12 bb18n11 . c c r . b u f f a l o . edu MPI Task Assignment : 13 bb18n11 . c c r . b u f f a l o . edu MPI Task Assignment : 14 bb18n10 . c c r . b u f f a l o . edu MPI Task Assignment : 15 bb18n10 . c c r . b u f f a l o . edu
HPC-I Fall 2012
29 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- MPI Time ( seconds ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Task AppTime MPITime MPI% 0 582 44.7 7.69 1 579 41.9 7.24 2 579 40.7 7.03 3 579 36.9 6.37 4 579 22.3 3.84 5 579 16.6 2.87 6 579 32 5.53 7 579 35.9 6.20 8 579 28.6 4.93 9 579 25.9 4.48 10 579 39.2 6.76 11 579 33.8 5.84 12 579 35.3 6.10 13 579 41 7.07 14 579 29.9 5.16 15 579 41.4 7.16 9.27 e+03 546 5.89 *
HPC-I Fall 2012
30 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- C a l l s i t e s : 13 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ID Lev F i l e / Address L i n e Parent_Funct MPI_Call 1 0 util .c 833 gsync Barrier 2 0 atlas . c 1531 r ea d Pr o jD a t a Allreduce 3 0 projAtlas . c 745 b a c k P r o j A t l a s Allreduce 4 0 atlas . c 1545 r ea d Pr o jD a t a Allreduce 5 0 atlas . c 1525 r ea d Pr o jD a t a Allreduce 6 0 atlas . c 1541 r ea d Pr o jD a t a Allreduce 7 0 atlas . c 1589 r ea d Pr o jD a t a Allreduce 8 0 atlas . c 1519 r ea d Pr o jD a t a Allreduce 9 0 util .c 789 mygcast Bcast 10 0 projAtlas . c 1100 c o m p u t e L o g l i k e A t l a s Allreduce 11 0 atlas . c 1514 re a dP r oj D a ta Allreduce 12 0 atlas . c 1537 re a dP r oj D a ta Allreduce 13 0 projAtlas . c 425 f w d B a c k P r o j A t l a s 2 Allreduce
HPC-I Fall 2012
31 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- Aggregate Time ( t o p twenty , descending , m i l l i s e c o n d s ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Call Site Time App% MPI% COV Allreduce 13 3.09 e+05 3.33 56.50 0.46 Barrier 1 2.13 e+05 2.30 38.97 0.35 Bcast 9 1.69 e+04 0.18 3.10 0.37 Allreduce 3 7.78 e+03 0.08 1.42 0.11 Allreduce 10 62.7 0.00 0.01 0.20 Allreduce 11 2.42 0.00 0.00 0.09 Allreduce 7 2.17 0.00 0.00 0.26 Allreduce 12 1.15 0.00 0.00 0.20 Allreduce 6 1.14 0.00 0.00 0.19 Allreduce 5 1.13 0.00 0.00 0.15 Allreduce 8 1.12 0.00 0.00 0.18 Allreduce 2 1.1 0.00 0.00 0.13 Allreduce 4 1.1 0.00 0.00 0.12
HPC-I Fall 2012
32 / 108
- - - - - - - - - - - - - - - - - - @- Aggregate Sent Message Size ( t o p - - - - - - - - - - - - - - - - - - Call Site Count Allreduce 13 65536 Allreduce 3 8192 Bcast 9 490784 Allreduce 11 16 Allreduce 10 512 Allreduce 7 16 Allreduce 2 16 Allreduce 6 16 Allreduce 5 16 Allreduce 4 16 Allreduce 8 16 Allreduce 12 16 ... ...
- - - - - - - - - - - - - - - - - - twenty , descending , b y t e s ) - - - - - - - - - - - - - - - - - - - - - - Total Avrg Sent% 2.28 e+09 3.48 e+04 83.69 2.85 e+08 3.48 e+04 10.46 1.59 e+08 325 5.85 2.07 e+04 1 . 3 e+03 0.00 4 . 1 e+03 8 0.00 256 16 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00
HPC-I Fall 2012
33 / 108
Using mpiP at Runtime
Now lets examine an example using mpiP at runtime. This example is solves a simple Laplace equation with Dirichlet boundary conditions using nite differences.
HPC-I Fall 2012
34 / 108
#PBS -S /bin/bash #PBS -q debug #PBS -l walltime=00:20:00 #PBS -l nodes=2:MEM24GB:ppn=8 #PBS -M jonesm@ccr.buffalo.edu #PBS -m e #PBS -N test #PBS -o subMPIP.out #PBS -j oe module load intel module load intel-mpi module load mpip module list cd $PBS_O_WORKDIR which mpiexec NNODES=`cat $PBS_NODEFILE | uniq | wc -l` NPROCS=`cat $PBS_NODEFILE | wc -l` export I_MPI_DEBUG=5 # Use LD_PRELOAD trick to load mpiP wrappers at runtime export LD_PRELOAD=$MPIPDIR/lib/libmpiP.so mpdboot -n $NNODES -f $PBS_NODEFILE -v mpiexec -np $NPROCS -envall ./laplace_mpi <<EOF 2000 EOF mpdallexit
HPC-I Fall 2012
35 / 108
... and then run it and examine the resulting mpiP output le:
[k07n14:~/d_laplace/d_mpip]$ ls -l laplace_mpi.16.8618.1.mpiP -rw-r - r - 1 jonesm ccrstaff 17919 Oct 14 16:28 laplace_mpi.16.8618.1.mpiP
and again we will break it down by section:
HPC-I Fall 2012
36 / 108
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
mpiP Command : ./laplace_mpi Version MPIP Build date Start time Stop time Timer Used MPIP env var Collector Rank Collector PID Final Output Dir Report generation MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment
: : : : : : : : : : : : : : : : : : : : : : : : : :
3.3.0 Oct 14 2011, 16:16:34 2011 10 14 16:26:15 2011 10 14 16:28:11 PMPI_Wtime [null] 0 8618 . Single collector task 0 d15n33.ccr.buffalo.edu 1 d15n33.ccr.buffalo.edu 2 d15n33.ccr.buffalo.edu 3 d15n33.ccr.buffalo.edu 4 d15n33.ccr.buffalo.edu 5 d15n33.ccr.buffalo.edu 6 d15n33.ccr.buffalo.edu 7 d15n33.ccr.buffalo.edu 8 d15n23.ccr.buffalo.edu 9 d15n23.ccr.buffalo.edu 10 d15n23.ccr.buffalo.edu 11 d15n23.ccr.buffalo.edu 12 d15n23.ccr.buffalo.edu 13 d15n23.ccr.buffalo.edu 14 d15n23.ccr.buffalo.edu 15 d15n23.ccr.buffalo.edu
HPC-I Fall 2012
37 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - MPI Time (seconds) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Task AppTime MPITime MPI% 0 116 16.9 14.57 1 115 16.4 14.18 2 115 16.8 14.53 3 115 17.7 15.34 4 115 16.6 14.39 5 115 14.3 12.37 6 115 13.7 11.90 7 115 11.1 9.65 8 115 12.5 10.87 9 115 13.8 12.00 10 115 14.5 12.60 11 115 16.9 14.67 12 115 17.5 15.15 13 115 19.4 16.82 14 115 16.4 14.22 15 115 17.5 15.13 1.85e+03 252 13.65 *
HPC-I Fall 2012
38 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Callsites: 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ID Lev File/Address Line Parent_Funct MPI_Call 1 0 laplace_mpi.f90 118 MAIN__ Allreduce 2 0 laplace_mpi.f90 143 MAIN__ Recv 3 0 laplace_mpi.f90 48 __paramod_MOD_xchange Sendrecv 4 0 laplace_mpi.f90 80 MAIN__ Bcast 5 0 laplace_mpi.f90 46 __paramod_MOD_xchange Sendrecv 6 0 laplace_mpi.f90 138 MAIN__ Send - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Aggregate Time (top twenty, descending, milliseconds) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Call Site Time App% MPI% COV Allreduce 1 2.34e+05 12.69 92.94 0.15 Sendrecv 3 8.88e+03 0.48 3.52 0.19 Sendrecv 5 8.47e+03 0.46 3.36 0.16 Send 6 321 0.02 0.13 0.51 Bcast 4 97.7 0.01 0.04 0.26 Recv 2 14.4 0.00 0.01 0.00 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --
HPC-I Fall 2012
39 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @ - - Callsite Time statistics (all, milliseconds): 80 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name Site Rank Count Max Mean Min Allreduce 1 0 23845 28.5 0.666 0.036 Allreduce 1 1 23845 28.6 0.644 0.035 Allreduce 1 2 23845 28.6 0.661 0.037 Allreduce 1 3 23845 28.6 0.695 0.038 Allreduce 1 4 23845 29.2 0.651 0.038 Allreduce 1 5 23845 29.2 0.555 0.035 Allreduce 1 6 23845 29 0.528 0.037 Allreduce 1 7 23845 26.3 0.405 0.038 Allreduce 1 8 23845 26.8 0.468 0.034 Allreduce 1 9 23845 28.4 0.538 0.033 Allreduce 1 10 23845 29.7 0.566 0.033 Allreduce 1 11 23845 29.7 0.661 0.033 Allreduce 1 12 23845 28.6 0.686 0.039 Allreduce 1 13 23845 28.8 0.768 0.041 Allreduce 1 14 23845 28.6 0.641 0.036 Allreduce 1 15 23845 28.7 0.694 0.038 Allreduce 1 29.7 0.614 0.033 * 381520
- - - - - -- - - - - -- - - - - -App% MPI% 13.69 94.00 13.30 93.79 13.66 93.98 14.36 93.63 13.46 93.54 11.46 92.71 10.91 91.64 8.37 86.74 9.67 88.92 11.11 92.59 11.70 92.90 13.65 93.10 14.17 93.49 15.87 94.35 13.25 93.19 14.33 94.76 12.69 92.94
HPC-I Fall 2012
40 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Callsite Message Sent statistics (all, sent bytes) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Name Site Rank Count Max Mean Min Sum Allreduce 1 0 23845 8 8 8 1.908e+05 Allreduce 1 1 23845 8 8 8 1.908e+05 Allreduce 1 2 23845 8 8 8 1.908e+05 Allreduce 1 3 23845 8 8 8 1.908e+05 Allreduce 1 4 23845 8 8 8 1.908e+05 Allreduce 1 5 23845 8 8 8 1.908e+05 Allreduce 1 6 23845 8 8 8 1.908e+05 Allreduce 1 7 23845 8 8 8 1.908e+05 Allreduce 1 8 23845 8 8 8 1.908e+05 Allreduce 1 9 23845 8 8 8 1.908e+05 Allreduce 1 10 23845 8 8 8 1.908e+05 Allreduce 1 11 23845 8 8 8 1.908e+05 Allreduce 1 12 23845 8 8 8 1.908e+05 Allreduce 1 13 23845 8 8 8 1.908e+05 Allreduce 1 14 23845 8 8 8 1.908e+05 Allreduce 1 15 23845 8 8 8 1.908e+05 Allreduce 1 8 8 8 3.052e+06 * 381520
HPC-I Fall 2012
41 / 108
Intel Trace Analyzer/Collector
Intel Trace Analyzer/Collector (ITAC)
A commercial product for performing MPI trace analysis that has enjoyed a long history is Vampir/Vampirtrace, originally developed and sold by Pallas GmbH. Now owned by Intel and available as the Intel Trace Analyzer and Collector. We have a license on U2 if someone wants to give it a try. Note that Vampir/Vampirtrace has since been reborn as an entirely new product.
HPC-I Fall 2012
42 / 108
ITAC Example
Note that you do not have to recompile your application to use ITAC (unless you are building it statically), you can just build it usual, using Intel MPI:
[ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ module l o a d i n t e l mpi [ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ make l a p l a c e _ m p i m p i i f o r t i p o O3 V a x l i b g c l a p l a c e _ m p i . f 9 0 m p i i f o r t i p o O3 V a x l i b g o l a p l a c e _ m p i l a p l a c e _ m p i . o i p o : remark #11001: p e r f o r m i n g s i n g l ef i l e o p t i m i z a t i o n s i p o : remark #11005: g e n e r a t i n g o b j e c t f i l e / tmp / i p o _ i f o r t j O z Y A n . o [ bono : ~ / d _ l a p l a c e / d _ i t a c ] $
You can turn trace collection on at run-time ...
HPC-I Fall 2012
43 / 108
[ bono : ~ / d _ l a p l a c e ] $ cat subICT #PBS S / b i n / bash #PBS q debug #PBS l w a l l t i m e =00:20:00 #PBS l nodes =1: ppn=8 #PBS jonesm@ccr . b u f f a l o . edu M #PBS e m #PBS N ITAC #PBS o subITAC . o u t #PBS j oe . $MODULESHOME/ i n i t / bash module l o a d i n t e l mpi module l i s t cd $PBS_O_WORKDIR which mpiexec NNODES= ` cat $PBS_NODEFILE | u n i q | wc l ` NPROCS= ` cat $PBS_NODEFILE | wc l ` UNIQ_HOSTS=tmp . h o s t s cat $PBS_NODEFILE | u n i q > $UNIQ_HOSTS export I_MPI_DEBUG=5 mpdboot n $NNODES f "$UNIQ_HOSTS" v mpiexec t r a c e np $NPROCS . / l a p l a c e _ m p i <<EOF 2000 EOF mpdallexit [ e "$UNIQ_HOSTS" ] && \ rm "$UNIQ_HOSTS"
HPC-I Fall 2012
44 / 108
Running the preceding batch job on U2 produces a bunch (many!)of proling output les, the most important of which can be is the name of your binary with a .stf sufx, in this case laplace_mpi.stf, which we feed to the Intel Trace Analyzer using the traceanalyzer command ... and we should see a prole that looks very much like what you can see using jumpshot.
HPC-I Fall 2012
45 / 108
HPC-I Fall 2012
46 / 108
More ITAC Documentation
Some helpful pointers to more ITAC documentation:

[ k07n14 : ~ ] $ which t r a c e a n a l y z e r / u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / bin / traceanalyzer [ k07n14 : ~ ] $ l s l / u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / doc / t o t a l 4211 r - r - r - 1 r o o t r o o t 91050 Aug 25 06:02 FAQ. p d f r - r - r - 1 r o o t r o o t 15566 Aug 25 06:02 G e t t i n g _ S t a r t e d . h t m l drwxrxrx 3 r o o t r o o t 57 Oct 4 08:02 h t m l r - r - r - 1 r o o t r o o t 61598 Aug 25 06:02 INSTALL . h t m l r - r - r - 1 r o o t r o o t 2051029 Aug 25 06:02 ITA_Reference_Guide . p d f r - r - r - 1 r o o t r o o t 1050276 Aug 25 06:02 ITC_Reference_Guide . p d f r - r - r - 1 r o o t r o o t 20681 Aug 25 06:02 Release_Notes . t x t
Generally a good idea to refer to the documentation for the same version that you are using (you can check with module show intel-mpi).
HPC-I Fall 2012
47 / 108
Performance API (PAPI)
Introduction to PAPI
Introduction
Performance Application Programming Interface Implement a portable(!) and efcient API to access existing hardware performance counters Ease the optimization of code by providing base infrastructure for cross-platform tools
HPC-I Fall 2012
49 / 108
Pre-PAPI
Before PAPI came along, there were hardware performance counters, of course - but access to them was limited to proprietary tools and APIs. Some examples were SGIs perfex and Crays hpm. Now, as long as PAPI has been ported to a particular hardware substrate, the end-programmer (or tool developer) can just use the PAPI interface.
HPC-I Fall 2012
50 / 108
PAPI Schematic
Best summarized by the following schematic picture:
HPC-I Fall 2012
51 / 108
Behind the PAPI Curtain

Linux - x86/x86_64 uses the perfctr kernel patches by Mikael Petterssen:
http://user.it.uu.se/~mikpe/linux/perfctr/2.6
Headed for inclusion in mainstream Linux kernel (was a custom patch applied to CCR systems prior to Linux kernel 2.6.32) low overhead
IA64 - uses PFM, developed by HP and included in the linux kernel (for x86_64):
Full use of available IA64 monitoring capabilities Quite a bit slower than perfctr, at least according to the PAPI developers http://www.hpl.hp.com/research/linux/perfmon libpfm lives on using perf events, but perfmon apparently ceased development for Linux as of kernel 2.6.30 or so
"Perf Events" added to Linux kernel in 2.6.31, replacing both of the above, c.f.:
http://web.eecs.utk.edu/~vweaver1/projects/perf-events/
Reminder: U2 Hardware
Xeon Block Diagram
Reconsider the block diagram for the (older nodes) Intel architecture in CCRs U2 cluster:
Look familiar?
HPC-I Fall 2012
53 / 108
Xeon Block Diagram (contd)
Consider the block diagram for the Intel architecture in CCRs U2 cluster:
Look familiar? (compare with von Neumanns original sketch)
HPC-I Fall 2012
54 / 108
Nehalem Xeon Block Diagram

Block diagram for Nehalem architecture:
10/14/11 5:25 PM
Intel Nehalem microarchitecture quadruple associative Instruction Cache 32 KByte, 128-entry TLB-4K, 7 TLB-2/4M per thread
128
Uncore
Prefetch Buffer (16 Bytes)
Predecode & Instruction Length Decoder

Instruction Queue 18 x86 Instructions Alignment MacroOp Fusion
Branch Prediction global/bimodal, loop, indirect jmp
Quick Path Interconnect

4 x 20 Bit 6,4 GT/s
DDR3 Memory Controller
3 x 64 Bit 1,33 GT/s
Complex Decoder
Simple Decoder
Simple Decoder
Simple Decoder
Loop Stream Decoder 2x Retirement Register File
Decoded Instruction Queue (28 OP entries) MicroOp Fusion 2 x Register Allocation Table (RAT) Reorder Buffer (128-entry) fused
Micro Instruction Sequencer
Common L3-Cache 8 MByte
Reservation Station (128-entry) fused

Port 4 Port 3 Port 2 Port 5 Port 1 Port 0
AGU Store Data Store Addr. Unit
AGU Load Addr. Unit
Integer/ MMX ALU, Branch SSE ADD Move

128
Integer/ MMX ALU SSE ADD Move

128
FP ADD
Integer/ FP MMX ALU, MUL 2x AGU
256 KByte 8-way, 64 Byte Cacheline, private L2-Cache
SSE MUL/DIV Move

128
512-entry L2-TLB-4K
Result Bus Memory Order Buffer (MOB)

128 128 256
octuple associative Data Cache 32 KByte, 64-entry TLB-4K, 32-entry TLB-2/4M
file:///Users/jonesm/Desktop/Intel_Nehalem_arch.svg
Page 1 of 2
HPC-I Fall 2012
55 / 108
Superscalar - EM64T Irwindale
Characteristics of the Irwindale Xeons that form (part of) CCRs U2 cluster: Clock Cycle TPP Pipeline L2 Cache Size L2 Bandwidth CPU-Memory Bandwidth 3.2 GHz 6.4 GFlop/s 31 stages 2 MByte 102.4 GByte/s 6.4 GByte/s (shared)
HPC-I Fall 2012
56 / 108
EM64T Irwindale Memory Hierarchy Penalties
Consider the penalties in lost computation for not reusing data in the various caches (using U2s Intel Irwindale Xeon processors): Memory L1 Cache L2 Cache Main Miss Penalty (cycles) 3 28 400
as determined by the lmbench benchmark1 .
http://www.bitmover.com/lmbench
Application Performance Tuning HPC-I Fall 2012 57 / 108
Westmere Xeons
Characteristics of the Westmere E5645 Xeons that form (part of) CCRs U2 cluster: Clock Cycle TPP Pipeline L2 Cache Size L3 Cache Size CPU-Memory Bandwidth 2.4 GHz 9.6 GFlop/s (per core) 14 stages 256 kByte 12 MByte 32 GByte/s (nonuniform!)
HPC-I Fall 2012
58 / 108
Westmere Xeon Memory Hierarchy Penalties
Consider the penalties in lost computation for not reusing data in the various caches (using U2s Intel Westmere Xeon E5645 processors): Memory L1 Cache L2 Cache Main Miss Penalty (cycles) 4 15 110
as determined by the lmbench benchmark2 .
http://www.bitmover.com/lmbench
Application Performance Tuning HPC-I Fall 2012 59 / 108
Available Counters in PAPI
Available PAPI Performance Data
Cycle count Instruction count (including Integer, Floating point, load/store) Branches (including Taken/not taken, mispredictions) Pipeline stalls (due to memory, resource conicts) Cache (misses for different levels, invalidation) TLB (misses, invalidation)
HPC-I Fall 2012
60 / 108
High vs. Low-level PAPI
High-level PAPI
Intended for coarse-grained measurements Requires little (or no) setup code Allows only PAPI preset Allows only aggregate counting (no statistical proling)
HPC-I Fall 2012
61 / 108
Low-level PAPI
More efcient (and functional) than high-level About 60 functions Thread-safe Supports presets and native events
HPC-I Fall 2012
62 / 108
Preset vs. Native Events
preset or pre-dened events, are those which have been considered useful by the PAPI community and developers:
http://icl.cs.utk.edu/projects/papi/presets.html
native events are those countable by the CPUs hardware. These events are highly platform specic, and you would need to consult the processor architecture manuals for the relevant native event lists
HPC-I Fall 2012
63 / 108
Low-level PAPI Functions
Hardware counter multiplexing (time sharing hardware counters to allow more events to be monitored than can be conventionally supported) Processor information Address space information Memory information (static and dynamic) Timing functions Hardware event inquiry ... and many more
HPC-I Fall 2012
64 / 108
More PAPI Information
For more on PAPI, including source code, documentation, presentations, and links to third-party tools that utilize PAPI, see http://icl.cs.utk.edu/projects/papi
HPC-I Fall 2012
65 / 108
How to Access PAPI at CCR
Consider a simple example code to measure Flop/s using the high-level PAPI API:
1 2 3 4 5 6 7 8 9 10 11 12 # include < s t d i o . h> # include < s t d l i b . h> # include " p a p i . h "
i n t main ( ) { f l o a t r e a l _ t i m e , proc_time , mflops ; long_long flpops ; float ireal_time , iproc_time , imflops ; long_long i f l p o p s ; int retval ;
HPC-I Fall 2012
66 / 108
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
i f ( ( r e t v a l = PAPI_flops (& i r e a l _ t i m e ,& i p r o c _ t i m e ,& i f l p o p s ,& i m f l o p s ) ) < PAPI_OK ) { p r i n t f ( " Could n o t i n i t i a l i s e PAPI_flops \ n " ) ; p r i n t f ( " Your p l a t f o r m may n o t s u p p o r t f l o a t i n g p o i n t o p e r a t i o n event . \ n " ) ; p r i n t f ( " r e t v a l : %d \ n " , r e t v a l ) ; exit (1); } your_slow_code ( ) ; i f ( ( r e t v a l = PAPI_flops ( &r e a l _ t i m e , &proc_time , &f l p o p s , &mflops ) ) <PAPI_OK ) { p r i n t f ( " r e t v a l : %d \ n " , r e t v a l ) ; exit (1); } p r i n t f ( " Real_time : %f Proc_time : %f T o t a l f l p o p s : %l l d MFLOPS: %f \ n " , r e a l _ t i m e , proc_time , f l p o p s , mflops ) ; exit (0); } i n t your_slow_code ( ) { int i ; double tmp = 1 . 1 ; f o r ( i =1; i <2000; i ++) { tmp =( tmp + 1 0 0 ) / i ; } return 0; }
HPC-I Fall 2012
67 / 108
How to Access PAPI at CCR
On U2, you access the papi module and compile accordingly:

[ k07n14 : ~ / d_papi ] $ qsub q debug lnodes =1:MEM48GB: ppn =12 , w a l l t i m e =01:00:00 I qsub : w a i t i n g f o r j o b 1144485. d15n41 . c c r . b u f f a l o . edu t o s t a r t qsub : j o b 1144485. d15n41 . c c r . b u f f a l o . edu ready Job 1144485. d15n41 . c c r . b u f f a l o . edu has requested 12 cores / p r o c e s s o r s per node . PBSTMPDIR i s / s c r a t c h /1144485. d15n41 . c c r . b u f f a l o . edu [ k16n01b : ~ ] $ cd $PBS_O_WORKDIR [ k16n01b : ~ / d_papi ] $ module l o a d p a p i ' p a p i / v4 . 1 . 4 ' l o a d complete . [ k16n01b : ~ / d_papi ] $ gcc I$PAPI / i n c l u d e o PAPI_flops PAPI_flops . c L$PAPI / l i b l p a p i [ k16n01b : ~ / d_papi ] $ . / PAPI_flops Real_time : 0.000042 Proc_time : 0.000029 T o t a l f l p o p s : 7983 MFLOPS: 276.572449
N.B., PAPI needs to support the underlying hardware, and this version does not support the 32-core Intel nodes (including the front end).
HPC-I Fall 2012
68 / 108
... and on Lennon (SGI Altix):

[ jonesm@lennon ~ / d_papi ] $ module l o a d p a p i ' p a p i / v3 . 2 . 1 ' l o a d complete . [ jonesm@lennon ~ / d_papi ] $ echo $PAPI / u t i l / p e r f t o o l s / papi 3.2.1 [ jonesm@lennon ~ / d_papi ] $ gcc I$PAPI / i n c l u d e o PAPI_flops PAPI_flops . c \ L$PAPI / l i b l p a p i [ jonesm@lennon ~ / d_papi ] $ l d d PAPI_flops l i b p a p i . so => / u t i l / p e r f t o o l s / papi 3.2.1/ l i b / l i b p a p i . so ( 0 x2000000000040000 ) l i b c . so . 6 . 1 => / l i b / t l s / l i b c . so . 6 . 1 ( 0 x20000000000ac000 ) l i b p f m . so . 2 => / u s r / l i b / l i b p f m . so . 2 ( 0 x2000000000318000 ) / l i b / l dl i n u xi a 6 4 . so . 2 => / l i b / l dl i n u xi a 6 4 . so . 2 ( 0 x2000000000000000 ) [ jonesm@lennon ~ / d_papi ] $ . / PAPI_flops Real_time : 0.000143 Proc_time : 0.000132 T o t a l f l p o p s : 48056 MFLOPS: 363.964111
N.B., The Altix is dead, but this is a good example of the cross-platform portability of PAPI accessign the hardware performance counters.
HPC-I Fall 2012
69 / 108
papi_avail Command
You can use papi_avail to check event availability (different CPUs support various events):
[k16n01b:~/d_papi]$ papi_avail Available events and hardware information. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PAPI Version : 4.1.4.0 Vendor string and code : GenuineIntel (1) Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44) CPU Revision : 2.000000 CPUID Info : Family: 6 Model: 44 Stepping: 2 CPU Megahertz : 2400.391113 CPU Clock Megahertz : 2400 Hdw Threads per core : 1 Cores per Socket : 6 NUMA Nodes : 2 CPU's per Node : 6 Total CPU's : 12 Number Hardware Counters : 16 Max Multiplex Counters : 512 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The following correspond to fields in the PAPI_event_info_t structure. Name PAPI_L1_DCM PAPI_L1_ICM Code Avail Deriv Description (Note) 0x80000000 Yes No Level 1 data cache misses 0x80000001 Yes No Level 1 instruction cache misses
HPC-I Fall 2012
70 / 108
Performance API (PAPI) PAPI_L2_DCM PAPI_L2_ICM PAPI_L3_DCM PAPI_L3_ICM PAPI_L1_TCM PAPI_L2_TCM PAPI_L3_TCM PAPI_CA_SNP PAPI_CA_SHR PAPI_CA_CLN PAPI_CA_INV PAPI_CA_ITV PAPI_L3_LDM PAPI_L3_STM PAPI_BRU_IDL PAPI_FXU_IDL PAPI_FPU_IDL PAPI_LSU_IDL PAPI_TLB_DM PAPI_TLB_IM PAPI_TLB_TL PAPI_L1_LDM PAPI_L1_STM PAPI_L2_LDM PAPI_L2_STM PAPI_BTAC_M PAPI_PRF_DM PAPI_L3_DCH PAPI_TLB_SD PAPI_CSR_FAL PAPI_CSR_SUC PAPI_CSR_TOT 0x80000002 0x80000003 0x80000004 0x80000005 0x80000006 0x80000007 0x80000008 0x80000009 0x8000000a 0x8000000b 0x8000000c 0x8000000d 0x8000000e 0x8000000f 0x80000010 0x80000011 0x80000012 0x80000013 0x80000014 0x80000015 0x80000016 0x80000017 0x80000018 0x80000019 0x8000001a 0x8000001b 0x8000001c 0x8000001d 0x8000001e 0x8000001f 0x80000020 0x80000021 Yes Yes No No Yes Yes Yes No No No No No Yes No No No No No Yes Yes Yes Yes Yes Yes Yes No No No No No No No Yes No No No Yes No No No No No No No No No No No No No No No Yes No No No No No No No No No No No
Level 2 data cache misses Level 2 instruction cache misses Level 3 data cache misses Level 3 instruction cache misses Level 1 cache misses Level 2 cache misses Level 3 cache misses Requests for a snoop Requests for exclusive access to shared cache line Requests for exclusive access to clean cache line Requests for cache line invalidation Requests for cache line intervention Level 3 load misses Level 3 store misses Cycles branch units are idle Cycles integer units are idle Cycles floating point units are idle Cycles load/store units are idle Data translation lookaside buffer misses Instruction translation lookaside buffer misses Total translation lookaside buffer misses Level 1 load misses Level 1 store misses Level 2 load misses Level 2 store misses Branch target address cache misses Data prefetch cache misses Level 3 data cache hits Translation lookaside buffer shootdowns Failed store conditional instructions Successful store conditional instructions Total store conditional instructions
HPC-I Fall 2012
71 / 108
Performance API (PAPI) PAPI_MEM_SCY PAPI_MEM_RCY PAPI_MEM_WCY PAPI_STL_ICY PAPI_FUL_ICY PAPI_STL_CCY PAPI_FUL_CCY PAPI_HW_INT PAPI_BR_UCN PAPI_BR_CN PAPI_BR_TKN PAPI_BR_NTK PAPI_BR_MSP PAPI_BR_PRC PAPI_FMA_INS PAPI_TOT_IIS PAPI_TOT_INS PAPI_INT_INS PAPI_FP_INS PAPI_LD_INS PAPI_SR_INS PAPI_BR_INS PAPI_VEC_INS PAPI_RES_STL PAPI_FP_STAL PAPI_TOT_CYC PAPI_LST_INS PAPI_SYC_INS PAPI_L1_DCH PAPI_L2_DCH PAPI_L1_DCA PAPI_L2_DCA PAPI_L3_DCA 0x80000022 0x80000023 0x80000024 0x80000025 0x80000026 0x80000027 0x80000028 0x80000029 0x8000002a 0x8000002b 0x8000002c 0x8000002d 0x8000002e 0x8000002f 0x80000030 0x80000031 0x80000032 0x80000033 0x80000034 0x80000035 0x80000036 0x80000037 0x80000038 0x80000039 0x8000003a 0x8000003b 0x8000003c 0x8000003d 0x8000003e 0x8000003f 0x80000040 0x80000041 0x80000042 No No No No No No No No Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes No Yes No Yes Yes No No Yes No Yes Yes No No No No No No No No No No No Yes No Yes No No No No No No No No No No No No Yes No No Yes No No Yes
Cycles Stalled Waiting for memory accesses Cycles Stalled Waiting for memory Reads Cycles Stalled Waiting for memory writes Cycles with no instruction issue Cycles with maximum instruction issue Cycles with no instructions completed Cycles with maximum instructions completed Hardware interrupts Unconditional branch instructions Conditional branch instructions Conditional branch instructions taken Conditional branch instructions not taken Conditional branch instructions mispredicted Conditional branch instructions correctly predicted FMA instructions completed Instructions issued Instructions completed Integer instructions Floating point instructions Load instructions Store instructions Branch instructions Vector/SIMD instructions (could include integer) Cycles stalled on any resource Cycles the FP unit(s) are stalled Total cycles Load/store instructions completed Synchronization instructions completed Level 1 data cache hits Level 2 data cache hits Level 1 data cache accesses Level 2 data cache accesses Level 3 data cache accesses
HPC-I Fall 2012
72 / 108
PAPI_L1_DCR PAPI_L2_DCR PAPI_L3_DCR PAPI_L1_DCW PAPI_L2_DCW PAPI_L3_DCW PAPI_L1_ICH PAPI_L2_ICH PAPI_L3_ICH PAPI_L1_ICA PAPI_L2_ICA PAPI_L3_ICA PAPI_L1_ICR PAPI_L2_ICR PAPI_L3_ICR PAPI_L1_ICW PAPI_L2_ICW PAPI_L3_ICW PAPI_L1_TCH PAPI_L2_TCH PAPI_L3_TCH PAPI_L1_TCA PAPI_L2_TCA PAPI_L3_TCA PAPI_L1_TCR PAPI_L2_TCR PAPI_L3_TCR PAPI_L1_TCW PAPI_L2_TCW PAPI_L3_TCW
0x80000043 0x80000044 0x80000045 0x80000046 0x80000047 0x80000048 0x80000049 0x8000004a 0x8000004b 0x8000004c 0x8000004d 0x8000004e 0x8000004f 0x80000050 0x80000051 0x80000052 0x80000053 0x80000054 0x80000055 0x80000056 0x80000057 0x80000058 0x80000059 0x8000005a 0x8000005b 0x8000005c 0x8000005d 0x8000005e 0x8000005f 0x80000060
No Yes Yes No Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No No No No Yes No No Yes Yes No Yes Yes No Yes Yes
No No No No No No No No No No No No No No No No No No No Yes No No No No No Yes Yes No No No
Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
data cache reads data cache reads data cache reads data cache writes data cache writes data cache writes instruction cache hits instruction cache hits instruction cache hits instruction cache accesses instruction cache accesses instruction cache accesses instruction cache reads instruction cache reads instruction cache reads instruction cache writes instruction cache writes instruction cache writes total cache hits total cache hits total cache hits total cache accesses total cache accesses total cache accesses total cache reads total cache reads total cache reads total cache writes total cache writes total cache writes
HPC-I Fall 2012
73 / 108
PAPI_FML_INS 0x80000061 No PAPI_FAD_INS 0x80000062 No PAPI_FDV_INS 0x80000063 No PAPI_FSQ_INS 0x80000064 No PAPI_FNV_INS 0x80000065 No PAPI_FP_OPS 0x80000066 Yes PAPI_SP_OPS 0x80000067 Yes scaled single precision vector PAPI_DP_OPS 0x80000068 Yes scaled double precision vector PAPI_VEC_SP 0x80000069 Yes PAPI_VEC_DP 0x8000006a Yes - - - - - - - - - - - - - - Of 107 possible events, 57 are avail.c
No Floating point multiply instructions No Floating point add instructions No Floating point divide instructions No Floating point square root instructions No Floating point inverse instructions Yes Floating point operations Yes Floating point operations; optimized to count operations Yes Floating point operations; optimized to count operations No Single precision vector/SIMD instructions No Double precision vector/SIMD instructions - - - - - - - - - - - - - - - - - - - - -available, of which 14 are derived. PASSED
HPC-I Fall 2012
74 / 108
papi_even_chooser Command
Not all events can be simultaneously monitored (at least not without multiplexing):
[k16n01b:~]$ papi_event_chooser PRESET PAPI_FP_OPS Event Chooser: Available events which can be added with given events. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PAPI Version : 4.1.4.0 Vendor string and code : GenuineIntel (1) Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44) CPU Revision : 2.000000 CPUID Info : Family: 6 Model: 44 Stepping: 2 CPU Megahertz : 2400.391113 CPU Clock Megahertz : 2400 Hdw Threads per core : 1 Cores per Socket : 6 NUMA Nodes : 2 CPU's per Node : 6 Total CPU's : 12 Number Hardware Counters : 16 Max Multiplex Counters : 512 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name PAPI_L1_DCM PAPI_L1_ICM PAPI_L2_DCM PAPI_L2_ICM Code Deriv Description (Note) 0x80000000 No Level 1 data cache misses 0x80000001 No Level 1 instruction cache misses 0x80000002 Yes Level 2 data cache misses 0x80000003 No Level 2 instruction cache misses
HPC-I Fall 2012
75 / 108
Performance API (PAPI) PAPI_L1_TCM PAPI_L2_TCM PAPI_L3_TCM PAPI_L3_LDM PAPI_TLB_DM PAPI_TLB_IM PAPI_TLB_TL PAPI_L1_LDM PAPI_L1_STM PAPI_L2_LDM PAPI_L2_STM PAPI_BR_UCN PAPI_BR_CN PAPI_BR_TKN PAPI_BR_NTK PAPI_BR_MSP PAPI_BR_PRC PAPI_TOT_IIS PAPI_TOT_INS PAPI_FP_INS PAPI_LD_INS PAPI_SR_INS PAPI_BR_INS PAPI_RES_STL PAPI_TOT_CYC PAPI_LST_INS PAPI_L2_DCH PAPI_L2_DCA PAPI_L3_DCA PAPI_L2_DCR PAPI_L3_DCR PAPI_L2_DCW PAPI_L3_DCW 0x80000006 0x80000007 0x80000008 0x8000000e 0x80000014 0x80000015 0x80000016 0x80000017 0x80000018 0x80000019 0x8000001a 0x8000002a 0x8000002b 0x8000002c 0x8000002d 0x8000002e 0x8000002f 0x80000031 0x80000032 0x80000034 0x80000035 0x80000036 0x80000037 0x80000039 0x8000003b 0x8000003c 0x8000003f 0x80000041 0x80000042 0x80000044 0x80000045 0x80000047 0x80000048 Yes No No No No No Yes No No No No No No No Yes No Yes No No No No No No No No Yes Yes No Yes No No No No
Level 1 cache misses Level 2 cache misses Level 3 cache misses Level 3 load misses Data translation lookaside buffer misses Instruction translation lookaside buffer misses Total translation lookaside buffer misses Level 1 load misses Level 1 store misses Level 2 load misses Level 2 store misses Unconditional branch instructions Conditional branch instructions Conditional branch instructions taken Conditional branch instructions not taken Conditional branch instructions mispredicted Conditional branch instructions correctly predicted Instructions issued Instructions completed Floating point instructions Load instructions Store instructions Branch instructions Cycles stalled on any resource Total cycles Load/store instructions completed Level 2 data cache hits Level 2 data cache accesses Level 3 data cache accesses Level 2 data cache reads Level 3 data cache reads Level 2 data cache writes Level 3 data cache writes
HPC-I Fall 2012
76 / 108
PAPI_L1_ICH 0x80000049 No Level 1 instruction cache hits PAPI_L2_ICH 0x8000004a No Level 2 instruction cache hits PAPI_L1_ICA 0x8000004c No Level 1 instruction cache accesses PAPI_L2_ICA 0x8000004d No Level 2 instruction cache accesses PAPI_L3_ICA 0x8000004e No Level 3 instruction cache accesses PAPI_L1_ICR 0x8000004f No Level 1 instruction cache reads PAPI_L2_ICR 0x80000050 No Level 2 instruction cache reads PAPI_L3_ICR 0x80000051 No Level 3 instruction cache reads PAPI_L2_TCH 0x80000056 Yes Level 2 total cache hits PAPI_L2_TCA 0x80000059 No Level 2 total cache accesses PAPI_L3_TCA 0x8000005a No Level 3 total cache accesses PAPI_L2_TCR 0x8000005c Yes Level 2 total cache reads PAPI_L3_TCR 0x8000005d Yes Level 3 total cache reads PAPI_L2_TCW 0x8000005f No Level 2 total cache writes PAPI_L3_TCW 0x80000060 No Level 3 total cache writes PAPI_SP_OPS 0x80000067 Yes Floating point operations; optimized to count scaled single precision vector operations PAPI_DP_OPS 0x80000068 Yes Floating point operations; optimized to count scaled double precision vector operations PAPI_VEC_SP 0x80000069 No Single precision vector/SIMD instructions PAPI_VEC_DP 0x8000006a No Double precision vector/SIMD instructions - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total events reported: 56 event_chooser.c PASSED
HPC-I Fall 2012
77 / 108
PAPI API Examples
PAPI Examples
In this section we will work through a few simple examples of using the PAPI API, mostly focused on using the high-level API. And we will steer clear of native events, and leave those to tool developers.
HPC-I Fall 2012
79 / 108
PAPI API Examples
Accessing Counters Through PAPI
Include les for constants and routine interfaces: C: papi.h F77: f77papi.h F90: f90papi.h
HPC-I Fall 2012
80 / 108
PAPI API Examples
PAPI Naming Scheme
The C interfaces: PAPI C interface (return type) PAPI_function_name(arg1, arg2, ...) and Fortran interfaces PAPI Fortran interfaces PAPIF_function_name(arg1, arg2, ..., check ) note that the check parameter is the same type and value as the C return value.
HPC-I Fall 2012
81 / 108
PAPI API Examples
Relation Between C and Fortran Types in PAPI
The following table shows the relation between the C and Fortran types used in PAPI:
Pseudo-type C_INT C_FLOAT C_LONG_LONG C_STRING C_INT FUNCTION Fortran type INTEGER REAL INTEGER*8 CHARACTER*(PAPI_MAX_STR_LEN) EXTERNAL INTEGER FUNCTION Description Default Integer type Default Real type Extended size integer Fortran string Fortran function returning integer result
HPC-I Fall 2012
82 / 108
PAPI API Examples
High-level API Example in C

Lets consider the following example code for using the high-level API in C.
# include " p a p i . h " # include < s t d i o . h> # define NUM_EVENTS 2 # define THRESHOLD 10000 # define ERROR_RETURN( r e t v a l ) { f p r i n t f ( s t d e r r , " E r r o r %d %s : l i n e %d : \ n " , r e t v a l , __FILE__ , __LINE__ ) ; exit ( retval ); } void c o m p ut at io n_ mu lt ( ) { / * s t u p i d codes t o be monitored * / double tmp = 1 . 0 ; i n t i =1; f o r ( i = 1 ; i < THRESHOLD; i ++ ) { tmp = tmp * i ; } } void computation_add ( ) { / * s t u p i d codes t o be monitored * / i n t tmp = 0 ; i n t i =0; f o r ( i = 0 ; i < THRESHOLD; i ++ ) { tmp = tmp + i ; } }
HPC-I Fall 2012
83 / 108
PAPI API Examples
i n t main ( ) { / * D e c l a r i n g and i n i t i a l i z i n g t h e event s e t w i t h t h e p r e s e t s * / i n t Events [ 2 ] = { PAPI_TOT_INS , PAPI_TOT_CYC } ; / * The l e n g t h o f t h e events a r r a y should be no l o n g e r than t h e v a l u e r e t u r n e d by PAPI_num_counters . * / / * d e c l a r i n g p l a c e h o l d e r f o r no o f hardware c o u n t e r s * / i n t num_hwcntrs = 0 ; int retval ; char e r r s t r i n g [ PAPI_MAX_STR_LEN ] ; / * T h i s i s going t o s t o r e our l i s t o f r e s u l t s * / l o n g _ l o n g v a l u e s [NUM_EVENTS ] ; /* ************************************************************************** * T h i s p a r t i n i t i a l i z e s t h e l i b r a r y and compares t h e v e r s i o n number o f t h e * * header f i l e , t o t h e v e r s i o n o f t h e l i b r a r y , i f these don ' t match then i t * * i s l i k e l y t h a t PAPI won ' t work c o r r e c t l y . I f t h e r e i s an e r r o r , r e t v a l * * keeps t r a c k o f t h e v e r s i o n number . * ************************************************************************** */ i f ( ( r e t v a l = P A P I _ l i b r a r y _ i n i t (PAPI_VER_CURRENT ) ) ! = PAPI_VER_CURRENT ) { f p r i n t f ( s t d e r r , " E r r o r : %d %s \ n " , r e t v a l , e r r s t r i n g ) ; exit (1); }
HPC-I Fall 2012
84 / 108
PAPI API Examples
/* ************************************************************************* * PAPI_num_counters r e t u r n s t h e number o f hardware c o u n t e r s t h e p l a t f o r m * * has o r a n e g a t i v e number i f t h e r e i s an e r r o r * ************************************************************************* */ i f ( ( num_hwcntrs = PAPI_num_counters ( ) ) < PAPI_OK ) { p r i n t f ( " There are no c o u n t e r s a v a i l a b l e . \ n " ) ; exit (1); } p r i n t f ( " There are %d c o u n t e r s i n t h i s system \ n " , num_hwcntrs ) ; /* ************************************************************************* * P A P I _ s t a r t _ c o u n t e r s i n i t i a l i z e s t h e PAPI l i b r a r y ( i f necessary ) and * * s t a r t s c o u n t i n g t h e events named i n t h e events a r r a y . T h i s f u n c t i o n * i m p l i c i t l y s t o p s and i n i t i a l i z e s any c o u n t e r s r u n n i n g as a r e s u l t o f * * a previous c a l l to PAPI_start_counters . * * ************************************************************************* */ i f ( ( r e t v a l = P A P I _ s t a r t _ c o u n t e r s ( Events , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " \ nCounter S t a r t e d : \ n " ) ; / * Your code goes here * / computation_add ( ) ;
HPC-I Fall 2012
85 / 108
PAPI API Examples
/* ********************************************************************* * PAPI_read_counters reads t h e c o u n t e r v a l u e s i n t o v a l u e s a r r a y * ********************************************************************* */ i f ( ( r e t v a l =PAPI_read_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " Read s u c c e s s f u l l y \ n " ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d d i t i o n are %l l d \ n " , v a l u e s [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ; p r i n t f ( " \ nNow we t r y t o use PAPI_accum t o accumulate v a l u e s \ n " ) ; / * Do some computation here * / computation_add ( ) ; /* *********************************************************************** * What PAPI_accum_counters does i s i t adds t h e r u n n i n g c o u n t e r v a l u e s * * t o what i s i n t h e v a l u e s a r r a y . The hardware c o u n t e r s are r e s e t and * * l e f t running a f t e r the c a l l . * *********************************************************************** */ i f ( ( r e t v a l =PAPI_accum_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( "We d i d an a d d i t i o n a l %d t i m e s a d d i t i o n ! \ n " , THRESHOLD ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d d i t i o n are %l l d \ n " , values [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ;
HPC-I Fall 2012
86 / 108
PAPI API Examples
/* ********************************************************************** * Stop c o u n t i n g events ( t h i s reads t h e c o u n t e r s as w e l l as s t o p s them * ********************************************************************** */ p r i n t f ( " \ nNow we t r y t o do some m u l t i p l i c a t i o n s \ n " ) ; c o m p u t a t io n_ mu lt ( ) ; / * * * * * * * * * * * * * * * * * * * PAPI_stop_counters * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( ( r e t v a l =PAPI_stop_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n executed f o r m u l t i p l i c a t i o n are %l l d \ n " , values [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ; exit (0); }
HPC-I Fall 2012
87 / 108
PAPI API Examples
Running on E5645 U2 nodes:

[k11n30b:~/d_papi]$ module load papi 'papi/v4.1.4' load complete. [k11n30b:~/d_papi]$ gcc -I$PAPI/include -o highlev highlev.c -L$PAPI/lib -lpapi [k11n30b:~/d_papi]$ ./highlev There are 16 counters in this system Counter Started: Read successfully The total instructions executed for addition are 54977 The total cycles used are 73894 Now we try to use PAPI_accum to accumulate values We did an additional 10000 times addition! The total instructions executed for addition are 112352 The total cycles used are 147814 Now we try to do some multiplications The total instruction executed for multiplication are 77459 The total cycles used are 126981
HPC-I Fall 2012
88 / 108
PAPI API Examples
PAPI Initialization
The preceding example used PAPI_library_init to initialize PAPI, which is also used for the low-level API, but you can also use the PAPI_num_counters, PAPI_start_counters, or one of the rate calls, PAPI_flips, PAPI_flops, or PAPI_ipc. Events are counted, as we saw in the example, using PAPI_accum_counters, PAPI_read_counters, and PAPI_stop_counters. Lets look at an even simpler example just using one of the rate counters.
HPC-I Fall 2012
89 / 108
PAPI API Examples
High-level Example in F90
For something a little different we can look at our old friend, matrix multiplication, this time in Fortran:
! A s i m p l e example f o r t h e use o f PAPI , t h e number o f f l o p s you should ! g e t i s about INDEX^3 on machines t h a t c o n s i d e r add and m u l t i p l y one f l o p ! such as SGI , and 2 * ( INDEX ^ 3 ) t h a t don ' t c o n s i d e r i t 1 f l o p such as INTEL ! Kevin London program f l o p s i m p l i c i t none include " f 9 0 p a p i . h " integer , parameter : : i 8 =SELECTED_INT_KIND ( 1 6 ) ! i n t e g e r * 8 integer , parameter : : i n d e x =1000 r e a l : : m a t r i x a ( index , i n d e x ) , m a t r i x b ( index , i n d e x ) , mres ( index , i n d e x ) r e a l : : proc_time , mflops , r e a l _ t i m e i n t e g e r ( kind= i 8 ) : : f l p i n s integer : : i , j , k , r e t v a l
HPC-I Fall 2012
90 / 108
PAPI API Examples
r e t v a l = PAPI_VER_CURRENT CALL P A P I f _ l i b r a r y _ i n i t ( r e t v a l ) i f ( r e t v a l .NE. PAPI_VER_CURRENT) then print * , ' Failure in PAPI_library_init : end i f
' , retval
CALL PAPIf_query_event ( PAPI_FP_OPS , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' Sorry , no PAPI_FP_OPS event : ' ,PAPI_ENOEVNT end i f ! I n i t i a l i z e the Matrix arrays do i =1 , i n d e x do j =1 , i n d e x matrixa ( i , j ) = i + j m a t r i x b ( i , j ) = ji mres ( i , j ) = 0 . 0 end do end do ! Setup PAPI l i b r a r y and begin c o l l e c t i n g data from t h e c o u n t e r s c a l l P A P I f _ f l o p s ( r e a l _ t i m e , proc_time , f l p i n s , mflops , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' F a i l u r e on P A P I f _ f l o p s : ' , r e t v a l end i f
HPC-I Fall 2012
91 / 108
PAPI API Examples
! M a t r i xM a t r i x M u l t i p l y do i =1 , i n d e x do j =1 , i n d e x do k =1 , i n d e x mres ( i , j ) = mres ( i , j ) + m a t r i x a ( i , k ) * m a t r i x b ( k , j ) end do end do end do ! C o l l e c t t h e data i n t o t h e V a r i a b l e s passed i n c a l l P A P I f _ f l o p s ( r e a l _ t i m e , proc_time , f l p i n s , mflops , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' F a i l u r e on P A P I f _ f l o p s : ' , r e t v a l end i f p r i n t * , ' Real_time : ' , r e a l _ t i m e p r i n t * , ' Proc_time : ' , p r o c _ t i m e print * , ' Total f l p i n s : ' , f l p i n s p r i n t * , ' MFLOPS: ' , mflops end program f l o p s
HPC-I Fall 2012
92 / 108
PAPI API Examples
Compile and run on E5645 U2:

[k11n30b:~/d_papi]$ ifort -o flops -I$PAPI/include flops.f90 -L$PAPI/lib -lpapi [k11n30b:~/d_papi]$ ./flops Real_time: 0.3325460 Proc_time: 0.3311317 Total flpins: 500893184 MFLOPS: 1512.671
HPC-I Fall 2012
93 / 108
PAPI API Examples
Low-level API
Low-level API
The low-level API is primarily intended for experienced application programmers and tool developers. It manages hardware events in user-dened groups called event sets, and can use both preset and native events. The low-level API can also interrogate the hardware and determine memory sizes of the executable itself. The low-level API can also be used for multiplexing, in which more (virtual) counters can be used than the underlying hardware supports by timesharing the available (physical) hardware counters.
HPC-I Fall 2012
94 / 108
PAPI API Examples
Low-level API
PAPI Low-level Example

A simple example using the low-level API:
# include < p a p i . h> # include < s t d i o . h> # define NUM_FLOPS 10000 main ( ) { i n t r e t v a l , EventSet=PAPI_NULL ; long_long values [ 1 ] ; / * I n i t i a l i z e t h e PAPI l i b r a r y * / r e t v a l = P A P I _ l i b r a r y _ i n i t (PAPI_VER_CURRENT ) ; i f ( r e t v a l ! = PAPI_VER_CURRENT) { f p r i n t f ( s t d e r r , " PAPI l i b r a r y i n i t e r r o r ! \ n " ) ; exit (1); } / * Create t h e Event Set * / i f ( PAPI_create_eventset (& EventSet ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; / * Add T o t a l I n s t r u c t i o n s Executed t o our Event Set * / i f ( PAPI_add_event ( EventSet , PAPI_TOT_INS ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ;
HPC-I Fall 2012
95 / 108
PAPI API Examples
Low-level API
/ * S t a r t c o u n t i n g events i n t h e Event Set * / i f ( P A P I _ s t a r t ( EventSet ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; / * Defined i n t e s t s / do_loops . c i n t h e PAPI source d i s t r i b u t i o n * / d o _ f l o p s (NUM_FLOPS ) ; / * Read t h e c o u n t i n g events i n t h e Event Set * / i f ( PAPI_read ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r r e a d i n g t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; / * Reset t h e c o u n t i n g events i n t h e Event Set * / i f ( PAPI_reset ( EventSet ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; d o _ f l o p s (NUM_FLOPS ) ; / * Add t h e c o u n t e r s i n t h e Event Set * / i f ( PAPI_accum ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r adding t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; d o _ f l o p s (NUM_FLOPS ) ; / * Stop t h e c o u n t i n g o f events i n t h e Event Set * / i f ( PAPI_stop ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r s t o p p i n g t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; }
HPC-I Fall 2012
96 / 108
PAPI API Examples
PAPI in Parallel
PAPI in Parallel
threads PAPI_thread_init enables PAPIs thread support, and should be called immediately after PAPI_library_init. MPI codes are treated very simply - each process has its own address space, and potentially its own hardware counters.
HPC-I Fall 2012
97 / 108
High-level Tools
High-level Tools
There are a bunch of open-source high-level tools that build on some of the simple approaches that we have been talking about. General characteristics found in most (not necessarily all): Ability to generate and view MPI trace les, leveraging MPIs built-in proling interface, Ability to do statistical proling ( la gprof) and code viewing for identifying hotspots, Ability to access performance counters, leveraging PAPI
HPC-I Fall 2012
99 / 108
High-level Tools
Popular Examples
Tool Examples
A list of such high-level tool examples (not exhaustive): TAU, Tuning and Analysis Utility,
http://www.cs.uoregon.edu/Research/tau/home.php
Open|SpeedShop,funded by U.S. DOE

http://www.openspeedshop.org
IPM, Integrated Performance Management

http://ipm-hpc.sourceforge.net
HPC-I Fall 2012
100 / 108
High-level Tools
Specic Example: IPM
Example: IPM
IPM is relatively simple to install and use, so we can easily walk through our favorite example. Note that IPM does: MPI PAPI I/O proling Memory Timings: wall, user, and system
HPC-I Fall 2012
101 / 108
High-level Tools
Specic Example: IPM
Run and Gather

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 #PBS -S /bin/bash #PBS -q debug #PBS -l walltime=00:20:00 #PBS -l nodes=2:MEM24GB:ppn=8 #PBS -M jonesm@ccr.buffalo.edu #PBS -m e #PBS -N test #PBS -o subMPIP.out #PBS -j oe module load intel module load intel-mpi module load papi module list cd $PBS_O_WORKDIR which mpiexec NNODES=`cat $PBS_NODEFILE | uniq | wc -l` NPROCS=`cat $PBS_NODEFILE | wc -l` export I_MPI_DEBUG=5 # Use LD_PRELOAD trick to load IPM wrappers at runtime export LD_PRELOAD=/projects/jonesm/ipm/src/ipm/lib/libipm.so mpdboot -n $NNODES -f "$UNIQ_HOSTS" -v mpiexec -np $NPROCS -envall ./laplace_mpi <<EOF 2000 EOF mpdallexit
HPC-I Fall 2012
102 / 108
High-level Tools
Specic Example: IPM
... and the output is a big xml le plus some useful output to standard output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [k07n14:~/d_laplace/d_ipm]$ file jonesm.1318862245.001449.0 jonesm.1318862245.001449.0: XML document text [k07n14:~/d_laplace/d_ipm]$ less subMPIP.out ... ##IPMv0.983#################################################################### # # command : ./laplace_mpi (completed) # host : d16n03/x86_64_Linux mpi_tasks : 16 on 2 nodes # start : 10/17/11/10:37:25 wallclock : 116.170005 sec # stop : 10/17/11/10:39:21 %comm : 13.94 # gbytes : 2.24606e+00 total gflop/sec : 5.02520e+00 total # ############################################################################## # region : * [ntasks] = 16 #
HPC-I Fall 2012
103 / 108
High-level Tools
Specic Example: IPM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# [total] <avg> min max # entries 16 1 1 1 # wallclock 1853.71 115.857 115.816 116.17 # user 1853.09 115.818 115.707 115.936 # system 2.18066 0.136291 0.071989 0.198969 # mpi 259.152 16.197 11.3859 19.1157 # %comm 13.9425 9.82914 16.5048 # gflop/sec 5.0252 0.314075 0.311741 0.319497 # gbytes 2.24606 0.140379 0.138138 0.170021 # # PAPI_FP_OPS 5.83778e+11 3.64861e+10 3.62149e+10 3.7116e+10 # PAPI_FP_INS 5.8276e+11 3.64225e+10 3.62144e+10 3.69079e+10 # PAPI_DP_OPS 5.82764e+11 3.64228e+10 3.62144e+10 3.69079e+10 # PAPI_VEC_DP 4.00803e+06 250501 0 4.00803e+06 # # [time] [calls] <%mpi> <%wall> # MPI_Allreduce 243.838 381520 94.09 13.15 # MPI_Sendrecv 14.9598 763040 5.77 0.81 # MPI_Send 0.339084 15 0.13 0.02 # MPI_Recv 0.0143731 15 0.01 0.00 # MPI_Bcast 0.00124932 16 0.00 0.00 # MPI_Comm_rank 1.58967e-05 16 0.00 0.00 # MPI_Comm_size 8.01496e-06 16 0.00 0.00 ###############################################################################
HPC-I Fall 2012
104 / 108
High-level Tools
Specic Example: IPM
Script to Generate HTML from XML Results
1 2 3 4 5 6 7 8 9 10 11
#!/bin/bash if [ $# -ne 1 ]; then echo "Usage: $0 xml_filename" exit fi XMLFILE=$1 export IPM_KEYFILE=/projects/jonesm/ipm/src/ipm/ipm_key export PATH=${PATH}:/projects/jonesm/ipm/src/ipm/bin /projects/jonesm/ipm/src/ipm/bin/ipm_parse -html $XMLFILE
HPC-I Fall 2012
105 / 108
High-level Tools
Specic Example: IPM
[u2:~/d_laplace/d_ipm]$ ./genhtml.sh jonesm.1318862245.001449.0 # data_acquire = 0 sec # data_workup = 0 sec # mpi_pie = 1 sec # task_data = 0 sec # load_bal = 0 sec # time_stack = 0 sec # mpi_stack = 1 sec # mpi_buff = 0 sec # switch+mem = 0 sec # topo_tables = 0 sec # topo_data = 0 sec # topo_time = 0 sec # html_all = 2 sec # html_regions = 0 sec # html_nonregion = 1 sec [u2:~/d_laplace/d_ipm]$ ls -l \ laplace_mpi_16_jonesm.1318862245.001449.0_ipm_1159896.d15n41.ccr.buffalo.edu/ total 346 -rw-r - r - 1 jonesm ccrstaff 994 Oct 17 16:07 dev.html -rw-r - r - 1 jonesm ccrstaff 104 Oct 17 16:07 env.html -rw-r - r - 1 jonesm ccrstaff 347 Oct 17 16:07 exec.html -rw-r - r - 1 jonesm ccrstaff 451 Oct 17 16:07 hostlist.html drwxr-xr-x 2 jonesm ccrstaff 930 Oct 17 16:07 img -rw-r - r - 1 jonesm ccrstaff 10550 Oct 17 16:07 index.html -rw-r - r - 1 jonesm ccrstaff 387 Oct 17 16:07 map_adjacency.txt -rw-r - r - 1 jonesm ccrstaff 8961 Oct 17 16:07 map_calls.txt -rw-r - r - 1 jonesm ccrstaff 1452 Oct 17 16:07 map_data.txt drwxr-xr-x 2 jonesm ccrstaff 803 Oct 17 16:07 pl -rw-r - r - 1 jonesm ccrstaff 2620 Oct 17 16:07 task_data [k07n14:~/d_laplace/d_ipm]$ tar czf my-ipm-files.tgz \ laplace_mpi_16_jonesm.1318862245.001449.0_ipm_1159896.d15n41.ccr.buffalo.edu/ [k07n14:~/d_laplace/d_ipm]$ ls -l my-ipm-files.tgz -rw-r - r - 1 jonesm ccrstaff 71509 Oct 17 16:48 my-ipm-files.tgz M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 106 / 108
High-level Tools
Specic Example: IPM
Visualize Results in Browser

Transfer compressed tar le to your local machine, unpack, and browse the results:
HPC-I Fall 2012
107 / 108
High-level Tools
Specic Example: IPM
Summary
Summary of high-level tools IPM is pretty easy to use, provides some good functionality TAU and Open|SpeedShop have steeper learning curves, much more functionality
HPC-I Fall 2012
108 / 108

Class16 Profiling-Handout PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Class16 Profiling-Handout PDF

Загружено:

Авторское право:

Доступные форматы

Application Performance Tuning

High Performance Computing I, 2012

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Introduction to Performance Engineering in HPC

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Introduction to Performance Engineering in HPC

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Introduction to Performance Engineering in HPC

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Introduction to Performance Engineering in HPC

Establish a Performance Baseline

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Introduction to Performance Engineering in HPC

Pitfalls in Parallel Performance

Parallel Performance Trap

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Simple Proling Tools

and using the -p option:

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Simple Proling Tools

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Simple Proling Tools

time MPI Example

Simple Proling Tools

Code Section Timing (Calipers)

Code Section Timing (Calipers)

Simple Proling Tools

Code Section Timing (Calipers)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Simple Proling Tools

GNU Tools: gprof

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Simple Proling Tools

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Simple Proling Tools

Types of gprof Proles

gprof proles come in three types:

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

Simple Proling Tools