Академический Документы
Профессиональный Документы
Культура Документы
M. D. Jones, Ph.D.
Center for Computational Research University at Buffalo State University of New York
1 / 108
Performance Fundamentals
Performance Foundations
Three pillars of performance optimization: Algorithmic - choose the most effective algorithm that you can for the problem of interest Serial Efciency - optimize the code to run efciently in a non-parallel environment Parallel Efciency - effectively use multiple processors to achieve a reduction in execution time, or equivalently, to solve a proportionately larger problem
3 / 108
Performance Fundamentals
Algorithmic Efciency
Choose the best algorithm before you start coding (recall that good planning is an essential part of writing good software): Running on large number of processors? Choose an algorithm that scales well with increasing processor count Running a large system (mesh points, particle count, etc.)? Choose an algorithm that scales well with system size If you are going to run on a massively parallel machine, plan from the beginning on how you intend to decompose the problem (it may save you a lot of time later)
4 / 108
Performance Baseline
Serial Efciency
Getting efcient code in parallel is made much more difcult if you have not optimized the sequential code, and in fact can lead to a misleading picture of parallel performance. Recall that our denition of parallel speedup, S , S(Np ) = p (Np ) involves the time, S , for an optimal sequential implementation (not just p (1)!)
5 / 108
Performance Baseline
Steps to establishing a baseline for your own performance expectations: Choose a representative problem (or better still a suite of problems) that can be run under identical circumstances on a multitude of platforms/compilers Requires portable code! How fast is fast? You can utilize hardware performance counters to measure actual code performance Prole, prole, and then prole some more ... to nd bottlenecks and spend your time more effectively in optimizing code
6 / 108
Pitfalls when measuring the performance of parallel codes: For many, speedup or linear scalability is the ultimate goal. This goal is incomplete - a terribly inefcient code can scale well, but actually deliver poor efciency. For example, consider a simple Monte Carlo based code that uses the most rudimentary uniform sampling (i.e. no importance sampling) - this can be made to scale perfectly in parallel, but the algorithmic efciency (measured perhaps by the delivered variance per cpu-hour) is quite low.
7 / 108
Timers
time Command
Note that this is not the time built-in function in many shells (bash and tcsh included), but instead the one located in /usr/bin. This command is quite useful for getting an overall picture of code performance. The default output format:
%Uuser %Ssystem %Eelapsed % PCPU (% X t e x t+%Ddata % Mmax) k %I i n p u t s+%Ooutputs (%Fmajor+%Rminor ) p a g e f a u l t s %Wswaps
9 / 108
Timers
time Example
[ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e . / l a p l a c e _ s : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 75.82 user 0.00 system 1 : 1 7 . 7 2 elapsed 97%CPU ( 0 a v g t e x t +0avgdata 0 maxresident ) k 0 i n p u t s +0 o u t p u t s ( 0 major +913 minor ) p a g e f a u l t s 0swaps [ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e p . / l a p l a c e _ s : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 r e a l 75.73 user 74.68 sys 0.00
10 / 108
Timers
OSCs mpiexec will report accurate timings in the (optional) email report, as it does not rely on rsh/ssh to launch tasks (but Intel MPI does, so in that case you will see the result of timing the mpiexec shell script, not the MPI code).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 11 / 108
Generally I prefer the MPI and OpenMP timing calls whenever I can use them (*the MPI and OpenMP specications call for their intrinsic timers to be high precision).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 12 / 108
More information on code section timers (and code for doing so): LLNL Performance Tools:
https://computing.llnl.gov/tutorials/performance_tools/#gettimeofday
Stopwatch (nice F90 module, but you need to supply a low-level function for accessing a timer): http://math.nist.gov/StopWatch
13 / 108
gprof
14 / 108
gprof
gprof Shortcomings
Shortcomings of gprof (which apply also to any statistical proling tool): Need to recompile to instrument the code Instrumentation can affect the statistics in the prole Overhead can signicantly increase the running time Compiler optimization can be affected by instrumentation
15 / 108
gprof
Flat Prole: shows how much time your program spent in each function, and how many times that function was called Call Graph: for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function Basic-block: Requires compilation with the -a ag (supported only by GNU?) - enables gprof to construct an annotated source code listing showing how many times each line of code was executed
16 / 108
gprof
gprof example
g77 I . O3 f f a s t math g pg o r p rp_read . o i n i t i a l . o en_gde . o \ adwfns . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o gen_etg . o g e n _ r t g . o \ gdewfn . o l i b / * . o [ jonesm@bono ~ / d_bench ] $ qsub qdebug lnodes =1: ppn =2 , w a l l t i m e =00:30:00 I [ jonesm@c16n30 ~ / d_bench ] $ . / r p E n t e r r u n i d :( <=9 chars ) short ... . . . s k i p copious amount o f s t a n d a r d o u t p u t . . . ... [ jonesm@bono ~ / d_bench ] $ g p r o f r p gmon . o u t > & o u t . g p r o f [ jonesm@bono ~ / d_bench ] $ l e s s o u t . g p r o f Flat profile : Each sample counts as 0.01 seconds . % cumulative self time seconds seconds calls 89.74 123.23 123.23 204008 6.96 132.79 9.56 1 1.18 134.41 1.62 200004 1.05 135.86 1.44 204002 0.71 136.83 0.97 14790551 0.27 137.20 0.37 204008 ...
17 / 108
gprof
[ jonesm@bono ~ / d_bench ] $ g p r o f - l i n e r p gmon . o u t > & o u t . g p r o f . l i n e [ jonesm@bono ~ / d_bench ] $ l e s s o u t . g p r o f . l i n e Flat profile : Each sample counts as 0.01 seconds . % cumulative self self total time seconds seconds c a l l s ns / c a l l ns / c a l l name 17.45 23.96 23.96 t r i w f n s _ ( adwfns . 14.46 43.82 19.86 t r i w f n s _ ( adwfns . 12.87 61.50 17.68 t r i w f n s _ ( adwfns . 12.31 78.41 16.91 t r i w f n s _ ( adwfns . 0.67 79.33 0.92 t r i w f n s _ ( adwfns . 0.59 80.14 0.82 MAIN__ (cc4WTuQH . 0.51 80.84 0.70 MAIN__ (cc4WTuQH .
@ @ @ @ @ @ @
18 / 108
gprof
19 / 108
pgprof
PGI tools also have proling capabilities (c.f. man pgf95) Graphical proler, pgprof
20 / 108
pgprof
pgprof example
pgf77
t p p764 f a s t s s e g 7 7 l i b s Mprof= l i n e s o r p rp_read . o i n i t i a l . o en_gde . o \ adwfns . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o gen_etg . o g e n _ r t g . o \ gdewfn . o l i b / * . o [ jonesm@bono ~ / d_bench ] $ . / r p E n t e r r u n i d :( <=9 chars ) short ... . . . s k i p copious amount o f s t a n d a r d o u t p u t . . . ... [ jonesm@bono ~ / d_bench ] $ p g p r o f exe . / r p p g p r o f . o u t
21 / 108
pgprof
22 / 108
pgprof
N.B. you can also use the -text option to pgprof to make it behave more like gprof See the PGI Tools Guide for more information (should be a PDF copy in $PGI/doc)
23 / 108
http://mpip.sourceforge.net Not a tracing tool, but a lightweight interface to accumulate statistics using the MPI proling interface Quite useful in conjunction with a tracele analysis (e.g. using jumpshot) Installed on CCR systems - see module avail for mpiP availability and location
25 / 108
mpiP Compilation
To use mpiP you need to: Add a -g ag to add symbols (this will allow mpiP to access the source code symbols and line numbers) Link in the necessary mpiP proling library and the binary utility libraries for actually decoding symbols (There is a trick that you can use most of the time to avoid having to link with mpiP, though). Compilation examples (from U2) follow ...
26 / 108
. disabled 1
256 0.0
27 / 108
From an older version of mpiP, but still almost entirely the same - this one links directly with mpiP; rst compile:
mpicc g o a t l a s a i j 2 _ b a s i s . o analyze . o a t l a s . o b a r r i e r . o b y t e f l i p . o \ chordgn2 . o c s t r i n g s . o i o 2 . o map . o m u t i l s . o numrec . o paramods . o p r o j . o \ p r o j A t l a s . o sym2 . o u t i l . o lm L / P r o j e c t s /CCR/ jonesm / mpiP 2.8.2/ gnu / ch_gm / l i b \ lmpiP l b f d l i b e r t y lm
then run (in this case using 16 processors) and examine the output le:
[ jonesm@joplin d_derenzo ] $ l s * . mpiP a t l a s . gcc3mpipapimpiP . 1 6 . 2 0 5 7 8 . 1 . mpiP [ jonesm@joplin d_derenzo ] $ l e s s a t l a s . gcc3mpipapimpiP . 1 6 . 2 0 5 7 8 . 1 . mpiP : :
28 / 108
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
mpiP Command : . / a t l a s . gcc3mpipapimpiP s t u d y . i n i 0 Version : 2.8.2 MPIP B u i l d date : Jun 29 2005 , 1 4: 5 3 : 4 1 S t a r t time : 2005 06 29 15 : 1 8 : 5 2 Stop t i m e : 2005 06 29 15 : 2 8 : 3 4 Timer Used : gettimeofday MPIP env v a r : [ null ] C o l l e c t o r Rank : 0 C o l l e c t o r PID : 20578 F i n a l Output D i r : . MPI Task Assignment : 0 bb18n17 . c c r . b u f f a l o . edu MPI Task Assignment : 1 bb18n17 . c c r . b u f f a l o . edu MPI Task Assignment : 2 bb18n16 . c c r . b u f f a l o . edu MPI Task Assignment : 3 bb18n16 . c c r . b u f f a l o . edu MPI Task Assignment : 4 bb18n15 . c c r . b u f f a l o . edu MPI Task Assignment : 5 bb18n15 . c c r . b u f f a l o . edu MPI Task Assignment : 6 bb18n14 . c c r . b u f f a l o . edu MPI Task Assignment : 7 bb18n14 . c c r . b u f f a l o . edu MPI Task Assignment : 8 bb18n13 . c c r . b u f f a l o . edu MPI Task Assignment : 9 bb18n13 . c c r . b u f f a l o . edu MPI Task Assignment : 10 bb18n12 . c c r . b u f f a l o . edu MPI Task Assignment : 11 bb18n12 . c c r . b u f f a l o . edu MPI Task Assignment : 12 bb18n11 . c c r . b u f f a l o . edu MPI Task Assignment : 13 bb18n11 . c c r . b u f f a l o . edu MPI Task Assignment : 14 bb18n10 . c c r . b u f f a l o . edu MPI Task Assignment : 15 bb18n10 . c c r . b u f f a l o . edu
29 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- MPI Time ( seconds ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Task AppTime MPITime MPI% 0 582 44.7 7.69 1 579 41.9 7.24 2 579 40.7 7.03 3 579 36.9 6.37 4 579 22.3 3.84 5 579 16.6 2.87 6 579 32 5.53 7 579 35.9 6.20 8 579 28.6 4.93 9 579 25.9 4.48 10 579 39.2 6.76 11 579 33.8 5.84 12 579 35.3 6.10 13 579 41 7.07 14 579 29.9 5.16 15 579 41.4 7.16 9.27 e+03 546 5.89 *
30 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- C a l l s i t e s : 13 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ID Lev F i l e / Address L i n e Parent_Funct MPI_Call 1 0 util .c 833 gsync Barrier 2 0 atlas . c 1531 r ea d Pr o jD a t a Allreduce 3 0 projAtlas . c 745 b a c k P r o j A t l a s Allreduce 4 0 atlas . c 1545 r ea d Pr o jD a t a Allreduce 5 0 atlas . c 1525 r ea d Pr o jD a t a Allreduce 6 0 atlas . c 1541 r ea d Pr o jD a t a Allreduce 7 0 atlas . c 1589 r ea d Pr o jD a t a Allreduce 8 0 atlas . c 1519 r ea d Pr o jD a t a Allreduce 9 0 util .c 789 mygcast Bcast 10 0 projAtlas . c 1100 c o m p u t e L o g l i k e A t l a s Allreduce 11 0 atlas . c 1514 re a dP r oj D a ta Allreduce 12 0 atlas . c 1537 re a dP r oj D a ta Allreduce 13 0 projAtlas . c 425 f w d B a c k P r o j A t l a s 2 Allreduce
31 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- Aggregate Time ( t o p twenty , descending , m i l l i s e c o n d s ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Call Site Time App% MPI% COV Allreduce 13 3.09 e+05 3.33 56.50 0.46 Barrier 1 2.13 e+05 2.30 38.97 0.35 Bcast 9 1.69 e+04 0.18 3.10 0.37 Allreduce 3 7.78 e+03 0.08 1.42 0.11 Allreduce 10 62.7 0.00 0.01 0.20 Allreduce 11 2.42 0.00 0.00 0.09 Allreduce 7 2.17 0.00 0.00 0.26 Allreduce 12 1.15 0.00 0.00 0.20 Allreduce 6 1.14 0.00 0.00 0.19 Allreduce 5 1.13 0.00 0.00 0.15 Allreduce 8 1.12 0.00 0.00 0.18 Allreduce 2 1.1 0.00 0.00 0.13 Allreduce 4 1.1 0.00 0.00 0.12
32 / 108
- - - - - - - - - - - - - - - - - - @- Aggregate Sent Message Size ( t o p - - - - - - - - - - - - - - - - - - Call Site Count Allreduce 13 65536 Allreduce 3 8192 Bcast 9 490784 Allreduce 11 16 Allreduce 10 512 Allreduce 7 16 Allreduce 2 16 Allreduce 6 16 Allreduce 5 16 Allreduce 4 16 Allreduce 8 16 Allreduce 12 16 ... ...
- - - - - - - - - - - - - - - - - - twenty , descending , b y t e s ) - - - - - - - - - - - - - - - - - - - - - - Total Avrg Sent% 2.28 e+09 3.48 e+04 83.69 2.85 e+08 3.48 e+04 10.46 1.59 e+08 325 5.85 2.07 e+04 1 . 3 e+03 0.00 4 . 1 e+03 8 0.00 256 16 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00
33 / 108
Now lets examine an example using mpiP at runtime. This example is solves a simple Laplace equation with Dirichlet boundary conditions using nite differences.
34 / 108
#PBS -S /bin/bash #PBS -q debug #PBS -l walltime=00:20:00 #PBS -l nodes=2:MEM24GB:ppn=8 #PBS -M jonesm@ccr.buffalo.edu #PBS -m e #PBS -N test #PBS -o subMPIP.out #PBS -j oe module load intel module load intel-mpi module load mpip module list cd $PBS_O_WORKDIR which mpiexec NNODES=`cat $PBS_NODEFILE | uniq | wc -l` NPROCS=`cat $PBS_NODEFILE | wc -l` export I_MPI_DEBUG=5 # Use LD_PRELOAD trick to load mpiP wrappers at runtime export LD_PRELOAD=$MPIPDIR/lib/libmpiP.so mpdboot -n $NNODES -f $PBS_NODEFILE -v mpiexec -np $NPROCS -envall ./laplace_mpi <<EOF 2000 EOF mpdallexit
35 / 108
... and then run it and examine the resulting mpiP output le:
[k07n14:~/d_laplace/d_mpip]$ ls -l laplace_mpi.16.8618.1.mpiP -rw-r - r - 1 jonesm ccrstaff 17919 Oct 14 16:28 laplace_mpi.16.8618.1.mpiP
36 / 108
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
mpiP Command : ./laplace_mpi Version MPIP Build date Start time Stop time Timer Used MPIP env var Collector Rank Collector PID Final Output Dir Report generation MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment
: : : : : : : : : : : : : : : : : : : : : : : : : :
3.3.0 Oct 14 2011, 16:16:34 2011 10 14 16:26:15 2011 10 14 16:28:11 PMPI_Wtime [null] 0 8618 . Single collector task 0 d15n33.ccr.buffalo.edu 1 d15n33.ccr.buffalo.edu 2 d15n33.ccr.buffalo.edu 3 d15n33.ccr.buffalo.edu 4 d15n33.ccr.buffalo.edu 5 d15n33.ccr.buffalo.edu 6 d15n33.ccr.buffalo.edu 7 d15n33.ccr.buffalo.edu 8 d15n23.ccr.buffalo.edu 9 d15n23.ccr.buffalo.edu 10 d15n23.ccr.buffalo.edu 11 d15n23.ccr.buffalo.edu 12 d15n23.ccr.buffalo.edu 13 d15n23.ccr.buffalo.edu 14 d15n23.ccr.buffalo.edu 15 d15n23.ccr.buffalo.edu
37 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - MPI Time (seconds) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Task AppTime MPITime MPI% 0 116 16.9 14.57 1 115 16.4 14.18 2 115 16.8 14.53 3 115 17.7 15.34 4 115 16.6 14.39 5 115 14.3 12.37 6 115 13.7 11.90 7 115 11.1 9.65 8 115 12.5 10.87 9 115 13.8 12.00 10 115 14.5 12.60 11 115 16.9 14.67 12 115 17.5 15.15 13 115 19.4 16.82 14 115 16.4 14.22 15 115 17.5 15.13 1.85e+03 252 13.65 *
38 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Callsites: 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ID Lev File/Address Line Parent_Funct MPI_Call 1 0 laplace_mpi.f90 118 MAIN__ Allreduce 2 0 laplace_mpi.f90 143 MAIN__ Recv 3 0 laplace_mpi.f90 48 __paramod_MOD_xchange Sendrecv 4 0 laplace_mpi.f90 80 MAIN__ Bcast 5 0 laplace_mpi.f90 46 __paramod_MOD_xchange Sendrecv 6 0 laplace_mpi.f90 138 MAIN__ Send - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Aggregate Time (top twenty, descending, milliseconds) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Call Site Time App% MPI% COV Allreduce 1 2.34e+05 12.69 92.94 0.15 Sendrecv 3 8.88e+03 0.48 3.52 0.19 Sendrecv 5 8.47e+03 0.46 3.36 0.16 Send 6 321 0.02 0.13 0.51 Bcast 4 97.7 0.01 0.04 0.26 Recv 2 14.4 0.00 0.01 0.00 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --
39 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @ - - Callsite Time statistics (all, milliseconds): 80 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name Site Rank Count Max Mean Min Allreduce 1 0 23845 28.5 0.666 0.036 Allreduce 1 1 23845 28.6 0.644 0.035 Allreduce 1 2 23845 28.6 0.661 0.037 Allreduce 1 3 23845 28.6 0.695 0.038 Allreduce 1 4 23845 29.2 0.651 0.038 Allreduce 1 5 23845 29.2 0.555 0.035 Allreduce 1 6 23845 29 0.528 0.037 Allreduce 1 7 23845 26.3 0.405 0.038 Allreduce 1 8 23845 26.8 0.468 0.034 Allreduce 1 9 23845 28.4 0.538 0.033 Allreduce 1 10 23845 29.7 0.566 0.033 Allreduce 1 11 23845 29.7 0.661 0.033 Allreduce 1 12 23845 28.6 0.686 0.039 Allreduce 1 13 23845 28.8 0.768 0.041 Allreduce 1 14 23845 28.6 0.641 0.036 Allreduce 1 15 23845 28.7 0.694 0.038 Allreduce 1 29.7 0.614 0.033 * 381520
- - - - - -- - - - - -- - - - - -App% MPI% 13.69 94.00 13.30 93.79 13.66 93.98 14.36 93.63 13.46 93.54 11.46 92.71 10.91 91.64 8.37 86.74 9.67 88.92 11.11 92.59 11.70 92.90 13.65 93.10 14.17 93.49 15.87 94.35 13.25 93.19 14.33 94.76 12.69 92.94
40 / 108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Callsite Message Sent statistics (all, sent bytes) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Name Site Rank Count Max Mean Min Sum Allreduce 1 0 23845 8 8 8 1.908e+05 Allreduce 1 1 23845 8 8 8 1.908e+05 Allreduce 1 2 23845 8 8 8 1.908e+05 Allreduce 1 3 23845 8 8 8 1.908e+05 Allreduce 1 4 23845 8 8 8 1.908e+05 Allreduce 1 5 23845 8 8 8 1.908e+05 Allreduce 1 6 23845 8 8 8 1.908e+05 Allreduce 1 7 23845 8 8 8 1.908e+05 Allreduce 1 8 23845 8 8 8 1.908e+05 Allreduce 1 9 23845 8 8 8 1.908e+05 Allreduce 1 10 23845 8 8 8 1.908e+05 Allreduce 1 11 23845 8 8 8 1.908e+05 Allreduce 1 12 23845 8 8 8 1.908e+05 Allreduce 1 13 23845 8 8 8 1.908e+05 Allreduce 1 14 23845 8 8 8 1.908e+05 Allreduce 1 15 23845 8 8 8 1.908e+05 Allreduce 1 8 8 8 3.052e+06 * 381520
41 / 108
A commercial product for performing MPI trace analysis that has enjoyed a long history is Vampir/Vampirtrace, originally developed and sold by Pallas GmbH. Now owned by Intel and available as the Intel Trace Analyzer and Collector. We have a license on U2 if someone wants to give it a try. Note that Vampir/Vampirtrace has since been reborn as an entirely new product.
42 / 108
ITAC Example
Note that you do not have to recompile your application to use ITAC (unless you are building it statically), you can just build it usual, using Intel MPI:
[ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ module l o a d i n t e l mpi [ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ make l a p l a c e _ m p i m p i i f o r t i p o O3 V a x l i b g c l a p l a c e _ m p i . f 9 0 m p i i f o r t i p o O3 V a x l i b g o l a p l a c e _ m p i l a p l a c e _ m p i . o i p o : remark #11001: p e r f o r m i n g s i n g l ef i l e o p t i m i z a t i o n s i p o : remark #11005: g e n e r a t i n g o b j e c t f i l e / tmp / i p o _ i f o r t j O z Y A n . o [ bono : ~ / d _ l a p l a c e / d _ i t a c ] $
43 / 108
[ bono : ~ / d _ l a p l a c e ] $ cat subICT #PBS S / b i n / bash #PBS q debug #PBS l w a l l t i m e =00:20:00 #PBS l nodes =1: ppn=8 #PBS jonesm@ccr . b u f f a l o . edu M #PBS e m #PBS N ITAC #PBS o subITAC . o u t #PBS j oe . $MODULESHOME/ i n i t / bash module l o a d i n t e l mpi module l i s t cd $PBS_O_WORKDIR which mpiexec NNODES= ` cat $PBS_NODEFILE | u n i q | wc l ` NPROCS= ` cat $PBS_NODEFILE | wc l ` UNIQ_HOSTS=tmp . h o s t s cat $PBS_NODEFILE | u n i q > $UNIQ_HOSTS export I_MPI_DEBUG=5 mpdboot n $NNODES f "$UNIQ_HOSTS" v mpiexec t r a c e np $NPROCS . / l a p l a c e _ m p i <<EOF 2000 EOF mpdallexit [ e "$UNIQ_HOSTS" ] && \ rm "$UNIQ_HOSTS"
44 / 108
Running the preceding batch job on U2 produces a bunch (many!)of proling output les, the most important of which can be is the name of your binary with a .stf sufx, in this case laplace_mpi.stf, which we feed to the Intel Trace Analyzer using the traceanalyzer command ... and we should see a prole that looks very much like what you can see using jumpshot.
45 / 108
46 / 108
Generally a good idea to refer to the documentation for the same version that you are using (you can check with module show intel-mpi).
47 / 108
Introduction to PAPI
Introduction
Performance Application Programming Interface Implement a portable(!) and efcient API to access existing hardware performance counters Ease the optimization of code by providing base infrastructure for cross-platform tools
49 / 108
Introduction to PAPI
Pre-PAPI
Before PAPI came along, there were hardware performance counters, of course - but access to them was limited to proprietary tools and APIs. Some examples were SGIs perfex and Crays hpm. Now, as long as PAPI has been ported to a particular hardware substrate, the end-programmer (or tool developer) can just use the PAPI interface.
50 / 108
Introduction to PAPI
PAPI Schematic
51 / 108
Introduction to PAPI
Headed for inclusion in mainstream Linux kernel (was a custom patch applied to CCR systems prior to Linux kernel 2.6.32) low overhead
IA64 - uses PFM, developed by HP and included in the linux kernel (for x86_64):
Full use of available IA64 monitoring capabilities Quite a bit slower than perfctr, at least according to the PAPI developers http://www.hpl.hp.com/research/linux/perfmon libpfm lives on using perf events, but perfmon apparently ceased development for Linux as of kernel 2.6.30 or so
"Perf Events" added to Linux kernel in 2.6.31, replacing both of the above, c.f.:
http://web.eecs.utk.edu/~vweaver1/projects/perf-events/
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 52 / 108
Reminder: U2 Hardware
Reconsider the block diagram for the (older nodes) Intel architecture in CCRs U2 cluster:
Look familiar?
53 / 108
Reminder: U2 Hardware
Consider the block diagram for the Intel architecture in CCRs U2 cluster:
54 / 108
Reminder: U2 Hardware
Intel Nehalem microarchitecture quadruple associative Instruction Cache 32 KByte, 128-entry TLB-4K, 7 TLB-2/4M per thread
128
Uncore
Complex Decoder
Simple Decoder
Simple Decoder
Simple Decoder
Decoded Instruction Queue (28 OP entries) MicroOp Fusion 2 x Register Allocation Table (RAT) Reorder Buffer (128-entry) fused
FP ADD
512-entry L2-TLB-4K
file:///Users/jonesm/Desktop/Intel_Nehalem_arch.svg
Page 1 of 2
55 / 108
Reminder: U2 Hardware
Characteristics of the Irwindale Xeons that form (part of) CCRs U2 cluster: Clock Cycle TPP Pipeline L2 Cache Size L2 Bandwidth CPU-Memory Bandwidth 3.2 GHz 6.4 GFlop/s 31 stages 2 MByte 102.4 GByte/s 6.4 GByte/s (shared)
56 / 108
Reminder: U2 Hardware
Consider the penalties in lost computation for not reusing data in the various caches (using U2s Intel Irwindale Xeon processors): Memory L1 Cache L2 Cache Main Miss Penalty (cycles) 3 28 400
http://www.bitmover.com/lmbench
Application Performance Tuning HPC-I Fall 2012 57 / 108
Reminder: U2 Hardware
Westmere Xeons
Characteristics of the Westmere E5645 Xeons that form (part of) CCRs U2 cluster: Clock Cycle TPP Pipeline L2 Cache Size L3 Cache Size CPU-Memory Bandwidth 2.4 GHz 9.6 GFlop/s (per core) 14 stages 256 kByte 12 MByte 32 GByte/s (nonuniform!)
58 / 108
Reminder: U2 Hardware
Consider the penalties in lost computation for not reusing data in the various caches (using U2s Intel Westmere Xeon E5645 processors): Memory L1 Cache L2 Cache Main Miss Penalty (cycles) 4 15 110
http://www.bitmover.com/lmbench
Application Performance Tuning HPC-I Fall 2012 59 / 108
Cycle count Instruction count (including Integer, Floating point, load/store) Branches (including Taken/not taken, mispredictions) Pipeline stalls (due to memory, resource conicts) Cache (misses for different levels, invalidation) TLB (misses, invalidation)
60 / 108
High-level PAPI
Intended for coarse-grained measurements Requires little (or no) setup code Allows only PAPI preset Allows only aggregate counting (no statistical proling)
61 / 108
Low-level PAPI
More efcient (and functional) than high-level About 60 functions Thread-safe Supports presets and native events
62 / 108
preset or pre-dened events, are those which have been considered useful by the PAPI community and developers:
http://icl.cs.utk.edu/projects/papi/presets.html
native events are those countable by the CPUs hardware. These events are highly platform specic, and you would need to consult the processor architecture manuals for the relevant native event lists
63 / 108
Hardware counter multiplexing (time sharing hardware counters to allow more events to be monitored than can be conventionally supported) Processor information Address space information Memory information (static and dynamic) Timing functions Hardware event inquiry ... and many more
64 / 108
For more on PAPI, including source code, documentation, presentations, and links to third-party tools that utilize PAPI, see http://icl.cs.utk.edu/projects/papi
65 / 108
Consider a simple example code to measure Flop/s using the high-level PAPI API:
1 2 3 4 5 6 7 8 9 10 11 12 # include < s t d i o . h> # include < s t d l i b . h> # include " p a p i . h "
i n t main ( ) { f l o a t r e a l _ t i m e , proc_time , mflops ; long_long flpops ; float ireal_time , iproc_time , imflops ; long_long i f l p o p s ; int retval ;
66 / 108
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
i f ( ( r e t v a l = PAPI_flops (& i r e a l _ t i m e ,& i p r o c _ t i m e ,& i f l p o p s ,& i m f l o p s ) ) < PAPI_OK ) { p r i n t f ( " Could n o t i n i t i a l i s e PAPI_flops \ n " ) ; p r i n t f ( " Your p l a t f o r m may n o t s u p p o r t f l o a t i n g p o i n t o p e r a t i o n event . \ n " ) ; p r i n t f ( " r e t v a l : %d \ n " , r e t v a l ) ; exit (1); } your_slow_code ( ) ; i f ( ( r e t v a l = PAPI_flops ( &r e a l _ t i m e , &proc_time , &f l p o p s , &mflops ) ) <PAPI_OK ) { p r i n t f ( " r e t v a l : %d \ n " , r e t v a l ) ; exit (1); } p r i n t f ( " Real_time : %f Proc_time : %f T o t a l f l p o p s : %l l d MFLOPS: %f \ n " , r e a l _ t i m e , proc_time , f l p o p s , mflops ) ; exit (0); } i n t your_slow_code ( ) { int i ; double tmp = 1 . 1 ; f o r ( i =1; i <2000; i ++) { tmp =( tmp + 1 0 0 ) / i ; } return 0; }
67 / 108
N.B., PAPI needs to support the underlying hardware, and this version does not support the 32-core Intel nodes (including the front end).
68 / 108
N.B., The Altix is dead, but this is a good example of the cross-platform portability of PAPI accessign the hardware performance counters.
69 / 108
papi_avail Command
You can use papi_avail to check event availability (different CPUs support various events):
[k16n01b:~/d_papi]$ papi_avail Available events and hardware information. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PAPI Version : 4.1.4.0 Vendor string and code : GenuineIntel (1) Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44) CPU Revision : 2.000000 CPUID Info : Family: 6 Model: 44 Stepping: 2 CPU Megahertz : 2400.391113 CPU Clock Megahertz : 2400 Hdw Threads per core : 1 Cores per Socket : 6 NUMA Nodes : 2 CPU's per Node : 6 Total CPU's : 12 Number Hardware Counters : 16 Max Multiplex Counters : 512 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The following correspond to fields in the PAPI_event_info_t structure. Name PAPI_L1_DCM PAPI_L1_ICM Code Avail Deriv Description (Note) 0x80000000 Yes No Level 1 data cache misses 0x80000001 Yes No Level 1 instruction cache misses
70 / 108
Performance API (PAPI) PAPI_L2_DCM PAPI_L2_ICM PAPI_L3_DCM PAPI_L3_ICM PAPI_L1_TCM PAPI_L2_TCM PAPI_L3_TCM PAPI_CA_SNP PAPI_CA_SHR PAPI_CA_CLN PAPI_CA_INV PAPI_CA_ITV PAPI_L3_LDM PAPI_L3_STM PAPI_BRU_IDL PAPI_FXU_IDL PAPI_FPU_IDL PAPI_LSU_IDL PAPI_TLB_DM PAPI_TLB_IM PAPI_TLB_TL PAPI_L1_LDM PAPI_L1_STM PAPI_L2_LDM PAPI_L2_STM PAPI_BTAC_M PAPI_PRF_DM PAPI_L3_DCH PAPI_TLB_SD PAPI_CSR_FAL PAPI_CSR_SUC PAPI_CSR_TOT 0x80000002 0x80000003 0x80000004 0x80000005 0x80000006 0x80000007 0x80000008 0x80000009 0x8000000a 0x8000000b 0x8000000c 0x8000000d 0x8000000e 0x8000000f 0x80000010 0x80000011 0x80000012 0x80000013 0x80000014 0x80000015 0x80000016 0x80000017 0x80000018 0x80000019 0x8000001a 0x8000001b 0x8000001c 0x8000001d 0x8000001e 0x8000001f 0x80000020 0x80000021 Yes Yes No No Yes Yes Yes No No No No No Yes No No No No No Yes Yes Yes Yes Yes Yes Yes No No No No No No No Yes No No No Yes No No No No No No No No No No No No No No No Yes No No No No No No No No No No No
Level 2 data cache misses Level 2 instruction cache misses Level 3 data cache misses Level 3 instruction cache misses Level 1 cache misses Level 2 cache misses Level 3 cache misses Requests for a snoop Requests for exclusive access to shared cache line Requests for exclusive access to clean cache line Requests for cache line invalidation Requests for cache line intervention Level 3 load misses Level 3 store misses Cycles branch units are idle Cycles integer units are idle Cycles floating point units are idle Cycles load/store units are idle Data translation lookaside buffer misses Instruction translation lookaside buffer misses Total translation lookaside buffer misses Level 1 load misses Level 1 store misses Level 2 load misses Level 2 store misses Branch target address cache misses Data prefetch cache misses Level 3 data cache hits Translation lookaside buffer shootdowns Failed store conditional instructions Successful store conditional instructions Total store conditional instructions
71 / 108
Performance API (PAPI) PAPI_MEM_SCY PAPI_MEM_RCY PAPI_MEM_WCY PAPI_STL_ICY PAPI_FUL_ICY PAPI_STL_CCY PAPI_FUL_CCY PAPI_HW_INT PAPI_BR_UCN PAPI_BR_CN PAPI_BR_TKN PAPI_BR_NTK PAPI_BR_MSP PAPI_BR_PRC PAPI_FMA_INS PAPI_TOT_IIS PAPI_TOT_INS PAPI_INT_INS PAPI_FP_INS PAPI_LD_INS PAPI_SR_INS PAPI_BR_INS PAPI_VEC_INS PAPI_RES_STL PAPI_FP_STAL PAPI_TOT_CYC PAPI_LST_INS PAPI_SYC_INS PAPI_L1_DCH PAPI_L2_DCH PAPI_L1_DCA PAPI_L2_DCA PAPI_L3_DCA 0x80000022 0x80000023 0x80000024 0x80000025 0x80000026 0x80000027 0x80000028 0x80000029 0x8000002a 0x8000002b 0x8000002c 0x8000002d 0x8000002e 0x8000002f 0x80000030 0x80000031 0x80000032 0x80000033 0x80000034 0x80000035 0x80000036 0x80000037 0x80000038 0x80000039 0x8000003a 0x8000003b 0x8000003c 0x8000003d 0x8000003e 0x8000003f 0x80000040 0x80000041 0x80000042 No No No No No No No No Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes No Yes No Yes Yes No No Yes No Yes Yes No No No No No No No No No No No Yes No Yes No No No No No No No No No No No No Yes No No Yes No No Yes
Cycles Stalled Waiting for memory accesses Cycles Stalled Waiting for memory Reads Cycles Stalled Waiting for memory writes Cycles with no instruction issue Cycles with maximum instruction issue Cycles with no instructions completed Cycles with maximum instructions completed Hardware interrupts Unconditional branch instructions Conditional branch instructions Conditional branch instructions taken Conditional branch instructions not taken Conditional branch instructions mispredicted Conditional branch instructions correctly predicted FMA instructions completed Instructions issued Instructions completed Integer instructions Floating point instructions Load instructions Store instructions Branch instructions Vector/SIMD instructions (could include integer) Cycles stalled on any resource Cycles the FP unit(s) are stalled Total cycles Load/store instructions completed Synchronization instructions completed Level 1 data cache hits Level 2 data cache hits Level 1 data cache accesses Level 2 data cache accesses Level 3 data cache accesses
72 / 108
PAPI_L1_DCR PAPI_L2_DCR PAPI_L3_DCR PAPI_L1_DCW PAPI_L2_DCW PAPI_L3_DCW PAPI_L1_ICH PAPI_L2_ICH PAPI_L3_ICH PAPI_L1_ICA PAPI_L2_ICA PAPI_L3_ICA PAPI_L1_ICR PAPI_L2_ICR PAPI_L3_ICR PAPI_L1_ICW PAPI_L2_ICW PAPI_L3_ICW PAPI_L1_TCH PAPI_L2_TCH PAPI_L3_TCH PAPI_L1_TCA PAPI_L2_TCA PAPI_L3_TCA PAPI_L1_TCR PAPI_L2_TCR PAPI_L3_TCR PAPI_L1_TCW PAPI_L2_TCW PAPI_L3_TCW
0x80000043 0x80000044 0x80000045 0x80000046 0x80000047 0x80000048 0x80000049 0x8000004a 0x8000004b 0x8000004c 0x8000004d 0x8000004e 0x8000004f 0x80000050 0x80000051 0x80000052 0x80000053 0x80000054 0x80000055 0x80000056 0x80000057 0x80000058 0x80000059 0x8000005a 0x8000005b 0x8000005c 0x8000005d 0x8000005e 0x8000005f 0x80000060
No Yes Yes No Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No No No No Yes No No Yes Yes No Yes Yes No Yes Yes
Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
data cache reads data cache reads data cache reads data cache writes data cache writes data cache writes instruction cache hits instruction cache hits instruction cache hits instruction cache accesses instruction cache accesses instruction cache accesses instruction cache reads instruction cache reads instruction cache reads instruction cache writes instruction cache writes instruction cache writes total cache hits total cache hits total cache hits total cache accesses total cache accesses total cache accesses total cache reads total cache reads total cache reads total cache writes total cache writes total cache writes
73 / 108
PAPI_FML_INS 0x80000061 No PAPI_FAD_INS 0x80000062 No PAPI_FDV_INS 0x80000063 No PAPI_FSQ_INS 0x80000064 No PAPI_FNV_INS 0x80000065 No PAPI_FP_OPS 0x80000066 Yes PAPI_SP_OPS 0x80000067 Yes scaled single precision vector PAPI_DP_OPS 0x80000068 Yes scaled double precision vector PAPI_VEC_SP 0x80000069 Yes PAPI_VEC_DP 0x8000006a Yes - - - - - - - - - - - - - - Of 107 possible events, 57 are avail.c
No Floating point multiply instructions No Floating point add instructions No Floating point divide instructions No Floating point square root instructions No Floating point inverse instructions Yes Floating point operations Yes Floating point operations; optimized to count operations Yes Floating point operations; optimized to count operations No Single precision vector/SIMD instructions No Double precision vector/SIMD instructions - - - - - - - - - - - - - - - - - - - - -available, of which 14 are derived. PASSED
74 / 108
papi_even_chooser Command
Not all events can be simultaneously monitored (at least not without multiplexing):
[k16n01b:~]$ papi_event_chooser PRESET PAPI_FP_OPS Event Chooser: Available events which can be added with given events. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PAPI Version : 4.1.4.0 Vendor string and code : GenuineIntel (1) Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44) CPU Revision : 2.000000 CPUID Info : Family: 6 Model: 44 Stepping: 2 CPU Megahertz : 2400.391113 CPU Clock Megahertz : 2400 Hdw Threads per core : 1 Cores per Socket : 6 NUMA Nodes : 2 CPU's per Node : 6 Total CPU's : 12 Number Hardware Counters : 16 Max Multiplex Counters : 512 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name PAPI_L1_DCM PAPI_L1_ICM PAPI_L2_DCM PAPI_L2_ICM Code Deriv Description (Note) 0x80000000 No Level 1 data cache misses 0x80000001 No Level 1 instruction cache misses 0x80000002 Yes Level 2 data cache misses 0x80000003 No Level 2 instruction cache misses
75 / 108
Performance API (PAPI) PAPI_L1_TCM PAPI_L2_TCM PAPI_L3_TCM PAPI_L3_LDM PAPI_TLB_DM PAPI_TLB_IM PAPI_TLB_TL PAPI_L1_LDM PAPI_L1_STM PAPI_L2_LDM PAPI_L2_STM PAPI_BR_UCN PAPI_BR_CN PAPI_BR_TKN PAPI_BR_NTK PAPI_BR_MSP PAPI_BR_PRC PAPI_TOT_IIS PAPI_TOT_INS PAPI_FP_INS PAPI_LD_INS PAPI_SR_INS PAPI_BR_INS PAPI_RES_STL PAPI_TOT_CYC PAPI_LST_INS PAPI_L2_DCH PAPI_L2_DCA PAPI_L3_DCA PAPI_L2_DCR PAPI_L3_DCR PAPI_L2_DCW PAPI_L3_DCW 0x80000006 0x80000007 0x80000008 0x8000000e 0x80000014 0x80000015 0x80000016 0x80000017 0x80000018 0x80000019 0x8000001a 0x8000002a 0x8000002b 0x8000002c 0x8000002d 0x8000002e 0x8000002f 0x80000031 0x80000032 0x80000034 0x80000035 0x80000036 0x80000037 0x80000039 0x8000003b 0x8000003c 0x8000003f 0x80000041 0x80000042 0x80000044 0x80000045 0x80000047 0x80000048 Yes No No No No No Yes No No No No No No No Yes No Yes No No No No No No No No Yes Yes No Yes No No No No
Level 1 cache misses Level 2 cache misses Level 3 cache misses Level 3 load misses Data translation lookaside buffer misses Instruction translation lookaside buffer misses Total translation lookaside buffer misses Level 1 load misses Level 1 store misses Level 2 load misses Level 2 store misses Unconditional branch instructions Conditional branch instructions Conditional branch instructions taken Conditional branch instructions not taken Conditional branch instructions mispredicted Conditional branch instructions correctly predicted Instructions issued Instructions completed Floating point instructions Load instructions Store instructions Branch instructions Cycles stalled on any resource Total cycles Load/store instructions completed Level 2 data cache hits Level 2 data cache accesses Level 3 data cache accesses Level 2 data cache reads Level 3 data cache reads Level 2 data cache writes Level 3 data cache writes
76 / 108
PAPI_L1_ICH 0x80000049 No Level 1 instruction cache hits PAPI_L2_ICH 0x8000004a No Level 2 instruction cache hits PAPI_L1_ICA 0x8000004c No Level 1 instruction cache accesses PAPI_L2_ICA 0x8000004d No Level 2 instruction cache accesses PAPI_L3_ICA 0x8000004e No Level 3 instruction cache accesses PAPI_L1_ICR 0x8000004f No Level 1 instruction cache reads PAPI_L2_ICR 0x80000050 No Level 2 instruction cache reads PAPI_L3_ICR 0x80000051 No Level 3 instruction cache reads PAPI_L2_TCH 0x80000056 Yes Level 2 total cache hits PAPI_L2_TCA 0x80000059 No Level 2 total cache accesses PAPI_L3_TCA 0x8000005a No Level 3 total cache accesses PAPI_L2_TCR 0x8000005c Yes Level 2 total cache reads PAPI_L3_TCR 0x8000005d Yes Level 3 total cache reads PAPI_L2_TCW 0x8000005f No Level 2 total cache writes PAPI_L3_TCW 0x80000060 No Level 3 total cache writes PAPI_SP_OPS 0x80000067 Yes Floating point operations; optimized to count scaled single precision vector operations PAPI_DP_OPS 0x80000068 Yes Floating point operations; optimized to count scaled double precision vector operations PAPI_VEC_SP 0x80000069 No Single precision vector/SIMD instructions PAPI_VEC_DP 0x8000006a No Double precision vector/SIMD instructions - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total events reported: 56 event_chooser.c PASSED
77 / 108
PAPI Examples
In this section we will work through a few simple examples of using the PAPI API, mostly focused on using the high-level API. And we will steer clear of native events, and leave those to tool developers.
79 / 108
Include les for constants and routine interfaces: C: papi.h F77: f77papi.h F90: f90papi.h
80 / 108
The C interfaces: PAPI C interface (return type) PAPI_function_name(arg1, arg2, ...) and Fortran interfaces PAPI Fortran interfaces PAPIF_function_name(arg1, arg2, ..., check ) note that the check parameter is the same type and value as the C return value.
81 / 108
The following table shows the relation between the C and Fortran types used in PAPI:
Pseudo-type C_INT C_FLOAT C_LONG_LONG C_STRING C_INT FUNCTION Fortran type INTEGER REAL INTEGER*8 CHARACTER*(PAPI_MAX_STR_LEN) EXTERNAL INTEGER FUNCTION Description Default Integer type Default Real type Extended size integer Fortran string Fortran function returning integer result
82 / 108
83 / 108
i n t main ( ) { / * D e c l a r i n g and i n i t i a l i z i n g t h e event s e t w i t h t h e p r e s e t s * / i n t Events [ 2 ] = { PAPI_TOT_INS , PAPI_TOT_CYC } ; / * The l e n g t h o f t h e events a r r a y should be no l o n g e r than t h e v a l u e r e t u r n e d by PAPI_num_counters . * / / * d e c l a r i n g p l a c e h o l d e r f o r no o f hardware c o u n t e r s * / i n t num_hwcntrs = 0 ; int retval ; char e r r s t r i n g [ PAPI_MAX_STR_LEN ] ; / * T h i s i s going t o s t o r e our l i s t o f r e s u l t s * / l o n g _ l o n g v a l u e s [NUM_EVENTS ] ; /* ************************************************************************** * T h i s p a r t i n i t i a l i z e s t h e l i b r a r y and compares t h e v e r s i o n number o f t h e * * header f i l e , t o t h e v e r s i o n o f t h e l i b r a r y , i f these don ' t match then i t * * i s l i k e l y t h a t PAPI won ' t work c o r r e c t l y . I f t h e r e i s an e r r o r , r e t v a l * * keeps t r a c k o f t h e v e r s i o n number . * ************************************************************************** */ i f ( ( r e t v a l = P A P I _ l i b r a r y _ i n i t (PAPI_VER_CURRENT ) ) ! = PAPI_VER_CURRENT ) { f p r i n t f ( s t d e r r , " E r r o r : %d %s \ n " , r e t v a l , e r r s t r i n g ) ; exit (1); }
84 / 108
/* ************************************************************************* * PAPI_num_counters r e t u r n s t h e number o f hardware c o u n t e r s t h e p l a t f o r m * * has o r a n e g a t i v e number i f t h e r e i s an e r r o r * ************************************************************************* */ i f ( ( num_hwcntrs = PAPI_num_counters ( ) ) < PAPI_OK ) { p r i n t f ( " There are no c o u n t e r s a v a i l a b l e . \ n " ) ; exit (1); } p r i n t f ( " There are %d c o u n t e r s i n t h i s system \ n " , num_hwcntrs ) ; /* ************************************************************************* * P A P I _ s t a r t _ c o u n t e r s i n i t i a l i z e s t h e PAPI l i b r a r y ( i f necessary ) and * * s t a r t s c o u n t i n g t h e events named i n t h e events a r r a y . T h i s f u n c t i o n * i m p l i c i t l y s t o p s and i n i t i a l i z e s any c o u n t e r s r u n n i n g as a r e s u l t o f * * a previous c a l l to PAPI_start_counters . * * ************************************************************************* */ i f ( ( r e t v a l = P A P I _ s t a r t _ c o u n t e r s ( Events , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " \ nCounter S t a r t e d : \ n " ) ; / * Your code goes here * / computation_add ( ) ;
85 / 108
/* ********************************************************************* * PAPI_read_counters reads t h e c o u n t e r v a l u e s i n t o v a l u e s a r r a y * ********************************************************************* */ i f ( ( r e t v a l =PAPI_read_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " Read s u c c e s s f u l l y \ n " ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d d i t i o n are %l l d \ n " , v a l u e s [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ; p r i n t f ( " \ nNow we t r y t o use PAPI_accum t o accumulate v a l u e s \ n " ) ; / * Do some computation here * / computation_add ( ) ; /* *********************************************************************** * What PAPI_accum_counters does i s i t adds t h e r u n n i n g c o u n t e r v a l u e s * * t o what i s i n t h e v a l u e s a r r a y . The hardware c o u n t e r s are r e s e t and * * l e f t running a f t e r the c a l l . * *********************************************************************** */ i f ( ( r e t v a l =PAPI_accum_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( "We d i d an a d d i t i o n a l %d t i m e s a d d i t i o n ! \ n " , THRESHOLD ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d d i t i o n are %l l d \ n " , values [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ;
86 / 108
/* ********************************************************************** * Stop c o u n t i n g events ( t h i s reads t h e c o u n t e r s as w e l l as s t o p s them * ********************************************************************** */ p r i n t f ( " \ nNow we t r y t o do some m u l t i p l i c a t i o n s \ n " ) ; c o m p u t a t io n_ mu lt ( ) ; / * * * * * * * * * * * * * * * * * * * PAPI_stop_counters * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( ( r e t v a l =PAPI_stop_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n executed f o r m u l t i p l i c a t i o n are %l l d \ n " , values [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ; exit (0); }
87 / 108
88 / 108
PAPI Initialization
The preceding example used PAPI_library_init to initialize PAPI, which is also used for the low-level API, but you can also use the PAPI_num_counters, PAPI_start_counters, or one of the rate calls, PAPI_flips, PAPI_flops, or PAPI_ipc. Events are counted, as we saw in the example, using PAPI_accum_counters, PAPI_read_counters, and PAPI_stop_counters. Lets look at an even simpler example just using one of the rate counters.
89 / 108
For something a little different we can look at our old friend, matrix multiplication, this time in Fortran:
! A s i m p l e example f o r t h e use o f PAPI , t h e number o f f l o p s you should ! g e t i s about INDEX^3 on machines t h a t c o n s i d e r add and m u l t i p l y one f l o p ! such as SGI , and 2 * ( INDEX ^ 3 ) t h a t don ' t c o n s i d e r i t 1 f l o p such as INTEL ! Kevin London program f l o p s i m p l i c i t none include " f 9 0 p a p i . h " integer , parameter : : i 8 =SELECTED_INT_KIND ( 1 6 ) ! i n t e g e r * 8 integer , parameter : : i n d e x =1000 r e a l : : m a t r i x a ( index , i n d e x ) , m a t r i x b ( index , i n d e x ) , mres ( index , i n d e x ) r e a l : : proc_time , mflops , r e a l _ t i m e i n t e g e r ( kind= i 8 ) : : f l p i n s integer : : i , j , k , r e t v a l
90 / 108
r e t v a l = PAPI_VER_CURRENT CALL P A P I f _ l i b r a r y _ i n i t ( r e t v a l ) i f ( r e t v a l .NE. PAPI_VER_CURRENT) then print * , ' Failure in PAPI_library_init : end i f
' , retval
CALL PAPIf_query_event ( PAPI_FP_OPS , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' Sorry , no PAPI_FP_OPS event : ' ,PAPI_ENOEVNT end i f ! I n i t i a l i z e the Matrix arrays do i =1 , i n d e x do j =1 , i n d e x matrixa ( i , j ) = i + j m a t r i x b ( i , j ) = ji mres ( i , j ) = 0 . 0 end do end do ! Setup PAPI l i b r a r y and begin c o l l e c t i n g data from t h e c o u n t e r s c a l l P A P I f _ f l o p s ( r e a l _ t i m e , proc_time , f l p i n s , mflops , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' F a i l u r e on P A P I f _ f l o p s : ' , r e t v a l end i f
91 / 108
! M a t r i xM a t r i x M u l t i p l y do i =1 , i n d e x do j =1 , i n d e x do k =1 , i n d e x mres ( i , j ) = mres ( i , j ) + m a t r i x a ( i , k ) * m a t r i x b ( k , j ) end do end do end do ! C o l l e c t t h e data i n t o t h e V a r i a b l e s passed i n c a l l P A P I f _ f l o p s ( r e a l _ t i m e , proc_time , f l p i n s , mflops , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' F a i l u r e on P A P I f _ f l o p s : ' , r e t v a l end i f p r i n t * , ' Real_time : ' , r e a l _ t i m e p r i n t * , ' Proc_time : ' , p r o c _ t i m e print * , ' Total f l p i n s : ' , f l p i n s p r i n t * , ' MFLOPS: ' , mflops end program f l o p s
92 / 108
93 / 108
Low-level API
Low-level API
The low-level API is primarily intended for experienced application programmers and tool developers. It manages hardware events in user-dened groups called event sets, and can use both preset and native events. The low-level API can also interrogate the hardware and determine memory sizes of the executable itself. The low-level API can also be used for multiplexing, in which more (virtual) counters can be used than the underlying hardware supports by timesharing the available (physical) hardware counters.
94 / 108
Low-level API
95 / 108
Low-level API
/ * S t a r t c o u n t i n g events i n t h e Event Set * / i f ( P A P I _ s t a r t ( EventSet ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; / * Defined i n t e s t s / do_loops . c i n t h e PAPI source d i s t r i b u t i o n * / d o _ f l o p s (NUM_FLOPS ) ; / * Read t h e c o u n t i n g events i n t h e Event Set * / i f ( PAPI_read ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r r e a d i n g t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; / * Reset t h e c o u n t i n g events i n t h e Event Set * / i f ( PAPI_reset ( EventSet ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; d o _ f l o p s (NUM_FLOPS ) ; / * Add t h e c o u n t e r s i n t h e Event Set * / i f ( PAPI_accum ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r adding t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; d o _ f l o p s (NUM_FLOPS ) ; / * Stop t h e c o u n t i n g o f events i n t h e Event Set * / i f ( PAPI_stop ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r s t o p p i n g t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; }
96 / 108
PAPI in Parallel
PAPI in Parallel
threads PAPI_thread_init enables PAPIs thread support, and should be called immediately after PAPI_library_init. MPI codes are treated very simply - each process has its own address space, and potentially its own hardware counters.
97 / 108
High-level Tools
High-level Tools
There are a bunch of open-source high-level tools that build on some of the simple approaches that we have been talking about. General characteristics found in most (not necessarily all): Ability to generate and view MPI trace les, leveraging MPIs built-in proling interface, Ability to do statistical proling ( la gprof) and code viewing for identifying hotspots, Ability to access performance counters, leveraging PAPI
99 / 108
High-level Tools
Popular Examples
Tool Examples
A list of such high-level tool examples (not exhaustive): TAU, Tuning and Analysis Utility,
http://www.cs.uoregon.edu/Research/tau/home.php
100 / 108
High-level Tools
Example: IPM
IPM is relatively simple to install and use, so we can easily walk through our favorite example. Note that IPM does: MPI PAPI I/O proling Memory Timings: wall, user, and system
101 / 108
High-level Tools
102 / 108
High-level Tools
... and the output is a big xml le plus some useful output to standard output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [k07n14:~/d_laplace/d_ipm]$ file jonesm.1318862245.001449.0 jonesm.1318862245.001449.0: XML document text [k07n14:~/d_laplace/d_ipm]$ less subMPIP.out ... ##IPMv0.983#################################################################### # # command : ./laplace_mpi (completed) # host : d16n03/x86_64_Linux mpi_tasks : 16 on 2 nodes # start : 10/17/11/10:37:25 wallclock : 116.170005 sec # stop : 10/17/11/10:39:21 %comm : 13.94 # gbytes : 2.24606e+00 total gflop/sec : 5.02520e+00 total # ############################################################################## # region : * [ntasks] = 16 #
103 / 108
High-level Tools
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# [total] <avg> min max # entries 16 1 1 1 # wallclock 1853.71 115.857 115.816 116.17 # user 1853.09 115.818 115.707 115.936 # system 2.18066 0.136291 0.071989 0.198969 # mpi 259.152 16.197 11.3859 19.1157 # %comm 13.9425 9.82914 16.5048 # gflop/sec 5.0252 0.314075 0.311741 0.319497 # gbytes 2.24606 0.140379 0.138138 0.170021 # # PAPI_FP_OPS 5.83778e+11 3.64861e+10 3.62149e+10 3.7116e+10 # PAPI_FP_INS 5.8276e+11 3.64225e+10 3.62144e+10 3.69079e+10 # PAPI_DP_OPS 5.82764e+11 3.64228e+10 3.62144e+10 3.69079e+10 # PAPI_VEC_DP 4.00803e+06 250501 0 4.00803e+06 # # [time] [calls] <%mpi> <%wall> # MPI_Allreduce 243.838 381520 94.09 13.15 # MPI_Sendrecv 14.9598 763040 5.77 0.81 # MPI_Send 0.339084 15 0.13 0.02 # MPI_Recv 0.0143731 15 0.01 0.00 # MPI_Bcast 0.00124932 16 0.00 0.00 # MPI_Comm_rank 1.58967e-05 16 0.00 0.00 # MPI_Comm_size 8.01496e-06 16 0.00 0.00 ###############################################################################
104 / 108
High-level Tools
1 2 3 4 5 6 7 8 9 10 11
#!/bin/bash if [ $# -ne 1 ]; then echo "Usage: $0 xml_filename" exit fi XMLFILE=$1 export IPM_KEYFILE=/projects/jonesm/ipm/src/ipm/ipm_key export PATH=${PATH}:/projects/jonesm/ipm/src/ipm/bin /projects/jonesm/ipm/src/ipm/bin/ipm_parse -html $XMLFILE
105 / 108
High-level Tools
[u2:~/d_laplace/d_ipm]$ ./genhtml.sh jonesm.1318862245.001449.0 # data_acquire = 0 sec # data_workup = 0 sec # mpi_pie = 1 sec # task_data = 0 sec # load_bal = 0 sec # time_stack = 0 sec # mpi_stack = 1 sec # mpi_buff = 0 sec # switch+mem = 0 sec # topo_tables = 0 sec # topo_data = 0 sec # topo_time = 0 sec # html_all = 2 sec # html_regions = 0 sec # html_nonregion = 1 sec [u2:~/d_laplace/d_ipm]$ ls -l \ laplace_mpi_16_jonesm.1318862245.001449.0_ipm_1159896.d15n41.ccr.buffalo.edu/ total 346 -rw-r - r - 1 jonesm ccrstaff 994 Oct 17 16:07 dev.html -rw-r - r - 1 jonesm ccrstaff 104 Oct 17 16:07 env.html -rw-r - r - 1 jonesm ccrstaff 347 Oct 17 16:07 exec.html -rw-r - r - 1 jonesm ccrstaff 451 Oct 17 16:07 hostlist.html drwxr-xr-x 2 jonesm ccrstaff 930 Oct 17 16:07 img -rw-r - r - 1 jonesm ccrstaff 10550 Oct 17 16:07 index.html -rw-r - r - 1 jonesm ccrstaff 387 Oct 17 16:07 map_adjacency.txt -rw-r - r - 1 jonesm ccrstaff 8961 Oct 17 16:07 map_calls.txt -rw-r - r - 1 jonesm ccrstaff 1452 Oct 17 16:07 map_data.txt drwxr-xr-x 2 jonesm ccrstaff 803 Oct 17 16:07 pl -rw-r - r - 1 jonesm ccrstaff 2620 Oct 17 16:07 task_data [k07n14:~/d_laplace/d_ipm]$ tar czf my-ipm-files.tgz \ laplace_mpi_16_jonesm.1318862245.001449.0_ipm_1159896.d15n41.ccr.buffalo.edu/ [k07n14:~/d_laplace/d_ipm]$ ls -l my-ipm-files.tgz -rw-r - r - 1 jonesm ccrstaff 71509 Oct 17 16:48 my-ipm-files.tgz M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 106 / 108
High-level Tools
107 / 108
High-level Tools
Summary
Summary of high-level tools IPM is pretty easy to use, provides some good functionality TAU and Open|SpeedShop have steeper learning curves, much more functionality
108 / 108