Вы находитесь на странице: 1из 102

Application Performance Tuning

M. D. Jones, Ph.D.
Center for Computational Research University at Buffalo State University of New York

High Performance Computing I, 2012

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

1 / 108

Introduction to Performance Engineering in HPC

Performance Fundamentals

Performance Foundations

Three pillars of performance optimization: Algorithmic - choose the most effective algorithm that you can for the problem of interest Serial Efciency - optimize the code to run efciently in a non-parallel environment Parallel Efciency - effectively use multiple processors to achieve a reduction in execution time, or equivalently, to solve a proportionately larger problem

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

3 / 108

Introduction to Performance Engineering in HPC

Performance Fundamentals

Algorithmic Efciency

Choose the best algorithm before you start coding (recall that good planning is an essential part of writing good software): Running on large number of processors? Choose an algorithm that scales well with increasing processor count Running a large system (mesh points, particle count, etc.)? Choose an algorithm that scales well with system size If you are going to run on a massively parallel machine, plan from the beginning on how you intend to decompose the problem (it may save you a lot of time later)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

4 / 108

Introduction to Performance Engineering in HPC

Performance Baseline

Serial Efciency

Getting efcient code in parallel is made much more difcult if you have not optimized the sequential code, and in fact can lead to a misleading picture of parallel performance. Recall that our denition of parallel speedup, S , S(Np ) = p (Np ) involves the time, S , for an optimal sequential implementation (not just p (1)!)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

5 / 108

Introduction to Performance Engineering in HPC

Performance Baseline

Establish a Performance Baseline

Steps to establishing a baseline for your own performance expectations: Choose a representative problem (or better still a suite of problems) that can be run under identical circumstances on a multitude of platforms/compilers Requires portable code! How fast is fast? You can utilize hardware performance counters to measure actual code performance Prole, prole, and then prole some more ... to nd bottlenecks and spend your time more effectively in optimizing code

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

6 / 108

Introduction to Performance Engineering in HPC

Pitfalls in Parallel Performance

Parallel Performance Trap

Pitfalls when measuring the performance of parallel codes: For many, speedup or linear scalability is the ultimate goal. This goal is incomplete - a terribly inefcient code can scale well, but actually deliver poor efciency. For example, consider a simple Monte Carlo based code that uses the most rudimentary uniform sampling (i.e. no importance sampling) - this can be made to scale perfectly in parallel, but the algorithmic efciency (measured perhaps by the delivered variance per cpu-hour) is quite low.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

7 / 108

Simple Proling Tools

Timers

time Command

Note that this is not the time built-in function in many shells (bash and tcsh included), but instead the one located in /usr/bin. This command is quite useful for getting an overall picture of code performance. The default output format:
%Uuser %Ssystem %Eelapsed % PCPU (% X t e x t+%Ddata % Mmax) k %I i n p u t s+%Ooutputs (%Fmajor+%Rminor ) p a g e f a u l t s %Wswaps

and using the -p option:


r e a l %e user % U sys %S

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

9 / 108

Simple Proling Tools

Timers

time Example

[ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e . / l a p l a c e _ s : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 75.82 user 0.00 system 1 : 1 7 . 7 2 elapsed 97%CPU ( 0 a v g t e x t +0avgdata 0 maxresident ) k 0 i n p u t s +0 o u t p u t s ( 0 major +913 minor ) p a g e f a u l t s 0swaps [ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e p . / l a p l a c e _ s : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 r e a l 75.73 user 74.68 sys 0.00

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

10 / 108

Simple Proling Tools

Timers

time MPI Example


[ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e mpirun np 2 . / l a p l a c e _ m p i : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 Writing l o g f i l e . . . . Finished w r i t i n g l o g f i l e . 28.43 user 1.54 system 0 : 3 1 . 9 5 elapsed 93%CPU ( 0 a v g t e x t +0avgdata 0 maxresident ) k 0 i n p u t s +0 o u t p u t s ( 0 major +14920 minor ) p a g e f a u l t s 0swaps [ bono : ~ / d _ l a p l a c e ] $ qsub q debug lnodes =2: ppn =2 , w a l l t i m e =00:30:00 I qsub : w a i t i n g f o r j o b 577255. bono . c c r . b u f f a l o . edu t o s t a r t #############################PBS Prologue ############################## PBS p r o l o g u e s c r i p t run on h o s t c15n32 a t F r i Sep 28 1 5 : 0 3 : 0 5 EDT 2007 PBSTMPDIR i s / s c r a t c h / 5 7 7 2 5 5 . bono . c c r . b u f f a l o . edu / u s r / b i n / t i m e mpiexec . / l a p l a c e _ m p i : Max v a l u e i n s o l : 0.999992327961218 Min v a l u e i n s o l : 8.742278000372475E008 0.01 user 0.01 system 0 : 3 0 . 8 2 elapsed 0%CPU ( 0 a v g t e x t +0avgdata 0 maxresident ) k 0 i n p u t s +0 o u t p u t s ( 0 major +1416 minor ) p a g e f a u l t s 0swaps

OSCs mpiexec will report accurate timings in the (optional) email report, as it does not rely on rsh/ssh to launch tasks (but Intel MPI does, so in that case you will see the result of timing the mpiexec shell script, not the MPI code).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 11 / 108

Simple Proling Tools

Code Section Timing (Calipers)

Code Section Timing (Calipers)


Timing sections of code requires a bit more work on the part of the programmer, but there are reasonably portable means of doing so: Routine times gettimeofday clock_gettime system_clock (f90) cpu_time (f95) MPI_Wtime* OMP_GET_WTIME* Type user/sys wall wall wall cpu wall wall Resolution ms s ns system-dependent compiler-dependent system-dependent system-dependent

Generally I prefer the MPI and OpenMP timing calls whenever I can use them (*the MPI and OpenMP specications call for their intrinsic timers to be high precision).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 12 / 108

Simple Proling Tools

Code Section Timing (Calipers)

More information on code section timers (and code for doing so): LLNL Performance Tools:
https://computing.llnl.gov/tutorials/performance_tools/#gettimeofday

Stopwatch (nice F90 module, but you need to supply a low-level function for accessing a timer): http://math.nist.gov/StopWatch

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

13 / 108

Simple Proling Tools

gprof

GNU Tools: gprof


Tool that we used briey before: Generic GNU proler Requires recompiling code with -pg option Running subsequent instrumented code produces gmon.out to be read by gprof Use the environment variable GMON_OUT_PREFIX to specify a new gmon.out prex to which the process ID will be appended (especially usefule for parallel runs) - this is a largely undocumented feature ... Line-level proling is possible, as we will see in the following example

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

14 / 108

Simple Proling Tools

gprof

gprof Shortcomings

Shortcomings of gprof (which apply also to any statistical proling tool): Need to recompile to instrument the code Instrumentation can affect the statistics in the prole Overhead can signicantly increase the running time Compiler optimization can be affected by instrumentation

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

15 / 108

Simple Proling Tools

gprof

Types of gprof Proles

gprof proles come in three types:


1

Flat Prole: shows how much time your program spent in each function, and how many times that function was called Call Graph: for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function Basic-block: Requires compilation with the -a ag (supported only by GNU?) - enables gprof to construct an annotated source code listing showing how many times each line of code was executed

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

16 / 108

Simple Proling Tools

gprof

gprof example
g77 I . O3 f f a s t math g pg o r p rp_read . o i n i t i a l . o en_gde . o \ adwfns . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o gen_etg . o g e n _ r t g . o \ gdewfn . o l i b / * . o [ jonesm@bono ~ / d_bench ] $ qsub qdebug lnodes =1: ppn =2 , w a l l t i m e =00:30:00 I [ jonesm@c16n30 ~ / d_bench ] $ . / r p E n t e r r u n i d :( <=9 chars ) short ... . . . s k i p copious amount o f s t a n d a r d o u t p u t . . . ... [ jonesm@bono ~ / d_bench ] $ g p r o f r p gmon . o u t > & o u t . g p r o f [ jonesm@bono ~ / d_bench ] $ l e s s o u t . g p r o f Flat profile : Each sample counts as 0.01 seconds . % cumulative self time seconds seconds calls 89.74 123.23 123.23 204008 6.96 132.79 9.56 1 1.18 134.41 1.62 200004 1.05 135.86 1.44 204002 0.71 136.83 0.97 14790551 0.27 137.20 0.37 204008 ...

self s/ call 0.00 9.56 0.00 0.00 0.00 0.00

total s/ call 0.00 137.22 0.00 0.00 0.00 0.00

name triwfns_ MAIN__ en_gde__ evol_ ranf_ gdewfn_

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

17 / 108

Simple Proling Tools

gprof

[ jonesm@bono ~ / d_bench ] $ g p r o f - l i n e r p gmon . o u t > & o u t . g p r o f . l i n e [ jonesm@bono ~ / d_bench ] $ l e s s o u t . g p r o f . l i n e Flat profile : Each sample counts as 0.01 seconds . % cumulative self self total time seconds seconds c a l l s ns / c a l l ns / c a l l name 17.45 23.96 23.96 t r i w f n s _ ( adwfns . 14.46 43.82 19.86 t r i w f n s _ ( adwfns . 12.87 61.50 17.68 t r i w f n s _ ( adwfns . 12.31 78.41 16.91 t r i w f n s _ ( adwfns . 0.67 79.33 0.92 t r i w f n s _ ( adwfns . 0.59 80.14 0.82 MAIN__ (cc4WTuQH . 0.51 80.84 0.70 MAIN__ (cc4WTuQH .

f :129 f :130 f :171 f :172 f :130 f :308 f :304

@ @ @ @ @ @ @

403c94 ) 403ce6 ) 404755) 4047e2 ) 403cd6 ) 4070c6 ) 4072da )

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

18 / 108

Simple Proling Tools

gprof

More gprof Information

More gprof documentation: gprof GNU Manual:


http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html

gprof man page: man gprof

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

19 / 108

Simple Proling Tools

pgprof

PGI Tools: pgprof

PGI tools also have proling capabilities (c.f. man pgf95) Graphical proler, pgprof

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

20 / 108

Simple Proling Tools

pgprof

pgprof example

pgf77

t p p764 f a s t s s e g 7 7 l i b s Mprof= l i n e s o r p rp_read . o i n i t i a l . o en_gde . o \ adwfns . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o gen_etg . o g e n _ r t g . o \ gdewfn . o l i b / * . o [ jonesm@bono ~ / d_bench ] $ . / r p E n t e r r u n i d :( <=9 chars ) short ... . . . s k i p copious amount o f s t a n d a r d o u t p u t . . . ... [ jonesm@bono ~ / d_bench ] $ p g p r o f exe . / r p p g p r o f . o u t

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

21 / 108

Simple Proling Tools

pgprof

pgprof example [screenshot]

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

22 / 108

Simple Proling Tools

pgprof

N.B. you can also use the -text option to pgprof to make it behave more like gprof See the PGI Tools Guide for more information (should be a PDF copy in $PGI/doc)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

23 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

mpiP: Statistical MPI Proling

http://mpip.sourceforge.net Not a tracing tool, but a lightweight interface to accumulate statistics using the MPI proling interface Quite useful in conjunction with a tracele analysis (e.g. using jumpshot) Installed on CCR systems - see module avail for mpiP availability and location

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

25 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

mpiP Compilation

To use mpiP you need to: Add a -g ag to add symbols (this will allow mpiP to access the source code symbols and line numbers) Link in the necessary mpiP proling library and the binary utility libraries for actually decoding symbols (There is a trick that you can use most of the time to avoid having to link with mpiP, though). Compilation examples (from U2) follow ...

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

26 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

mpiP Runtime Flags


You can set various mpiP runtime ags (e.g. export MPIP=-t 10.0 -k 2):
Option -c -e -f dir -g -k n -n -o -s n -t x -v Description Generate concise version of report, omitting callsite process-specic detail. Print report data using oating-point format Record output le in directory <dir> Enable mpiP debug mode Sets callsite stack traceback depth to <n> Do not truncate full pathname of lename in callsites Disable proling at initialization. Application must enable proling with MPI_Pcontrol() Set hash table size to <n> Set print threshold for report, where <x> is the MPI percentage of time for each callsite Generates both concise and verbose report output Default

. disabled 1

256 0.0

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

27 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

mpiP Example Output

From an older version of mpiP, but still almost entirely the same - this one links directly with mpiP; rst compile:
mpicc g o a t l a s a i j 2 _ b a s i s . o analyze . o a t l a s . o b a r r i e r . o b y t e f l i p . o \ chordgn2 . o c s t r i n g s . o i o 2 . o map . o m u t i l s . o numrec . o paramods . o p r o j . o \ p r o j A t l a s . o sym2 . o u t i l . o lm L / P r o j e c t s /CCR/ jonesm / mpiP 2.8.2/ gnu / ch_gm / l i b \ lmpiP l b f d l i b e r t y lm

then run (in this case using 16 processors) and examine the output le:
[ jonesm@joplin d_derenzo ] $ l s * . mpiP a t l a s . gcc3mpipapimpiP . 1 6 . 2 0 5 7 8 . 1 . mpiP [ jonesm@joplin d_derenzo ] $ l e s s a t l a s . gcc3mpipapimpiP . 1 6 . 2 0 5 7 8 . 1 . mpiP : :

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

28 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @

mpiP Command : . / a t l a s . gcc3mpipapimpiP s t u d y . i n i 0 Version : 2.8.2 MPIP B u i l d date : Jun 29 2005 , 1 4: 5 3 : 4 1 S t a r t time : 2005 06 29 15 : 1 8 : 5 2 Stop t i m e : 2005 06 29 15 : 2 8 : 3 4 Timer Used : gettimeofday MPIP env v a r : [ null ] C o l l e c t o r Rank : 0 C o l l e c t o r PID : 20578 F i n a l Output D i r : . MPI Task Assignment : 0 bb18n17 . c c r . b u f f a l o . edu MPI Task Assignment : 1 bb18n17 . c c r . b u f f a l o . edu MPI Task Assignment : 2 bb18n16 . c c r . b u f f a l o . edu MPI Task Assignment : 3 bb18n16 . c c r . b u f f a l o . edu MPI Task Assignment : 4 bb18n15 . c c r . b u f f a l o . edu MPI Task Assignment : 5 bb18n15 . c c r . b u f f a l o . edu MPI Task Assignment : 6 bb18n14 . c c r . b u f f a l o . edu MPI Task Assignment : 7 bb18n14 . c c r . b u f f a l o . edu MPI Task Assignment : 8 bb18n13 . c c r . b u f f a l o . edu MPI Task Assignment : 9 bb18n13 . c c r . b u f f a l o . edu MPI Task Assignment : 10 bb18n12 . c c r . b u f f a l o . edu MPI Task Assignment : 11 bb18n12 . c c r . b u f f a l o . edu MPI Task Assignment : 12 bb18n11 . c c r . b u f f a l o . edu MPI Task Assignment : 13 bb18n11 . c c r . b u f f a l o . edu MPI Task Assignment : 14 bb18n10 . c c r . b u f f a l o . edu MPI Task Assignment : 15 bb18n10 . c c r . b u f f a l o . edu

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

29 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- MPI Time ( seconds ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Task AppTime MPITime MPI% 0 582 44.7 7.69 1 579 41.9 7.24 2 579 40.7 7.03 3 579 36.9 6.37 4 579 22.3 3.84 5 579 16.6 2.87 6 579 32 5.53 7 579 35.9 6.20 8 579 28.6 4.93 9 579 25.9 4.48 10 579 39.2 6.76 11 579 33.8 5.84 12 579 35.3 6.10 13 579 41 7.07 14 579 29.9 5.16 15 579 41.4 7.16 9.27 e+03 546 5.89 *

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

30 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- C a l l s i t e s : 13 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ID Lev F i l e / Address L i n e Parent_Funct MPI_Call 1 0 util .c 833 gsync Barrier 2 0 atlas . c 1531 r ea d Pr o jD a t a Allreduce 3 0 projAtlas . c 745 b a c k P r o j A t l a s Allreduce 4 0 atlas . c 1545 r ea d Pr o jD a t a Allreduce 5 0 atlas . c 1525 r ea d Pr o jD a t a Allreduce 6 0 atlas . c 1541 r ea d Pr o jD a t a Allreduce 7 0 atlas . c 1589 r ea d Pr o jD a t a Allreduce 8 0 atlas . c 1519 r ea d Pr o jD a t a Allreduce 9 0 util .c 789 mygcast Bcast 10 0 projAtlas . c 1100 c o m p u t e L o g l i k e A t l a s Allreduce 11 0 atlas . c 1514 re a dP r oj D a ta Allreduce 12 0 atlas . c 1537 re a dP r oj D a ta Allreduce 13 0 projAtlas . c 425 f w d B a c k P r o j A t l a s 2 Allreduce

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

31 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @- Aggregate Time ( t o p twenty , descending , m i l l i s e c o n d s ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Call Site Time App% MPI% COV Allreduce 13 3.09 e+05 3.33 56.50 0.46 Barrier 1 2.13 e+05 2.30 38.97 0.35 Bcast 9 1.69 e+04 0.18 3.10 0.37 Allreduce 3 7.78 e+03 0.08 1.42 0.11 Allreduce 10 62.7 0.00 0.01 0.20 Allreduce 11 2.42 0.00 0.00 0.09 Allreduce 7 2.17 0.00 0.00 0.26 Allreduce 12 1.15 0.00 0.00 0.20 Allreduce 6 1.14 0.00 0.00 0.19 Allreduce 5 1.13 0.00 0.00 0.15 Allreduce 8 1.12 0.00 0.00 0.18 Allreduce 2 1.1 0.00 0.00 0.13 Allreduce 4 1.1 0.00 0.00 0.12

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

32 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

- - - - - - - - - - - - - - - - - - @- Aggregate Sent Message Size ( t o p - - - - - - - - - - - - - - - - - - Call Site Count Allreduce 13 65536 Allreduce 3 8192 Bcast 9 490784 Allreduce 11 16 Allreduce 10 512 Allreduce 7 16 Allreduce 2 16 Allreduce 6 16 Allreduce 5 16 Allreduce 4 16 Allreduce 8 16 Allreduce 12 16 ... ...

- - - - - - - - - - - - - - - - - - twenty , descending , b y t e s ) - - - - - - - - - - - - - - - - - - - - - - Total Avrg Sent% 2.28 e+09 3.48 e+04 83.69 2.85 e+08 3.48 e+04 10.46 1.59 e+08 325 5.85 2.07 e+04 1 . 3 e+03 0.00 4 . 1 e+03 8 0.00 256 16 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00 64 4 0.00

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

33 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

Using mpiP at Runtime

Now lets examine an example using mpiP at runtime. This example is solves a simple Laplace equation with Dirichlet boundary conditions using nite differences.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

34 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

#PBS -S /bin/bash #PBS -q debug #PBS -l walltime=00:20:00 #PBS -l nodes=2:MEM24GB:ppn=8 #PBS -M jonesm@ccr.buffalo.edu #PBS -m e #PBS -N test #PBS -o subMPIP.out #PBS -j oe module load intel module load intel-mpi module load mpip module list cd $PBS_O_WORKDIR which mpiexec NNODES=`cat $PBS_NODEFILE | uniq | wc -l` NPROCS=`cat $PBS_NODEFILE | wc -l` export I_MPI_DEBUG=5 # Use LD_PRELOAD trick to load mpiP wrappers at runtime export LD_PRELOAD=$MPIPDIR/lib/libmpiP.so mpdboot -n $NNODES -f $PBS_NODEFILE -v mpiexec -np $NPROCS -envall ./laplace_mpi <<EOF 2000 EOF mpdallexit

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

35 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

... and then run it and examine the resulting mpiP output le:
[k07n14:~/d_laplace/d_mpip]$ ls -l laplace_mpi.16.8618.1.mpiP -rw-r - r - 1 jonesm ccrstaff 17919 Oct 14 16:28 laplace_mpi.16.8618.1.mpiP

and again we will break it down by section:

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

36 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @

mpiP Command : ./laplace_mpi Version MPIP Build date Start time Stop time Timer Used MPIP env var Collector Rank Collector PID Final Output Dir Report generation MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment MPI Task Assignment

: : : : : : : : : : : : : : : : : : : : : : : : : :

3.3.0 Oct 14 2011, 16:16:34 2011 10 14 16:26:15 2011 10 14 16:28:11 PMPI_Wtime [null] 0 8618 . Single collector task 0 d15n33.ccr.buffalo.edu 1 d15n33.ccr.buffalo.edu 2 d15n33.ccr.buffalo.edu 3 d15n33.ccr.buffalo.edu 4 d15n33.ccr.buffalo.edu 5 d15n33.ccr.buffalo.edu 6 d15n33.ccr.buffalo.edu 7 d15n33.ccr.buffalo.edu 8 d15n23.ccr.buffalo.edu 9 d15n23.ccr.buffalo.edu 10 d15n23.ccr.buffalo.edu 11 d15n23.ccr.buffalo.edu 12 d15n23.ccr.buffalo.edu 13 d15n23.ccr.buffalo.edu 14 d15n23.ccr.buffalo.edu 15 d15n23.ccr.buffalo.edu

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

37 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - MPI Time (seconds) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Task AppTime MPITime MPI% 0 116 16.9 14.57 1 115 16.4 14.18 2 115 16.8 14.53 3 115 17.7 15.34 4 115 16.6 14.39 5 115 14.3 12.37 6 115 13.7 11.90 7 115 11.1 9.65 8 115 12.5 10.87 9 115 13.8 12.00 10 115 14.5 12.60 11 115 16.9 14.67 12 115 17.5 15.15 13 115 19.4 16.82 14 115 16.4 14.22 15 115 17.5 15.13 1.85e+03 252 13.65 *

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

38 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Callsites: 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ID Lev File/Address Line Parent_Funct MPI_Call 1 0 laplace_mpi.f90 118 MAIN__ Allreduce 2 0 laplace_mpi.f90 143 MAIN__ Recv 3 0 laplace_mpi.f90 48 __paramod_MOD_xchange Sendrecv 4 0 laplace_mpi.f90 80 MAIN__ Bcast 5 0 laplace_mpi.f90 46 __paramod_MOD_xchange Sendrecv 6 0 laplace_mpi.f90 138 MAIN__ Send - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Aggregate Time (top twenty, descending, milliseconds) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Call Site Time App% MPI% COV Allreduce 1 2.34e+05 12.69 92.94 0.15 Sendrecv 3 8.88e+03 0.48 3.52 0.19 Sendrecv 5 8.47e+03 0.46 3.36 0.16 Send 6 321 0.02 0.13 0.51 Bcast 4 97.7 0.01 0.04 0.26 Recv 2 14.4 0.00 0.01 0.00 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

39 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - @ - - Callsite Time statistics (all, milliseconds): 80 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name Site Rank Count Max Mean Min Allreduce 1 0 23845 28.5 0.666 0.036 Allreduce 1 1 23845 28.6 0.644 0.035 Allreduce 1 2 23845 28.6 0.661 0.037 Allreduce 1 3 23845 28.6 0.695 0.038 Allreduce 1 4 23845 29.2 0.651 0.038 Allreduce 1 5 23845 29.2 0.555 0.035 Allreduce 1 6 23845 29 0.528 0.037 Allreduce 1 7 23845 26.3 0.405 0.038 Allreduce 1 8 23845 26.8 0.468 0.034 Allreduce 1 9 23845 28.4 0.538 0.033 Allreduce 1 10 23845 29.7 0.566 0.033 Allreduce 1 11 23845 29.7 0.661 0.033 Allreduce 1 12 23845 28.6 0.686 0.039 Allreduce 1 13 23845 28.8 0.768 0.041 Allreduce 1 14 23845 28.6 0.641 0.036 Allreduce 1 15 23845 28.7 0.694 0.038 Allreduce 1 29.7 0.614 0.033 * 381520

- - - - - -- - - - - -- - - - - -App% MPI% 13.69 94.00 13.30 93.79 13.66 93.98 14.36 93.63 13.46 93.54 11.46 92.71 10.91 91.64 8.37 86.74 9.67 88.92 11.11 92.59 11.70 92.90 13.65 93.10 14.17 93.49 15.87 94.35 13.25 93.19 14.33 94.76 12.69 92.94

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

40 / 108

Parallel Proling Tools

Statistical MPI Proling With mpiP

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -@ - - Callsite Message Sent statistics (all, sent bytes) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Name Site Rank Count Max Mean Min Sum Allreduce 1 0 23845 8 8 8 1.908e+05 Allreduce 1 1 23845 8 8 8 1.908e+05 Allreduce 1 2 23845 8 8 8 1.908e+05 Allreduce 1 3 23845 8 8 8 1.908e+05 Allreduce 1 4 23845 8 8 8 1.908e+05 Allreduce 1 5 23845 8 8 8 1.908e+05 Allreduce 1 6 23845 8 8 8 1.908e+05 Allreduce 1 7 23845 8 8 8 1.908e+05 Allreduce 1 8 23845 8 8 8 1.908e+05 Allreduce 1 9 23845 8 8 8 1.908e+05 Allreduce 1 10 23845 8 8 8 1.908e+05 Allreduce 1 11 23845 8 8 8 1.908e+05 Allreduce 1 12 23845 8 8 8 1.908e+05 Allreduce 1 13 23845 8 8 8 1.908e+05 Allreduce 1 14 23845 8 8 8 1.908e+05 Allreduce 1 15 23845 8 8 8 1.908e+05 Allreduce 1 8 8 8 3.052e+06 * 381520

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

41 / 108

Parallel Proling Tools

Intel Trace Analyzer/Collector

Intel Trace Analyzer/Collector (ITAC)

A commercial product for performing MPI trace analysis that has enjoyed a long history is Vampir/Vampirtrace, originally developed and sold by Pallas GmbH. Now owned by Intel and available as the Intel Trace Analyzer and Collector. We have a license on U2 if someone wants to give it a try. Note that Vampir/Vampirtrace has since been reborn as an entirely new product.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

42 / 108

Parallel Proling Tools

Intel Trace Analyzer/Collector

ITAC Example

Note that you do not have to recompile your application to use ITAC (unless you are building it statically), you can just build it usual, using Intel MPI:
[ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ module l o a d i n t e l mpi [ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ make l a p l a c e _ m p i m p i i f o r t i p o O3 V a x l i b g c l a p l a c e _ m p i . f 9 0 m p i i f o r t i p o O3 V a x l i b g o l a p l a c e _ m p i l a p l a c e _ m p i . o i p o : remark #11001: p e r f o r m i n g s i n g l ef i l e o p t i m i z a t i o n s i p o : remark #11005: g e n e r a t i n g o b j e c t f i l e / tmp / i p o _ i f o r t j O z Y A n . o [ bono : ~ / d _ l a p l a c e / d _ i t a c ] $

You can turn trace collection on at run-time ...

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

43 / 108

Parallel Proling Tools

Intel Trace Analyzer/Collector

[ bono : ~ / d _ l a p l a c e ] $ cat subICT #PBS S / b i n / bash #PBS q debug #PBS l w a l l t i m e =00:20:00 #PBS l nodes =1: ppn=8 #PBS jonesm@ccr . b u f f a l o . edu M #PBS e m #PBS N ITAC #PBS o subITAC . o u t #PBS j oe . $MODULESHOME/ i n i t / bash module l o a d i n t e l mpi module l i s t cd $PBS_O_WORKDIR which mpiexec NNODES= ` cat $PBS_NODEFILE | u n i q | wc l ` NPROCS= ` cat $PBS_NODEFILE | wc l ` UNIQ_HOSTS=tmp . h o s t s cat $PBS_NODEFILE | u n i q > $UNIQ_HOSTS export I_MPI_DEBUG=5 mpdboot n $NNODES f "$UNIQ_HOSTS" v mpiexec t r a c e np $NPROCS . / l a p l a c e _ m p i <<EOF 2000 EOF mpdallexit [ e "$UNIQ_HOSTS" ] && \ rm "$UNIQ_HOSTS"

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

44 / 108

Parallel Proling Tools

Intel Trace Analyzer/Collector

Running the preceding batch job on U2 produces a bunch (many!)of proling output les, the most important of which can be is the name of your binary with a .stf sufx, in this case laplace_mpi.stf, which we feed to the Intel Trace Analyzer using the traceanalyzer command ... and we should see a prole that looks very much like what you can see using jumpshot.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

45 / 108

Parallel Proling Tools

Intel Trace Analyzer/Collector

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

46 / 108

Parallel Proling Tools

Intel Trace Analyzer/Collector

More ITAC Documentation

Some helpful pointers to more ITAC documentation:


[ k07n14 : ~ ] $ which t r a c e a n a l y z e r / u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / bin / traceanalyzer [ k07n14 : ~ ] $ l s l / u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / doc / t o t a l 4211 r - r - r - 1 r o o t r o o t 91050 Aug 25 06:02 FAQ. p d f r - r - r - 1 r o o t r o o t 15566 Aug 25 06:02 G e t t i n g _ S t a r t e d . h t m l drwxrxrx 3 r o o t r o o t 57 Oct 4 08:02 h t m l r - r - r - 1 r o o t r o o t 61598 Aug 25 06:02 INSTALL . h t m l r - r - r - 1 r o o t r o o t 2051029 Aug 25 06:02 ITA_Reference_Guide . p d f r - r - r - 1 r o o t r o o t 1050276 Aug 25 06:02 ITC_Reference_Guide . p d f r - r - r - 1 r o o t r o o t 20681 Aug 25 06:02 Release_Notes . t x t

Generally a good idea to refer to the documentation for the same version that you are using (you can check with module show intel-mpi).

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

47 / 108

Performance API (PAPI)

Introduction to PAPI

Introduction

Performance Application Programming Interface Implement a portable(!) and efcient API to access existing hardware performance counters Ease the optimization of code by providing base infrastructure for cross-platform tools

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

49 / 108

Performance API (PAPI)

Introduction to PAPI

Pre-PAPI

Before PAPI came along, there were hardware performance counters, of course - but access to them was limited to proprietary tools and APIs. Some examples were SGIs perfex and Crays hpm. Now, as long as PAPI has been ported to a particular hardware substrate, the end-programmer (or tool developer) can just use the PAPI interface.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

50 / 108

Performance API (PAPI)

Introduction to PAPI

PAPI Schematic

Best summarized by the following schematic picture:

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

51 / 108

Performance API (PAPI)

Introduction to PAPI

Behind the PAPI Curtain


Linux - x86/x86_64 uses the perfctr kernel patches by Mikael Petterssen:
http://user.it.uu.se/~mikpe/linux/perfctr/2.6

Headed for inclusion in mainstream Linux kernel (was a custom patch applied to CCR systems prior to Linux kernel 2.6.32) low overhead

IA64 - uses PFM, developed by HP and included in the linux kernel (for x86_64):
Full use of available IA64 monitoring capabilities Quite a bit slower than perfctr, at least according to the PAPI developers http://www.hpl.hp.com/research/linux/perfmon libpfm lives on using perf events, but perfmon apparently ceased development for Linux as of kernel 2.6.30 or so

"Perf Events" added to Linux kernel in 2.6.31, replacing both of the above, c.f.:
http://web.eecs.utk.edu/~vweaver1/projects/perf-events/
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 52 / 108

Performance API (PAPI)

Reminder: U2 Hardware

Xeon Block Diagram

Reconsider the block diagram for the (older nodes) Intel architecture in CCRs U2 cluster:

Look familiar?

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

53 / 108

Performance API (PAPI)

Reminder: U2 Hardware

Xeon Block Diagram (contd)

Consider the block diagram for the Intel architecture in CCRs U2 cluster:

Look familiar? (compare with von Neumanns original sketch)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

54 / 108

Performance API (PAPI)

Reminder: U2 Hardware

Nehalem Xeon Block Diagram


Block diagram for Nehalem architecture:
10/14/11 5:25 PM

Intel Nehalem microarchitecture quadruple associative Instruction Cache 32 KByte, 128-entry TLB-4K, 7 TLB-2/4M per thread
128

Uncore

Prefetch Buffer (16 Bytes)

Predecode & Instruction Length Decoder


Instruction Queue 18 x86 Instructions Alignment MacroOp Fusion

Branch Prediction global/bimodal, loop, indirect jmp

Quick Path Interconnect


4 x 20 Bit 6,4 GT/s

DDR3 Memory Controller

3 x 64 Bit 1,33 GT/s

Complex Decoder

Simple Decoder

Simple Decoder

Simple Decoder

Loop Stream Decoder 2x Retirement Register File

Decoded Instruction Queue (28 OP entries) MicroOp Fusion 2 x Register Allocation Table (RAT) Reorder Buffer (128-entry) fused

Micro Instruction Sequencer

Common L3-Cache 8 MByte

Reservation Station (128-entry) fused


Port 4 Port 3 Port 2 Port 5 Port 1 Port 0

AGU Store Data Store Addr. Unit

AGU Load Addr. Unit

Integer/ MMX ALU, Branch SSE ADD Move


128

Integer/ MMX ALU SSE ADD Move


128

FP ADD

Integer/ FP MMX ALU, MUL 2x AGU

256 KByte 8-way, 64 Byte Cacheline, private L2-Cache

SSE MUL/DIV Move


128

512-entry L2-TLB-4K

Result Bus Memory Order Buffer (MOB)


128 128 256

octuple associative Data Cache 32 KByte, 64-entry TLB-4K, 32-entry TLB-2/4M

file:///Users/jonesm/Desktop/Intel_Nehalem_arch.svg

Page 1 of 2

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

55 / 108

Performance API (PAPI)

Reminder: U2 Hardware

Superscalar - EM64T Irwindale

Characteristics of the Irwindale Xeons that form (part of) CCRs U2 cluster: Clock Cycle TPP Pipeline L2 Cache Size L2 Bandwidth CPU-Memory Bandwidth 3.2 GHz 6.4 GFlop/s 31 stages 2 MByte 102.4 GByte/s 6.4 GByte/s (shared)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

56 / 108

Performance API (PAPI)

Reminder: U2 Hardware

EM64T Irwindale Memory Hierarchy Penalties

Consider the penalties in lost computation for not reusing data in the various caches (using U2s Intel Irwindale Xeon processors): Memory L1 Cache L2 Cache Main Miss Penalty (cycles) 3 28 400

as determined by the lmbench benchmark1 .

http://www.bitmover.com/lmbench
Application Performance Tuning HPC-I Fall 2012 57 / 108

M. D. Jones, Ph.D. (CCR/UB)

Performance API (PAPI)

Reminder: U2 Hardware

Westmere Xeons

Characteristics of the Westmere E5645 Xeons that form (part of) CCRs U2 cluster: Clock Cycle TPP Pipeline L2 Cache Size L3 Cache Size CPU-Memory Bandwidth 2.4 GHz 9.6 GFlop/s (per core) 14 stages 256 kByte 12 MByte 32 GByte/s (nonuniform!)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

58 / 108

Performance API (PAPI)

Reminder: U2 Hardware

Westmere Xeon Memory Hierarchy Penalties

Consider the penalties in lost computation for not reusing data in the various caches (using U2s Intel Westmere Xeon E5645 processors): Memory L1 Cache L2 Cache Main Miss Penalty (cycles) 4 15 110

as determined by the lmbench benchmark2 .

http://www.bitmover.com/lmbench
Application Performance Tuning HPC-I Fall 2012 59 / 108

M. D. Jones, Ph.D. (CCR/UB)

Performance API (PAPI)

Available Counters in PAPI

Available PAPI Performance Data

Cycle count Instruction count (including Integer, Floating point, load/store) Branches (including Taken/not taken, mispredictions) Pipeline stalls (due to memory, resource conicts) Cache (misses for different levels, invalidation) TLB (misses, invalidation)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

60 / 108

Performance API (PAPI)

High vs. Low-level PAPI

High-level PAPI

Intended for coarse-grained measurements Requires little (or no) setup code Allows only PAPI preset Allows only aggregate counting (no statistical proling)

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

61 / 108

Performance API (PAPI)

High vs. Low-level PAPI

Low-level PAPI

More efcient (and functional) than high-level About 60 functions Thread-safe Supports presets and native events

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

62 / 108

Performance API (PAPI)

High vs. Low-level PAPI

Preset vs. Native Events

preset or pre-dened events, are those which have been considered useful by the PAPI community and developers:
http://icl.cs.utk.edu/projects/papi/presets.html

native events are those countable by the CPUs hardware. These events are highly platform specic, and you would need to consult the processor architecture manuals for the relevant native event lists

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

63 / 108

Performance API (PAPI)

High vs. Low-level PAPI

Low-level PAPI Functions

Hardware counter multiplexing (time sharing hardware counters to allow more events to be monitored than can be conventionally supported) Processor information Address space information Memory information (static and dynamic) Timing functions Hardware event inquiry ... and many more

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

64 / 108

Performance API (PAPI)

High vs. Low-level PAPI

More PAPI Information

For more on PAPI, including source code, documentation, presentations, and links to third-party tools that utilize PAPI, see http://icl.cs.utk.edu/projects/papi

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

65 / 108

Performance API (PAPI)

High vs. Low-level PAPI

How to Access PAPI at CCR

Consider a simple example code to measure Flop/s using the high-level PAPI API:
1 2 3 4 5 6 7 8 9 10 11 12 # include < s t d i o . h> # include < s t d l i b . h> # include " p a p i . h "

i n t main ( ) { f l o a t r e a l _ t i m e , proc_time , mflops ; long_long flpops ; float ireal_time , iproc_time , imflops ; long_long i f l p o p s ; int retval ;

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

66 / 108

Performance API (PAPI)

High vs. Low-level PAPI

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

i f ( ( r e t v a l = PAPI_flops (& i r e a l _ t i m e ,& i p r o c _ t i m e ,& i f l p o p s ,& i m f l o p s ) ) < PAPI_OK ) { p r i n t f ( " Could n o t i n i t i a l i s e PAPI_flops \ n " ) ; p r i n t f ( " Your p l a t f o r m may n o t s u p p o r t f l o a t i n g p o i n t o p e r a t i o n event . \ n " ) ; p r i n t f ( " r e t v a l : %d \ n " , r e t v a l ) ; exit (1); } your_slow_code ( ) ; i f ( ( r e t v a l = PAPI_flops ( &r e a l _ t i m e , &proc_time , &f l p o p s , &mflops ) ) <PAPI_OK ) { p r i n t f ( " r e t v a l : %d \ n " , r e t v a l ) ; exit (1); } p r i n t f ( " Real_time : %f Proc_time : %f T o t a l f l p o p s : %l l d MFLOPS: %f \ n " , r e a l _ t i m e , proc_time , f l p o p s , mflops ) ; exit (0); } i n t your_slow_code ( ) { int i ; double tmp = 1 . 1 ; f o r ( i =1; i <2000; i ++) { tmp =( tmp + 1 0 0 ) / i ; } return 0; }

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

67 / 108

Performance API (PAPI)

High vs. Low-level PAPI

How to Access PAPI at CCR

On U2, you access the papi module and compile accordingly:


[ k07n14 : ~ / d_papi ] $ qsub q debug lnodes =1:MEM48GB: ppn =12 , w a l l t i m e =01:00:00 I qsub : w a i t i n g f o r j o b 1144485. d15n41 . c c r . b u f f a l o . edu t o s t a r t qsub : j o b 1144485. d15n41 . c c r . b u f f a l o . edu ready Job 1144485. d15n41 . c c r . b u f f a l o . edu has requested 12 cores / p r o c e s s o r s per node . PBSTMPDIR i s / s c r a t c h /1144485. d15n41 . c c r . b u f f a l o . edu [ k16n01b : ~ ] $ cd $PBS_O_WORKDIR [ k16n01b : ~ / d_papi ] $ module l o a d p a p i ' p a p i / v4 . 1 . 4 ' l o a d complete . [ k16n01b : ~ / d_papi ] $ gcc I$PAPI / i n c l u d e o PAPI_flops PAPI_flops . c L$PAPI / l i b l p a p i [ k16n01b : ~ / d_papi ] $ . / PAPI_flops Real_time : 0.000042 Proc_time : 0.000029 T o t a l f l p o p s : 7983 MFLOPS: 276.572449

N.B., PAPI needs to support the underlying hardware, and this version does not support the 32-core Intel nodes (including the front end).

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

68 / 108

Performance API (PAPI)

High vs. Low-level PAPI

... and on Lennon (SGI Altix):


[ jonesm@lennon ~ / d_papi ] $ module l o a d p a p i ' p a p i / v3 . 2 . 1 ' l o a d complete . [ jonesm@lennon ~ / d_papi ] $ echo $PAPI / u t i l / p e r f t o o l s / papi 3.2.1 [ jonesm@lennon ~ / d_papi ] $ gcc I$PAPI / i n c l u d e o PAPI_flops PAPI_flops . c \ L$PAPI / l i b l p a p i [ jonesm@lennon ~ / d_papi ] $ l d d PAPI_flops l i b p a p i . so => / u t i l / p e r f t o o l s / papi 3.2.1/ l i b / l i b p a p i . so ( 0 x2000000000040000 ) l i b c . so . 6 . 1 => / l i b / t l s / l i b c . so . 6 . 1 ( 0 x20000000000ac000 ) l i b p f m . so . 2 => / u s r / l i b / l i b p f m . so . 2 ( 0 x2000000000318000 ) / l i b / l dl i n u xi a 6 4 . so . 2 => / l i b / l dl i n u xi a 6 4 . so . 2 ( 0 x2000000000000000 ) [ jonesm@lennon ~ / d_papi ] $ . / PAPI_flops Real_time : 0.000143 Proc_time : 0.000132 T o t a l f l p o p s : 48056 MFLOPS: 363.964111

N.B., The Altix is dead, but this is a good example of the cross-platform portability of PAPI accessign the hardware performance counters.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

69 / 108

Performance API (PAPI)

High vs. Low-level PAPI

papi_avail Command
You can use papi_avail to check event availability (different CPUs support various events):
[k16n01b:~/d_papi]$ papi_avail Available events and hardware information. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PAPI Version : 4.1.4.0 Vendor string and code : GenuineIntel (1) Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44) CPU Revision : 2.000000 CPUID Info : Family: 6 Model: 44 Stepping: 2 CPU Megahertz : 2400.391113 CPU Clock Megahertz : 2400 Hdw Threads per core : 1 Cores per Socket : 6 NUMA Nodes : 2 CPU's per Node : 6 Total CPU's : 12 Number Hardware Counters : 16 Max Multiplex Counters : 512 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The following correspond to fields in the PAPI_event_info_t structure. Name PAPI_L1_DCM PAPI_L1_ICM Code Avail Deriv Description (Note) 0x80000000 Yes No Level 1 data cache misses 0x80000001 Yes No Level 1 instruction cache misses

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

70 / 108

Performance API (PAPI) PAPI_L2_DCM PAPI_L2_ICM PAPI_L3_DCM PAPI_L3_ICM PAPI_L1_TCM PAPI_L2_TCM PAPI_L3_TCM PAPI_CA_SNP PAPI_CA_SHR PAPI_CA_CLN PAPI_CA_INV PAPI_CA_ITV PAPI_L3_LDM PAPI_L3_STM PAPI_BRU_IDL PAPI_FXU_IDL PAPI_FPU_IDL PAPI_LSU_IDL PAPI_TLB_DM PAPI_TLB_IM PAPI_TLB_TL PAPI_L1_LDM PAPI_L1_STM PAPI_L2_LDM PAPI_L2_STM PAPI_BTAC_M PAPI_PRF_DM PAPI_L3_DCH PAPI_TLB_SD PAPI_CSR_FAL PAPI_CSR_SUC PAPI_CSR_TOT 0x80000002 0x80000003 0x80000004 0x80000005 0x80000006 0x80000007 0x80000008 0x80000009 0x8000000a 0x8000000b 0x8000000c 0x8000000d 0x8000000e 0x8000000f 0x80000010 0x80000011 0x80000012 0x80000013 0x80000014 0x80000015 0x80000016 0x80000017 0x80000018 0x80000019 0x8000001a 0x8000001b 0x8000001c 0x8000001d 0x8000001e 0x8000001f 0x80000020 0x80000021 Yes Yes No No Yes Yes Yes No No No No No Yes No No No No No Yes Yes Yes Yes Yes Yes Yes No No No No No No No Yes No No No Yes No No No No No No No No No No No No No No No Yes No No No No No No No No No No No

High vs. Low-level PAPI

Level 2 data cache misses Level 2 instruction cache misses Level 3 data cache misses Level 3 instruction cache misses Level 1 cache misses Level 2 cache misses Level 3 cache misses Requests for a snoop Requests for exclusive access to shared cache line Requests for exclusive access to clean cache line Requests for cache line invalidation Requests for cache line intervention Level 3 load misses Level 3 store misses Cycles branch units are idle Cycles integer units are idle Cycles floating point units are idle Cycles load/store units are idle Data translation lookaside buffer misses Instruction translation lookaside buffer misses Total translation lookaside buffer misses Level 1 load misses Level 1 store misses Level 2 load misses Level 2 store misses Branch target address cache misses Data prefetch cache misses Level 3 data cache hits Translation lookaside buffer shootdowns Failed store conditional instructions Successful store conditional instructions Total store conditional instructions

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

71 / 108

Performance API (PAPI) PAPI_MEM_SCY PAPI_MEM_RCY PAPI_MEM_WCY PAPI_STL_ICY PAPI_FUL_ICY PAPI_STL_CCY PAPI_FUL_CCY PAPI_HW_INT PAPI_BR_UCN PAPI_BR_CN PAPI_BR_TKN PAPI_BR_NTK PAPI_BR_MSP PAPI_BR_PRC PAPI_FMA_INS PAPI_TOT_IIS PAPI_TOT_INS PAPI_INT_INS PAPI_FP_INS PAPI_LD_INS PAPI_SR_INS PAPI_BR_INS PAPI_VEC_INS PAPI_RES_STL PAPI_FP_STAL PAPI_TOT_CYC PAPI_LST_INS PAPI_SYC_INS PAPI_L1_DCH PAPI_L2_DCH PAPI_L1_DCA PAPI_L2_DCA PAPI_L3_DCA 0x80000022 0x80000023 0x80000024 0x80000025 0x80000026 0x80000027 0x80000028 0x80000029 0x8000002a 0x8000002b 0x8000002c 0x8000002d 0x8000002e 0x8000002f 0x80000030 0x80000031 0x80000032 0x80000033 0x80000034 0x80000035 0x80000036 0x80000037 0x80000038 0x80000039 0x8000003a 0x8000003b 0x8000003c 0x8000003d 0x8000003e 0x8000003f 0x80000040 0x80000041 0x80000042 No No No No No No No No Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes No Yes No Yes Yes No No Yes No Yes Yes No No No No No No No No No No No Yes No Yes No No No No No No No No No No No No Yes No No Yes No No Yes

High vs. Low-level PAPI

Cycles Stalled Waiting for memory accesses Cycles Stalled Waiting for memory Reads Cycles Stalled Waiting for memory writes Cycles with no instruction issue Cycles with maximum instruction issue Cycles with no instructions completed Cycles with maximum instructions completed Hardware interrupts Unconditional branch instructions Conditional branch instructions Conditional branch instructions taken Conditional branch instructions not taken Conditional branch instructions mispredicted Conditional branch instructions correctly predicted FMA instructions completed Instructions issued Instructions completed Integer instructions Floating point instructions Load instructions Store instructions Branch instructions Vector/SIMD instructions (could include integer) Cycles stalled on any resource Cycles the FP unit(s) are stalled Total cycles Load/store instructions completed Synchronization instructions completed Level 1 data cache hits Level 2 data cache hits Level 1 data cache accesses Level 2 data cache accesses Level 3 data cache accesses

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

72 / 108

Performance API (PAPI)

High vs. Low-level PAPI

PAPI_L1_DCR PAPI_L2_DCR PAPI_L3_DCR PAPI_L1_DCW PAPI_L2_DCW PAPI_L3_DCW PAPI_L1_ICH PAPI_L2_ICH PAPI_L3_ICH PAPI_L1_ICA PAPI_L2_ICA PAPI_L3_ICA PAPI_L1_ICR PAPI_L2_ICR PAPI_L3_ICR PAPI_L1_ICW PAPI_L2_ICW PAPI_L3_ICW PAPI_L1_TCH PAPI_L2_TCH PAPI_L3_TCH PAPI_L1_TCA PAPI_L2_TCA PAPI_L3_TCA PAPI_L1_TCR PAPI_L2_TCR PAPI_L3_TCR PAPI_L1_TCW PAPI_L2_TCW PAPI_L3_TCW

0x80000043 0x80000044 0x80000045 0x80000046 0x80000047 0x80000048 0x80000049 0x8000004a 0x8000004b 0x8000004c 0x8000004d 0x8000004e 0x8000004f 0x80000050 0x80000051 0x80000052 0x80000053 0x80000054 0x80000055 0x80000056 0x80000057 0x80000058 0x80000059 0x8000005a 0x8000005b 0x8000005c 0x8000005d 0x8000005e 0x8000005f 0x80000060

No Yes Yes No Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No No No No Yes No No Yes Yes No Yes Yes No Yes Yes

No No No No No No No No No No No No No No No No No No No Yes No No No No No Yes Yes No No No

Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

data cache reads data cache reads data cache reads data cache writes data cache writes data cache writes instruction cache hits instruction cache hits instruction cache hits instruction cache accesses instruction cache accesses instruction cache accesses instruction cache reads instruction cache reads instruction cache reads instruction cache writes instruction cache writes instruction cache writes total cache hits total cache hits total cache hits total cache accesses total cache accesses total cache accesses total cache reads total cache reads total cache reads total cache writes total cache writes total cache writes

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

73 / 108

Performance API (PAPI)

High vs. Low-level PAPI

PAPI_FML_INS 0x80000061 No PAPI_FAD_INS 0x80000062 No PAPI_FDV_INS 0x80000063 No PAPI_FSQ_INS 0x80000064 No PAPI_FNV_INS 0x80000065 No PAPI_FP_OPS 0x80000066 Yes PAPI_SP_OPS 0x80000067 Yes scaled single precision vector PAPI_DP_OPS 0x80000068 Yes scaled double precision vector PAPI_VEC_SP 0x80000069 Yes PAPI_VEC_DP 0x8000006a Yes - - - - - - - - - - - - - - Of 107 possible events, 57 are avail.c

No Floating point multiply instructions No Floating point add instructions No Floating point divide instructions No Floating point square root instructions No Floating point inverse instructions Yes Floating point operations Yes Floating point operations; optimized to count operations Yes Floating point operations; optimized to count operations No Single precision vector/SIMD instructions No Double precision vector/SIMD instructions - - - - - - - - - - - - - - - - - - - - -available, of which 14 are derived. PASSED

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

74 / 108

Performance API (PAPI)

High vs. Low-level PAPI

papi_even_chooser Command
Not all events can be simultaneously monitored (at least not without multiplexing):
[k16n01b:~]$ papi_event_chooser PRESET PAPI_FP_OPS Event Chooser: Available events which can be added with given events. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PAPI Version : 4.1.4.0 Vendor string and code : GenuineIntel (1) Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44) CPU Revision : 2.000000 CPUID Info : Family: 6 Model: 44 Stepping: 2 CPU Megahertz : 2400.391113 CPU Clock Megahertz : 2400 Hdw Threads per core : 1 Cores per Socket : 6 NUMA Nodes : 2 CPU's per Node : 6 Total CPU's : 12 Number Hardware Counters : 16 Max Multiplex Counters : 512 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name PAPI_L1_DCM PAPI_L1_ICM PAPI_L2_DCM PAPI_L2_ICM Code Deriv Description (Note) 0x80000000 No Level 1 data cache misses 0x80000001 No Level 1 instruction cache misses 0x80000002 Yes Level 2 data cache misses 0x80000003 No Level 2 instruction cache misses

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

75 / 108

Performance API (PAPI) PAPI_L1_TCM PAPI_L2_TCM PAPI_L3_TCM PAPI_L3_LDM PAPI_TLB_DM PAPI_TLB_IM PAPI_TLB_TL PAPI_L1_LDM PAPI_L1_STM PAPI_L2_LDM PAPI_L2_STM PAPI_BR_UCN PAPI_BR_CN PAPI_BR_TKN PAPI_BR_NTK PAPI_BR_MSP PAPI_BR_PRC PAPI_TOT_IIS PAPI_TOT_INS PAPI_FP_INS PAPI_LD_INS PAPI_SR_INS PAPI_BR_INS PAPI_RES_STL PAPI_TOT_CYC PAPI_LST_INS PAPI_L2_DCH PAPI_L2_DCA PAPI_L3_DCA PAPI_L2_DCR PAPI_L3_DCR PAPI_L2_DCW PAPI_L3_DCW 0x80000006 0x80000007 0x80000008 0x8000000e 0x80000014 0x80000015 0x80000016 0x80000017 0x80000018 0x80000019 0x8000001a 0x8000002a 0x8000002b 0x8000002c 0x8000002d 0x8000002e 0x8000002f 0x80000031 0x80000032 0x80000034 0x80000035 0x80000036 0x80000037 0x80000039 0x8000003b 0x8000003c 0x8000003f 0x80000041 0x80000042 0x80000044 0x80000045 0x80000047 0x80000048 Yes No No No No No Yes No No No No No No No Yes No Yes No No No No No No No No Yes Yes No Yes No No No No

High vs. Low-level PAPI

Level 1 cache misses Level 2 cache misses Level 3 cache misses Level 3 load misses Data translation lookaside buffer misses Instruction translation lookaside buffer misses Total translation lookaside buffer misses Level 1 load misses Level 1 store misses Level 2 load misses Level 2 store misses Unconditional branch instructions Conditional branch instructions Conditional branch instructions taken Conditional branch instructions not taken Conditional branch instructions mispredicted Conditional branch instructions correctly predicted Instructions issued Instructions completed Floating point instructions Load instructions Store instructions Branch instructions Cycles stalled on any resource Total cycles Load/store instructions completed Level 2 data cache hits Level 2 data cache accesses Level 3 data cache accesses Level 2 data cache reads Level 3 data cache reads Level 2 data cache writes Level 3 data cache writes

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

76 / 108

Performance API (PAPI)

High vs. Low-level PAPI

PAPI_L1_ICH 0x80000049 No Level 1 instruction cache hits PAPI_L2_ICH 0x8000004a No Level 2 instruction cache hits PAPI_L1_ICA 0x8000004c No Level 1 instruction cache accesses PAPI_L2_ICA 0x8000004d No Level 2 instruction cache accesses PAPI_L3_ICA 0x8000004e No Level 3 instruction cache accesses PAPI_L1_ICR 0x8000004f No Level 1 instruction cache reads PAPI_L2_ICR 0x80000050 No Level 2 instruction cache reads PAPI_L3_ICR 0x80000051 No Level 3 instruction cache reads PAPI_L2_TCH 0x80000056 Yes Level 2 total cache hits PAPI_L2_TCA 0x80000059 No Level 2 total cache accesses PAPI_L3_TCA 0x8000005a No Level 3 total cache accesses PAPI_L2_TCR 0x8000005c Yes Level 2 total cache reads PAPI_L3_TCR 0x8000005d Yes Level 3 total cache reads PAPI_L2_TCW 0x8000005f No Level 2 total cache writes PAPI_L3_TCW 0x80000060 No Level 3 total cache writes PAPI_SP_OPS 0x80000067 Yes Floating point operations; optimized to count scaled single precision vector operations PAPI_DP_OPS 0x80000068 Yes Floating point operations; optimized to count scaled double precision vector operations PAPI_VEC_SP 0x80000069 No Single precision vector/SIMD instructions PAPI_VEC_DP 0x8000006a No Double precision vector/SIMD instructions - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total events reported: 56 event_chooser.c PASSED

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

77 / 108

PAPI API Examples

PAPI Examples

In this section we will work through a few simple examples of using the PAPI API, mostly focused on using the high-level API. And we will steer clear of native events, and leave those to tool developers.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

79 / 108

PAPI API Examples

Accessing Counters Through PAPI

Include les for constants and routine interfaces: C: papi.h F77: f77papi.h F90: f90papi.h

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

80 / 108

PAPI API Examples

PAPI Naming Scheme

The C interfaces: PAPI C interface (return type) PAPI_function_name(arg1, arg2, ...) and Fortran interfaces PAPI Fortran interfaces PAPIF_function_name(arg1, arg2, ..., check ) note that the check parameter is the same type and value as the C return value.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

81 / 108

PAPI API Examples

Relation Between C and Fortran Types in PAPI

The following table shows the relation between the C and Fortran types used in PAPI:
Pseudo-type C_INT C_FLOAT C_LONG_LONG C_STRING C_INT FUNCTION Fortran type INTEGER REAL INTEGER*8 CHARACTER*(PAPI_MAX_STR_LEN) EXTERNAL INTEGER FUNCTION Description Default Integer type Default Real type Extended size integer Fortran string Fortran function returning integer result

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

82 / 108

PAPI API Examples

High-level API Example in C

High-level API Example in C


Lets consider the following example code for using the high-level API in C.
# include " p a p i . h " # include < s t d i o . h> # define NUM_EVENTS 2 # define THRESHOLD 10000 # define ERROR_RETURN( r e t v a l ) { f p r i n t f ( s t d e r r , " E r r o r %d %s : l i n e %d : \ n " , r e t v a l , __FILE__ , __LINE__ ) ; exit ( retval ); } void c o m p ut at io n_ mu lt ( ) { / * s t u p i d codes t o be monitored * / double tmp = 1 . 0 ; i n t i =1; f o r ( i = 1 ; i < THRESHOLD; i ++ ) { tmp = tmp * i ; } } void computation_add ( ) { / * s t u p i d codes t o be monitored * / i n t tmp = 0 ; i n t i =0; f o r ( i = 0 ; i < THRESHOLD; i ++ ) { tmp = tmp + i ; } }

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

83 / 108

PAPI API Examples

High-level API Example in C

i n t main ( ) { / * D e c l a r i n g and i n i t i a l i z i n g t h e event s e t w i t h t h e p r e s e t s * / i n t Events [ 2 ] = { PAPI_TOT_INS , PAPI_TOT_CYC } ; / * The l e n g t h o f t h e events a r r a y should be no l o n g e r than t h e v a l u e r e t u r n e d by PAPI_num_counters . * / / * d e c l a r i n g p l a c e h o l d e r f o r no o f hardware c o u n t e r s * / i n t num_hwcntrs = 0 ; int retval ; char e r r s t r i n g [ PAPI_MAX_STR_LEN ] ; / * T h i s i s going t o s t o r e our l i s t o f r e s u l t s * / l o n g _ l o n g v a l u e s [NUM_EVENTS ] ; /* ************************************************************************** * T h i s p a r t i n i t i a l i z e s t h e l i b r a r y and compares t h e v e r s i o n number o f t h e * * header f i l e , t o t h e v e r s i o n o f t h e l i b r a r y , i f these don ' t match then i t * * i s l i k e l y t h a t PAPI won ' t work c o r r e c t l y . I f t h e r e i s an e r r o r , r e t v a l * * keeps t r a c k o f t h e v e r s i o n number . * ************************************************************************** */ i f ( ( r e t v a l = P A P I _ l i b r a r y _ i n i t (PAPI_VER_CURRENT ) ) ! = PAPI_VER_CURRENT ) { f p r i n t f ( s t d e r r , " E r r o r : %d %s \ n " , r e t v a l , e r r s t r i n g ) ; exit (1); }

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

84 / 108

PAPI API Examples

High-level API Example in C

/* ************************************************************************* * PAPI_num_counters r e t u r n s t h e number o f hardware c o u n t e r s t h e p l a t f o r m * * has o r a n e g a t i v e number i f t h e r e i s an e r r o r * ************************************************************************* */ i f ( ( num_hwcntrs = PAPI_num_counters ( ) ) < PAPI_OK ) { p r i n t f ( " There are no c o u n t e r s a v a i l a b l e . \ n " ) ; exit (1); } p r i n t f ( " There are %d c o u n t e r s i n t h i s system \ n " , num_hwcntrs ) ; /* ************************************************************************* * P A P I _ s t a r t _ c o u n t e r s i n i t i a l i z e s t h e PAPI l i b r a r y ( i f necessary ) and * * s t a r t s c o u n t i n g t h e events named i n t h e events a r r a y . T h i s f u n c t i o n * i m p l i c i t l y s t o p s and i n i t i a l i z e s any c o u n t e r s r u n n i n g as a r e s u l t o f * * a previous c a l l to PAPI_start_counters . * * ************************************************************************* */ i f ( ( r e t v a l = P A P I _ s t a r t _ c o u n t e r s ( Events , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " \ nCounter S t a r t e d : \ n " ) ; / * Your code goes here * / computation_add ( ) ;

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

85 / 108

PAPI API Examples

High-level API Example in C

/* ********************************************************************* * PAPI_read_counters reads t h e c o u n t e r v a l u e s i n t o v a l u e s a r r a y * ********************************************************************* */ i f ( ( r e t v a l =PAPI_read_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " Read s u c c e s s f u l l y \ n " ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d d i t i o n are %l l d \ n " , v a l u e s [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ; p r i n t f ( " \ nNow we t r y t o use PAPI_accum t o accumulate v a l u e s \ n " ) ; / * Do some computation here * / computation_add ( ) ; /* *********************************************************************** * What PAPI_accum_counters does i s i t adds t h e r u n n i n g c o u n t e r v a l u e s * * t o what i s i n t h e v a l u e s a r r a y . The hardware c o u n t e r s are r e s e t and * * l e f t running a f t e r the c a l l . * *********************************************************************** */ i f ( ( r e t v a l =PAPI_accum_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( "We d i d an a d d i t i o n a l %d t i m e s a d d i t i o n ! \ n " , THRESHOLD ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d d i t i o n are %l l d \ n " , values [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ;

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

86 / 108

PAPI API Examples

High-level API Example in C

/* ********************************************************************** * Stop c o u n t i n g events ( t h i s reads t h e c o u n t e r s as w e l l as s t o p s them * ********************************************************************** */ p r i n t f ( " \ nNow we t r y t o do some m u l t i p l i c a t i o n s \ n " ) ; c o m p u t a t io n_ mu lt ( ) ; / * * * * * * * * * * * * * * * * * * * PAPI_stop_counters * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( ( r e t v a l =PAPI_stop_counters ( values , NUM_EVENTS ) ) ! = PAPI_OK ) ERROR_RETURN( r e t v a l ) ; p r i n t f ( " The t o t a l i n s t r u c t i o n executed f o r m u l t i p l i c a t i o n are %l l d \ n " , values [ 0 ] ) ; p r i n t f ( " The t o t a l c y c l e s used are %l l d \ n " , v a l u e s [ 1 ] ) ; exit (0); }

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

87 / 108

PAPI API Examples

High-level API Example in C

Running on E5645 U2 nodes:


[k11n30b:~/d_papi]$ module load papi 'papi/v4.1.4' load complete. [k11n30b:~/d_papi]$ gcc -I$PAPI/include -o highlev highlev.c -L$PAPI/lib -lpapi [k11n30b:~/d_papi]$ ./highlev There are 16 counters in this system Counter Started: Read successfully The total instructions executed for addition are 54977 The total cycles used are 73894 Now we try to use PAPI_accum to accumulate values We did an additional 10000 times addition! The total instructions executed for addition are 112352 The total cycles used are 147814 Now we try to do some multiplications The total instruction executed for multiplication are 77459 The total cycles used are 126981

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

88 / 108

PAPI API Examples

High-level API Example in C

PAPI Initialization

The preceding example used PAPI_library_init to initialize PAPI, which is also used for the low-level API, but you can also use the PAPI_num_counters, PAPI_start_counters, or one of the rate calls, PAPI_flips, PAPI_flops, or PAPI_ipc. Events are counted, as we saw in the example, using PAPI_accum_counters, PAPI_read_counters, and PAPI_stop_counters. Lets look at an even simpler example just using one of the rate counters.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

89 / 108

PAPI API Examples

High-level Example in F90

High-level Example in F90

For something a little different we can look at our old friend, matrix multiplication, this time in Fortran:
! A s i m p l e example f o r t h e use o f PAPI , t h e number o f f l o p s you should ! g e t i s about INDEX^3 on machines t h a t c o n s i d e r add and m u l t i p l y one f l o p ! such as SGI , and 2 * ( INDEX ^ 3 ) t h a t don ' t c o n s i d e r i t 1 f l o p such as INTEL ! Kevin London program f l o p s i m p l i c i t none include " f 9 0 p a p i . h " integer , parameter : : i 8 =SELECTED_INT_KIND ( 1 6 ) ! i n t e g e r * 8 integer , parameter : : i n d e x =1000 r e a l : : m a t r i x a ( index , i n d e x ) , m a t r i x b ( index , i n d e x ) , mres ( index , i n d e x ) r e a l : : proc_time , mflops , r e a l _ t i m e i n t e g e r ( kind= i 8 ) : : f l p i n s integer : : i , j , k , r e t v a l

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

90 / 108

PAPI API Examples

High-level Example in F90

r e t v a l = PAPI_VER_CURRENT CALL P A P I f _ l i b r a r y _ i n i t ( r e t v a l ) i f ( r e t v a l .NE. PAPI_VER_CURRENT) then print * , ' Failure in PAPI_library_init : end i f

' , retval

CALL PAPIf_query_event ( PAPI_FP_OPS , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' Sorry , no PAPI_FP_OPS event : ' ,PAPI_ENOEVNT end i f ! I n i t i a l i z e the Matrix arrays do i =1 , i n d e x do j =1 , i n d e x matrixa ( i , j ) = i + j m a t r i x b ( i , j ) = ji mres ( i , j ) = 0 . 0 end do end do ! Setup PAPI l i b r a r y and begin c o l l e c t i n g data from t h e c o u n t e r s c a l l P A P I f _ f l o p s ( r e a l _ t i m e , proc_time , f l p i n s , mflops , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' F a i l u r e on P A P I f _ f l o p s : ' , r e t v a l end i f

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

91 / 108

PAPI API Examples

High-level Example in F90

! M a t r i xM a t r i x M u l t i p l y do i =1 , i n d e x do j =1 , i n d e x do k =1 , i n d e x mres ( i , j ) = mres ( i , j ) + m a t r i x a ( i , k ) * m a t r i x b ( k , j ) end do end do end do ! C o l l e c t t h e data i n t o t h e V a r i a b l e s passed i n c a l l P A P I f _ f l o p s ( r e a l _ t i m e , proc_time , f l p i n s , mflops , r e t v a l ) i f ( r e t v a l .NE. PAPI_OK ) then p r i n t * , ' F a i l u r e on P A P I f _ f l o p s : ' , r e t v a l end i f p r i n t * , ' Real_time : ' , r e a l _ t i m e p r i n t * , ' Proc_time : ' , p r o c _ t i m e print * , ' Total f l p i n s : ' , f l p i n s p r i n t * , ' MFLOPS: ' , mflops end program f l o p s

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

92 / 108

PAPI API Examples

High-level Example in F90

Compile and run on E5645 U2:


[k11n30b:~/d_papi]$ ifort -o flops -I$PAPI/include flops.f90 -L$PAPI/lib -lpapi [k11n30b:~/d_papi]$ ./flops Real_time: 0.3325460 Proc_time: 0.3311317 Total flpins: 500893184 MFLOPS: 1512.671

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

93 / 108

PAPI API Examples

Low-level API

Low-level API

The low-level API is primarily intended for experienced application programmers and tool developers. It manages hardware events in user-dened groups called event sets, and can use both preset and native events. The low-level API can also interrogate the hardware and determine memory sizes of the executable itself. The low-level API can also be used for multiplexing, in which more (virtual) counters can be used than the underlying hardware supports by timesharing the available (physical) hardware counters.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

94 / 108

PAPI API Examples

Low-level API

PAPI Low-level Example


A simple example using the low-level API:
# include < p a p i . h> # include < s t d i o . h> # define NUM_FLOPS 10000 main ( ) { i n t r e t v a l , EventSet=PAPI_NULL ; long_long values [ 1 ] ; / * I n i t i a l i z e t h e PAPI l i b r a r y * / r e t v a l = P A P I _ l i b r a r y _ i n i t (PAPI_VER_CURRENT ) ; i f ( r e t v a l ! = PAPI_VER_CURRENT) { f p r i n t f ( s t d e r r , " PAPI l i b r a r y i n i t e r r o r ! \ n " ) ; exit (1); } / * Create t h e Event Set * / i f ( PAPI_create_eventset (& EventSet ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; / * Add T o t a l I n s t r u c t i o n s Executed t o our Event Set * / i f ( PAPI_add_event ( EventSet , PAPI_TOT_INS ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ;

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

95 / 108

PAPI API Examples

Low-level API

/ * S t a r t c o u n t i n g events i n t h e Event Set * / i f ( P A P I _ s t a r t ( EventSet ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; / * Defined i n t e s t s / do_loops . c i n t h e PAPI source d i s t r i b u t i o n * / d o _ f l o p s (NUM_FLOPS ) ; / * Read t h e c o u n t i n g events i n t h e Event Set * / i f ( PAPI_read ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r r e a d i n g t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; / * Reset t h e c o u n t i n g events i n t h e Event Set * / i f ( PAPI_reset ( EventSet ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; d o _ f l o p s (NUM_FLOPS ) ; / * Add t h e c o u n t e r s i n t h e Event Set * / i f ( PAPI_accum ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r adding t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; d o _ f l o p s (NUM_FLOPS ) ; / * Stop t h e c o u n t i n g o f events i n t h e Event Set * / i f ( PAPI_stop ( EventSet , v a l u e s ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ; p r i n t f ( " A f t e r s t o p p i n g t h e c o u n t e r s : %l l d \ n " , v a l u e s [ 0 ] ) ; }

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

96 / 108

PAPI API Examples

PAPI in Parallel

PAPI in Parallel

threads PAPI_thread_init enables PAPIs thread support, and should be called immediately after PAPI_library_init. MPI codes are treated very simply - each process has its own address space, and potentially its own hardware counters.

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

97 / 108

High-level Tools

High-level Tools

There are a bunch of open-source high-level tools that build on some of the simple approaches that we have been talking about. General characteristics found in most (not necessarily all): Ability to generate and view MPI trace les, leveraging MPIs built-in proling interface, Ability to do statistical proling ( la gprof) and code viewing for identifying hotspots, Ability to access performance counters, leveraging PAPI

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

99 / 108

High-level Tools

Popular Examples

Tool Examples

A list of such high-level tool examples (not exhaustive): TAU, Tuning and Analysis Utility,
http://www.cs.uoregon.edu/Research/tau/home.php

Open|SpeedShop,funded by U.S. DOE


http://www.openspeedshop.org

IPM, Integrated Performance Management


http://ipm-hpc.sourceforge.net

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

100 / 108

High-level Tools

Specic Example: IPM

Example: IPM

IPM is relatively simple to install and use, so we can easily walk through our favorite example. Note that IPM does: MPI PAPI I/O proling Memory Timings: wall, user, and system

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

101 / 108

High-level Tools

Specic Example: IPM

Run and Gather


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 #PBS -S /bin/bash #PBS -q debug #PBS -l walltime=00:20:00 #PBS -l nodes=2:MEM24GB:ppn=8 #PBS -M jonesm@ccr.buffalo.edu #PBS -m e #PBS -N test #PBS -o subMPIP.out #PBS -j oe module load intel module load intel-mpi module load papi module list cd $PBS_O_WORKDIR which mpiexec NNODES=`cat $PBS_NODEFILE | uniq | wc -l` NPROCS=`cat $PBS_NODEFILE | wc -l` export I_MPI_DEBUG=5 # Use LD_PRELOAD trick to load IPM wrappers at runtime export LD_PRELOAD=/projects/jonesm/ipm/src/ipm/lib/libipm.so mpdboot -n $NNODES -f "$UNIQ_HOSTS" -v mpiexec -np $NPROCS -envall ./laplace_mpi <<EOF 2000 EOF mpdallexit

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

102 / 108

High-level Tools

Specic Example: IPM

... and the output is a big xml le plus some useful output to standard output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [k07n14:~/d_laplace/d_ipm]$ file jonesm.1318862245.001449.0 jonesm.1318862245.001449.0: XML document text [k07n14:~/d_laplace/d_ipm]$ less subMPIP.out ... ##IPMv0.983#################################################################### # # command : ./laplace_mpi (completed) # host : d16n03/x86_64_Linux mpi_tasks : 16 on 2 nodes # start : 10/17/11/10:37:25 wallclock : 116.170005 sec # stop : 10/17/11/10:39:21 %comm : 13.94 # gbytes : 2.24606e+00 total gflop/sec : 5.02520e+00 total # ############################################################################## # region : * [ntasks] = 16 #

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

103 / 108

High-level Tools

Specic Example: IPM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

# [total] <avg> min max # entries 16 1 1 1 # wallclock 1853.71 115.857 115.816 116.17 # user 1853.09 115.818 115.707 115.936 # system 2.18066 0.136291 0.071989 0.198969 # mpi 259.152 16.197 11.3859 19.1157 # %comm 13.9425 9.82914 16.5048 # gflop/sec 5.0252 0.314075 0.311741 0.319497 # gbytes 2.24606 0.140379 0.138138 0.170021 # # PAPI_FP_OPS 5.83778e+11 3.64861e+10 3.62149e+10 3.7116e+10 # PAPI_FP_INS 5.8276e+11 3.64225e+10 3.62144e+10 3.69079e+10 # PAPI_DP_OPS 5.82764e+11 3.64228e+10 3.62144e+10 3.69079e+10 # PAPI_VEC_DP 4.00803e+06 250501 0 4.00803e+06 # # [time] [calls] <%mpi> <%wall> # MPI_Allreduce 243.838 381520 94.09 13.15 # MPI_Sendrecv 14.9598 763040 5.77 0.81 # MPI_Send 0.339084 15 0.13 0.02 # MPI_Recv 0.0143731 15 0.01 0.00 # MPI_Bcast 0.00124932 16 0.00 0.00 # MPI_Comm_rank 1.58967e-05 16 0.00 0.00 # MPI_Comm_size 8.01496e-06 16 0.00 0.00 ###############################################################################

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

104 / 108

High-level Tools

Specic Example: IPM

Script to Generate HTML from XML Results

1 2 3 4 5 6 7 8 9 10 11

#!/bin/bash if [ $# -ne 1 ]; then echo "Usage: $0 xml_filename" exit fi XMLFILE=$1 export IPM_KEYFILE=/projects/jonesm/ipm/src/ipm/ipm_key export PATH=${PATH}:/projects/jonesm/ipm/src/ipm/bin /projects/jonesm/ipm/src/ipm/bin/ipm_parse -html $XMLFILE

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

105 / 108

High-level Tools

Specic Example: IPM

[u2:~/d_laplace/d_ipm]$ ./genhtml.sh jonesm.1318862245.001449.0 # data_acquire = 0 sec # data_workup = 0 sec # mpi_pie = 1 sec # task_data = 0 sec # load_bal = 0 sec # time_stack = 0 sec # mpi_stack = 1 sec # mpi_buff = 0 sec # switch+mem = 0 sec # topo_tables = 0 sec # topo_data = 0 sec # topo_time = 0 sec # html_all = 2 sec # html_regions = 0 sec # html_nonregion = 1 sec [u2:~/d_laplace/d_ipm]$ ls -l \ laplace_mpi_16_jonesm.1318862245.001449.0_ipm_1159896.d15n41.ccr.buffalo.edu/ total 346 -rw-r - r - 1 jonesm ccrstaff 994 Oct 17 16:07 dev.html -rw-r - r - 1 jonesm ccrstaff 104 Oct 17 16:07 env.html -rw-r - r - 1 jonesm ccrstaff 347 Oct 17 16:07 exec.html -rw-r - r - 1 jonesm ccrstaff 451 Oct 17 16:07 hostlist.html drwxr-xr-x 2 jonesm ccrstaff 930 Oct 17 16:07 img -rw-r - r - 1 jonesm ccrstaff 10550 Oct 17 16:07 index.html -rw-r - r - 1 jonesm ccrstaff 387 Oct 17 16:07 map_adjacency.txt -rw-r - r - 1 jonesm ccrstaff 8961 Oct 17 16:07 map_calls.txt -rw-r - r - 1 jonesm ccrstaff 1452 Oct 17 16:07 map_data.txt drwxr-xr-x 2 jonesm ccrstaff 803 Oct 17 16:07 pl -rw-r - r - 1 jonesm ccrstaff 2620 Oct 17 16:07 task_data [k07n14:~/d_laplace/d_ipm]$ tar czf my-ipm-files.tgz \ laplace_mpi_16_jonesm.1318862245.001449.0_ipm_1159896.d15n41.ccr.buffalo.edu/ [k07n14:~/d_laplace/d_ipm]$ ls -l my-ipm-files.tgz -rw-r - r - 1 jonesm ccrstaff 71509 Oct 17 16:48 my-ipm-files.tgz M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 106 / 108

High-level Tools

Specic Example: IPM

Visualize Results in Browser


Transfer compressed tar le to your local machine, unpack, and browse the results:

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

107 / 108

High-level Tools

Specic Example: IPM

Summary

Summary of high-level tools IPM is pretty easy to use, provides some good functionality TAU and Open|SpeedShop have steeper learning curves, much more functionality

M. D. Jones, Ph.D. (CCR/UB)

Application Performance Tuning

HPC-I Fall 2012

108 / 108

Вам также может понравиться