Академический Документы
Профессиональный Документы
Культура Документы
wjb19@psu.edu
Outline
Intro uction Har ware (e#initions UNIX )ernel * shell +iles Permissions Utilities Bash 'cripting C programming
wjb19@psu.edu
HPC s!stems compose o# : 'o#tware Har ware (evices $eg,& is-s% Compute elements $eg,& CPU% 'hare an /or istri"ute memor! Communication $eg,& In#ini"an networ-%
A HPC system ...isn't... unless hardware is configured correctly and software leverages all resources made available to it, in an optimal manner .n operating s!stem controls the e/ecution o# so#tware on the har ware0 HPC clusters almost e/clusivel! use UNIX/1inu/
In the computational sciences& we pass ata an /or a"stractions through a pipeline wor-#low0 UNIX is the natural analogue to this solving/ iscover! process
wjb19@psu.edu
UNIX
UNIX is a multi2user/tas-ing O' create "! (ennis Ritchie an )en 3hompson at .3*3 Bell 1a"s 456524578& written primaril! in C language $also evelope "! Ritchie%
UNIX is compose o# : Kernel O' itsel# which han les sche uling& memor! management& I/O etc Shell (eg., Bash) Interacts with -ernel& comman line interpreter Utilities Programs run "! the shell& tools #or #ile manipulation& interaction with the s!stem Files Ever!thing "ut process$es%& compose o# ata,,,
wjb19@psu.edu
(ata2Relate (e#initions
Binary 9ost #un amental ata representation in computing& "ase : num"er s!stem $others0 he/ ; "ase 46& oct ; "ase <% Byte < "its = <" = 4B!te = 4B0 4-B = 48:> B0 49B = 48:> -B etc ASCII .merican 'tan ar Co e #or In#ormation Interchange0 character enco ing scheme& 7"its $tra itional% or <"its $U3+2<% per character& a Unicode enco ing Stream . #low o# "!tes0 source ; stdout $* stderr%& sin- ; stdin Bus Communication channel over which ata #lows& connects elements within a machine Process +un amental unit o# computational wor- per#orme "! a processor0 CPU e/ecutes application or O' instructions Node 'ingle computer& compose o# man! elements& various architectures #or CPU& eg,& /<6& RI'C
wjb19@psu.edu
9ore (e#initions
Cluster 9an! no es connecte together via networNet&or' Communication channel& inter2no e0 connects machines Shared !emory 9emor! region share within no e (istri)uted !emory 9emor! region across two or more no es (irect !emory Access ((!A) .ccess memor! in epen entl! o# programme I/O ie,& in epen ent o# the CPU Band&idth Rate o# ata trans#er across serial or parallel communication channel& e/presse as "its $"% or B!tes $B% per secon $s% Beware Auotations o# "an wi th0 man! #actors eg,& simple// uple/& pea-/sustaine & no, o# lanes etc Latency or the time to create a communication channel is often more important
wjb19@psu.edu
Ban wi ths
(e*ices U'B : 689B/s $version :,8% Har (is- : 4889Bs2B889B/s PCIe : C:@B/s $/<& version :,8% Net&or's 48/488Base 3 : 48/488 9"it/s 4888Base3 $4@igE% : 4888 9"it/s 48 @igE : 48 @"it/s In#ini"an ?(R >X: >8 @"it/s !emory CPU : D CB @B/s $Nehalem& C/ 4,C@HE (I99/soc-et%F @PU : D 4<8 @B/s $@e+orce @3X ><8% .GOI( evices& -eep ata resi ent in memor!& minimiEe communication "twn processes 9.NH su"tleties to CPU memor! management eg,& with </ CPU cores& total "an wi th ma! "e I C88 @B/s or as little as 48 @B/s& will iscuss #urther
wjb19@psu.edu
Outline
Intro uction HPC har ware (e#initions UNIX )ernel * shell +iles Permissions Utilities Bash 'cripting C programming
wjb19@psu.edu
User I( @roup I(
#ilename
We can manipulate files using myriad utilities, these utilities are commands interpreted by the shell and e ecuted by the !ernel 3o learn more& chec- man pages ie,& #rom the comman line 'man <command>'
wjb19@psu.edu
+ile 9anipulation I
Qor-ing #rom the comman line in a Bash shell: 1ist irector! foo_dir contents& human rea a"le : Change ownership o# foo.xyz to wjb190 group an user:
wjb19@psu.edu
+ile 9anipulation II
Cop! foo.txt #rom lionga to #ile /home/bill/foo.txt on dirac : [wjb19@lionga scratch] $ scp foo.txt \ wjb19@dirac.rcc.psu.edu:/home/bill/foo.txt
Create gzip compresse #ile archive o# irector! foo an contents : Create bzip2 compresse #ile archive o# irector! foo an contents : Unpac- compresse #ile archive : E it a te/t #ile using GI9:
GI9 is a venera"le an power#ul comman line e itor with a rich set o# comman s
wjb19@psu.edu
'ave w/o Auitting 'ave an Auit $ie,& Rshi#tI .N( PEP .N( PEP% ?uit w/o saving (elete / lines eg&, /=48 $also store Han- $cop!% / lines eg,& /=48 'plit screen/"u##er 'witch win ow/"u##er @o to line / eg,& /=48 +in matching construct $eg,& #rom { to }%
Paste: 'p' un o: 'u' re o: '<CNTRL>-r' 9ove up/ own one screen line : '-' an '+' 'earch #or e/pression exp, #orwar ('n' or 'N' navigate up/ own highlighte matches% '/exp<ENTER>' or "ac-war '?exp<ENTER>' wjb19@psu.edu
Push te/t #rom right to le#t $when right win ow active an cursor in relevant region% using comman 'dp' Pull te/t #rom right to le#t $when le#t win ow active an cursor in relevant region% using comman 'do' wjb19@psu.edu
Bash 'cripting
+ile an other utilities can "e assem"le into scripts& interprete "! the shell eg,& Bash 3he scripts can "e collections o# comman s/utilities * #un amental programming constructs
Co e Comment Pipe stdout o# proc. to stdin o# procB Re irect stdout o# proc. to #ile #oo,t/tF Comman separator I# "loc(ispla! on stdout Garia"le assignment * literal value Concatenate strings 3e/t Processing utilities 'earch utilities
#this is a comment procA | procB procA > foo.txt procA; procB if [condition] then procA fi
F'treams have file descriptors $num"ers% associate with them0 eg,& to re irect stderr #rom proc. to #oo,t/t ; procA 2> foo.txt wjb19@psu.edu
3e/t Processing
3e/t ocuments are compose o# records $roughl! spea-ing& lines separate "! carriage returns% an fields $separate "! spaces%
3e/t processing using sed * gawk involves coupling patterns with actions eg,& print #iel 4 in ocument foo.txt when encountering wor image:
input
Parse& without case sensitivit!& change #rom e#ault space #iel separator $+'% to eAuals sign& print #iel ::
[wjb19@lionga scratch] $ gawk 'BEGIN{IGNORECASE=1; FS==} \ /image/ {print $2;}' foo.txt
Putting it all together " create a #ash script w$ %&' or other (eg,. Pico)...
wjb19@psu.edu
Bash E/ample I
#!/bin/bash #set source and destination paths DIR_PATH=~/scratch/espresso-PRACE/PW BAK_PATH=~/scratch/PW_BAK declare -a file_list #filenames to array file_list=$(ls -l ${BAK_PATH} | gawk '/f90/ {print $9}') cnt=0; #parse files & pretty up for x in $file_list do let "cnt+=1" sed 's/\,\&/\,\ \&/g' $BAK_PATH/$x | \ sed 's/)/)\ /g' | \ sed 's/call/\ call\ /g' | \ sed 's/CALL/\ call\ /g' > $DIR_PATH/$x echo cleaned file no. $cnt $x done exit
'earch * replace
wjb19@psu.edu
Bash E/ample II
#!/bin/bash if [ $# -lt 6 ] then echo usage: fitCPCPMG.sh '[/path/and/filename.csv] \ [desired number of gaussians in mixture (2-10)] \ [no. random samples (1000-10000)]\ [mcmc steps (1000-30000)]\ [percent noise level (0-10)]\ [percent step size (0.01-20)]\ [/path/to/restart/filename.csv; optional]' exit fi ext=${1##*.} if [ "$ext" != "csv" ] then echo ERROR: file must be *.csv exit fi base=$(basename $1 .csv) if [[ $2 -lt 2 ]] || [[ $2 -gt 10 ]] then echo "ERROR: must specify 2<=x<=10 gaussians in mixture" exit fi
3otal arguments
+ile e/tension
+ile "asename
wjb19@psu.edu
Outline
Intro uction HPC har ware (e#initions UNIX )ernel * shell +iles Permissions Utilities Bash 'cripting C programming
wjb19@psu.edu
3he C 1anguage
Utilities& user applications an in ee the UNIX O' itsel# are e/ecute "! the CPU& when e/presse as machine co e eg,& store/loa #rom memor!& a ition etc +un amental operations li-e memor! allocation& I/O etc are la"orious to e/press at this level& most #reAuentl! we "egin #rom a high2level language li-e C 3he process o# creating an e/ecuta"le consists o# at least C #un amental steps0 creation o# source co e te/t #ile containing all desired ob*ects and operations, compilation an lin!ing eg&, using the @NU tool gcc to create e/ecuta"le foo.x #rom source #ile foo.c: [wjb19@tesla2 scratch]$ gcc -std=c99 foo.c -o foo.x
C Co e Elements I
Compose o# primitive atat!pes $eg,& int, float, long%& which have i##erent siEes in memor!& multiples o# 4 "!te
9a! "e compose o# staticall! allocate memor! $compile time%& !namicall! allocate memor! $runtime%& or "oth
Pointers $eg,& float *% are primitives with + or , byte lengths (-.bit or /+bit machines) which contain an address to a contiguous region of dynamically allocated memory
9ore complicate o"Oects can "e constructe #rom primitives an arra!s eg,& a struct
wjb19@psu.edu
C Co e Elements II
Common operations are gathere into functions& the most common "eing main(), which must "e present in e/ecuta"le
+unctions have a istinct name& ta-e arguments& an return output0 this in#ormation comprises the prototype, e/presse separatel! to the implementation etails& #ormer o#ten in header file
3he operating s!stem e/ecutes compile co e0 a running program is a process $more ne/t time%
wjb19@psu.edu
C Co e E/ample
#include <stdio.h> #include <stdlib.h> #include "allDefines.h" //Kirchoff Migration function in psktmCPU.c void ktmMigrationCPU(struct imageGrid* imageX, struct imageGrid* imageY, struct imageGrid* imageZ, struct jobParams* config, float* midX, float* midY, float* offX, float* offY, float* traces, float* slowness, float* image);
+unction protot!pe0 must give arguments& their t!pes an return t!pe0 implementation elsewhere
int main() { int IMAGE_SIZE = 10; float* image = (float*) malloc (IMAGE_SIZE*sizeof(float)); printf(size of image = %i\n,IMAGE_SIZE); for (int i=0; i<IMAGE_SIZE; i++) printf(image point %i = %f\n,i,image[i]); free(image); return 0; }
wjb19@psu.edu
Onl! han simple comman line options to main() using argc,argv[]0 in general we wish to han le short an long options $eg,& see @NU co ing stan ar s% an the use o# getopt_long() is pre#era"le,
UtiliEe the environment variables of the host shell& particularl! in setting runtime con itions in e/ecute co e via getenv() eg,& in Bash set in .bashrc con#ig #ile or via comman line:
[wjb19@lionga scratch] $ export MY_STRING=hello I# !our proOect/program reAuires a% sophisticate o"Oects "% man! evelopers c% woul "ene#it #rom o"Oect oriente esign principles& !ou shoul consi er writing in CJJ $although "eing a higher2level language it is har er to optimiEe%
wjb19@psu.edu
Use assert to test vali it! o# #unction arguments& statements etc0 will intro uce per#ormance hit& "ut asserts can "e remove at compile time with NDEBUG macro $C stan ar %
(e"ug with gdb& pro#ile with gprof& valgrind0 target most e/pensive #unctions #or optimi0ation
B1.'/1.P.C)/'ca1.P.C) Original "asic an e/ten e linear alge"ra routines http://www,netli",org/ Intel 9ath )ernel 1i"rar! $9)1% implementation o# a"ove routines& w/ solvers& ##t etc http://so#tware,intel,com/en2us/articles/intel2m-l/ .9( Core 9ath 1i"rar! $.C91% (itto http:// eveloper,am ,com/li"raries/acml/pages/ e#ault,asp/ Open9PI Open source 9PI implementation http://www,open2mpi,org/ PE3'c (ata structures an routines #or parallel scienti#ic applications "ase on P(EPs http://www,mcs,anl,gov/petsc/petsc2as/ wjb19@psu.edu
UNIX C Compilation I
In general the creation an use o# share li"raries $*so% is pre#era"le to static $*a%& #or space reasons an ease o# so#tware up ates
Use -fPIC #lag in share li"rar! compilation0 PIC==position in epen ent& co e in share o"Oect oes not epen on a ress/location at which it is loa e ,
(onPt #orget to up ate !our PATH an LD_LIBRARY_PATH env vars w/ !our "inar! e/ecuta"le path * an! li"raries !ou nee /create #or the application& respectivel!
wjb19@psu.edu
UNIX C Compilation II
Remem"er in compilation steps to -I/set/header/paths and !eep interface (in headers) separate from implementation as much as possible
Remem"er in lin-ing steps #or share li"s to: -L/set/path/to/library .N( set #lag -lmyLib, where /set/path/to/library/libmyLib.so must e/ist otherwise !ou will have un e#ine re#erences an /or 'can't find -lmyLib' etc
Compile with -Wall or similar an #i/ all warnings Rea the manual :%
wjb19@psu.edu
Conclusions
High Per#ormance Computing '!stems are an assem"l! o# har ware an so#tware wor-ing together& usuall! "ase on the UNIX O'0 multiple compute no es are connecte together
3he UNIX -ernel is surroun e "! a shell eg,& Bash0 comman s an constructs ma! "e assem"le into scripts
UNIX& associate utilities an user applications are tra itionall! written in high2 level languages li-e C
HPC user applications may ta!e advantage of shared or distributed memory compute models, or both
Regar less& goo co e minimiEes I/O& -eeps ata resi ent in memor! #or as long as possi"le an minimiEes communication "etween processes
User applications shoul ta-e a vantage o# e/isting high per#ormance li"raries& an tools li-e gdb, gprof an valgrind
wjb19@psu.edu
Re#erences
(ennis Ritchie& RIP http://en,wi-ipe ia,org/wi-i/(ennisSRitchie . vance "ash scripting gui e http://tl p,org/1(P/a"s/html/ 3e/t processing w/ @.Q) http://www,gnu,org/s/gaw-/manual/gaw-,html . vance 1inu/ programming http://www,a vance linu/programming,com/alp2#ol er/ E/cellent optimiEation tips http://www,lri,#r/D"astoul/localScopies/lee,html @NU compiler collection ocuments http://gcc,gnu,org/online ocs/ Original RI'C esign paper http://www,eecs,"er-ele!,e u/Pu"s/3echRpts/45<:/C'(2<:2486,p # CJJ +.? http://www,parashi#t,com/cJJ2#aA2lite/ GI9 Qi-i http://vim,wi-ia,com/wi-i/GimS3ipsSQi-i
wjb19@psu.edu
E/ercises
3a-e supplie co e an compile using gcc, creating e/ecuta"le foo.x; attempt to run as './foo.x' Co e has a segmentation #ault& an error in memor! allocation which is han le via the malloc #unction Recompile with e"ug #lag -g& run through gdb an correct the source o# the segmentation #ault 1oa the valgrind mo ule ie,& 'module load valgrind' an then run as 'valgrind ./foo.x'; this powerful profiling tool will help identify memory lea!s, or memory on the heap1 which has not been freed
Qrite a Bash script that stores !our home irector! #ile contents in an arra! an : Uses sed to swap vowels $eg,& PaP an PeP% in names Parses the arra! o# names an returns onl! a single match& i# it e/ists& else echo NO-MATCH
wjb19@psu.edu
1aunch :
Run w/ comman line argument P488P : 'et "rea-point at line 48 in source #ile :
(gdb) b foo.c:10 Breakpoint 1 at 0x400594: file foo.c, line 10. (gdb) run Starting program: /gpfs/scratch/wjb19/foo.x Breakpoint 1, main () at foo.c:22 22 int IMAGE_SIZE = 10;
(gdb) p image[2] $4 = 0
wjb19@psu.edu
HPC Essentials
Part II : Elements o# Parallelism
wjb19@psu.edu
Outline
Intro uction 9otivation HPC operations 9ultiprocessors Processes 9emor! (igression Girtual 9emor! Cache 3hrea s PO'IX Open9P .##init!
wjb19@psu.edu
9otivation
3he pro"lems in science we see- to solve are "ecoming increasingl! large& as we go own in scale $eg,& Auantum chemistr!% or up $eg,& astroph!sics%
3here#ore we want to increase #loating point operations per#orme an memor! "an wi th an thus see- paralleli0ation as we run out o# resources using a single processor
Qe are limite "! .m ahlPs law& an e/pression o# the ma/imum improvement o# parallel co e over serial:
4/$$42P% J P$3% where P is the portion o# application co e we paralleliEe& an 3 is the num"er o# processors ie,& as 3 increases& the portion o# remaining serial co e "ecomes increasingl! e/pensive& relativel! spea-ing
wjb19@psu.edu
9otivation
Unless the portion o# co e we can paralleliEe approaches 488T& we see rapidly diminishing returns with increasing numbers of processors
4:
Improvement #actor
48 < 6 > : 8 8 46 C: >< 6> <8 56 44: 4:< 4>> 468 476 45: :8< ::> :>8 :B6
P=58T
Nonetheless& #or man! applications we have a goo chance o# paralleliEing the vast maOorit! o# the co e,,,
wjb19@psu.edu
'eismic trace ata acAuire over :( geometr! is integrate to give image o# earthPs interior& using D @reenPs metho
Input is generall! 48U> V 48U6 traces& 48UC V 48U> ata points each& ie,& lots o# ata to process0 output image is also ver! large
3his is an integral techniAue $ie,& summation& eas! to paralleliEe%& Oust one o# man! popular algorithms per#orme in HPC
/==image space ==seismic space t==traveltime Image point Qeight 3race (ata
wjb19@psu.edu
Integration 1oa /store& a * multipl! eg,& trans#orms (eri*ati*es (Finite di++erences) 1oa /store& su"tract * ivi e eg,& P(E ,inear Alge)ra 1oa /store& su"tract/a /multipl!/ ivi e chemistr! * ph!sics& solvers sparse $classical ph!sics% * ense $Auantum%
Regar less o# the operations per#orme & a#ter compilation into machine co e& when e/ecute "! the CPU& instructions are cloc-e through a pipeline into registers #or e/ecution
Instruction e/ecution generall! ta-es place in #our steps& an multiple instruction groups are concurrent within the pipeline0 e ecution rate is a direct function of the cloc! rate
wjb19@psu.edu
E/ecution Pipeline
3his is the most #ine2graine #orm o# parallelism0 itPs e##icienc! is a strong #unction o# branch prediction hardware& or the pre iction o# which instruction in a program is the ne/t to e/ecuteF
.t a similar level& present in more recent evices are so2calle streaming 'I9( e/tension $''E% registers an associate compute har ware
Cloc' cycle
. /
1 4
pen ing
PIP",IN"
e/ecuting
complete
wjb19@psu.edu
''E
'treaming 'I9( $'ingle instruction& multiple (ata% computation e/ploits special registers an instructions to increase computation man!2#ol in certain cases& since several ata elements are operate on simultaneousl!
Each o# < ''E registers $la"ele xmm0 through xmm7% is 4:<2"it longs& storing > / C:2"it #loating2point num"ers0 ''E: an ''EC speci#ications have e/pan e the allowe atat!pes to inclu e ou"les& ints etc
#loat:
#loat4
#loat8 8
Operations ma! "e PscalarP or Ppac-P $ie,& vector%& e/presse using intrinsics in __asm "loc- within C co e eg,& addps
operation
xmm0,xmm1
st operan src operan
One can either co e the intrinsics e/plicitl!& or rel! on the compiler,& eg,& icc with optimiEation $-O3%
9ultiprocessor Overview
9ultiprocessors or multiple core CPUPs are "ecoming u"iAuitous0 "etter scaling $c# 9oorePs law% "ut limite "! contention #or share resources& especiall! memor!
9ost commonl! we eal with '!mmetric 9ultiprocessors $'9P%& with uniAue cache an registers& as well as share memor! region$s%0 more on cache in a moment 9emor! not necessaril! ne/t to processors ; Non2uni+orm !emory Access (NU!A)0 CPU8 CPU4 tr! to ensure memor! access is as local to CPU core$s% as possi"le registers registers
cache
cache
main memor!
3he proc irector! on UNIX machines is a special irector! written an up ate "! the -ernel& containing in#ormation on CPU $/proc/cpuinfo% an memor! $/proc/meminfo%
Processes
.pplication processes are launche on the CPU "! the -ernel using the fork() s!stem call0 ever! process has a process I( pid& availa"le on UNIX s!stems via the getpid() s!stem call
3he -ernel manages man! processes concurrentl!0 all in#ormation reAuire to run a process is containe in the process control bloc! $PCB% ata structure& containing $among other things%:
3he pi 3he a ress space I/O in#ormation eg,& open #iles/streams Pointer to ne/t PCB
Processes ma! spawn chil ren using the fork() s!stem call0 chil ren are initiall! a cop! o# the parent& "ut ma! ta-e on i##erent attri"utes via the exec() call
wjb19@psu.edu
Processes
. chil process ta-es the i o# the parent $ppid%& an a pid eg,& output #rom ps comman & escri"ing itsel# :
[wjb19@tesla1 ~]$ ps -eHo "%P %p %c %t %C" PPID PID COMMAND ELAPSED %CPU 12608 1719 sshd 01:07:54 0.0 1719 1724 sshd 01:07:49 0.0 1724 1725 bash 01:07:48 0.0 1725 1986 ps 00:00 0.0
(uring a conte t switch& -ernel will swap one process control "loc- #or another0 conte/t switches are detrimental to HPC an have one or more triggers& inclu ing: I/O reAuests 3imer interrupts
Conte/t switching is a ver! #ine2graine #orm o# sche uling0 on compute clusters we also have coarse graine sche uling in the #orm o# Oo" sche uling so#tware $more ne/t time%
3he uniAue a ress space #rom the perspective o# the process is re#erre to as virtual memory
wjb19@psu.edu
Girtual 9emor!
. running process is given memor! "! the -ernel& re#erre to as virtual memory $G9%0 a ress space does not correspon to ph!sical memor! a ress space
3he 9emor! 9anagement Unit $99U% on CPU translates "etween the two a ress spaces& #or reAuests ma e "etween process an O'
Girtual 9emor! #or ever! process has the same structure& "elow le#t0 virtual a ress space is ivi e into units calle pages
7igh Address
3he 99U is assiste in a ress translation "! the 3ranslation 1oo-asi e Bu##er $31B%& which stores page etails in a cache
Cache is high speed memory immediately ad*acent to the CP4 and it's registers, connected via bus(es)
,o& Address
Instructions wjb19@psu.edu
In the #ormer case& we are limite "! the rate at which instructions can "e e/ecute "! the CPU In the latter& we are limite "! the rate at which ata can "e processe "! the CPU ata are loa e into cache0 cache memor! is lai
Cache memory is intermediate in the overall hierarchy, lying between CP4 registers and main memory I# the e/ecuting process reAuests an a ress correspon ing to ata or instructions in cache& we have a PhitP& else PmissP& an a much slower retrieval o# instruction or ata #rom main memor! must ta-e place
wjb19@psu.edu
,,, It simulates a machine with in epen ent #irst2level instruction an ata caches $I4 an (4%& "ac-e "! a uni#ie secon 2level cache $1:%, 3his e/actl! matches the con#iguration o# man! mo ern machines, However& some mo ern machines have three levels o# cache, +or these machines $in the cases where Cachegrin can auto2 etect the cache con#iguration% Cachegrin simulates the #irst2level an thir 2level caches, 3he reason #or this choice is that the 1C cache has the most in#luence on runtime& as it mas-s accesses to main memor!, +urthermore& the 14 caches o#ten have low associativit!& so simulating them can etect cases where the co e interacts "a l! with this cache $eg, traversing a matri/ column2wise with the row length "eing a power o# :%
wjb19@psu.edu
Cache E/ample
3he istri"ution o# ata to cache levels is largel! set "! compiler& har ware an -ernel& however the programmer is still responsi"le #or the "est data access patterns in his$her code possible Use cachegrind to optimiEe ata alignment * cache usage eg,&
#include <stdlib.h> #include <stdio.h> int main(){ int SIZE_X,SIZE_Y; SIZE_X=2048; SIZE_Y=2048; float * data = (float*) malloc(SIZE_X*SIZE_Y*sizeof(float)); for (int i=0; i<SIZE_X; i++) for (int j=0; j<SIZE_Y; j++) data[j+SIZE_Y*i] = 10.0f * 3.14f; //bad data access //data[i+SIZE_Y*j] = 10.0f * 3.14f; free(data); return 0; }
wjb19@psu.edu
Cache : Ba .ccess
bill@bill-HP-EliteBook-6930p:~$ valgrind --tool=cachegrind ./foo.x ==3088== Cachegrind, a cache and branch-prediction profiler ==3088== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al. ==3088== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==3088== Command: ./foo.x ==3088== ==3088== ==3088== I refs: 50,503,275 ==3088== I1 misses: 734 ==3088== LLi misses: 733 instructions ==3088== I1 miss rate: 0.00% ==3088== LLi miss rate: 0.00% ==3088== RE.( Ops QRI3E Ops ==3088== D refs: 33,617,678 (29,410,213 rd + 4,207,465 wr) ==3088== D1 misses: 4,197,161 ( 2,335 rd + 4,194,826 wr) ==3088== LLd misses: 4,196,772 ( 1,985 rd + 4,194,787 wr) ata ==3088== D1 miss rate: 12.4% ( 0.0% + 99.6% ) ==3088== LLd miss rate: 12.4% ( 0.0% + 99.6% ) ==3088== ==3088== LL refs: 4,197,895 ( 3,069 rd + 4,194,826 wr) ==3088== LL misses: 4,197,505 ( 2,718 rd + 4,194,787 wr) lowest level ==3088== LL miss rate: 4.9% ( 0.0% + 99.6% )
wjb19@psu.edu
wjb19@psu.edu
Cache Per#ormance
5or large data problems, any speedup introduced by paralleli0ation can easily be negated by poor cache utili0ation
In this case& memor! "an wi th is an order of magnitude worse #or pro"lem siEe $:U4>%U: $c# earlier note on wi el! varia"le memor! "an wi ths0 we have to wor- har to approach pea-%
High T miss
time $s%
wjb19@psu.edu
Outline
Intro uction 9otivation Computational operations 9ultiprocessors Processes 9emor! (igression Girtual 9emor! Cache 3hrea s PO'IX Open9P .##init!
wjb19@psu.edu
PO'IX 3hrea s I
. process ma! spawn one or more threads0 on a multiprocessor, the 67 can schedule these threads across a variety of cores, providing parallelism in the form of 'light2weight processes' (LWP)
Qhereas a chil process receives a cop! o# the parentPs virtual memor! an e/ecutes in epen entl! therea#ter& a threa shares the memor! o# the parent inclu ing instructions& an also has private ata
Using threa s we per#orm shared memory processing $c# istri"ute memor!& ne/t time%
Qe are at li"ert! to launch as man! threa s as we wish& although as !ou might e/pect& per#ormance ta-es a hit as more threa s are launche than can "e sche ule simultaneousl! across availa"le cores
wjb19@psu.edu
PO'IX 3hrea s II
Pthrea s re#ers to the PO'IX stan ar & which is Oust a speci#ication0 implementations e/ist #or various s!stems
9uch li-e processes& we can monitor threa e/ecution using utilities such as top an ps
3he memor! share among threa s must "e use care#ull! in or er to prevent race conditions& or threa s seeing incorrect ata uring e/ecution& ue to more than one threa per#orming operations on sai ata& in an uncoor inate #ashion
wjb19@psu.edu
9ulti23hrea e programs must also avoi deadloc!& a highl! un esirous state where one or more threa s await resources& an in turn are una"le to o##er up resources reAuire "! others
(ea loc-s can also "e avoi e through goo co ing& as well as the use o# communication techniAues "ase aroun semaphores& #or e/ample
3hrea s awaiting resources ma! sleep $conte/t switch "! -ernel& slow& saves c!cles% or "us! wait $e/ecutes while loop or similar chec-ing semaphore& #ast& wastes c!cles%
wjb19@psu.edu
Pthrea s E/ample
#include <pthread.h> #include <stdio.h> #include <stdlib.h> int sum; void *worker(void *param); int main(int argc, char *argv[]){ pthread_t tid; pthread_attr_t attr; if (argc!=2 || atoi(argv[1])<0){ printf("usage : a.out <int value>, where int value > 0\n"); return -1; } pthread_attr_init(&attr); pthread_create(&tid,&attr,worker,argv[1]); pthread_join(tid,NULL); printf("sum = %d\n",sum); } void * worker(void *total){ int upper=atoi(total); sum = 0; for (int i=0; i<upper; i++) sum += i; pthread_exit(0); }
wjb19@psu.edu
wjb19@psu.edu
coor ination
wjb19@psu.edu
on $2g% to w/
Qe are primaril! concerne with Psimpl!P ivi ing a wor-loa among availa"le cores0 Open9P proves much less unwiel ! to use
wjb19@psu.edu
3he Open9P stan ar is manage "! a review "oar & an is e#ine "! a large num"er o# har ware ven ors
.pplications written using Open9P emplo! pragmas& or statements interprete "! the preprocessor $"e#ore compilation%& representing #unctionalit! li-e #or- * Ooin that woul ta-e consi era"l! more e##ort an care to implement otherwise
Open9P pragmas or irectives in icate parallel sections o# co e ie,& a#ter compilation& at runtime& threa s are each given a portion o# wor- eg,& in this case& loop iterations will "e ivi e evenl! among running threa s :
#pragma omp parallel for for (int i=0; i<SIZE; i++) y[i]=x[i]*10.0f;
wjb19@psu.edu
Open9P Clauses I
3he num"er o# threa s launche uring parallel "loc-s ma! "e set via #unction calls or "! setting the OMP_NUM_THREADS environment varia"le
(ata o"Oects are generall! "! e#ault share $loop counters are private "! e#ault%& a num"er o# pragma clauses are availa"le& which are vali #or the scope o# the parallel section eg,& : private shared firstprivate 2initialiEe to value "e#ore parallel "loc lastprivate 2varia"le -eeps value a#ter parallel "loc reduction 2threa sa#e wa! o# com"ining ata at conclusion o# parallel "loc
3hrea s!nchroniEation is implicit to parallel sections0 there are a variet! o# clauses availa"le #or controlling this "ehavior also& inclu ing : critical2one threa at a time wor-s in this section eg,& in or er to avoi race $e/pensive& esign !our co e to avoi at all costs% atomic2 sa#e memor! up ates per#orme using eg,& mutual e/clusion $cost% barrier2threa s wait at this point #or others to arrrive wjb19@psu.edu
Open9P Clauses II
Open9P has e#ault threa sche uling "ehavior han le via the runtime li"rar!& which ma! "e mo i#ie through use o# the schedule(type,chunk) clause& with t!pes :
static - loop iterations are ivi e among threa s eAuall! "! e#ault0 speci#!ing an integer #or the parameter chun- will allocate a num"er o# contiguous iterations to a threa dynamic - total iterations #orm a pool& #rom which threa s wor- on small contiguous su"sets until all are complete& with su"set siEe given again "! chunguided - a large section o# contiguous iterations are allocate to each threa !namicall!, 3he section siEe ecreases e/ponentiall! with each successive allocation to a minimum siEe speci#ie "! chun-
wjb19@psu.edu
an Open9P pragma
#pragma omp parallel for //loop over trace records for (int k=0; k<config->traceNo; k++){ //loop over imageX for(int i=0; i<Li; i++){ tempC = ( midX[k] - imageXX[i]-offX[k]) * (midX[k]- imageXX[i]-offX[k]); tempD = ( midX[k] - imageXX[i]+offX[k]) * (midX[k]- imageXX[i]+offX[k]); //loop over imageY for(int j=0; j<Lj; j++){ tempA = tempC + ( midY[k] - imageYY[j]-offY[k]) * (midY[k]- imageYY[j]-offY[k]); tempB = tempD + ( midY[k] - imageYY[j]+offY[k]) * (midY[k]- imageYY[j]+offY[k]); //loop over imageZ for (int l=0; l<Ll; l++){ temp = sqrtf(tauS[l] + tempA * slownessS[l]); temp += sqrtf(tauS[l] + tempB * slownessS[l]); timeIndex = (int) (temp / sRate); if ((timeIndex < config->tracePts) && (timeIndex > 0)){ image[i*Lj*Ll + j*Ll + l] += traces[timeIndex + k * config->tracePts] * temp *sqrtf(tauS[l] / temp); } } //imageZ } //imageY } //imageX }//input trace records
wjb19@psu.edu
Coverage $.m ahlPs law%0 as we increase processors& relative cost o# serial co e portion increases Hardware limitations Locality,,,
B >,B > C,B C :,B : 4,B 4 8,B 8 4 : > < 46
E/ecution time
Qe can change this "! restricting threa s to e/ecute on a su"set o# processors& "! setting processor affinity
'implest approach is to set environment varia"le KMP_AFFINITY to: etermine the machine topolog!& assign threa s to processors Usage:
KMP_AFFINITY=[<modifier>]<type>[<permute>][<offset>]
wjb19@psu.edu
3he t!pe settings re#er to the nature o# the a##init!& an ma! ta-e values : compact2tr! to assign threa nJ4 conte/t as close as possi"le to n isa"le explicit2#orce assign o# threa s to processors in proclist none2Oust return the topolog! w/ ver"ose mo i#ier scatter2 istri"ute as evenl! as possi"le
fine & thread re#er to the same thing& namel! that threa s onl! resume in the same conte/t0 the core mo i#ier implies that the! ma! resume within a i##erent conte/t& "ut the same ph!sical core
CPU .##init! can e##ect application per#ormance signi#icantl! an is worth tuning& "ase on !our application an the machine topolog!,,,
wjb19@psu.edu
Qithout h!perthrea ing& there is onl! a single conte/t per core ie,& mo i#iers threa /#ine& core are in istinguisha"le
[wjb19@hammer16 scratch] $ export KMP_AFFINITY=verbose,none [wjb19@hammer16 scratch] $ ./psktm.x OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids. OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #156: KMP_AFFINITY: 12 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores) OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
wjb19@psu.edu
[wjb19@hammer5 scratch]$ export KMP_AFFINITY=verbose,granularity=fine,compact [wjb19@hammer5 scratch]$ ./psktm.x OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids. OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #156: KMP_AFFINITY: 12 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores) OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 10 OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 10 OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2} OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {10} OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {6} OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {1} OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {9} OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5} OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {3} OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {11}
wjb19@psu.edu
Conclusions
'cienti#ic research is supporte "! computational scaling an per#ormance& "oth provi e "! parallelism& limite to some e/tent "! .m ahlPs law
Parallelism has various levels o# granularit!0 at the #inest level is the instruction pipeline an vectoriEe registers eg,& ''E
3he ne/t level up in parallel granularit! is the multiprocessor0 we ma! run man! concurrent threa s using the pthrea s .PI or the Open9P stan ar #or instance
3hrea s must "e co e an han le with care& to avoi race an con itions
ea loc-
Per#ormance is a strong #unction o# cache utiliEation0 "ene#its intro uce through paralleliEation can easil! "e negate "! slopp! use o# memor! "an wi th 'caling across cores is limite "! har ware& .m ahlPs law "ut also localit!0 we have some control over the latter using KMP_AFFINITY #or instance
wjb19@psu.edu
Re#erences
Galgrin $"u! the manual& worth ever! penn!% http://valgrin ,org/ Open9P http://openmp,org/wp/ @NU Open9P http://gcc,gnu,org/proOects/gomp/ 'ummar! o# Open9P C,8 C/CJJ '!nta/ http://openmp,org/mp2 ocuments/Open9PC,42CCar ,p # 'ummar! o# Open9P C,8 +ortran '!nta/ http://www,openmp,org/mp2 ocuments/Open9PC,82+ortranCar ,p # Nice ''E tutorial http://neil-emp,us/src/sseStutorial/sseStutorial,html Intel Nehalem http://en,wi-ipe ia,org/wi-i/NehalemST:<microarchitectureT:5 @NU 9a-e http://www,gnu,org/s/ma-e/ Intel h!perthrea ing http://en,wi-ipe ia,org/wi-i/H!per2threa ing
wjb19@psu.edu
E/ercises
3a-e the supplie co e an paralleliEe using Open9P pragma aroun the wor-er #unction Create a ma-e#ile which "uil s the co e& compare timings "twn serial * parallel "! var!ing OMP_NUM_THREADS E/amine e##ect o# various settings #or KMP_AFFINITY
wjb19@psu.edu
wjb19@psu.edu
HPC Essentials
Part III : 9essage Passing Inter#ace
wjb19@psu.edu
Outline
9otivation Interprocess Communication 'ignals 'oc-ets * Networ-s procfs (igression 9essage Passing Inter#ace 'en /Receive Communication Parallel Constructs @rouping (ata Communicators * 3opologies
wjb19@psu.edu
9otivation
Qe saw last time that .m ahlPs law implies an asymptotic limit to per#ormance gains #rom parallelism& where parallel P an serial co e (82 P) portions have +i0ed relative cost
Qe loo-e at threa s $Xlight2weight processesY% an also saw that per#ormance epen s on a variet! o# things& inclu ing goo cache utiliEation an a##init!
+or the pro"lem siEe investigate & ultimatel! the limiting #actor was isI/O& there was no sense going "e!on a single compute no e0 in a machine with 46 cores or more& there is no point when P R 68T& shoul the process have su##icient memor!
However, as we increase our problem si0e, the relative parallel$serial cost changes and P can approach 8
wjb19@psu.edu
9otivation
In the limit as processors 3 ; we #in the ma/imum per#ormance improvement : 8$(82P) It is help#ul to see the C B points #or this limit ie,& the num"er o# processors 3 8$. reAuire to achieve (8$9.)1ma = 8$(9.1(82P))0 eAuating with .m ahlPs law * a#ter some alge"ra : 38$. : 8$((82P)1(9.28))
N4/:
4B8 488 B8 8 8,5 8,54 8,5: 8,5C 8,5> 8,5B 8,56 8,57 8,5< 8,55
Parallel co e #raction P
wjb19@psu.edu
9otivation
Points to note #rom the graph : P D 8,58& we can "ene#it #rom D :8 cores P D 8,55& we can "ene#it #rom a cluster siEe o# D :B6 cores P ; 4& we approach the Xem"arrassingl! parallelY limit P ; 4& per#ormance improvement irectl! proportional to cores P D 4 implies independent or "atch processes
?uite asi e #rom consi erations o# .m ahlPs law& as the pro"lem siEe grows& we ma! simpl! e/cee the memor! availa"le on a single no e
In this case& must move to a distributed memory processing mo el/multiple no es $unless P D 4 o# course%
wjb19@psu.edu
Pro#iling w/ Galgrin
[wjb19@lionxf scratch]$ valgrind --tool=callgrind ./psktm.x [wjb19@lionxf scratch]$ callgrind_annotate --inclusive=yes callgrind.out.3853 -------------------------------------------------------------------------------Profile data file 'callgrind.out.3853' (creator: callgrind-3.5.0) -------------------------------------------------------------------------------I1 cache: D1 cache: ParalleliEa"le wor-er L2 cache: Timerange: Basic block 0 - 2628034011 #unction is 55,BT o# Trigger: Program termination total instructions Profiled target: ./psktm.x (PID 3853, part 1)
e/ecute
-------------------------------------------------------------------------------20,043,133,545 PROGRAM TOTALS -------------------------------------------------------------------------------Ir file:function -------------------------------------------------------------------------------20,043,133,545 ???:0x0000003128400a70 [/lib64/ld-2.5.so] 20,042,523,959 ???:0x0000000000401330 [/gpfs/scratch/wjb19/psktm.x] 20,042,522,144 ???:(below main) [/lib64/libc-2.5.so] 20,042,473,687 /gpfs/scratch/wjb19/demoA.c:main 20,042,473,687 demoA.c:main [/gpfs/scratch/wjb19/psktm.x] 19,934,044,644 psktmCPU.c:ktmMigrationCPU [/gpfs/scratch/wjb19/psktm.x] 19,934,044,644 /gpfs/scratch/wjb19/psktmCPU.c:ktmMigrationCPU 6,359,083,826 ???:sqrtf [/gpfs/scratch/wjb19/psktm.x] 4,402,442,574 ???:sqrtf.L [/gpfs/scratch/wjb19/psktm.x] 104,966,265 demoA.c:fileSizeFourBytes [/gpfs/scratch/wjb19/psktm.x]
I# we wish to scale outsi e a single no e& we must use some #orm o# interprocess communication
wjb19@psu.edu
Inter2Process Communication
3here are a variet! o# wa!s #or processes to e/change in#ormation& inclu ing: 9emor! $Dlast wee-% +iles Pipes $name /anon!mous% 'ignals 'oc-ets 9essage Passing +ile I/O is too slow& an rea /writes lia"le to race con itions
.non!mous * name pipes are highl! e##icient "ut +I+O $#irst in& #irst out% "u##ers& allowing onl! uni irectional communication& an "etween processes on the same no e
'ignals are a ver! limite #orm o# communication& sent to the process a#ter an interrupt "! the -ernel& an han le using a e#ault han ler or one speci#ie using signal() s!stem call
'ignals ma! come #rom a variet! o# sources eg,& segmentation #ault $ SIGSEGV%& -e!"oar interrupt Ctrl2C $SIGINT% etc
wjb19@psu.edu
'ignals
strace is a power#ul utilit! in UNIX which shows the interaction "etween a running process an -ernel in the #orm o# s!stem calls an signals0 here& a partial output showing mapping o# signals to e#aults with s!stem call sigaction(), #rom ./psktm.x :
UNIX signals
rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGILL, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGABRT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGFPE, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGSEGV, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0
'ignals are cru e an restricte to local communication0 to communicate remotel!& we can esta"lish a soc!et "etween processes& an communicate over the networ
wjb19@psu.edu
'oc-ets * Networ-s
(avies/Baran #irst evise pac-et switching& an e##icient means o# communication over a channel0 a computer was conceive to realiEe their esign an .RP.NE3 went online Oct 4565 "etween UC1. an 'tan#or
3CP/IP "ecame the communication protocol o# .RP.NE3 4 Nan 45<C& which was retire in 4558 an N+'NE3 esta"lishe 0 universit! networ-s in the U' an Europe Ooin
3CP/IP is Oust one o# man! protocols& which escri"es the #ormat o# ata pac-ets& an the nature o# the communication0 an analogous connection metho is use "! In#ini"an networ-s in conOunction with Remote (irect 9emor! .ccess $R(9.%
Unrelia"le (atagram Protocol $U(P% is analogous to a connectionless metho o# communication use "! In#ini"an high per#ormance networ-s
wjb19@psu.edu
int main(void) { //creates an endpoint & returns file descriptor //uses IPv4 domain, datagram type, UDP transport int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP); //socket address object (sa) and memory buffer struct sockaddr_in sa; char buffer[1024]; ssize_t recsize; socklen_t fromlen; //specify same domain type, any input address and port 7654 to listen on memset(&sa, 0, sizeof sa); sa.sin_family = AF_INET; sa.sin_addr.s_addr = INADDR_ANY; sa.sin_port = htons(7654); fromlen = sizeof(sa);
wjb19@psu.edu
wjb19@psu.edu
Hou can monitor soc-ets "! using the netstat #acilit!& which ta-es itPs ata #rom /proc/net wjb19@psu.edu
Outline
9otivation Interprocess Communication 'ignals 'oc-ets * Networ-s procfs (igression 9essage Passing 'en /Receive Communication Parallel Constructs @rouping (ata Communicators * 3opologies
wjb19@psu.edu
proc#s
Qe mentione the /proc irector! previousl! in the conte/t o# cpu an memor! in#ormation& which is #reAuentl! re#erre to as the proc #iles!stem or procfs
&t is a veritable treasure trove of information& written perio icall! "! the -ernel& an is use "! a variet! o# tools eg,& ps
Each irector! contains te/t #iles an su" irectories with ever! etail o# a running process& inclu ing conte/t switching statistics& memor! management& open #ile escriptors an much more
9uch li-e the ptrace() s!stem call& procfs also gives user applications the a"ilit! to directly manipulate running processes& given su##icient permission0 !ou can e/plore that on !our own :%
wjb19@psu.edu
proc#s : e/amples
/proc/PID/cmdline : comman use to launch process /proc/PID/cwd : current wor-ing irector! /proc/PID/environ : environment varia"les #or the process /proc/PID/fd : irector! w/ s!m"olic lin- #or each open #ile escriptor eg,& streams /proc/PID/status : in#ormation inclu ing signals& state& memor! usage /proc/PID/maps : memor! map "etween virtual an ph!sical a resses
[wjb19@hammer1 fd]$ ls -lah total 0 dr-x------ 2 wjb19 wjb19 0 Dec 7 12:13 . dr-xr-xr-x 6 wjb19 wjb19 0 Dec 7 12:10 .. lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 0 -> /dev/pts/28 lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 1 -> /dev/pts/28 lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 2 -> /dev/pts/28 lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 3 -> /gpfs/scratch/wjb19/inputDataSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 4 -> /gpfs/scratch/wjb19/inputSrcXSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 5 -> /gpfs/scratch/wjb19/inputSrcYSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 6 -> /gpfs/scratch/wjb19/inputRecXSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 7 -> /gpfs/scratch/wjb19/inputRecYSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 8 -> /gpfs/scratch/wjb19/velModel.bin
wjb19@psu.edu
signals wjb19@psu.edu
Outline
9otivation Interprocess Communication 'ignals 'oc-ets * Networ-s procfs (igression 9essage Passing Inter#ace 'en /Receive Communication Parallel Constructs @rouping (ata Communicators * 3opologies
wjb19@psu.edu
9ultiple Instruction& multiple ata $9I9(% s!stem ; connecte processes are as!nchronous& generall! istri"ute memor! $ma! also "e share where processes on single no e%
9I9( Processors are connecte in some networ- topolog!0 we onPt have to worr! a"out the etails& 9PI a"stracts this awa!
9PI is a stan ar #or parallel programming #irst esta"lishe in 4554& up ate occasionall!& "! aca emics an in ustr!
It comprises routines #or point2to2point an collective communication& with "in ings to C/CJJ an #ortran
(epen ing on un erl!ing networ- #a"ric& communication ma!"e 3CP or U(P2 li-e in In#ini"an networ-s
wjb19@psu.edu
One ma! sen in#ormation re#erencing process ran- eg,&: MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
Bu##er a ress Ran- o# rcv
3his #unction has a receive analogue0 "oth routines are bloc!ing "! e#ault
'en /receive statements generall! occur in same co e& processors e/ecute appropriate statement accor ing to ran- * co e "ranch
Non2"loc-ing #unctions availa"le& allows communicating processes to continue with e/ecution where a"le
wjb19@psu.edu
MPI_Init(&argc, &argv); //Start up MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //My rank MPI_Comm_size(MPI_COMM_WORLD, &p); //No. processors MPI_Finalize(); //close up shop MPI_COMM_WORLD is a communicator parameter& a collection o# processes that can sen messages to each other,
9essages are sent with tags to i enti#! them& allowing speci#icit! "e!on using Oust a source/ estination parameter
wjb19@psu.edu
9PI : (atat!pes
MPI_CHAR MPI_SHORT MPI_INT MPI_LONG MPI_UNSIGNED_CHAR MPI_UNSIGNED_SHORT MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_FLOAT MPI_DOUBLE MPI_LONG_DOUBLE MPI_BYTE MPI_PACKED
signed char signed short int signed int signed long int unsigned char unsigned short int unsigned int unsigned long int float double long double
wjb19@psu.edu
wjb19@psu.edu
. communication pattern involving all processes in a communicator is a collective communication eg,& a "roa cast 'ame ata sent to ever! process in communicator& more e##icient than using multiple p:p routines& optimiEe :
MPI_Bcast(void* message, int count, MPI_Datatype type, int root, MPI_Comm comm)
'en s cop! o# ata in message #rom root process to all in comm, a scatter/map operation Collective communication is at the heart o# e##icient parallel operations
wjb19@psu.edu
MPI_Reduce(void* operand, void* result, int count, MPI_Datatype type, MPI_Op operator, int root, MPI_Comm comm)
Com"ines all operand, using operator an stores result on process root, in result . tree2structure re uce at all no es == MPI_Allreduce,ie,& ever! process in comm gets a cop! o# the result
4 : C p24
root
wjb19@psu.edu
Re uction Ops
MPI_MAX MPI_MIN MPI_SUM MPI_PROD
MPI_LAND
Logical and Bitwise and Logical or Bitwise or Logical XOR Bitwise XOR Max w/ location Min w/ location
wjb19@psu.edu
Bul- trans#ers o# man!2to2one an one2to2man! are accomplishe "! gather an scatter operations respectivel! 3hese operations #orm the -ernel o# matri//vector operations #or e/ample0 the! are use#ul #or istri"uting an reassem"ling arra!s
/8 /4 /: /C @ather
a88
a84
a8:
a8C
'catter wjb19@psu.edu
'catter/@ather '!nta/
MPI_Gather(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data, int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm) Collects ata re#erence "! send_data #rom each process in comm an stores ata in process ran- or er on process w/ ran- root& in memor! re#erence "! recv_data MPI_Scatter(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data, int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm) 'plits ata re#erence "! send_data on process w/ ran- root into segments& send_count elements each& w/ send_type * istri"ute in or er to processes +or gather result to .11 processes ; MPI_Allgather
wjb19@psu.edu
@rouping (ata I
Communication is e/pensive ; "un le varia"les into single message Qe must e#ine a erive t!pe than can escri"e the heterogeneous contents o# a message using t!pe an isplacement pairs 'everal wa!s to "uil this MPI_Datatype eg,&
MPI_Type_Struct(int count, int block_lengths[], //contains no. entries in each block MPI_Aint displacements[], //element offset from msg start MPI_Datatype typelist[], //exactly that MPI_Datatype* new_mpi_t //a pointer to this new type)
.llows #or a
resses I int
. ver! general erive t!pe& although arra!s to struct must "e constructe e/plicitl! using other 9PI comman s 'impler when less heterogeneous eg,& MPI_Type_vector, MPI_Type_Contiguous, MPI_Type_indexed
wjb19@psu.edu
@rouping (ata II
Be#ore these erive t!pes can "e use "! a communication #unction& must "e committe with MPI_type_commit #unction call In or er #or message to "e receive & t!pe signatures at sen an receive must "e compatible0 i# a collective communication& signatures must "e i entical MPI_Pack * MPI_Unpack are use#ul #or when messages o# heterogeneous ata are in#reAuent& an cost o# constructing erive t!pe outweighs "ene#it 3hese metho s also allow buffering in user versus s!stem memor!& an the num"er o# items transmitte is in the message itsel# @roup ata allows #or sophisticate o"Oects0 we can also create more fined grained communication ob*ects
wjb19@psu.edu
Communicators
Process su"sets or groups e/pan communication "e!on simple p:p an "roa cast communication& to create :
Communicators/groups are opa?ue& internals not irectl! accessi"le0 these o"Oects are re#erence "! a han le
wjb19@psu.edu
Communicators Cont,
Internal contents manipulate "! metho s& much li-e private ata in CJJ class o"Oects eg,& int MPI_Group_incl(MPI_Group old_group,int new_group_size, int ranks_in_old_group[], MPI_Group* new_group) create a new_group #rom old_group& using ranks_in_old_group[] etc
int MPI_Comm_create(MPI_Comm old_comm, MPI_Group new_group, MPI_Comm* new_comm) create a new communicator #rom the ol & with conte/t
MPI_Comm_group an MPI_Group_incl are local metho s without communication& MPI_Comm_create is a collective communication impl!ing synchroni0ation ie&, to esta"lish single conte/t 9ultiple communicators ma! "e create simultaneously using MPI_Comm_split
wjb19@psu.edu
3opologies I
9PI allows one to associate i##erent a ressing schemes to processes within a group 3his is a virtual versus real or ph!sical topolog!& an is either a graph structure or a $Cartesian% grid0 properties: (imensions& w/ 'iEe o# each Perio o# each Option to have processes reordere optimall! within gri 9etho to esta"lish Cartesian gri cart_comm :
int MPI_Cart_create(MPI_Comm old_comm, int number_of_dims, int dim_sizes[], int wrap_around[], int reorder, MPI_Comm* cart_comm)
wjb19@psu.edu
3opologies II
cart_comm will contain the processes #rom old_comm with associate coor inates& availa"le #rom MPI_Cart_coords: int coordinates[2]; int my_grid_rank; MPI_Comm_rank(cart_comm, &my_grid_rank); MPI_Cart_Coords(cart_comm, my_grid_rank,2,coordinates);
Call to MPI_Comm_rank is necessar! "ecause o# process ranreor ering $optimiEation% Processes in cartScomm are store in row maOor or er Can also partition in to su"2gri $s% using 9PISCartSsu" eg,& #or row:
int free_coords[2]; MPI_Comm row_comm; //new sub-grid free_coords[0]=0; //bool; first coordinate fixed free_coords[1]=1; //bool; second coordinate free MPI_Cart_sub(cart_comm,free_coords,&row_comm);
wjb19@psu.edu
Qriting Parallel Co e
.ssuming wePve pro#ile our co e an eci e to paralleliEe& eAuippe with 9PI routines& we must eci e whether to ta-e a : (omain 8arallel $ ivi e tas-s& similar ata% or
(ata parallel in general scales much "etter& implies lower communication overhea Regar less& easiest to "egin "! selecting or esigning ata structures& an su"seAuentl! their istri"ution using a constructe topolog! or scatter/gather routines& #or e/ample Program in mo ules& "eginning with easiest/essential #unctions $eg,& I/O%& relegating Phar P #unctionalit! to stu"s initiall! 3ime co e sections& loo- at targets #or optimiEation * re esign Onl! concern !oursel# with the highest levels o# a"straction germane to !our pro"lem& use parallel constructs wherever possi"le
wjb19@psu.edu
.s application evelopers an /or scientists& we nee onl! "e concerne with la!ers > an a"ove
Function process accessing networencr!t/ ecr!pt& ata conversion management relia"ilit! * #low control path a ressing signals/electrical
Conclusions
Qe can etermine the parallel portion o# our co e through pro#iling0 as a rule o# thum" a co e with P D 55T can e##ectivel! utiliEe a"out :B6 cores& co e with P D 58T a"out :8 cores
Qhen the parallel portion o# co e approaches 58T& we can Ousti#! going outsi e the multi2core no e an using some #orm o# inter2process communication $IPC%
IPC comes in a variet! o# #orms eg,& soc-ets connecte over networ-s& signals "etween processes on a single machine
3he message passing inter#ace $9PI% a"stracts awa! etails o# IPC use over networ-s& provi ing language "in ings to C&#ortran etc
9PI has a num"er o# highl! optimiEe collective communication an parallel constructs& sophisticate means o# grouping o"Oects& as well as computational topologies
3he O'I 9o el assigns various communication entities to one o# seven la!ers& we nee onl! "e concerne with la!er #our an a"ove
wjb19@psu.edu
http://www,cs,us#ca,e u/Dpeter/ppmpi/ Galgrin $no reall!& "u! the manual% http://valgrin ,org/ UNIX signals http://www,cs,pitt,e u/DalanOawi/cs>>5/co e/shell/Uni/'ignals,htm Open9PI http://open2mpi,org/ proc#s http://www,-ernel,org/ oc/man2pages/online/pages/manB/proc,B,html E/cellent article on ptrace http://linu/gaEette,net/<4/san eep,html )ernel vulnera"ilities associate with ptrace/proc#s http://www,-",cert,org/vuls/i /476<<< 9PI tutorials http://www,mcs,anl,gov/research/proOects/mpi/learning,html 1inu/ @aEette articles http://linu/gaEette,net Open '!stems Interconnection http://en,wi-ipe ia,org/wi-i/O'ISRe#erenceS9o el PB' re#erence http://rcc,its,psu,e u/userSgui es/s!stemSutilities/p"s/ wjb19@psu.edu
E/ercises
Buil the supplie 9PI co e via Pmake -f Makefile_P an su"mit to cluster o# !our choice using the #ollowing PB' script Compare scaling with Open9P e/ample #rom last wee-& "! var!ing "oth no es an procs per no e $ppn%0 i##erencesK $NU9. vs goo localit! w/ 9PI%
'-etch how the gather #unction is collecting ata& an the root process su"seAuentl! writes out to is'imilarl!& s-etch how the image ata e/ists in memor!0 are the two pictures commensurateK $hint: no :% % Re2assign image gri tiles to processes such that no #ile manipulation is reAuire a#ter program completion
wjb19@psu.edu
cd /gpfs/home/wjb19/scratch module load openmpi/gnu/1.4.2 mpirun ./psktm.x 'u"mit to cluster : [wjb19@lionxf scratch]$ qsub foo.pbs
wjb19@psu.edu
9PI programs are rea il! e"ugge using serial programs li-e gdb& once !ou have : compile with -g & su"mitte !our Oo"
an assigne
Once attache an wor-ing with gdb& !ou can set some "rea-points an alter the parameter i $eg,& set var i=7% to move out o# the loop
wjb19@psu.edu