Allparallellectures 13275850976597 Phpapp01 120126074055 Phpapp01

HPC Essentials
Part I : UNIX/C Overview
Bill Brouwer Research Computing an C!"erin#rastructure $RCC%& P'U
wjb19@psu.edu
Outline
Intro uction Har ware (e#initions UNIX )ernel * shell +iles Permissions Utilities Bash 'cripting C programming
wjb19@psu.edu
HPC Intro uction
HPC s!stems compose o# : 'o#tware Har ware (evices $eg,& is-s% Compute elements $eg,& CPU% 'hare an /or istri"ute memor! Communication $eg,& In#ini"an networ-%
A HPC system ...isn't... unless hardware is configured correctly and software leverages all resources made available to it, in an optimal manner .n operating s!stem controls the e/ecution o# so#tware on the har ware0 HPC clusters almost e/clusivel! use UNIX/1inu/
In the computational sciences& we pass ata an /or a"stractions through a pipeline wor-#low0 UNIX is the natural analogue to this solving/ iscover! process
wjb19@psu.edu
UNIX
UNIX is a multi2user/tas-ing O' create "! (ennis Ritchie an )en 3hompson at .3*3 Bell 1a"s 456524578& written primaril! in C language $also evelope "! Ritchie%

UNIX is compose o# : Kernel O' itsel# which han les sche uling& memor! management& I/O etc Shell (eg., Bash) Interacts with -ernel& comman line interpreter Utilities Programs run "! the shell& tools #or #ile manipulation& interaction with the s!stem Files Ever!thing "ut process$es%& compose o# ata,,,
wjb19@psu.edu
(ata2Relate (e#initions
Binary 9ost #un amental ata representation in computing& "ase : num"er s!stem $others0 he/ ; "ase 46& oct ; "ase <% Byte < "its = <" = 4B!te = 4B0 4-B = 48:> B0 49B = 48:> -B etc ASCII .merican 'tan ar Co e #or In#ormation Interchange0 character enco ing scheme& 7"its $tra itional% or <"its $U3+2<% per character& a Unicode enco ing Stream . #low o# "!tes0 source ; stdout $* stderr%& sin- ; stdin Bus Communication channel over which ata #lows& connects elements within a machine Process +un amental unit o# computational wor- per#orme "! a processor0 CPU e/ecutes application or O' instructions Node 'ingle computer& compose o# man! elements& various architectures #or CPU& eg,& /<6& RI'C
wjb19@psu.edu
3!pical Compute No e $Intel i7%

CPU ?uic-Path Interconnect @PU IOH PCI2e/press (irect 9e ia Inter#ace ethernet PCI2e car s ICH N"#$% K volatile storage memor! "us A!
'.3./U'B BIO' non2volatile storage wjb19@psu.edu
9ore (e#initions
Cluster 9an! no es connecte together via networNet&or' Communication channel& inter2no e0 connects machines Shared !emory 9emor! region share within no e (istri)uted !emory 9emor! region across two or more no es (irect !emory Access ((!A) .ccess memor! in epen entl! o# programme I/O ie,& in epen ent o# the CPU Band&idth Rate o# ata trans#er across serial or parallel communication channel& e/presse as "its $"% or B!tes $B% per secon $s% Beware Auotations o# "an wi th0 man! #actors eg,& simple// uple/& pea-/sustaine & no, o# lanes etc Latency or the time to create a communication channel is often more important
wjb19@psu.edu
Ban wi ths
(e*ices U'B : 689B/s $version :,8% Har (is- : 4889Bs2B889B/s PCIe : C:@B/s $/<& version :,8% Net&or's 48/488Base 3 : 48/488 9"it/s 4888Base3 $4@igE% : 4888 9"it/s 48 @igE : 48 @"it/s In#ini"an ?(R >X: >8 @"it/s !emory CPU : D CB @B/s $Nehalem& C/ 4,C@HE (I99/soc-et%F @PU : D 4<8 @B/s $@e+orce @3X ><8% .GOI( evices& -eep ata resi ent in memor!& minimiEe communication "twn processes 9.NH su"tleties to CPU memor! management eg,& with </ CPU cores& total "an wi th ma! "e I C88 @B/s or as little as 48 @B/s& will iscuss #urther
Fhttp://www, elltechcenter,com/page/8>28<2:885J2JNehalemJan J9emor!JCon#igurationsKt=anonL#"i =XMRE#lAGM6N
wjb19@psu.edu
Outline
Intro uction HPC har ware (e#initions UNIX )ernel * shell +iles Permissions Utilities Bash 'cripting C programming
wjb19@psu.edu
UNIX Permissions * +iles

.t the highest level& UNIX o"Oects are either #iles or processes& an "oth are protecte "! permissions $processes ne/t time% Ever! #ile o"Oect has two I(Ps& the user an group& "oth are assigne on creation0 onl! the root user has unrestricte access to ever!thing +iles also have "its which speci#! rea $r%& write $w% an e/ecute $/% permissions #or the user& group an others eg,& output o# ls comman :
-rw-r--r-- 1 root root 0 Jun 11 1976 /usr/local/foo.txt

user/group/others
User I( @roup I(
#ilename
We can manipulate files using myriad utilities, these utilities are commands interpreted by the shell and e ecuted by the !ernel 3o learn more& chec- man pages ie,& #rom the comman line 'man <command>'
wjb19@psu.edu
+ile 9anipulation I
Qor-ing #rom the comman line in a Bash shell: 1ist irector! foo_dir contents& human rea a"le : Change ownership o# foo.xyz to wjb190 group an user:
[wjb19@lionga scratch] $ ls -lah foo_dir
[wjb19@lionga scratch] $ chown wjb19:wjb19 foo.xyz
. e/ecute permission to foo.xyz: [wjb19@lionga scratch] $ chmod +x foo.xyz
(etermine #ilet!pe #or foo.xyz: [wjb19@lionga scratch] $ file foo.xyz
Peruse te/t #ile foo.xyz: [wjb19@lionga scratch] $ more foo.xyz
wjb19@psu.edu
+ile 9anipulation II
Cop! foo.txt #rom lionga to #ile /home/bill/foo.txt on dirac : [wjb19@lionga scratch] $ scp foo.txt \ wjb19@dirac.rcc.psu.edu:/home/bill/foo.txt

Create gzip compresse #ile archive o# irector! foo an contents : Create bzip2 compresse #ile archive o# irector! foo an contents : Unpac- compresse #ile archive : E it a te/t #ile using GI9:
[wjb19@lionga scratch] $ tar -cfz foo_archive.tgz foo/*
[wjb19@lionga scratch] $ tar -cfj foo_archive.tbz foo/*
[wjb19@lionga scratch] $ tar -xvf foo_archive.tgz
[wjb19@lionga scratch] $ vim foo.txt
GI9 is a venera"le an power#ul comman line e itor with a rich set o# comman s
wjb19@psu.edu
3e/t +ile E it w/ GI9

3wo main mo es o# operation0 e iting or comman , +rom comman & switch to e it "! issuing Pa' $insert a#ter cursor% or 'i' $"e#ore%& switch "ac- to comman via <ESC>
'ave w/o Auitting 'ave an Auit $ie,& Rshi#tI .N( PEP .N( PEP% ?uit w/o saving (elete / lines eg&, /=48 $also store Han- $cop!% / lines eg,& /=48 'plit screen/"u##er 'witch win ow/"u##er @o to line / eg,& /=48 +in matching construct $eg,& #rom { to }%
:w<ENTER> :wq<ENTER> :q!<ENTER>

in clip"oar %
d10d y10y :split<ENTER> <CNTRL>-w-w :10<ENTER> %
Paste: 'p' un o: 'u' re o: '<CNTRL>-r' 9ove up/ own one screen line : '-' an '+' 'earch #or e/pression exp, #orwar ('n' or 'N' navigate up/ own highlighte matches% '/exp<ENTER>' or "ac-war '?exp<ENTER>' wjb19@psu.edu
3e/t +ile Compare w/ GI9(I++

'ame comman s as GI9& "ut highlights i##erences in #iles& allows trans#er o# te/t "twn "u##ers/#iles0 launch with 'vimdiff foo.txt foo2.txt'
Push te/t #rom right to le#t $when right win ow active an cursor in relevant region% using comman 'dp' Pull te/t #rom right to le#t $when le#t win ow active an cursor in relevant region% using comman 'do' wjb19@psu.edu
Bash 'cripting
+ile an other utilities can "e assem"le into scripts& interprete "! the shell eg,& Bash 3he scripts can "e collections o# comman s/utilities * #un amental programming constructs
Co e Comment Pipe stdout o# proc. to stdin o# procB Re irect stdout o# proc. to #ile #oo,t/tF Comman separator I# "loc(ispla! on stdout Garia"le assignment * literal value Concatenate strings 3e/t Processing utilities 'earch utilities
#this is a comment procA | procB procA > foo.txt procA; procB if [condition] then procA fi
echo hello a = foo; echo $a b=a.foo2;

sed,gawk find,grep
F'treams have file descriptors $num"ers% associate with them0 eg,& to re irect stderr #rom proc. to #oo,t/t ; procA 2> foo.txt wjb19@psu.edu
3e/t Processing
3e/t ocuments are compose o# records $roughl! spea-ing& lines separate "! carriage returns% an fields $separate "! spaces%
3e/t processing using sed * gawk involves coupling patterns with actions eg,& print #iel 4 in ocument foo.txt when encountering wor image:
[wjb19@lionga scratch] $ gawk '/image/ {print $1;}' foo.txt pattern action
input
Parse& without case sensitivit!& change #rom e#ault space #iel separator $+'% to eAuals sign& print #iel ::
[wjb19@lionga scratch] $ gawk 'BEGIN{IGNORECASE=1; FS==} \ /image/ {print $2;}' foo.txt
Putting it all together " create a #ash script w$ %&' or other (eg,. Pico)...
wjb19@psu.edu
Bash E/ample I
#!/bin/bash #set source and destination paths DIR_PATH=~/scratch/espresso-PRACE/PW BAK_PATH=~/scratch/PW_BAK declare -a file_list #filenames to array file_list=$(ls -l ${BAK_PATH} | gawk '/f90/ {print $9}') cnt=0; #parse files & pretty up for x in $file_list do let "cnt+=1" sed 's/\,\&/\,\ \&/g' $BAK_PATH/$x | \ sed 's/)/)\ /g' | \ sed 's/call/\ call\ /g' | \ sed 's/CALL/\ call\ /g' > $DIR_PATH/$x echo cleaned file no. $cnt $x done exit
Run using "ash
(eclare an arra! Comman output
'earch * replace
wjb19@psu.edu
Bash E/ample II
#!/bin/bash if [ $# -lt 6 ] then echo usage: fitCPCPMG.sh '[/path/and/filename.csv] \ [desired number of gaussians in mixture (2-10)] \ [no. random samples (1000-10000)]\ [mcmc steps (1000-30000)]\ [percent noise level (0-10)]\ [percent step size (0.01-20)]\ [/path/to/restart/filename.csv; optional]' exit fi ext=${1##*.} if [ "$ext" != "csv" ] then echo ERROR: file must be *.csv exit fi base=$(basename $1 .csv) if [[ $2 -lt 2 ]] || [[ $2 -gt 10 ]] then echo "ERROR: must specify 2<=x<=10 gaussians in mixture" exit fi
3otal arguments
+ile e/tension
+ile "asename
wjb19@psu.edu
Outline
Intro uction HPC har ware (e#initions UNIX )ernel * shell +iles Permissions Utilities Bash 'cripting C programming
wjb19@psu.edu
3he C 1anguage
Utilities& user applications an in ee the UNIX O' itsel# are e/ecute "! the CPU& when e/presse as machine co e eg,& store/loa #rom memor!& a ition etc +un amental operations li-e memor! allocation& I/O etc are la"orious to e/press at this level& most #reAuentl! we "egin #rom a high2level language li-e C 3he process o# creating an e/ecuta"le consists o# at least C #un amental steps0 creation o# source co e te/t #ile containing all desired ob*ects and operations, compilation an lin!ing eg&, using the @NU tool gcc to create e/ecuta"le foo.x #rom source #ile foo.c: [wjb19@tesla2 scratch]$ gcc -std=c99 foo.c -o foo.x
FC55 stan ar compile 'ource Fc #ile O"Oect Fo co e linE/ecuta"le
1i"rar! o"Oects wjb19@psu.edu
C Co e Elements I
Compose o# primitive atat!pes $eg,& int, float, long%& which have i##erent siEes in memor!& multiples o# 4 "!te

9a! "e compose o# staticall! allocate memor! $compile time%& !namicall! allocate memor! $runtime%& or "oth
Pointers $eg,& float *% are primitives with + or , byte lengths (-.bit or /+bit machines) which contain an address to a contiguous region of dynamically allocated memory
9ore complicate o"Oects can "e constructe #rom primitives an arra!s eg,& a struct
wjb19@psu.edu
C Co e Elements II
Common operations are gathere into functions& the most common "eing main(), which must "e present in e/ecuta"le
+unctions have a istinct name& ta-e arguments& an return output0 this in#ormation comprises the prototype, e/presse separatel! to the implementation etails& #ormer o#ten in header file
Important s!stem #unctions inclu e read,write,printf $I/O% an malloc,free $9emor!%
3he operating s!stem e/ecutes compile co e0 a running program is a process $more ne/t time%
wjb19@psu.edu
C Co e E/ample
#include <stdio.h> #include <stdlib.h> #include "allDefines.h" //Kirchoff Migration function in psktmCPU.c void ktmMigrationCPU(struct imageGrid* imageX, struct imageGrid* imageY, struct imageGrid* imageZ, struct jobParams* config, float* midX, float* midY, float* offX, float* offY, float* traces, float* slowness, float* image);
3ells preprocessor to inclu e these hea ers0 s!stem #unctions etc
+unction protot!pe0 must give arguments& their t!pes an return t!pe0 implementation elsewhere
int main() { int IMAGE_SIZE = 10; float* image = (float*) malloc (IMAGE_SIZE*sizeof(float)); printf(size of image = %i\n,IMAGE_SIZE); for (int i=0; i<IMAGE_SIZE; i++) printf(image point %i = %f\n,i,image[i]); free(image); return 0; }
wjb19@psu.edu
UNIX C @oo Practice I

Use three streams& with #ile escriptors 8&4&: respectivel!& allows assem"l! o# operations into pipeline an these ata streams are PcheapP to use
Onl! han simple comman line options to main() using argc,argv[]0 in general we wish to han le short an long options $eg,& see @NU co ing stan ar s% an the use o# getopt_long() is pre#era"le,
UtiliEe the environment variables of the host shell& particularl! in setting runtime con itions in e/ecute co e via getenv() eg,& in Bash set in .bashrc con#ig #ile or via comman line:
[wjb19@lionga scratch] $ export MY_STRING=hello I# !our proOect/program reAuires a% sophisticate o"Oects "% man! evelopers c% woul "ene#it #rom o"Oect oriente esign principles& !ou shoul consi er writing in CJJ $although "eing a higher2level language it is har er to optimiEe%
wjb19@psu.edu
UNIX C @oo Practice II

In high per#ormance applications& avoi s!stem calls eg,& read/write where control is given over to the -ernel an processes can "e bloc!ed until the resource is rea ! eg,& is I+ s!stem calls must "e use & han le errors an report to stderr I+ temporar! #iles must "e written& use mkstemp which sets permissions & #ollowe "! unlink0 the #ile escriptor is close "! the -ernel when the program e/ists an the #ile remove
Use assert to test vali it! o# #unction arguments& statements etc0 will intro uce per#ormance hit& "ut asserts can "e remove at compile time with NDEBUG macro $C stan ar %
(e"ug with gdb& pro#ile with gprof& valgrind0 target most e/pensive #unctions #or optimi0ation

Put common #unctions in/use li"raries wherever possi"le,,,,

wjb19@psu.edu
)e! HPC 1i"raries
B1.'/1.P.C)/'ca1.P.C) Original "asic an e/ten e linear alge"ra routines http://www,netli",org/ Intel 9ath )ernel 1i"rar! $9)1% implementation o# a"ove routines& w/ solvers& ##t etc http://so#tware,intel,com/en2us/articles/intel2m-l/ .9( Core 9ath 1i"rar! $.C91% (itto http:// eveloper,am ,com/li"raries/acml/pages/ e#ault,asp/ Open9PI Open source 9PI implementation http://www,open2mpi,org/ PE3'c (ata structures an routines #or parallel scienti#ic applications "ase on P(EPs http://www,mcs,anl,gov/petsc/petsc2as/ wjb19@psu.edu
UNIX C Compilation I
In general the creation an use o# share li"raries $*so% is pre#era"le to static $*a%& #or space reasons an ease o# so#tware up ates

Program in modules an lin- separate o"Oects
Use -fPIC #lag in share li"rar! compilation0 PIC==position in epen ent& co e in share o"Oect oes not epen on a ress/location at which it is loa e ,

Use the make utilit! to manage "uil s $more ne/t time%
(onPt #orget to up ate !our PATH an LD_LIBRARY_PATH env vars w/ !our "inar! e/ecuta"le path * an! li"raries !ou nee /create #or the application& respectivel!
wjb19@psu.edu
UNIX C Compilation II
Remem"er in compilation steps to -I/set/header/paths and !eep interface (in headers) separate from implementation as much as possible
Remem"er in lin-ing steps #or share li"s to: -L/set/path/to/library .N( set #lag -lmyLib, where /set/path/to/library/libmyLib.so must e/ist otherwise !ou will have un e#ine re#erences an /or 'can't find -lmyLib' etc

Compile with -Wall or similar an #i/ all warnings Rea the manual :%
wjb19@psu.edu
Conclusions
High Per#ormance Computing '!stems are an assem"l! o# har ware an so#tware wor-ing together& usuall! "ase on the UNIX O'0 multiple compute no es are connecte together
3he UNIX -ernel is surroun e "! a shell eg,& Bash0 comman s an constructs ma! "e assem"le into scripts
UNIX& associate utilities an user applications are tra itionall! written in high2 level languages li-e C
HPC user applications may ta!e advantage of shared or distributed memory compute models, or both
Regar less& goo co e minimiEes I/O& -eeps ata resi ent in memor! #or as long as possi"le an minimiEes communication "etween processes
User applications shoul ta-e a vantage o# e/isting high per#ormance li"raries& an tools li-e gdb, gprof an valgrind
wjb19@psu.edu
Re#erences
(ennis Ritchie& RIP http://en,wi-ipe ia,org/wi-i/(ennisSRitchie . vance "ash scripting gui e http://tl p,org/1(P/a"s/html/ 3e/t processing w/ @.Q) http://www,gnu,org/s/gaw-/manual/gaw-,html . vance 1inu/ programming http://www,a vance linu/programming,com/alp2#ol er/ E/cellent optimiEation tips http://www,lri,#r/D"astoul/localScopies/lee,html @NU compiler collection ocuments http://gcc,gnu,org/online ocs/ Original RI'C esign paper http://www,eecs,"er-ele!,e u/Pu"s/3echRpts/45<:/C'(2<:2486,p # CJJ +.? http://www,parashi#t,com/cJJ2#aA2lite/ GI9 Qi-i http://vim,wi-ia,com/wi-i/GimS3ipsSQi-i
wjb19@psu.edu
E/ercises
3a-e supplie co e an compile using gcc, creating e/ecuta"le foo.x; attempt to run as './foo.x' Co e has a segmentation #ault& an error in memor! allocation which is han le via the malloc #unction Recompile with e"ug #lag -g& run through gdb an correct the source o# the segmentation #ault 1oa the valgrind mo ule ie,& 'module load valgrind' an then run as 'valgrind ./foo.x'; this powerful profiling tool will help identify memory lea!s, or memory on the heap1 which has not been freed
Qrite a Bash script that stores !our home irector! #ile contents in an arra! an : Uses sed to swap vowels $eg,& PaP an PeP% in names Parses the arra! o# names an returns onl! a single match& i# it e/ists& else echo NO-MATCH
Fheap== region o# !namicall! allocate memor!
wjb19@psu.edu
1aunch :
@(B Auic- start
[wjb19@tesla1 scratch]$ gdb ./foo.x
Run w/ comman line argument P488P : 'et "rea-point at line 48 in source #ile :
(gdb) run 100
(gdb) b foo.c:10 Breakpoint 1 at 0x400594: file foo.c, line 10. (gdb) run Starting program: /gpfs/scratch/wjb19/foo.x Breakpoint 1, main () at foo.c:22 22 int IMAGE_SIZE = 10;
'tep to ne/t instruction $issuing PcontinueP will resume e/ecution% :
(gdb) step 23 float * image = (float*) malloc (IMAGE_SIZE*sizeof(float));
Print secon value in arra! PimageP :
(gdb) p image[2] $4 = 0
(ispla! #ull "ac-trace :
(gdb) bt full #0 main () at foo.c:27 i = 0 IMAGE_SIZE = 10 image = 0x601010
wjb19@psu.edu
HPC Essentials
Part II : Elements o# Parallelism
wjb19@psu.edu
Outline
Intro uction 9otivation HPC operations 9ultiprocessors Processes 9emor! (igression Girtual 9emor! Cache 3hrea s PO'IX Open9P .##init!
wjb19@psu.edu
9otivation
3he pro"lems in science we see- to solve are "ecoming increasingl! large& as we go own in scale $eg,& Auantum chemistr!% or up $eg,& astroph!sics%
.s a natural conseAuence& we see- "oth per#ormance an scaling in our scienti#ic applications
3here#ore we want to increase #loating point operations per#orme an memor! "an wi th an thus see- paralleli0ation as we run out o# resources using a single processor
Qe are limite "! .m ahlPs law& an e/pression o# the ma/imum improvement o# parallel co e over serial:
4/$$42P% J P$3% where P is the portion o# application co e we paralleliEe& an 3 is the num"er o# processors ie,& as 3 increases& the portion o# remaining serial co e "ecomes increasingl! e/pensive& relativel! spea-ing
wjb19@psu.edu
9otivation
Unless the portion o# co e we can paralleliEe approaches 488T& we see rapidly diminishing returns with increasing numbers of processors
4:
Improvement #actor
48 < 6 > : 8 8 46 C: >< 6> <8 56 44: 4:< 4>> 468 476 45: :8< ::> :>8 :B6
P=58T
P=68T P=C8T P=48T processors
Nonetheless& #or man! applications we have a goo chance o# paralleliEing the vast maOorit! o# the co e,,,
wjb19@psu.edu
E/ample : )irchho## 3ime 9igration

)39 is a techniAue use wi el! in oilJgas e/ploration& provi ing images into the earthPs interior& use to i enti#! resources
'eismic trace ata acAuire over :( geometr! is integrate to give image o# earthPs interior& using D @reenPs metho
Input is generall! 48U> V 48U6 traces& 48UC V 48U> ata points each& ie,& lots o# ata to process0 output image is also ver! large
3his is an integral techniAue $ie,& summation& eas! to paralleliEe%& Oust one o# man! popular algorithms per#orme in HPC
/==image space ==seismic space t==traveltime Image point Qeight 3race (ata
wjb19@psu.edu
Common Operations in HPC
Integration 1oa /store& a * multipl! eg,& trans#orms (eri*ati*es (Finite di++erences) 1oa /store& su"tract * ivi e eg,& P(E ,inear Alge)ra 1oa /store& su"tract/a /multipl!/ ivi e chemistr! * ph!sics& solvers sparse $classical ph!sics% * ense $Auantum%
Regar less o# the operations per#orme & a#ter compilation into machine co e& when e/ecute "! the CPU& instructions are cloc-e through a pipeline into registers #or e/ecution
Instruction e/ecution generall! ta-es place in #our steps& an multiple instruction groups are concurrent within the pipeline0 e ecution rate is a direct function of the cloc! rate
wjb19@psu.edu
E/ecution Pipeline
3his is the most #ine2graine #orm o# parallelism0 itPs e##icienc! is a strong #unction o# branch prediction hardware& or the pre iction o# which instruction in a program is the ne/t to e/ecuteF
.t a similar level& present in more recent evices are so2calle streaming 'I9( e/tension $''E% registers an associate compute har ware
Cloc' cycle
. /
1 4
pen ing
-.Fetch ..(ecode /."0ecute 1.$rite2)ac'

Fassiste "! compiler hints
PIP",IN"
e/ecuting
complete
wjb19@psu.edu
''E
'treaming 'I9( $'ingle instruction& multiple (ata% computation e/ploits special registers an instructions to increase computation man!2#ol in certain cases& since several ata elements are operate on simultaneousl!
Each o# < ''E registers $la"ele xmm0 through xmm7% is 4:<2"it longs& storing > / C:2"it #loating2point num"ers0 ''E: an ''EC speci#ications have e/pan e the allowe atat!pes to inclu e ou"les& ints etc
#loatC Bit 4:7
#loat:
#loat4
#loat8 8
Operations ma! "e PscalarP or Ppac-P $ie,& vector%& e/presse using intrinsics in __asm "loc- within C co e eg,& addps
operation
xmm0,xmm1
st operan src operan
One can either co e the intrinsics e/plicitl!& or rel! on the compiler,& eg,& icc with optimiEation $-O3%
3he ne/t level up o# paralleliEation is the multiprocessor,,,

wjb19@psu.edu
9ultiprocessor Overview
9ultiprocessors or multiple core CPUPs are "ecoming u"iAuitous0 "etter scaling $c# 9oorePs law% "ut limite "! contention #or share resources& especiall! memor!
9ost commonl! we eal with '!mmetric 9ultiprocessors $'9P%& with uniAue cache an registers& as well as share memor! region$s%0 more on cache in a moment 9emor! not necessaril! ne/t to processors ; Non2uni+orm !emory Access (NU!A)0 CPU8 CPU4 tr! to ensure memor! access is as local to CPU core$s% as possi"le registers registers
cache
cache
main memor!
3he proc irector! on UNIX machines is a special irector! written an up ate "! the -ernel& containing in#ormation on CPU $/proc/cpuinfo% an memor! $/proc/meminfo%
3he #un amental unit o# wor- on the cores is a process,,, wjb19@psu.edu
Processes
.pplication processes are launche on the CPU "! the -ernel using the fork() s!stem call0 ever! process has a process I( pid& availa"le on UNIX s!stems via the getpid() s!stem call
3he -ernel manages man! processes concurrentl!0 all in#ormation reAuire to run a process is containe in the process control bloc! $PCB% ata structure& containing $among other things%:

3he pi 3he a ress space I/O in#ormation eg,& open #iles/streams Pointer to ne/t PCB
Processes ma! spawn chil ren using the fork() s!stem call0 chil ren are initiall! a cop! o# the parent& "ut ma! ta-e on i##erent attri"utes via the exec() call
wjb19@psu.edu
Processes
. chil process ta-es the i o# the parent $ppid%& an a pid eg,& output #rom ps comman & escri"ing itsel# :
itionall! has a uniAue
[wjb19@tesla1 ~]$ ps -eHo "%P %p %c %t %C" PPID PID COMMAND ELAPSED %CPU 12608 1719 sshd 01:07:54 0.0 1719 1724 sshd 01:07:49 0.0 1724 1725 bash 01:07:48 0.0 1725 1986 ps 00:00 0.0
(uring a conte t switch& -ernel will swap one process control "loc- #or another0 conte/t switches are detrimental to HPC an have one or more triggers& inclu ing: I/O reAuests 3imer interrupts
Conte/t switching is a ver! #ine2graine #orm o# sche uling0 on compute clusters we also have coarse graine sche uling in the #orm o# Oo" sche uling so#tware $more ne/t time%
3he uniAue a ress space #rom the perspective o# the process is re#erre to as virtual memory
wjb19@psu.edu
Girtual 9emor!
. running process is given memor! "! the -ernel& re#erre to as virtual memory $G9%0 a ress space does not correspon to ph!sical memor! a ress space
3he 9emor! 9anagement Unit $99U% on CPU translates "etween the two a ress spaces& #or reAuests ma e "etween process an O'
Girtual 9emor! #or ever! process has the same structure& "elow le#t0 virtual a ress space is ivi e into units calle pages
7igh Address
Environment varia"les +unction arguments 'tacUnuse Heap
3he 99U is assiste in a ress translation "! the 3ranslation 1oo-asi e Bu##er $31B%& which stores page etails in a cache
Cache is high speed memory immediately ad*acent to the CP4 and it's registers, connected via bus(es)
,o& Address
Instructions wjb19@psu.edu
Cache : Intro uction
In HPC& we tal- a"out pro"lems "eing compute or memory bound
In the #ormer case& we are limite "! the rate at which instructions can "e e/ecute "! the CPU In the latter& we are limite "! the rate at which ata can "e processe "! the CPU ata are loa e into cache0 cache memor! is lai
Both instructions an out in lines

Cache memory is intermediate in the overall hierarchy, lying between CP4 registers and main memory I# the e/ecuting process reAuests an a ress correspon ing to ata or instructions in cache& we have a PhitP& else PmissP& an a much slower retrieval o# instruction or ata #rom main memor! must ta-e place
wjb19@psu.edu
Cache : Intro uction

9o ern architectures have various levels o# cache an ivisions o# responsi"ilities& we will #ollow valgrin 2cachegrin convention& #rom the manual:
,,, It simulates a machine with in epen ent #irst2level instruction an ata caches $I4 an (4%& "ac-e "! a uni#ie secon 2level cache $1:%, 3his e/actl! matches the con#iguration o# man! mo ern machines, However& some mo ern machines have three levels o# cache, +or these machines $in the cases where Cachegrin can auto2 etect the cache con#iguration% Cachegrin simulates the #irst2level an thir 2level caches, 3he reason #or this choice is that the 1C cache has the most in#luence on runtime& as it mas-s accesses to main memor!, +urthermore& the 14 caches o#ten have low associativit!& so simulating them can etect cases where the co e interacts "a l! with this cache $eg, traversing a matri/ column2wise with the row length "eing a power o# :%
wjb19@psu.edu
Cache E/ample
3he istri"ution o# ata to cache levels is largel! set "! compiler& har ware an -ernel& however the programmer is still responsi"le #or the "est data access patterns in his$her code possible Use cachegrind to optimiEe ata alignment * cache usage eg,&
#include <stdlib.h> #include <stdio.h> int main(){ int SIZE_X,SIZE_Y; SIZE_X=2048; SIZE_Y=2048; float * data = (float*) malloc(SIZE_X*SIZE_Y*sizeof(float)); for (int i=0; i<SIZE_X; i++) for (int j=0; j<SIZE_Y; j++) data[j+SIZE_Y*i] = 10.0f * 3.14f; //bad data access //data[i+SIZE_Y*j] = 10.0f * 3.14f; free(data); return 0; }
wjb19@psu.edu
Cache : Ba .ccess
bill@bill-HP-EliteBook-6930p:~$ valgrind --tool=cachegrind ./foo.x ==3088== Cachegrind, a cache and branch-prediction profiler ==3088== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al. ==3088== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==3088== Command: ./foo.x ==3088== ==3088== ==3088== I refs: 50,503,275 ==3088== I1 misses: 734 ==3088== LLi misses: 733 instructions ==3088== I1 miss rate: 0.00% ==3088== LLi miss rate: 0.00% ==3088== RE.( Ops QRI3E Ops ==3088== D refs: 33,617,678 (29,410,213 rd + 4,207,465 wr) ==3088== D1 misses: 4,197,161 ( 2,335 rd + 4,194,826 wr) ==3088== LLd misses: 4,196,772 ( 1,985 rd + 4,194,787 wr) ata ==3088== D1 miss rate: 12.4% ( 0.0% + 99.6% ) ==3088== LLd miss rate: 12.4% ( 0.0% + 99.6% ) ==3088== ==3088== LL refs: 4,197,895 ( 3,069 rd + 4,194,826 wr) ==3088== LL misses: 4,197,505 ( 2,718 rd + 4,194,787 wr) lowest level ==3088== LL miss rate: 4.9% ( 0.0% + 99.6% )
wjb19@psu.edu
Cache : @oo .ccess

bill@bill-HP-EliteBook-6930p:~$ valgrind --tool=cachegrind ./foo.x ==4410== Cachegrind, a cache and branch-prediction profiler ==4410== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al. ==4410== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==4410== Command: ./foo.x ==4410== ==4410== ==4410== I refs: 50,503,275 ==4410== I1 misses: 734 ==4410== LLi misses: 733 ==4410== I1 miss rate: 0.00% ==4410== LLi miss rate: 0.00% ==4410== ==4410== D refs: 33,617,678 (29,410,213 rd + 4,207,465 wr) ==4410== D1 misses: 265,002 ( 2,335 rd + 262,667 wr) ==4410== LLd misses: 264,613 ( 1,985 rd + 262,628 wr) ==4410== D1 miss rate: 0.7% ( 0.0% + 6.2% ) ==4410== LLd miss rate: 0.7% ( 0.0% + 6.2% ) ==4410== ==4410== LL refs: 265,736 ( 3,069 rd + 262,667 wr) ==4410== LL misses: 265,346 ( 2,718 rd + 262,628 wr) ==4410== LL miss rate: 0.3% ( 0.0% + 6.2% )
wjb19@psu.edu
Cache Per#ormance
5or large data problems, any speedup introduced by paralleli0ation can easily be negated by poor cache utili0ation
In this case& memor! "an wi th is an order of magnitude worse #or pro"lem siEe $:U4>%U: $c# earlier note on wi el! varia"le memor! "an wi ths0 we have to wor- har to approach pea-%

In man! cases we are limite also "! random access patterns

4: 48
High T miss
time $s%
< 6 > : 8 48 44 4: 4C 4>
1ow T miss log: 'IMESX
wjb19@psu.edu
Outline
Intro uction 9otivation Computational operations 9ultiprocessors Processes 9emor! (igression Girtual 9emor! Cache 3hrea s PO'IX Open9P .##init!
wjb19@psu.edu
PO'IX 3hrea s I
. process ma! spawn one or more threads0 on a multiprocessor, the 67 can schedule these threads across a variety of cores, providing parallelism in the form of 'light2weight processes' (LWP)
Qhereas a chil process receives a cop! o# the parentPs virtual memor! an e/ecutes in epen entl! therea#ter& a threa shares the memor! o# the parent inclu ing instructions& an also has private ata
Using threa s we per#orm shared memory processing $c# istri"ute memor!& ne/t time%
Qe are at li"ert! to launch as man! threa s as we wish& although as !ou might e/pect& per#ormance ta-es a hit as more threa s are launche than can "e sche ule simultaneousl! across availa"le cores
wjb19@psu.edu
PO'IX 3hrea s II
Pthrea s re#ers to the PO'IX stan ar & which is Oust a speci#ication0 implementations e/ist #or various s!stems

Each pthrea has: .n I( .ttri"utes : 'tac- siEe 'che ule in#ormation
9uch li-e processes& we can monitor threa e/ecution using utilities such as top an ps
3he memor! share among threa s must "e use care#ull! in or er to prevent race conditions& or threa s seeing incorrect ata uring e/ecution& ue to more than one threa per#orming operations on sai ata& in an uncoor inate #ashion
wjb19@psu.edu
PO'IX 3hrea s III

Race con itions ma! "e ameliorate through care#ul co ing& "ut also through e/plicit constructs eg,& loc-s& where"! a single threa gains an relinAuishes control; implies serialiEation an computational overhea
9ulti23hrea e programs must also avoi deadloc!& a highl! un esirous state where one or more threa s await resources& an in turn are una"le to o##er up resources reAuire "! others
(ea loc-s can also "e avoi e through goo co ing& as well as the use o# communication techniAues "ase aroun semaphores& #or e/ample
3hrea s awaiting resources ma! sleep $conte/t switch "! -ernel& slow& saves c!cles% or "us! wait $e/ecutes while loop or similar chec-ing semaphore& #ast& wastes c!cles%
wjb19@psu.edu
Pthrea s E/ample
#include <pthread.h> #include <stdio.h> #include <stdlib.h> int sum; void *worker(void *param); int main(int argc, char *argv[]){ pthread_t tid; pthread_attr_t attr; if (argc!=2 || atoi(argv[1])<0){ printf("usage : a.out <int value>, where int value > 0\n"); return -1; } pthread_attr_init(&attr); pthread_create(&tid,&attr,worker,argv[1]); pthread_join(tid,NULL); printf("sum = %d\n",sum); } void * worker(void *total){ int upper=atoi(total); sum = 0; for (int i=0; i<upper; i++) sum += i; pthread_exit(0); }
glo"al $share % varia"le main threa threa i * attri"utes
wor-er threa creation * Ooin a#ter completion local $private% varia"le
wjb19@psu.edu
Galgrin 2helgrin output

[wjb19@hammer16 scratch]$ valgrind --tool=helgrind -v ./foo.x 100 ==5185== Helgrind, a thread error detector ==5185== Copyright (C) 2007-2009, and GNU GPL'd, by OpenWorks LLP et al. ==5185== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==5185== Command: ./foo.x 100 ==5185== s!stem calls esta"lishing threa ie,& there --5185-- Valgrind options: --5185---tool=helgrind is a CO'3 to create an estro! threa s --5185--v --5185-- Contents of /proc/version: --5185-Linux version 2.6.18-274.7.1.el5 (mockbuild@x86-004.build.bos.redhat.com) (gcc version --5185-- REDIR: 0x3a97e7c240 (memcpy) redirected to 0x4a09e3c (memcpy) --5185-- REDIR: 0x3a97e79420 (index) redirected to 0x4a09bc9 (index) --5185-- REDIR: 0x3a98a069a0 (pthread_create@@GLIBC_2.2.5) redirected to 0x4a0b2a5 (pthread_create@*) --5185-- REDIR: 0x3a97e749e0 (calloc) redirected to 0x4a05942 (calloc) --5185-- REDIR: 0x3a98a08ca0 (pthread_mutex_lock) redirected to 0x4a076c2 (pthread_mutex_lock) --5185-- REDIR: 0x3a97e74dc0 (malloc) redirected to 0x4a0664a (malloc) --5185-- REDIR: 0x3a98a0a020 (pthread_mutex_unlock) redirected to 0x4a07b66 (pthread_mutex_unlock) --5185-- REDIR: 0x3a97e79b50 (strlen) redirected to 0x4a09cbb (strlen) --5185-- REDIR: 0x3a98a07a10 (pthread_join) redirected to 0x4a07431 (pthread_join) sum = 4950 ==5185== ==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3) --5185---5185-- used_suppression: 1 helgrind-glibc2X-101 --5185-- used_suppression: 1 helgrind-glibc2X-112 --5185-- used_suppression: 1 helgrind-glibc2X-102 ==5185== ==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3)
wjb19@psu.edu
Pthrea s: Race Con ition

#include <pthread.h> #include <stdio.h> #include <stdlib.h> int sum; void *worker(void *param); int main(int argc, char *argv[]){ pthread_t tid; pthread_attr_t attr; if (argc!=2 || atoi(argv[1])<0){ printf("usage : a.out <int value>, where int value > 0\n"); return -1; } pthread_attr_init(&attr); pthread_create(&tid,&attr,worker,argv[1]); int upper=atoi(argv[1]); main threa wor-s on sum=0; glo"al varia"le as well& for (int i=0; i<upper; i++) sum+=i; without s!nchroniEation/ pthread_join(tid,NULL); printf("sum = %d\n",sum); }
coor ination
wjb19@psu.edu
Helgrin output w/ race

[wjb19@hammer16 scratch]$ valgrind --tool=helgrind ./foo.x 100 ==5384== Helgrind, a thread error detector ==5384== Copyright (C) 2007-2009, and GNU GPL'd, by OpenWorks LLP et al. ==5384== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==5384== Command: ./foo.x 100 ==5384== "uilt #oo,/ with e"ug ==5384== Thread #1 is the program's root thread #in source #ile line$s% ==5384== ==5384== Thread #2 was created error$s% ==5384== at 0x3A97ED447E: clone (in /lib64/libc-2.5.so) ==5384== by 0x3A98A06D87: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so) ==5384== by 0x4A0B206: pthread_create_WRK (hg_intercepts.c:229) ==5384== by 0x4A0B2AD: pthread_create@* (hg_intercepts.c:256) ==5384== by 0x400748: main (fooThread2.c:18) ==5384== ==5384== Possible data race during write of size 4 at 0x600cdc by thread #1 ==5384== at 0x400764: main (fooThread2.c:20) ==5384== This conflicts with a previous write of size 4 by thread #2 ==5384== at 0x4007E3: worker (fooThread2.c:31) ==5384== by 0x4A0B330: mythread_wrapper (hg_intercepts.c:201) ==5384== by 0x3A98A0673C: start_thread (in /lib64/libpthread-2.5.so) ==5384== by 0x3A97ED44BC: clone (in /lib64/libc-2.5.so) ==5384==
on $2g% to w/
Pthrea s is a versatile al"eit large an inherentl! complicate inter#ace
Qe are primaril! concerne with Psimpl!P ivi ing a wor-loa among availa"le cores0 Open9P proves much less unwiel ! to use
wjb19@psu.edu
Open9P Intro uction

Open9P is a set o# multi2plat#orm/O' compiler irectives& li"raries an environment varia"les #or rea il! creating multi2threa e applications
3he Open9P stan ar is manage "! a review "oar & an is e#ine "! a large num"er o# har ware ven ors
.pplications written using Open9P emplo! pragmas& or statements interprete "! the preprocessor $"e#ore compilation%& representing #unctionalit! li-e #or- * Ooin that woul ta-e consi era"l! more e##ort an care to implement otherwise
Open9P pragmas or irectives in icate parallel sections o# co e ie,& a#ter compilation& at runtime& threa s are each given a portion o# wor- eg,& in this case& loop iterations will "e ivi e evenl! among running threa s :
#pragma omp parallel for for (int i=0; i<SIZE; i++) y[i]=x[i]*10.0f;
wjb19@psu.edu
Open9P Clauses I
3he num"er o# threa s launche uring parallel "loc-s ma! "e set via #unction calls or "! setting the OMP_NUM_THREADS environment varia"le
(ata o"Oects are generall! "! e#ault share $loop counters are private "! e#ault%& a num"er o# pragma clauses are availa"le& which are vali #or the scope o# the parallel section eg,& : private shared firstprivate 2initialiEe to value "e#ore parallel "loc lastprivate 2varia"le -eeps value a#ter parallel "loc reduction 2threa sa#e wa! o# com"ining ata at conclusion o# parallel "loc
3hrea s!nchroniEation is implicit to parallel sections0 there are a variet! o# clauses availa"le #or controlling this "ehavior also& inclu ing : critical2one threa at a time wor-s in this section eg,& in or er to avoi race $e/pensive& esign !our co e to avoi at all costs% atomic2 sa#e memor! up ates per#orme using eg,& mutual e/clusion $cost% barrier2threa s wait at this point #or others to arrrive wjb19@psu.edu
Open9P Clauses II
Open9P has e#ault threa sche uling "ehavior han le via the runtime li"rar!& which ma! "e mo i#ie through use o# the schedule(type,chunk) clause& with t!pes :

static - loop iterations are ivi e among threa s eAuall! "! e#ault0 speci#!ing an integer #or the parameter chun- will allocate a num"er o# contiguous iterations to a threa dynamic - total iterations #orm a pool& #rom which threa s wor- on small contiguous su"sets until all are complete& with su"set siEe given again "! chunguided - a large section o# contiguous iterations are allocate to each threa !namicall!, 3he section siEe ecreases e/ponentiall! with each successive allocation to a minimum siEe speci#ie "! chun-
wjb19@psu.edu
Open9P E/ample : )39

In our #irst attempt at paralleliEation shortl!& we simpl! a "e#ore the computational loops in wor-er #unction:
an Open9P pragma
#pragma omp parallel for //loop over trace records for (int k=0; k<config->traceNo; k++){ //loop over imageX for(int i=0; i<Li; i++){ tempC = ( midX[k] - imageXX[i]-offX[k]) * (midX[k]- imageXX[i]-offX[k]); tempD = ( midX[k] - imageXX[i]+offX[k]) * (midX[k]- imageXX[i]+offX[k]); //loop over imageY for(int j=0; j<Lj; j++){ tempA = tempC + ( midY[k] - imageYY[j]-offY[k]) * (midY[k]- imageYY[j]-offY[k]); tempB = tempD + ( midY[k] - imageYY[j]+offY[k]) * (midY[k]- imageYY[j]+offY[k]); //loop over imageZ for (int l=0; l<Ll; l++){ temp = sqrtf(tauS[l] + tempA * slownessS[l]); temp += sqrtf(tauS[l] + tempB * slownessS[l]); timeIndex = (int) (temp / sRate); if ((timeIndex < config->tracePts) && (timeIndex > 0)){ image[i*Lj*Ll + j*Ll + l] += traces[timeIndex + k * config->tracePts] * temp *sqrtf(tauS[l] / temp); } } //imageZ } //imageY } //imageX }//input trace records
wjb19@psu.edu
Open9P )39 Results

'cales well up to eight cores& then rops o##0 '9P mo el has e#iciencies ue to a num"er o# #actors& inclu ing :

Coverage $.m ahlPs law%0 as we increase processors& relative cost o# serial co e portion increases Hardware limitations Locality,,,
B >,B > C,B C :,B : 4,B 4 8,B 8 4 : > < 46
E/ecution time
CPU cores wjb19@psu.edu
CPU .##init! $IntelF%

Recall that the O' sche ules processes an threa s using conte/t switches0 can "e etrimental ; threads may resume on different core, destroying locality
Qe can change this "! restricting threa s to e/ecute on a su"set o# processors& "! setting processor affinity

'implest approach is to set environment varia"le KMP_AFFINITY to: etermine the machine topolog!& assign threa s to processors Usage:
KMP_AFFINITY=[<modifier>]<type>[<permute>][<offset>]
F+or @NU& D eAuivalent env var == GOMP_CPU_AFFINITY
wjb19@psu.edu
CPU .##init! 'ettings

3he mo i#ier ma! ta-e settings correspon ing to granularit! $with speci#iers: fine, thread& an core%& as well as a processor list $proclist={<proclist>}%& ver"ose& warnings an others

3he t!pe settings re#er to the nature o# the a##init!& an ma! ta-e values : compact2tr! to assign threa nJ4 conte/t as close as possi"le to n isa"le explicit2#orce assign o# threa s to processors in proclist none2Oust return the topolog! w/ ver"ose mo i#ier scatter2 istri"ute as evenl! as possi"le
fine & thread re#er to the same thing& namel! that threa s onl! resume in the same conte/t0 the core mo i#ier implies that the! ma! resume within a i##erent conte/t& "ut the same ph!sical core
CPU .##init! can e##ect application per#ormance signi#icantl! an is worth tuning& "ase on !our application an the machine topolog!,,,
wjb19@psu.edu
CPU 3opolog! 9ap

+or an! given computational no e& we have several i##erent ph!sical evices $pac-ages in soc-ets%& comprise o# cores $eg,& two here%& which run one or two threa conte/ts
Qithout h!perthrea ing& there is onl! a single conte/t per core ie,& mo i#iers threa /#ine& core are in istinguisha"le
No e pac-age. core8 8 4 8 core4 4 8 pac-ageB core8 4 8 core4 4 3hrea conte/t wjb19@psu.edu
CPU .##init! E/amples
(ispla! machine topolog! map eg&, Hammer :
[wjb19@hammer16 scratch] $ export KMP_AFFINITY=verbose,none [wjb19@hammer16 scratch] $ ./psktm.x OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids. OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #156: KMP_AFFINITY: 12 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores) OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
wjb19@psu.edu
CPU .##init! E/amples
'et a##init! with compact setting& #ine granularit! :
[wjb19@hammer5 scratch]$ export KMP_AFFINITY=verbose,granularity=fine,compact [wjb19@hammer5 scratch]$ ./psktm.x OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids. OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #156: KMP_AFFINITY: 12 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores) OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 10 OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 10 OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2} OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {10} OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {6} OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {1} OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {9} OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5} OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {3} OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {11}
wjb19@psu.edu
Conclusions
'cienti#ic research is supporte "! computational scaling an per#ormance& "oth provi e "! parallelism& limite to some e/tent "! .m ahlPs law
Parallelism has various levels o# granularit!0 at the #inest level is the instruction pipeline an vectoriEe registers eg,& ''E
3he ne/t level up in parallel granularit! is the multiprocessor0 we ma! run man! concurrent threa s using the pthrea s .PI or the Open9P stan ar #or instance
3hrea s must "e co e an han le with care& to avoi race an con itions

ea loc-
Per#ormance is a strong #unction o# cache utiliEation0 "ene#its intro uce through paralleliEation can easil! "e negate "! slopp! use o# memor! "an wi th 'caling across cores is limite "! har ware& .m ahlPs law "ut also localit!0 we have some control over the latter using KMP_AFFINITY #or instance
wjb19@psu.edu
Re#erences
Galgrin $"u! the manual& worth ever! penn!% http://valgrin ,org/ Open9P http://openmp,org/wp/ @NU Open9P http://gcc,gnu,org/proOects/gomp/ 'ummar! o# Open9P C,8 C/CJJ '!nta/ http://openmp,org/mp2 ocuments/Open9PC,42CCar ,p # 'ummar! o# Open9P C,8 +ortran '!nta/ http://www,openmp,org/mp2 ocuments/Open9PC,82+ortranCar ,p # Nice ''E tutorial http://neil-emp,us/src/sseStutorial/sseStutorial,html Intel Nehalem http://en,wi-ipe ia,org/wi-i/NehalemST:<microarchitectureT:5 @NU 9a-e http://www,gnu,org/s/ma-e/ Intel h!perthrea ing http://en,wi-ipe ia,org/wi-i/H!per2threa ing
wjb19@psu.edu
E/ercises
3a-e the supplie co e an paralleliEe using Open9P pragma aroun the wor-er #unction Create a ma-e#ile which "uil s the co e& compare timings "twn serial * parallel "! var!ing OMP_NUM_THREADS E/amine e##ect o# various settings #or KMP_AFFINITY
wjb19@psu.edu
Buil w/ Con#i ence : ma-e

#Makefile for basic Kirchhoff Time Migration example #set compiler CC=icc -openmp #set build options CFLAGS=-std=c99 -c #main executable all: psktm.x #objects and dependencies psktm.x: psktmCPU.o demoA.o $(CC) psktmCPU.o demoA.o -o psktm.x psktmCPU.o: psktmCPU.c $(CC) $(CFLAGS) psktmCPU.c demoA.o: demoA.c $(CC) $(CFLAGS) demoA.c clean: rm -rf *o psktm.x
in ent with ta" onl!W
wjb19@psu.edu
HPC Essentials
Part III : 9essage Passing Inter#ace
wjb19@psu.edu
Outline
9otivation Interprocess Communication 'ignals 'oc-ets * Networ-s procfs (igression 9essage Passing Inter#ace 'en /Receive Communication Parallel Constructs @rouping (ata Communicators * 3opologies
wjb19@psu.edu
9otivation
Qe saw last time that .m ahlPs law implies an asymptotic limit to per#ormance gains #rom parallelism& where parallel P an serial co e (82 P) portions have +i0ed relative cost
Qe loo-e at threa s $Xlight2weight processesY% an also saw that per#ormance epen s on a variet! o# things& inclu ing goo cache utiliEation an a##init!
+or the pro"lem siEe investigate & ultimatel! the limiting #actor was isI/O& there was no sense going "e!on a single compute no e0 in a machine with 46 cores or more& there is no point when P R 68T& shoul the process have su##icient memor!
However, as we increase our problem si0e, the relative parallel$serial cost changes and P can approach 8
wjb19@psu.edu
9otivation
In the limit as processors 3 ; we #in the ma/imum per#ormance improvement : 8$(82P) It is help#ul to see the C B points #or this limit ie,& the num"er o# processors 3 8$. reAuire to achieve (8$9.)1ma = 8$(9.1(82P))0 eAuating with .m ahlPs law * a#ter some alge"ra : 38$. : 8$((82P)1(9.28))
C88 :B8 :88
N4/:
4B8 488 B8 8 8,5 8,54 8,5: 8,5C 8,5> 8,5B 8,56 8,57 8,5< 8,55
Parallel co e #raction P
wjb19@psu.edu
9otivation
Points to note #rom the graph : P D 8,58& we can "ene#it #rom D :8 cores P D 8,55& we can "ene#it #rom a cluster siEe o# D :B6 cores P ; 4& we approach the Xem"arrassingl! parallelY limit P ; 4& per#ormance improvement irectl! proportional to cores P D 4 implies independent or "atch processes
?uite asi e #rom consi erations o# .m ahlPs law& as the pro"lem siEe grows& we ma! simpl! e/cee the memor! availa"le on a single no e
In this case& must move to a distributed memory processing mo el/multiple no es $unless P D 4 o# course%

How do we determine P< " P=65&L&3>
wjb19@psu.edu
Pro#iling w/ Galgrin
[wjb19@lionxf scratch]$ valgrind --tool=callgrind ./psktm.x [wjb19@lionxf scratch]$ callgrind_annotate --inclusive=yes callgrind.out.3853 -------------------------------------------------------------------------------Profile data file 'callgrind.out.3853' (creator: callgrind-3.5.0) -------------------------------------------------------------------------------I1 cache: D1 cache: ParalleliEa"le wor-er L2 cache: Timerange: Basic block 0 - 2628034011 #unction is 55,BT o# Trigger: Program termination total instructions Profiled target: ./psktm.x (PID 3853, part 1)
e/ecute
-------------------------------------------------------------------------------20,043,133,545 PROGRAM TOTALS -------------------------------------------------------------------------------Ir file:function -------------------------------------------------------------------------------20,043,133,545 ???:0x0000003128400a70 [/lib64/ld-2.5.so] 20,042,523,959 ???:0x0000000000401330 [/gpfs/scratch/wjb19/psktm.x] 20,042,522,144 ???:(below main) [/lib64/libc-2.5.so] 20,042,473,687 /gpfs/scratch/wjb19/demoA.c:main 20,042,473,687 demoA.c:main [/gpfs/scratch/wjb19/psktm.x] 19,934,044,644 psktmCPU.c:ktmMigrationCPU [/gpfs/scratch/wjb19/psktm.x] 19,934,044,644 /gpfs/scratch/wjb19/psktmCPU.c:ktmMigrationCPU 6,359,083,826 ???:sqrtf [/gpfs/scratch/wjb19/psktm.x] 4,402,442,574 ???:sqrtf.L [/gpfs/scratch/wjb19/psktm.x] 104,966,265 demoA.c:fileSizeFourBytes [/gpfs/scratch/wjb19/psktm.x]
I# we wish to scale outsi e a single no e& we must use some #orm o# interprocess communication
wjb19@psu.edu
Inter2Process Communication
3here are a variet! o# wa!s #or processes to e/change in#ormation& inclu ing: 9emor! $Dlast wee-% +iles Pipes $name /anon!mous% 'ignals 'oc-ets 9essage Passing +ile I/O is too slow& an rea /writes lia"le to race con itions
.non!mous * name pipes are highl! e##icient "ut +I+O $#irst in& #irst out% "u##ers& allowing onl! uni irectional communication& an "etween processes on the same no e
'ignals are a ver! limite #orm o# communication& sent to the process a#ter an interrupt "! the -ernel& an han le using a e#ault han ler or one speci#ie using signal() s!stem call
'ignals ma! come #rom a variet! o# sources eg,& segmentation #ault $ SIGSEGV%& -e!"oar interrupt Ctrl2C $SIGINT% etc
wjb19@psu.edu
'ignals
strace is a power#ul utilit! in UNIX which shows the interaction "etween a running process an -ernel in the #orm o# s!stem calls an signals0 here& a partial output showing mapping o# signals to e#aults with s!stem call sigaction(), #rom ./psktm.x :
UNIX signals
rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGILL, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGABRT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGFPE, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGSEGV, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0
'ignals are cru e an restricte to local communication0 to communicate remotel!& we can esta"lish a soc!et "etween processes& an communicate over the networ
wjb19@psu.edu
'oc-ets * Networ-s
(avies/Baran #irst evise pac-et switching& an e##icient means o# communication over a channel0 a computer was conceive to realiEe their esign an .RP.NE3 went online Oct 4565 "etween UC1. an 'tan#or
3CP/IP "ecame the communication protocol o# .RP.NE3 4 Nan 45<C& which was retire in 4558 an N+'NE3 esta"lishe 0 universit! networ-s in the U' an Europe Ooin
3CP/IP is Oust one o# man! protocols& which escri"es the #ormat o# ata pac-ets& an the nature o# the communication0 an analogous connection metho is use "! In#ini"an networ-s in conOunction with Remote (irect 9emor! .ccess $R(9.%
Unrelia"le (atagram Protocol $U(P% is analogous to a connectionless metho o# communication use "! In#ini"an high per#ormance networ-s
wjb19@psu.edu
'oc-ets : U(P host e/ample

#include #include #include #include #include #include #include #include <stdio.h> <errno.h> <string.h> <sys/socket.h> <sys/types.h> <netinet/in.h> <unistd.h> /* for close() for socket */ <stdlib.h>
int main(void) { //creates an endpoint & returns file descriptor //uses IPv4 domain, datagram type, UDP transport int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP); //socket address object (sa) and memory buffer struct sockaddr_in sa; char buffer[1024]; ssize_t recsize; socklen_t fromlen; //specify same domain type, any input address and port 7654 to listen on memset(&sa, 0, sizeof sa); sa.sin_family = AF_INET; sa.sin_addr.s_addr = INADDR_ANY; sa.sin_port = htons(7654); fromlen = sizeof(sa);
wjb19@psu.edu
'oc-ets : host e/ample cont,

//we bind an address (sa) to the socket using fd sock if (-1 == bind(sock,(struct sockaddr *)&sa, sizeof(sa))) { perror("error bind failed"); close(sock); exit(EXIT_FAILURE); } for (;;) { //listen and dump buffer to stdout where applicable printf ("recv test....\n"); recsize = recvfrom(sock, (void *)buffer, 1024, 0, (struct sockaddr *)&sa, &fromlen); if (recsize < 0) { fprintf(stderr, "%s\n", strerror(errno)); exit(EXIT_FAILURE); } printf("recsize: %z\n ", recsize); sleep(1); printf("datagram: %.*s\n", (int)recsize, buffer); } }
wjb19@psu.edu
'oc-ets : client e/ample

int main(int argc, char *argv[]) { //create a buffer with character data int sock; struct sockaddr_in sa; int bytes_sent; char buffer[200]; strcpy(buffer, "hello world!"); //create a socket, same IP and transport as before, address of host 127.0.0.1 sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP); if (-1 == sock) /* if socket failed to initialize, exit */ { printf("Error Creating Socket"); exit(EXIT_FAILURE); } memset(&sa, 0, sizeof sa); sa.sin_family = AF_INET; sa.sin_addr.s_addr = inet_addr("127.0.0.1"); sa.sin_port = htons(7654); bytes_sent = sendto(sock, buffer, strlen(buffer), 0,(struct sockaddr*)&sa, sizeof sa); if (bytes_sent < 0) { printf("Error sending packet: %s\n", strerror(errno)); exit(EXIT_FAILURE); } close(sock); /* close the socket */ return 0; }
Hou can monitor soc-ets "! using the netstat #acilit!& which ta-es itPs ata #rom /proc/net wjb19@psu.edu
Outline
9otivation Interprocess Communication 'ignals 'oc-ets * Networ-s procfs (igression 9essage Passing 'en /Receive Communication Parallel Constructs @rouping (ata Communicators * 3opologies
wjb19@psu.edu
proc#s
Qe mentione the /proc irector! previousl! in the conte/t o# cpu an memor! in#ormation& which is #reAuentl! re#erre to as the proc #iles!stem or procfs
&t is a veritable treasure trove of information& written perio icall! "! the -ernel& an is use "! a variet! o# tools eg,& ps

Each running process is assigne a irector!& whose name is the process i
Each irector! contains te/t #iles an su" irectories with ever! etail o# a running process& inclu ing conte/t switching statistics& memor! management& open #ile escriptors an much more
9uch li-e the ptrace() s!stem call& procfs also gives user applications the a"ilit! to directly manipulate running processes& given su##icient permission0 !ou can e/plore that on !our own :%
wjb19@psu.edu
proc#s : e/amples
'ome o# the more use#ul #iles :

/proc/PID/cmdline : comman use to launch process /proc/PID/cwd : current wor-ing irector! /proc/PID/environ : environment varia"les #or the process /proc/PID/fd : irector! w/ s!m"olic lin- #or each open #ile escriptor eg,& streams /proc/PID/status : in#ormation inclu ing signals& state& memor! usage /proc/PID/maps : memor! map "etween virtual an ph!sical a resses
eg,& contents o# the # #irector! #or running process ./psktm.x :
[wjb19@hammer1 fd]$ ls -lah total 0 dr-x------ 2 wjb19 wjb19 0 Dec 7 12:13 . dr-xr-xr-x 6 wjb19 wjb19 0 Dec 7 12:10 .. lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 0 -> /dev/pts/28 lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 1 -> /dev/pts/28 lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 2 -> /dev/pts/28 lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 3 -> /gpfs/scratch/wjb19/inputDataSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 4 -> /gpfs/scratch/wjb19/inputSrcXSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 5 -> /gpfs/scratch/wjb19/inputSrcYSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 6 -> /gpfs/scratch/wjb19/inputRecXSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 7 -> /gpfs/scratch/wjb19/inputRecYSmall.bin lrwx------ 1 wjb19 wjb19 64 Dec 7 12:13 8 -> /gpfs/scratch/wjb19/velModel.bin
wjb19@psu.edu
proc#s : status #ile e/tract

[wjb19@hammer1 30769]$ more status Name: psktm.x State: R (running) SleepAVG: 0% Tgid: 30769 Pid: 30769 PPid: 30687 TracerPid: 0 Uid: 2511 2511 2511 2511 Gid: 2530 2530 2530 2530 FDSize: 256 Groups: 2472 2530 3835 4933 5505 5732 VmPeak: 65520 kB VmSize: 65520 kB VmLck: 0 kB VmHWM: 37016 kB VmRSS: 37016 kB VmData: 51072 kB VmStk: 88 kB VmExe: 64 kB VmLib: 2944 kB VmPTE: 164 kB StaBrk: 1289a000 kB Brk: 128bb000 kB StaStk: 7fffbd0a0300 kB Threads: 5 SigQ: 0/398335 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000180000000
Girtual memor! usage
signals wjb19@psu.edu
Outline
9otivation Interprocess Communication 'ignals 'oc-ets * Networ-s procfs (igression 9essage Passing Inter#ace 'en /Receive Communication Parallel Constructs @rouping (ata Communicators * 3opologies
wjb19@psu.edu
9essage Passing Inter#ace $9PI%

Classical von Neumann machine has single instruction/ ata stream $'I'(% ; single process * memor!
9ultiple Instruction& multiple ata $9I9(% s!stem ; connecte processes are as!nchronous& generall! istri"ute memor! $ma! also "e share where processes on single no e%
9I9( Processors are connecte in some networ- topolog!0 we onPt have to worr! a"out the etails& 9PI a"stracts this awa!
9PI is a stan ar #or parallel programming #irst esta"lishe in 4554& up ate occasionall!& "! aca emics an in ustr!
It comprises routines #or point2to2point an collective communication& with "in ings to C/CJJ an #ortran
(epen ing on un erl!ing networ- #a"ric& communication ma!"e 3CP or U(P2 li-e in In#ini"an networ-s
wjb19@psu.edu
9PI : Basic communication

9ultiple& istri"ute processes are spawne at initialiEation& each process assigne a uniAue ran! 8&4&,,,&p24

One ma! sen in#ormation re#erencing process ran- eg,&: MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
Bu##er a ress Ran- o# rcv
3his #unction has a receive analogue0 "oth routines are bloc!ing "! e#ault
'en /receive statements generall! occur in same co e& processors e/ecute appropriate statement accor ing to ran- * co e "ranch
Non2"loc-ing #unctions availa"le& allows communicating processes to continue with e/ecution where a"le
wjb19@psu.edu
9PI : ReAuisite #unctions

Bare minimum ; initiali0e, get ran! for process, total processes and finali0e when done
MPI_Init(&argc, &argv); //Start up MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //My rank MPI_Comm_size(MPI_COMM_WORLD, &p); //No. processors MPI_Finalize(); //close up shop MPI_COMM_WORLD is a communicator parameter& a collection o# processes that can sen messages to each other,
9essages are sent with tags to i enti#! them& allowing speci#icit! "e!on using Oust a source/ estination parameter
wjb19@psu.edu
9PI : (atat!pes
MPI_CHAR MPI_SHORT MPI_INT MPI_LONG MPI_UNSIGNED_CHAR MPI_UNSIGNED_SHORT MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_FLOAT MPI_DOUBLE MPI_LONG_DOUBLE MPI_BYTE MPI_PACKED
signed char signed short int signed int signed long int unsigned char unsigned short int unsigned int unsigned long int float double long double
wjb19@psu.edu
9inimal 9PI e/ample

#include "mpi.h" #include <stdio.h> int main(int argc, char *argv[]) { int rank, size, i; int buffer[10]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank > 0) { for (int i =0; i<10; i++) buffer[i]=i * rank; MPI_Send(buffer, 10, MPI_INT, 0, 0, MPI_COMM_WORLD); } else { for (int i=1; i<size; i++){ MPI_Recv(buffer, 10, MPI_INT, i, 0, MPI_COMM_WORLD, &status); printf("buffer element 0 : %i from proc : %i \n",buffer[0],i); } } MPI_Finalize(); return 0; }
wjb19@psu.edu
9PI : Collective Communication
. communication pattern involving all processes in a communicator is a collective communication eg,& a "roa cast 'ame ata sent to ever! process in communicator& more e##icient than using multiple p:p routines& optimiEe :
MPI_Bcast(void* message, int count, MPI_Datatype type, int root, MPI_Comm comm)
'en s cop! o# ata in message #rom root process to all in comm, a scatter/map operation Collective communication is at the heart o# e##icient parallel operations
wjb19@psu.edu
Parallel Operations : Re uction
(ata ma!"e gathere /re uce a#ter computation via :
MPI_Reduce(void* operand, void* result, int count, MPI_Datatype type, MPI_Op operator, int root, MPI_Comm comm)
Com"ines all operand, using operator an stores result on process root, in result . tree2structure re uce at all no es == MPI_Allreduce,ie,& ever! process in comm gets a cop! o# the result
4 : C p24
root
wjb19@psu.edu
Re uction Ops
MPI_MAX MPI_MIN MPI_SUM MPI_PROD
MPI_LAND
Logical and Bitwise and Logical or Bitwise or Logical XOR Bitwise XOR Max w/ location Min w/ location
MPI_BAND MPI_LOR MPI_BOR MPI_LXOR MPI_BXOR MPI_MAXLOC MPI_MINLOC MPI_PACKED
wjb19@psu.edu
Parallel Operations : 'catter/@ather
Bul- trans#ers o# man!2to2one an one2to2man! are accomplishe "! gather an scatter operations respectivel! 3hese operations #orm the -ernel o# matri//vector operations #or e/ample0 the! are use#ul #or istri"uting an reassem"ling arra!s
Process 8 Process 4 Process : Process C
/8 /4 /: /C @ather
a88
a84
a8:
a8C
'catter wjb19@psu.edu
'catter/@ather '!nta/
MPI_Gather(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data, int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm) Collects ata re#erence "! send_data #rom each process in comm an stores ata in process ran- or er on process w/ ran- root& in memor! re#erence "! recv_data MPI_Scatter(void* send_data, int send_count, MPI_Datatype send_type, void* recv_data, int recv_count, MPI_Datatype recv_type, int root, MPI_Comm comm) 'plits ata re#erence "! send_data on process w/ ran- root into segments& send_count elements each& w/ send_type * istri"ute in or er to processes +or gather result to .11 processes ; MPI_Allgather
wjb19@psu.edu
@rouping (ata I

Communication is e/pensive ; "un le varia"les into single message Qe must e#ine a erive t!pe than can escri"e the heterogeneous contents o# a message using t!pe an isplacement pairs 'everal wa!s to "uil this MPI_Datatype eg,&
MPI_Type_Struct(int count, int block_lengths[], //contains no. entries in each block MPI_Aint displacements[], //element offset from msg start MPI_Datatype typelist[], //exactly that MPI_Datatype* new_mpi_t //a pointer to this new type)
.llows #or a
resses I int
. ver! general erive t!pe& although arra!s to struct must "e constructe e/plicitl! using other 9PI comman s 'impler when less heterogeneous eg,& MPI_Type_vector, MPI_Type_Contiguous, MPI_Type_indexed
wjb19@psu.edu
@rouping (ata II
Be#ore these erive t!pes can "e use "! a communication #unction& must "e committe with MPI_type_commit #unction call In or er #or message to "e receive & t!pe signatures at sen an receive must "e compatible0 i# a collective communication& signatures must "e i entical MPI_Pack * MPI_Unpack are use#ul #or when messages o# heterogeneous ata are in#reAuent& an cost o# constructing erive t!pe outweighs "ene#it 3hese metho s also allow buffering in user versus s!stem memor!& an the num"er o# items transmitte is in the message itsel# @roup ata allows #or sophisticate o"Oects0 we can also create more fined grained communication ob*ects
wjb19@psu.edu
Communicators
Process su"sets or groups e/pan communication "e!on simple p:p an "roa cast communication& to create :
Intra2communicators ; communicate among one other an participate in collective communication& compose o# :

an or ere collection o# processes $group% a conte/t
Inter2communicators ; communicate "etween i##erent groups
Communicators/groups are opa?ue& internals not irectl! accessi"le0 these o"Oects are re#erence "! a han le
wjb19@psu.edu
Communicators Cont,
Internal contents manipulate "! metho s& much li-e private ata in CJJ class o"Oects eg,& int MPI_Group_incl(MPI_Group old_group,int new_group_size, int ranks_in_old_group[], MPI_Group* new_group) create a new_group #rom old_group& using ranks_in_old_group[] etc
int MPI_Comm_create(MPI_Comm old_comm, MPI_Group new_group, MPI_Comm* new_comm) create a new communicator #rom the ol & with conte/t
MPI_Comm_group an MPI_Group_incl are local metho s without communication& MPI_Comm_create is a collective communication impl!ing synchroni0ation ie&, to esta"lish single conte/t 9ultiple communicators ma! "e create simultaneously using MPI_Comm_split
wjb19@psu.edu
3opologies I

9PI allows one to associate i##erent a ressing schemes to processes within a group 3his is a virtual versus real or ph!sical topolog!& an is either a graph structure or a $Cartesian% grid0 properties: (imensions& w/ 'iEe o# each Perio o# each Option to have processes reordere optimall! within gri 9etho to esta"lish Cartesian gri cart_comm :
int MPI_Cart_create(MPI_Comm old_comm, int number_of_dims, int dim_sizes[], int wrap_around[], int reorder, MPI_Comm* cart_comm)
old_comm is t!picall! Oust MPI_COMM_WORLD create at init
wjb19@psu.edu
3opologies II
cart_comm will contain the processes #rom old_comm with associate coor inates& availa"le #rom MPI_Cart_coords: int coordinates[2]; int my_grid_rank; MPI_Comm_rank(cart_comm, &my_grid_rank); MPI_Cart_Coords(cart_comm, my_grid_rank,2,coordinates);

Call to MPI_Comm_rank is necessar! "ecause o# process ranreor ering $optimiEation% Processes in cartScomm are store in row maOor or er Can also partition in to su"2gri $s% using 9PISCartSsu" eg,& #or row:
int free_coords[2]; MPI_Comm row_comm; //new sub-grid free_coords[0]=0; //bool; first coordinate fixed free_coords[1]=1; //bool; second coordinate free MPI_Cart_sub(cart_comm,free_coords,&row_comm);
wjb19@psu.edu
Qriting Parallel Co e
.ssuming wePve pro#ile our co e an eci e to paralleliEe& eAuippe with 9PI routines& we must eci e whether to ta-e a : (omain 8arallel $ ivi e tas-s& similar ata% or
(ata 8arallel $ ivi e ata& similar tas-s% approach
(ata parallel in general scales much "etter& implies lower communication overhea Regar less& easiest to "egin "! selecting or esigning ata structures& an su"seAuentl! their istri"ution using a constructe topolog! or scatter/gather routines& #or e/ample Program in mo ules& "eginning with easiest/essential #unctions $eg,& I/O%& relegating Phar P #unctionalit! to stu"s initiall! 3ime co e sections& loo- at targets #or optimiEation * re esign Onl! concern !oursel# with the highest levels o# a"straction germane to !our pro"lem& use parallel constructs wherever possi"le
wjb19@psu.edu
. Note on the O'I 9o el

QePve "een pla!ing #ast an loose with a variet! o# communication entities0 soc-ets& networ-s& protocols li-e U(P& 3CP etc 3he Open '!stems Interconnection mo el separates these entities into 7 la!ers o# a"straction& each la!er provi ing services to the la!er imme iatel! a"ove (ata "ecomes increasingl! #ine graine going own #rom la!er 7 to 4
.s application evelopers an /or scientists& we nee onl! "e concerne with la!ers > an a"ove
,ayer 7,.pplication 6,Presentation B,'ession >,3ransport C,Networ:,(ata 1in4,Ph!sical
9ranularity ata ata ata segments pac-ets #rames "its
Function process accessing networencr!t/ ecr!pt& ata conversion management relia"ilit! * #low control path a ressing signals/electrical
"0am8le 9PI 9PI 9PI IB ver"s In#ini"an In#ini"an In#ini"an wjb19@psu.edu
Conclusions
Qe can etermine the parallel portion o# our co e through pro#iling0 as a rule o# thum" a co e with P D 55T can e##ectivel! utiliEe a"out :B6 cores& co e with P D 58T a"out :8 cores
Qhen the parallel portion o# co e approaches 58T& we can Ousti#! going outsi e the multi2core no e an using some #orm o# inter2process communication $IPC%
IPC comes in a variet! o# #orms eg,& soc-ets connecte over networ-s& signals "etween processes on a single machine
3he message passing inter#ace $9PI% a"stracts awa! etails o# IPC use over networ-s& provi ing language "in ings to C&#ortran etc
9PI has a num"er o# highl! optimiEe collective communication an parallel constructs& sophisticate means o# grouping o"Oects& as well as computational topologies
3he O'I 9o el assigns various communication entities to one o# seven la!ers& we nee onl! "e concerne with la!er #our an a"ove
wjb19@psu.edu
http://www,cs,us#ca,e u/Dpeter/ppmpi/ Galgrin $no reall!& "u! the manual% http://valgrin ,org/ UNIX signals http://www,cs,pitt,e u/DalanOawi/cs>>5/co e/shell/Uni/'ignals,htm Open9PI http://open2mpi,org/ proc#s http://www,-ernel,org/ oc/man2pages/online/pages/manB/proc,B,html E/cellent article on ptrace http://linu/gaEette,net/<4/san eep,html )ernel vulnera"ilities associate with ptrace/proc#s http://www,-",cert,org/vuls/i /476<<< 9PI tutorials http://www,mcs,anl,gov/research/proOects/mpi/learning,html 1inu/ @aEette articles http://linu/gaEette,net Open '!stems Interconnection http://en,wi-ipe ia,org/wi-i/O'ISRe#erenceS9o el PB' re#erence http://rcc,its,psu,e u/userSgui es/s!stemSutilities/p"s/ wjb19@psu.edu
Re#erences PachecoPs e/cellent 9PI te/t
E/ercises
Buil the supplie 9PI co e via Pmake -f Makefile_P an su"mit to cluster o# !our choice using the #ollowing PB' script Compare scaling with Open9P e/ample #rom last wee-& "! var!ing "oth no es an procs per no e $ppn%0 i##erencesK $NU9. vs goo localit! w/ 9PI%
'-etch how the gather #unction is collecting ata& an the root process su"seAuentl! writes out to is'imilarl!& s-etch how the image ata e/ists in memor!0 are the two pictures commensurateK $hint: no :% % Re2assign image gri tiles to processes such that no #ile manipulation is reAuire a#ter program completion
wjb19@psu.edu
'che uling on clusters : PB'
Basic su"mission script foo.pbs: -l -l -l -o -e -V nodes=4:ppn=4 mem=1Gb walltime=00:20:00 /gpfs/home/wjb19/scratch/ktm_stdout.txt /gpfs/home/wjb19/scratch/ktm_stderr.txt
#PBS #PBS #PBS #PBS #PBS #PBS
cd /gpfs/home/wjb19/scratch module load openmpi/gnu/1.4.2 mpirun ./psktm.x 'u"mit to cluster : [wjb19@lionxf scratch]$ qsub foo.pbs
Chec- status: [wjb19@lionxf scratch]$ qstat -u wjb19
1ist no es #or running Oo"s : [wjb19@lionxf scratch]$ qstat -n
wjb19@psu.edu
(e"ug 9PI applications
9PI programs are rea il! e"ugge using serial programs li-e gdb& once !ou have : compile with -g & su"mitte !our Oo"
an assigne
no e #rom qstat Auer!
process i on that no e ie,& ssh to no e an run gdb pid==pid_for_the_proc
Open9PI,org give a use#ul co e "loc- to use in conOunction with this techniAue :

int i = 0; char hostname[256]; gethostname(hostname, sizeof(hostname)); printf("PID %d on %s ready for attach\n", getpid(), hostname); fflush(stdout); while (0 == i) sleep(5);
Once attache an wor-ing with gdb& !ou can set some "rea-points an alter the parameter i $eg,& set var i=7% to move out o# the loop
wjb19@psu.edu

Allparallellectures 13275850976597 Phpapp01 120126074055 Phpapp01

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Allparallellectures 13275850976597 Phpapp01 120126074055 Phpapp01

Загружено:

Авторское право:

Доступные форматы

HPC Essentials

Part I : UNIX/C Overview

Bill Brouwer Research Computing an C!"erin#rastructure $RCC%& P'U

HPC Intro uction

3!pical Compute No e $Intel i7%

'.3./U'B BIO' non2volatile storage wjb19@psu.edu

Fhttp://www, elltechcenter,com/page/8>28<2:885J2JNehalemJan J9emor!JCon#igurationsKt=anonL#"i =XMRE#lAGM6N

UNIX Permissions * +iles

-rw-r--r-- 1 root root 0 Jun 11 1976 /usr/local/foo.txt

[wjb19@lionga scratch] $ ls -lah foo_dir

[wjb19@lionga scratch] $ chown wjb19:wjb19 foo.xyz

. e/ecute permission to foo.xyz: [wjb19@lionga scratch] $ chmod +x foo.xyz

(etermine #ilet!pe #or foo.xyz: [wjb19@lionga scratch] $ file foo.xyz

Peruse te/t #ile foo.xyz: [wjb19@lionga scratch] $ more foo.xyz

[wjb19@lionga scratch] $ tar -cfz foo_archive.tgz foo/*

[wjb19@lionga scratch] $ tar -cfj foo_archive.tbz foo/*

[wjb19@lionga scratch] $ tar -xvf foo_archive.tgz

[wjb19@lionga scratch] $ vim foo.txt

3e/t +ile E it w/ GI9

:w<ENTER> :wq<ENTER> :q!<ENTER>

d10d y10y :split<ENTER> <CNTRL>-w-w :10<ENTER> %

3e/t +ile Compare w/ GI9(I++

echo hello a = foo; echo $a b=a.foo2;

[wjb19@lionga scratch] $ gawk '/image/ {print $1;}' foo.txt pattern action

Run using "ash

(eclare an arra! Comman output

FC55 stan ar compile 'ource Fc #ile O"Oect Fo co e linE/ecuta"le

1i"rar! o"Oects wjb19@psu.edu

Important s!stem #unctions inclu e read,write,printf $I/O% an malloc,free $9emor!%

3ells preprocessor to inclu e these hea ers0 s!stem #unctions etc

UNIX C @oo Practice I

UNIX C @oo Practice II

Put common #unctions in/use li"raries wherever possi"le,,,,

)e! HPC 1i"raries

Program in modules an lin- separate o"Oects

Use the make utilit! to manage "uil s $more ne/t time%

Fheap== region o# !namicall! allocate memor!

@(B Auic- start

[wjb19@tesla1 scratch]$ gdb ./foo.x

(gdb) run 100

'tep to ne/t instruction $issuing PcontinueP will resume e/ecution% :

(gdb) step 23 float * image = (float*) malloc (IMAGE_SIZE*sizeof(float));

Print secon value in arra! PimageP :

(ispla! #ull "ac-trace :

(gdb) bt full #0 main () at foo.c:27 i = 0 IMAGE_SIZE = 10 image = 0x601010

Bill Brouwer Research Computing an C!"erin#rastructure $RCC%& P'U

.s a natural conseAuence& we see- "oth per#ormance an scaling in our scienti#ic applications

P=68T P=C8T P=48T processors

E/ample : )irchho## 3ime 9igration

Common Operations in HPC

-.Fetch ..(ecode /."0ecute 1.$rite2)ac'

#loatC Bit 4:7

3he ne/t level up o# paralleliEation is the multiprocessor,,,

3he #un amental unit o# wor- on the cores is a process,,, wjb19@psu.edu

itionall! has a uniAue

Environment varia"les +unction arguments 'tacUnuse Heap

Cache : Intro uction

In HPC& we tal- a"out pro"lems "eing compute or memory bound

Both instructions an out in lines

Cache : Intro uction

Cache : @oo .ccess

In man! cases we are limite also "! random access patterns

(gdb) step 23 float * image = (float) malloc (IMAGE_SIZEsizeof(float));