OpenMP API Overview API is a set of compiler directives inserted in the source program (in addition to some library functions). Ideally, compiler directives do not affect sequential code. pragma’s in C / C++ . (special) comments in Fortran code. API Semantics Master thread executes sequential code. Master and slaves execute parallel code. Note: very similar to fork-join semantics of Pthreads create/join primitives. OpenMP Directives Parallelization directives: parallel region parallel for Data environment directives: shared, private, threadprivate, reduction, etc. Synchronization directives: barrier, critical General Rules about Directives They always apply to the next statement, which must be a structured block. Examples #pragma omp … statement #pragma omp … { statement1; statement2; statement3; } OpenMP Parallel Region #pragma omp parallel
A number of threads are spawned at entry.
Each thread executes the same code. Each thread waits at the end. Very similar to a number of create/join’s with the same function in Pthreads. Getting Threads to do Different Things Through explicit thread identification (as in Pthreads). Through work-sharing directives. Thread Identification int omp_get_thread_num() int omp_get_num_threads()
Gets the thread id.
Gets the total number of threads. Example #pragma omp parallel { if( !omp_get_thread_num() ) master(); else slave(); } Work Sharing Directives Always occur within a parallel region directive. Two principal ones are parallel for parallel section OpenMP Parallel For #pragma omp parallel #pragma omp for for( … ) { … } Each thread executes a subset of the iterations. All threads wait at the end of the parallel for. Multiple Work Sharing Directives May occur within a single parallel region #pragma omp parallel { #pragma omp for for( ; ; ) { … } #pragma omp for for( ; ; ) { … } } All threads wait at the end of the first for. The NoWait Qualifier #pragma omp parallel { #pragma omp for nowait for( ; ; ) { … } #pragma omp for for( ; ; ) { … } } Threads proceed to second for w/o waiting. Sections A parallel loop is an example of independent work units that are numbered If you have a pre-determined number of independent work units, the sections is more appropriate In a sections construct can be any number of section constructs and each should be independent They can be executed by any available thread in the current team Parallel Sections Directive
#pragma omp parallel
{ #pragma omp sections { {…} #pragma omp section this is a delimiter {…} #pragma omp section {…} … } } Example: y = f(x) + g(x) double y1,y2; #pragma omp sections { #pragma omp section y1 = f(x) #pragma omp section y2 = g(x) } y = y1+y2; Single directive It limits the execution of a block to a single thread If the computation needs to be done only once Helpful for initializing shared variables #pragma omp parallel { #pragma omp single printf(“Inside section single!\n"); //Try to get thread numbers using omp_get_thread_num // parallel code } Exercise 1: Matrix multiplication using sections primitive and observe the time taken Matrix multiplication using serial programming and observe the time taken Exercise 2: Data Environment Directives (2 of 2) Private Threadprivate Reduction Private Variables #pragma omp parallel for private( list )
Makes a private copy for each thread for each variable
in the list. This and all further examples are with parallel for, but same applies to other region and work-sharing directives. Private Variables: Example (1 of 2) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Swaps the values in a and b. Loop-carried dependence on tmp. Easily fixed by privatizing tmp. Private Variables: Example (2 of 2) #pragma omp parallel for private( tmp ) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Removes dependence on tmp. Would be more difficult to do in Pthreads. Threadprivate Private variables are private on a parallel region basis. Threadprivate variables are global variables that are private throughout the execution of the program. Threadprivate #pragma omp threadprivate( list ) Example: #pragma omp threadprivate( x) Requires program change in Pthreads. Requires an array of size p. Access as x[pthread_self()]. Costly if accessed frequently. Not cheap in OpenMP either. Reduction Variables #pragma omp parallel for reduction( op:list ) op is one of +, *, -, &, ^, |, &&, or || The variables in list must be used with this operator in the loop. The variables are automatically initialized to sensible values. Reduction Variables: Example #pragma omp parallel for reduction( +:sum ) for( i=0; i<n; i++ ) sum += a[i];
Sum is automatically initialized to zero.
{ int x; x = 2; #pragma omp parallel num_threads(2) shared(x) { if (omp_get_thread_num() == 0) { x = 5; } else { /* Print 1: the following read of x has a race */ printf("1: Thread# %d: x = %d\n", omp_get_thread_num(),x ); } #pragma omp barrier if (omp_get_thread_num() == 0) { /* Print 2 */ printf("2: Thread# %d: x = %d\n", omp_get_thread_num(),x ); } else { /* Print 3 */ printf("3: Thread# %d: x = %d\n", omp_get_thread_num(),x ); } } return 0; Synchronization Primitives Critical #pragma omp critical name Implements critical sections by name. Similar to Pthreads mutex locks (name ~ lock). Barrier #pragma omp critical barrier Implements global barrier. Reduction #pragma omp parallel for reduction(+,sum) for( i=0, sum=0; i<n; i++ ) sum += a[i];
Dependence on sum is removed.
Exercise Use OpenMP to implement a producer-consumer program in which some of the threads are producers and others are consumers. The producers read text from a collection of files, one per producer. They insert lines of text into a single shared queue. The consumers take the lines of text and tokenize them. Tokens are “words” A search engine can be implemented using a farm of servers; each contains a subset of data that can be searched. Assume that this farm server has a single front-end that interacts with clients who submit queries. Implement the above server form using master-worker pattern